Roosroos loon-software




















Loon's teams across hardware, software, and operations orgs used postmortems, as was standard practice in their fields for incident response.

The Flight Operations Team, which handled the day-to-day operations of steering launched balloons, captured in-flight issues in a tracking system. The tracking system was part of the anomaly resolution system devised to identify and resolve root cause problems.

Given that most incidents spanned multiple teams e. The Aviation and Systems Safety Team, which focused on safety related to the flight system and flight process, also brought their own tradition and best practices of postmortems. However, because industry standards for postmortems and how to handle different types of problems varied across teams, there was some divergence in process.

We proactively encouraged teams to share postmortems between teams, between orgs, and across the company so that anyone could provide feedback and insight into an incident. In that way, anyone at Loon could contribute to a postmortem, see how an incident was handled, and learn about the breadth of challenges that Loon was solving.

While everyone agreed that postmortems were an important practice, in a fast moving start-up culture, it was a struggle to comprehensively follow through on action items. Ideally, we would have prioritized postmortems that focused on best practices and learnings that were applicable to multiple generations of the platform, but those weren't easy to identify at the time of each incident.

Even though the company was not especially large, the novelty of Loon's platform and interconnectedness of its operations made determining which team was responsible for writing a postmortem and investigating root causes difficult. For example, a 20 minute service disruption on the ground might be caused by a loss of connectivity from the balloon to the backhaul network, a pointing error with the antennae on the payload, insufficient battery levels, or wind that temporarily blew the balloon out of range.

Actual causes could be quite nuanced, and often were attributable to interactions between multiple sub-systems. Thus, we had a chicken-and-egg problem: which team should start the postmortem and investigation, and when should they hand off the postmortem to the teams that likely owned the faulty system or process? Not all teams had a culture of postmortems, so the process could stall depending on the system where the root cause originated.

Much of how Loon used postmortems, especially in software development and Prod Team, was in line with SRE industry standards. Sharing the postmortems openly and widely across Loon was critical to building a culture of continuous improvement and addressing root causes. To increase cross-team awareness of incidents, in we instituted a Postmortem Working Group. In addition to reading and discussing recent postmortems from across the company, the goals of the working group were to make it easier to write postmortems, promote the practice of writing postmortems, increase sharing across teams, and discuss the findings of these incidents in order to learn the patterns of failure.

The use of postmortems became a standardizing factor across Loon's teams — from avionics and manufacturing, to flight operations, to software platforms and network service. Many industries have adopted the use of postmortems — they are fairly common in high-risk fields where mistakes can be fatal or extremely expensive. As the original SRE book states, blameless postmortems are key to "an environment where every 'mistake' is seen as an opportunity to strengthen the system.

To facilitate learning, SRE's postmortem format includes both what went well — acknowledging the successes that should be maintained and expanded — and what went poorly and needs to be changed. The Prod Team had three primary goals:.

Own the mission of fielding and providing a reliable commercial service Loon Library in the real world. Seeking to complement the anomaly resolution system, the Flight Operations Team incorporated the SRE software team's postmortem format for incidents that needed further investigation — for example, failure to avoid a storm system, deviations from the simulated expected flight path that led to an incident, and flight operator actions that directly or indirectly caused an incident.

Their motto, "Own our Safety", brought a commitment to continually improving safety performance and building a positive safety culture across the company. This was one of the strengths of Loon's culture: all the organizations were aligned not just on our audacious vision to "connect people everywhere," but also on doing so safely and effectively. This probably comes as no surprise to developers in similar environments — when the platform or services that require investment are rapidly changing or being replaced, it's hard to spend resources on not repeating the same mistakes.

Its founding goal was to " Cultivate a postmortem culture in Loon to encourage thoughtful risk taking, to take advantage of mistakes, and to provide structure to support improvement over time. Prod Team and several other teams' meetings had a standing agenda item to review postmortems of interest from across the company, and we sent a semi-annual email celebrating Loon's "best-of" recent incidents: the most interesting or educational outages.

Once we had a standardized postmortem template in place, we could adopt and reuse it to document commercial service field tests. By recording a timeline and incidents, defining a process and space to determine root causes of problems, recording measurements and metrics, and providing the structure for action item tracking, we brought the benefits of postmortem retrospectives to prospective tasks. When Loon began commercial trials in countries like Peru and Kenya, we conducted numerous field tests.

Prod Team proactively used the postmortem template to document the field tests. It provided a useful format to record the log of test events, results that did and did not match expectations, and links to further investigations into those failures. As a cutting edge project in a highly variable operating environment, using the postmortem template as our default testing template was an acknowledgement that we were in a state of constant and rapid iteration and improvement.

These trials took place in early to mid , under the sudden specter of Covid and the subsequent shift towards working from home. The structured communications at the core of Loon's postmortem structure were particularly helpful as we moved from in-person coordination rooms to WFH. Postmortems are widely used in various industries because they are effective. That culture should include a clear process for writing postmortems, clear guidelines for when to conduct a postmortem, and a staffed commitment to follow up on action items.

The many points of failure we observed across the range of postmortems were indicative of both the complexity of Loon's systems and the complexity of some of its supporting infrastructure. Postmortems are equally adept at finding flaky tests and fragile processes vs.

These are complexities familiar to many startups, where postmortems can help manage the tradeoff between making changes safely vs. Loon was still operating a superhero culture: across a wide range of issues, a small set of experts were repeatedly called upon to fix the system. Once we identified this pattern, our plan for commercial service was to staff a 24x7 oncall rotation, complemented by Program Managers driving intention processes to de-risk production.

While the specifics of Loon's journey to standardize postmortems tell the story of one company, we have some tips and takeaways that should be applicable at most organizations. Although the initiative of writing postmortems often originates with a software team, if you want every team to adopt the practice, we suggest trying the following:.

Invite people representing different teams to be part of the postmortem working group. They will give insights into what could work better for their respective teams.

Reviewing and consulting on postmortems may be in scope of their duties, especially while new teams are adopting this practice. Especially during adoption, you want teams to see the benefits of postmortems, not the burden of writing them. Creating a postmortem template with minimum requirements can be helpful. Who should write a postmortem and when? Explore the world of music as you listen, precisely tailored to your tastes.

Roon extracts the best sound quality from your audio equipment by ensuring bit-perfect delivery, as well as giving you access to powerful DSP for a customized listening experience. Using your files and streaming music, Roon builds you an interconnected digital library. It provides enhanced, up-to-date metadata, which is then displayed in an amazing interface enriched with content.

Take control of your system. And much more. Roon makes sure that all your music plays everywhere. Get the most out of every piece of audio gear in your home, from high-performance audiophile products to tabletop speakers. Nucleus is the ultimate appliance built for extreme audio quality, designed by our engineers to enhance your Roon experience. Every aspect of Nucleus — hardware, operating system, and software — was developed to provide a silent, high-performance, power-efficient Roon Core.

We use cookies for analytics and to improve our site. You agree to our use of cookies by closing this message box or continuing to use our site.

To find out more, see our Cookie Policy. Try it FREE. What is Roon? Start your FREE trial.



0コメント

  • 1000 / 1000