Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems

The InfoQ Podcast - A podcast by InfoQ

Categories:

In today’s podcast we sit down with Ryan Kitchens, a senior site reliability engineer and member of the CORE team at Netflix. This team is responsible for the entire lifecycle of incident management at Netflix, from incident response to memorialising an issue. Why listen to this podcast: - Top level metrics can be used as a proxy for user experience, and can be used to determine that issue should be alerted on an investigated. For example, at Netflix if the customer playback initiation “streams per second” metric declines rapidly, this may be an indication that something has broken. - Focusing on how things go right can provide valuable insight into the resilience within your system e.g. what are people doing everyday that helps us overcome incidents. Finding sources of resilience is somewhat “the story of the incident you didn’t have”. - When conducting an incident postmortem, simply reconstructing an incident is often not sufficient to determine what needs to be fixed; there is no root cause with complex socio-technical systems as found at Netflix and most modern web-based organisations. Instead, teams must dig a little deeper, and look for what went well, what contributed to the problem, and where are the recurring patterns. - Resilience engineering is a multidisciplinary field that was established in the early 2000s, and the associated community that has emerged is both academic and deeply practical. Although much resilience engineering focuses on domains such as aviation, surgery and military agencies, there is much overlap with the domain of software engineering. - Make sure that support staff within an organisation have a feedback loop into the product team, as these people providing support often know where all of the hidden problems are, the nuances of the systems, and the workarounds. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2LLwk8T You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2LLwk8T

Visit the podcast's native language site