Tanya Reilly on Site Reliability Engineering and the Evolution of the New York City Fire Code

The InfoQ Podcast - A podcast by InfoQ

Categories:

This week on the InfoQ Podcast, Wes Reisz talks to Tanya Reilly (Principal Engineer at Squarespace and previously a staff SRE at Google). Tanya discusses her research into how the fire code evolved in New York and draws on some of the parallels she sees in software. Along the way, she discusses what it means to be an SRE, what effective aspects of the role might look like, and her opinions on what we as an industry should be doing to prevent disasters. This podcast features discussion on paved roads, prevention, testing, firefighting (in software), and reliability questions to ask throughout the software lifecycle. Why listen to this podcast: - Teams increasingly are responsible for the entire software lifecycle. When this happens, they think about the software differently because they know their the ones that will get paged if it fails. This idea is at the core of the “You Build It, You Run It” philosophy in DevOps. - The role of SRE is to define how to do things in a really reliable way. The focus is to make the majority of the operations work go away, and, for the things that can’t go away, it’s as easy as possible. - At the very start of a project (when you’re writing the initial design), you should be thinking about the dependencies for a system and how will those that follow with be able to determine that. A great way to do this is to offer an API that people will want to use and then instrument it. - We can learn a lot from the growth of fire safety regulations as metaphors for software, including: fireproof interior walls, socializing best practices, software inspections, and circuit breakers are all examples. - The work SREs do varies in many places. SREs range from making recommendations on patterns to library creators in other areas. Occasionally, SREs are firefighters of last resort. In these cases, they’re the last resort though. - We use error budgets and SLOs to quantify how many much risk we’re comfortable taking. It’s used to inform how much less (or more risk) we’re willing to take on. - We need to consider software reliability throughout the full cycle of software development. When you build systems. Think about as if there will not be someone on call for it . You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq

Visit the podcast's native language site