Martin Mao on Observability, focusing on Alerting, Triage, & RCA

The InfoQ Podcast - A podcast by InfoQ

Categories:

Observability is a crucial aspect of operating Microservices at scale today. Today on the InfoQ podcast, Wes Reisz speaks with Chronosphere’s CEO Martin Mao about how he thinks about observability. Specifically, the two discuss Chronosphere’s strategy for implementing a successful observability program. Starting with alerting, Martin discusses how metrics (usually things like RED metrics or Google’s Four Golden Signals) are tools to aggregate counts and let operators know when things are moving towards an incident. In stage two of this approach, operators begin to isolate and triage what’s happening in an effort to provide a quick system restoration. Finally, Martin talks about root cause analysis (RCA) in the final stage as a way of preventing what happened from happening again. Martin uses this three stage approach (and the questions that should be asked in each of these stages) as a way of focusing on what’s important (or reducing things like Mean Time to Recovery) in a modern cloud native architecture. Observability is the ability to understand the state of a system by observing its outputs, on today’s podcast we talk about a strategy for implementing a meaning observability program. Read a transcript of this interview: https://bit.ly/3AZYpkD Subscribe to our newsletters: - The InfoQ weekly newsletter: bit.ly/24x3IVq - The Software Architects’ Newsletter [monthly]: www.infoq.com/software-architects-newsletter/ Upcoming Virtual Events - events.infoq.com/ InfoQ Live: live.infoq.com/ - July 20, 2021 - August 17, 2021 Follow InfoQ: - Twitter: twitter.com/InfoQ - LinkedIn: www.linkedin.com/company/infoq - Facebook: bit.ly/2jmlyG8 - Instagram: @infoqdotcom - Youtube: www.youtube.com/infoq

Visit the podcast's native language site