Haley Tucker on Responding to Failures in Playback Features at Netflix
The InfoQ Podcast - A podcast by InfoQ
Categories:
In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members. Why listen to this podcast: - Distributed systems fail regularly, often due to unexpected reasons - Data canaries can identify invalid metadata before it can enter and corrupt the production environment - ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions - Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures - Distributed systems are fundamentally social systems, and require a blameless culture to be successful Notes and links can be found on: http://bit.ly/2hqzQ6K 2m:05s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services. 2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data. 3m:30s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production. 3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2hqzQ6K You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq