A Reflection On Data Observability As It Reaches Broader Adoption

Data Engineering Podcast - A podcast by Tobias Macey - Duminică

Categories:

Summary Data observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo Interview Introduction How did you get involved in the area of data management? Can you give the elevator pitch for Monte Carlo? What are the notable changes in the Monte Carlo product and business since our last conversation in October 2020? You were one of the early entrants in the market of data quality/data observability products. In your work to gain visibility and traction you invested substantially in content creation (blog posts, presentations, round table conversations, etc.). How would you summarize the focus of your initial efforts? Why do you think data observability has really taken off? A few years ago, the category barely existed – what’s changed? There’s a larger debate within the data engineering community regarding whether it makes sense to go deep or go broad when it comes to monitoring your data. In other words, do you start with a few important data sets, or do you attempt to cover the entire ecosystem. What is your take? For engineers and teams who are just now investigating and investing in observability/quality automation for their data, what are their motivations? How has the conversation around the value/motivating factors matured or changed over the past couple of years? In what way have the requirements and capabilities of data observability platforms shifted? What are the forces in the ecosystem that have driven those changes? How has the scope and vision for your work at Monte Carlo evolved as the understanding and impact of data quality have become more widespread? When teams invest in data quality/observability what are some of the ways that the insights gained influence their other priorities and design choices? (e.g. platform design, pipeline design, data usage, etc.) When it comes to selecting what parts of the data stack to invest in, how do data leaders prioritize? For instance, when does it make sense to build or buy a data catalog? A data observability platform? The adoption of any tool that adds constraints is a delicate balance. What have you found to be the predominant patterns for teams who are incorporating Monte Carlo? (e.g. maintaining delivery velocity and adding safety/trust) A corollary to the goal of data engineers for higher reliability and visibility is the need by the business/team leadership to identify "return on investment". How do you and your customers think about the useful metrics and measurement goals to justify the time spent on "non-functional" requirements? What are the most interesting, innovative, or unexpected ways that you have seen Monte Carlo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Monte Carlo? When is Monte Carlo the wrong choice? What do you have planned for the future of Monte Carlo? Contact Info Barr LinkedIn @BM_DataDowntime on Twitter Lior LinkedIn @lgavish on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Monte Carlo Podcast Episode App Dynamics Datadog New Relic Data Quality Fundamentals book State Of Data Quality Survey dbt Podcast Episode Airflow Dagster Podcast Episode Episode: Incident Management For Data Teams Databricks Delta Patch.tech Snowflake APIs Hightouch Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Visit the podcast's native language site