Democratize Data Cleaning Across Your Organization With Trifacta
Data Engineering Podcast - A podcast by Tobias Macey - Duminică
Categories:
Summary Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Trifacta is and the story behind it? Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT? How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform? What is Trifacta’s role in the overall data platform/data lifecycle for an organization? What are some examples of tools that Trifacta might replace? What tools or systems does Trifacta integrate with? Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality? Can you describe how Trifacta is architected? How have the goals and design of the system changed or evolved since you first began working on it? Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it? How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization? What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools? What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve? What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata? When is Trifacta the wrong choice? What do you have planned for the future of Trifacta? Contact Info LinkedIn @a_adam_wilson on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Trifacta Informatica UC Berkeley Stanford University Citadel Podcast Episode Stanford Data Wrangler DBT Podcast Episode Pig Databricks Sqoop Flume SPSS Tableau SDLC == Software Delivery Life-Cycle The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast