Enabling Version Controlled Data Collaboration With TerminusDB
Data Engineering Podcast - A podcast by Tobias Macey - Duminică
Categories:
Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation Interview Introduction How did you get involved in the area of data management? Can you start by describing what TerminusDB is and what motivated you to build it? What are the use cases that TerminusDB and TerminusHub are designed for? There are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today? Can you describe how TerminusDB is implemented? How has the design changed or evolved since you first began working on it? What was the decision process and design considerations that led you to choose Prolog as the implementation language? One of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB? What are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.) How does the use of RDF triples and JSON-LD impact the audience for TerminusDB? How much overhead is incurred by maintaining a long history of changes for a database? How do you handle garbage collection/compaction of versions? How does the availability of branching and merging strategies change the approach that data teams take when working on a project? What are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations? What are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB? Another interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved? What are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used? https://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub? When is TerminusDB the wrong choice? What do you have planned for the future of the project? Contact Info @GavinMGleason on Twitter LinkedIn GavinMendelGleason on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TerminusDB TerminusHub Chem Informatics Type Theory Graph Database Trinity College Dublin Sesshat Databank analytics over civilizations in history PostgreSQL DGraph Grakn Neo4J Datomic LakeFS DVC Dolt Persistent Succinct Data Structure Currying Prolog WOQL TerminusDB query language RDF JSON-LD Semantic Web Property Graph Hypergraph Super Node Bloom Filters Data Curation Podcast Episode CRDT == Conflict-Free Replicated Data Types Podcast Episode SPARQL Datalog AST == Abstract Syntax Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast