Hello, and welcome to the data engineering podcast, the show about modern data management. If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.
Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure. Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos.
Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service. Because let's be honest, we all need to retool how we handle data requests. Your host is Tobias Macey, and today I'm interviewing Tim Sehn about DOLT, a version controlled database engine and its applications for agentic workflows. So, Tim, can you start by introducing yourself?
Sure. You did a pretty good job already, but I'm Tim. I'm the founder and CEO of DOLT Hub. Our main product is DOLT. It's the world's first version controlled SQL database. So it's named after git. Git means idiot in British slang. So we needed to word them at idiot, start with d for data, short enough to type on the command line. We're not calling our users stupid. That's not the the intent.
And do you remember how you first got started working in data?
Well, I've been in tech for twenty five years now. I obviously, you're always touching databases as you're building software. I've ended up being VP of engineering at Snapchat and had a whole data team. So built started as employee, team engineer, 10 there. So built the whole data pipeline and all the metrics all the way to IPO where they're audited and all that good stuff. So I have a lot of exposure to building data pipelines from scratch
and OLTP databases building applications. And so DOLT's a lot more of the latter so that you build an application on top of it. It's an OLTP database. It's not a warehouse or a pipeline, although people do use it sometimes in that context.
And so digging now into DOLT, I'm wondering if you can just describe a bit more about what it is and some of the problems that you're trying to solve with it and how it got started.
Sure. The the kind of tagline that we like to use in this it conveys the most information the most quickly is imagine if Git and MySQL had a baby. That's DOLT. So DOLT gives you all the power of MySQL. In fact, it's like a drop in replacement. You, you support all the features and syntax of MySQL. You connect with a MySQL client. There is no MySQL code in it, though. There it is completely built from the storage engine up to provide git style version control on top of that SQL database. So you
can create a branch. You can push. You can pull. You can clone. You can make a commit. You can diff. And the target is SQL tables instead of files. And you may have heard of some other products that do this. They mostly do that with only schema. We do that with data and schema. And the way we do that is we have a custom storage engine
that implements a new data structure we call a Prolytree that allow that allows for this at scale. And so we have Dolby databases that are in the terabyte multi terabyte range work while working on this model. So we've been doing it for almost a decade. So we just recently got it as as fast as MySQL on Sysbench. So, you know, a lot of progress. I'd say it's a production grade OLTP database
that gives you all the features you know and love from Git. For the Git stuff in SQL, it's the read operations are mostly exposed to system tables functions, and the write operations are exposed as procedures. So where you would call and they're all named exactly like they are in Git. So where you run git add on the command line, you run call doldad.
When you wanna make a commit, you call DOLL commit, and all the command line arguments are about the same. So the feedback we get from users is mostly, if you know Git and you know MySQL, you kinda know DOLL. It just makes a lot of sense.
And as far as the use cases that are unlocked by having that version control as a core primitive in the database, obviously, there are some very easy ones that you can think about, but I'm just wondering if you can talk to some of the core problems that you're trying to solve for by having that versioning as a fundamental capability of the engine.
Sure. We've been doing this for a long time, and our original use case was data sharing. And so we wanted an open data platform like GitHub, where there are databases that you clone and contribute to and push around. And we for the first three or four years of the company, we were really, really focused on that use case. It kind of to the expense of the database use case. Like, Dilt wasn't a very good database. It was slow. It didn't have transactions.
And people just kept showing up and telling us, hey. If this was a database I could build something on top of, if I could move my application to this, I could give version control to my customers. And so we we built a second version of the storage engine, which was a lot more to be focused. And now, we still have the data sharing use case and people use that. And if you go on dolehub.com,
you can see a bunch of stock market data. That's our most popular data that people share and that we still love that use case. We think, you know, it's the reason we built the company. But the thing that most people use it for in production is to power an application that gives version control to their applications customers. So, if you wanna add Ratchet Merge to an application or diffs or any of these features, you can move the back end of your application to DOLT, and you can then use
Django or any of these frameworks to build a build an application as for control. The other place that people use it is as a feature store for machine learning training data. When you cut a map model or when you make a model, you cut a good style tag, and then you can always get reproducibility. If tomorrow's model works worse than today's, you can do a diff and see what data change. So that's a main use case. The other use case is if you end up having, like, of gigs of configuration
in a file based version control, say, as, like, YAML or something like that, you often can't find anything, but you wanna keep the version control. And so people move those kind of systems out of Git or Perforce into DOLT and then Git querying on top of this version control data. So basically, files become data at a certain point. Like, cross they there's a spectrum. Right?
And that's mostly popular in the video game space that these AAA titles end up I had a CTO at one of the, like, game company comedies, like, literally opened up a YAML file, and he's like, I've got one of these for every object in my game. There's, like, 2,000,000,000 objects. And so anyway, so those those are kind of the three traditional use cases, which are and I'll just repeat them for posterity, which is adding version control to your own application
as a machine learning feature store or to version control a large amount of configuration that you would traditionally store in files. What's got us really excited to admit, and this kind of touches on some of the top the topic you kind of hinted at it when you were introducing us is we think that
agentic rights need version control. Like, you would never let a coding agent write to your file your code write your code if it wasn't sitting in Git, and you could just roll it back if it did something stupid. Right? And so the persistence layer for
kind of every other application on the Internet is some sort of database. Now we we tend to focus on the SQL side, but built storage model can be ported to other interfaces if we get if it gets really popular. So we're really excited about that use case, basically allowing agents to work through an MCP server or even your API, make rights on a branch. You have
another application or agent review those, see the changes. So going forward, that's the use case that really excites us as we have adopted coding agents pretty aggressively in our company and see the power of them.
Unpacking that concept of versioning and version control, particularly in the context of a database, there are a number of projects in the broader ecosystem, both in recent memory and from several years ago, that have worked to add some form of that versioning or branching and merging capability.
So things that come to mind are Lake FS, which is largely around files in s three and being able to act as a means of branching and merging some of that data. There's the NESE project that is somewhat similar in terms of its overall scope, but focused on the iceberg table format. There's a database. I guess you could call it an abstraction layer over a database from I think I first heard about it maybe five or ten years ago called Datomic.
And I'm just wondering if you can unpack some of the nuance around versioning and version control and those semantics and how what you're focused on differs from some of those other projects.
Sure. So usually the what I would consider our our realm of I wouldn't even call them competitors because we all do slightly different things. But the the people that kind of talk about version control and data today that see that we tend to hear from the most are LakeFS, which you mentioned, Neon, which is just got bought by Databricks, but they they offer branching capability. They're the default database for, like, Replit and Lovable. And then PlanetScale, which is a MySQL
implementation that's similar to Neon, a little different. They've they've pioneered kind of schema version control on top of this eventually consistent large database. And so those are kind of the the three we look at. I'm less familiar with Datamec and Nessie. I I'd have to do a little bit more more research. So first off, LakeFS, that is OLAP. So there are folks on the data lake, as you said, s three. Major difference from a versioning perspective is
and I'd say LakeFS is the most similar to DOLT in the sense that they are built on the Git model. And so what they do is each file, Parquet file or whatever you're sticking in your lake, gets a content address.
And so and then they put a Git style commit graph on the root of all those files, similar to how Git works. Now, the unit of versioning in LakeFS is the file. So and you can you if you ever use a file is equivalent to a table. Right? These things can be 100 gigs of just log data. Right? And so it may be useful to you to know that this log this parquet this 100 gig log file changed, but finding out what changed in it is still a very complex operation. Right?
Especially if the it's m log n. You have to compare both files. It's it'll take you a long time. DOLT, on the other hand, breaks that file down into four kilobyte chunks
in a data structure called a Prawlr tree, which is basically a content addressed to VTree. And each four k chunk of the database gets a content address. And those through some clever algorithms are laid out in a VTree all the way up to the root of the tree. And the root of the tree gets a content address. And then the scheme of the table gets a content address. That means the table has a content address. All the tables in the database are then hashed together and they get a content address. And that means that you can find you can diff tables,
terabyte databases, if you've you can find the single chunk, the single four kilobyte chunk that changed in order of time of the differences, which if it's a single chunk, basically instantly. And so DOLT is built from the storage engine up to be able to provide diff capabilities at the table data layer, basically at the row level layer of the database. And it gives you database, OLTP database, read and write performance on top of that. So it gives you true git style diff merge branch
on top of database tables at database speed. But in order to do that, we had to build a brand new database from scratch. It's taken a decade. It's not, it's got its own SQL analyzer. It does like, it doesn't use Postgres or MySQL. We had to build that off from scratch, and that's taken a long time. It's been a very difficult, arduous journey. You don't build a database from scratch very easily. Now if you look at Neon and PlanetScale, what they've done is they've taken MySQL or Neon's taken Postgres and PlanetScale's take MySQL and they've
basically separated storage from compute and distributed. So it's at first and foremost, it's the horizontally scalable database. Right? DOLD is not that. DOLD is a single node database just like regular MySQL or Postgres. You put it on a single hard drive, and the way you scale it is you add replicas, just like nineteen ninety five. The
Neon and PlanScale are horizontally scalable databases. They scale zero. They scale to infinity. You there that's that's why it's called plan scale. And they add because they with Neon in Neon's case, they put the data itself in a copy on write file system, which gives you basically what they call a branch, but it's really a fork. So you basically can create a copy really quickly because it's a copy outright file system, and then you can start writing on that copy and then on another copy. And then they have some
merge utilities, usually schema only, to kind of merge the branches. But usually it's more like a hot swap. You have main, you're building a feature on another enum branch, you get
the feature working and then you swap hot swap over to the new branch and that's your new version. And so you can't diff between the branches. Or you can maybe just schema because that's pretty simple, you certainly can't diff the data at least quickly. And so that's kind of the main difference between the products in this space is that Neon and my and PlanetScale, those are actually Postgres and MySQL. They are those technologies that run those things. They've made some modifications. They've added some version control features on top of that. It well, I wouldn't call them truly version controlled. LakeFS and DOLT are
truly version controlled databases in the Git model. LakeFS is OLAP. It focuses the unit of versioning as a file, which can be very large. Think of it more like a table. In DOLT, the unit of versioning as a row, and you can diff merge and do all the good operations at that granularity. So that I hope that that made sense as I was saying it kinda lit it laid up the the competition as I kinda understand it.
Yeah. No. That that's definitely a very good enumeration of the different variants variants in terms of versioning and branching and some of the use cases that are enabled by those different approaches. And at least speaking as someone who has been using Git for a long time and data for a long time, I could definitely map those concepts fairly cleanly. I don't know that I could necessarily say that for everyone who comes across this conversation, but we'll take that as read.
Sure. Of the other interesting aspects of using those Git semantics
and that level of granularity and detail in terms of the versioning is that I suspect that it also comes along with all of the foot guns that we know and love from Git where you can potentially do a Git merge, but you have conflicts and that leaves you in an indeterminate state and you have to do some form of manual resolution or being able to do git rebases that sometimes weren't great and sometimes turned into three days worth of nightmare.
And I'm just wondering if you could talk to some of the maybe the user experience and semantics around how to circumvent or avoid some of those problematic situations that can occur from having that level of power.
Oh, yes. As I said, DOLT is as user friendly as Git, which is either either a compliment or an insult. It follows the Git model exactly. So you can't like, merges are done at the cell level. So, cell is the I'd say across versions, rows are identified by the primary key. If you and I make two changes on a branch to the same cell, that will throw a conflict. And there's a whole there's a set of tables and procedures for result that you can use to resolve those. The most of our users,
if in the case of a conflict, they usually just choose what's on one of the branches kind of blindly. There's we haven't really got into a situation where people want to do really complex conflict resolution. Like, usually basically, I think the the standard
mode of thinking is, oh, if there's a conflict, there's a problem. That merge just fails. You if you think in traditional MVCC model view concurrency control transactions. Right? Like, if there is a conflict, it just rolls back. And so one of the the second transaction, the one that that got there last. Right? And so a lot of people kind of port that model to the to conflicts in this case as well, and that works well. Obviously, rebase is supported. Rebase is just as hairy as it is with Git. Although the,
most times people are using rebase in DOLT, it's to compress history. So they'll be like, oh, I've got these 100 commits that are granular. I just want them to be one single commit, and that tends to be a pretty simple rebase. Yeah, I'd say we
definitely have questions around the Git model, but the people that tend to adopt DOLT know the Git model and like the Git model, and that's the reason they're choosing DOLT in the first place. And so we haven't gotten to the point where, for instance, we need to go into someone's reflog, which we do support. If you're Git power users will know what a reflog is. Most people else won't. We have one of those as well. Go in and find a commit that's no longer materialized in the graph and restore to that. Like, we haven't we don't get that many of those types of problems. So I honestly we're much more likely to get this
10 table join is slow, and it is not slow in MySQL. We're much more likely to get the database problems than to get problems.
Digging now a bit more into the positioning that you have around DOLT being the database for AI, you mentioned a little bit about the need for mitigating some of the trust factors that come in by having AI be in the path of manipulating data. But I'm just wondering if you can talk to some of the ways that data versioning and this get based semantics around versioning in the database engine positions DOLT as being something that is functionally beneficial for agentic systems?
So we used to say the database for agents, but then people kind of didn't know what agents were, so we went back to AI. So there's a few use cases in the AI space. Obviously, already as a feature store, it's very good for storing data to build models. And so that's, let's call it AI circa 2020. I think the easiest way to describe
our vision in the AI space is we think that there's going to be cursor for everything. You're going to have a chat window on the right of every application that or maybe left. But if the cursor does right, the and you're going to be able to have an agent or that chat win you're gonna be able to interact with that chat window in plain English and text and have that thing operate on the application
view that you traditionally would see on the left. So think of Google Sheet or your Stripe dashboard or GitHub itself or whatever your your normal web application. Instead of clicking around buttons in that interface, you'll be talking to an agent and have them click effectively click the buttons for you.
And then the reason that works in cursor, the reason I can ask to write code is because then I can then see in the left window exactly what it changed and say, oh, I don't like that or oh, that's wrong. If you don't have that see what changed functionality, that cursor window,
that chat window with an LLM becomes way too scary to use, especially for, like, important tasks. But if you do have this diff ex kind of PR review workflow with the agent that's making these changes, I think you're it's a it's it'll be an on ramp to many more cursors for for everything. And so the way we see it, what's the backing store of every one of these applications?
Right? It's some sort of database, usually SQL, but not always. And so if you move your data out of whatever database you have into DOLT, you can have the agent work through your API or MCP or whatever interface you want to make changes in your application,
make sure the agents do it on a branch, and then build a diff UI via the diff functionality in Dulp, which is very fast and gives you SQL database like performance to that diff. So you can absolutely build an application on top of it. And that we provide the diff and merge primitives that this cursor for everything use case requires for everything that is in code. And so that's the easiest way for me to describe our vision for why DOLT's important in this new world.
Another interesting aspect of that situation is when you think about agentic software engineering and just software engineering in general, one of the biggest challenges is that the software works great on my machine until it actually encounters the real world and all of the messy data that happens in the real world day to day operations of any application
that's been running for any period of time. And one of the perennial challenges is how do I actually get a copy of the production data for me to be able to verify my bug fixes against And so I'm wondering how you're seeing some of that functionality as far as being able to create a branch of production data, verify a change set against it before you even bother committing it to the software repository or just the branching and diffing functionality of
DOLT as maybe some of that core engineering workflow as well.
Yeah. The in Git, there are two layers of write isolation. There's branches, which kind of live on a single machine server repository gets a sense in. You can think of those as living on a single database server, but you also have this extra layer of isolation that gets famously decentralized. You can you can clone it. You can take make a copy and then
and have it completely isolated on your own machine, on your laptop, make changes. And then when you go to sync with the main copy, GitHub usually, it can easily produce the changes. You can merge them in. DOLT also has this property. Right? So DOLT is also a decentralized database.
You can have it in your production server, in fact, and then you can have that maybe pushing to GitHub, or you can even use it as a remote in the in the git sense. There's a bunch of architectures that you can set up. Git, DOLT supports all of these models. But I think the thing you're getting at, which is really interesting is if you are using DOLT as your database or even as a replica of your MySQL database, which is totally reasonable, your developers or agents can clone a copy of production,
mess with it on their laptop, break it, drop databases, hit it with a ton of load, do whatever they want to it, and make rights is kind of an important piece of this. Change the schema, add an index, do some tests, all set completely isolated from production. Now you can do this on a branch too, but you are still sharing the same CPU and memory, so you can't be as reckless. The
so you can do that all on a clone. And then once you're happy with it, you can push the changes up on a branch, get it reviewed in a pull request. If you like it, you merge it into your repository, your main production database, and you're good to go. That model definitely is something
that works with DOLT. I think it's more important in an agentic world. I think you're probably going to want agents working on clones of your database, not even branches, because they're going to do things, maybe mean things like drop tables or wrap really complex queries and hog all the resources on your production database. So having a clone, especially as you scale out these things, becomes really nest interesting.
It's also just safer, right? It's not if it it's not gonna drop the database. Or if it does, it doesn't really matter. You just deprovision their container and and say, oops. And so so we do we do of just think in the meta as we adopt AI for more use cases, what's gonna happen is there's just gonna things that were traditionally human scale rights
are going to become machine scale, and they're gonna be long running and untrusted rights. And so you're going to need a system, a bunch of systems, and we think Dolt is one of them to manage that additional complexity that's introduced in that world. And clones clones, like you mentioned, is a big piece of that, I think.
So that actually opens up a whole another set of possibilities that I hadn't previously considered with being able to generate those replicas of you can have your production instance that is the central repository of information, and you can actually create an isolated
full copy, including the ability to do additional branches and merges that don't ever get reflected back to that central repository until you care to do so. And that opens up the possibility of doing things like PR deployments where I've got a set of code changes. I wanna deploy this to my own isolated environment for somebody else to be able to take a look at it. Having a replica of the data is often one of the biggest challenges there where now instead I just say, here's another instance of my DOLT database.
I'm going to run a git pull from RC or the production environment, and this is now my own isolated copy. I can do whatever I want. As soon as I'm done, I blow it away. Similarly,
for agentic use cases where I've got my GitHub Copilot, I wanna be able to have it automatically set up a code space. It runs a git pull of the data from whichever location, whether that's production or pre prod, and it can then iterate on all of that in a much more informed context versus just whatever automated fixtures I might generate for my local development environment.
I'm just wondering as people come to grips with the realities of having those Git semantics for their data, what are some of the ways that it changes how they think about the role of that relational engine across the entire life cycle of development, deployment, product features, etcetera?
We're still really early on this. Right? Like, even the stuff we're talking about, I'd say most of our users only get there after using DOLT for a while. Right? So most of our users are like, oh, great. It's a database with branches, and they're using branches to isolate rights.
They know it has all these Git features, but the it's kind of hard to map onto reality because it's no one's ever done it before. And so then, oh, you know what would be great? It'd be great if my production DOLT automatically
was pushing changes up to DOLT hub so I could get a UI on top of that. And then, oh, now that I have that, I can now have my developers clone from there whenever they need to fix an issue and just have them have a complete copy of the production database. And so, you know, they people tend to eat the apple slowly, not all at once. And so
and we're still really early in a lot of these journeys with customers. The one thing that you didn't mention that's really interesting is audit logs. Like you also get an audit log of all the changes across all of these clones that can be joined together. That introduces some interesting use cases across isolated environments. And so the thing we haven't talked about yet, which is I think is interesting to a lot of customers, that the database enforces a
fully granular, queryable, autologue of every cell in your database. And so that's also in a decentralized way. Right? So, yeah, I think that that's another thing we haven't talked about that gets kind of interesting for people is like, internal audit is is asking me who changed this and why, and I can just tell them. Right? I don't have to build a separate system. I don't need Kafka. Listen to changes, sticking it in some finicky system. Like, I know for sure, like, no one's touched this. It's cryptographically
provable. Right? And so that's another interesting use case. The but yeah. No. It it think once you we're still at the point for most people that they don't even know exist. And then when they see that don't exist, they're like, it must be slow. It can't be possible. It must not work. Like, it's not and I can I'm here to assure everyone who's listening to this, it does work. It runs in production. It scales to a terabyte. You get MySQL level query performance.
And so so I think getting them to then be like, oh, imagine if every developer in my company had a clone of the production database. Like, we're still maybe a few years away from getting people to realize that that's even possible.
And you briefly mentioned the presence of DOLT GRES as well, where we've been talking about DOLT being a drop in for your MySQL engine. I know that over the past few years, you've also been iterating on making a Postgres compatible version. And I'm just wondering if you can talk to some of the engineering challenges of keeping that core versioning capability as that foundational primitive, but building the appropriate levels of feature sets and functionality that everyone expects from
Postgres versus Bisql and some of the ways that you're able to share commonality and how much of it has to be just a complete rewrite.
Sure. So the first thing I always like to remind people on this topic is DOLT is its own thing. DOLT contains no MySQL code. DOLT GRES contains no Postgres code. So if you are adopting DOLT or DOLT GRES, you are adopting a new thing. Now what what what is supported and what is different in DOLT and DOLT GRES is the SQL dialect.
Right? Like, MySQL, I think, is newer. It's from the nineties. It has a trimmed down SQL dialect. Postgres is from the eighties. It does. It has a lot of it implements an older spec, which has a lot more features. And for instance, you like, there's row types. You can like return a row. And and so there's a lot, I'd say the surface area of SQL syntax in Postgres is
much bigger. But that's what Dold Doldgres implement is the syntax. The engine itself is the same. And the way that happens is we take the AST, basically a SQL query as it comes into your engine. You parse it into an AST, an abstract syntax tree, and then we do a transformation of that AST from postgres AST to a MySQL AST and then send that AST down through our engine. And so that ends up being the same. And then we kind of do the same on with the results as they come out the other end. And so
so it is, in fact, the same engine at its core. The reason we built Dolphin in the first place is so when we started Dolphin, it was 2018. We've known it a long time. We took over for the from these guys called NOMS who had implemented the core data structure of the Polytree. They had been working on it since 2015, in fact. So the the code in Git is in fact, for DOLT, 10 old. So so the re the so MySQL and Postgres, it was unclear in 2018
kind of which format was winning. MySQL was kind of more popular at the time, but over the last seven years, that's kind of changed. I think all the momentum in the open source community is around Postgres, the format, the ecosystem, super base, Neon. These companies are kind of the hottest thing in databases. They all implement Postgres, CockroachDB. So all the momentum is around Postgres. In fact, MySQL is
owned by Oracle and they laid off vast majority of their team. There's talk with them, donating it to a foundation like people that own MySQL don't even like MySQL. And so we're so we saw the writing on the wall a couple years ago, especially because most of our customers, the first thing they say is I wish it was Postgres. And so we set out to build a couple years ago. As I said, it's the same engine. It just there's more features and syntax to to implement. DualCrest is beta,
whereas DualCrest one o, in fact, it's gonna go two o this quarter probably. The you'll if you do adopt Dualtgress, you'll probably I'd say if you adopt Dualt and you run-in MySQL,
I'd say you're very, very unlikely to hit a bug so much so that we make a twenty four hour pledge. Like, if you find a correctness bug between MySQL and Dualt, we'll fix it in twenty four hours or less. We hit that about 80 to 90% of the time. And so with Dualtgress, we also make the same pledge, but we it's sometimes harder to hit it. So you'll probably have to work with us on usually a handful of
syntax incompatibilities or features that we don't support yet in Dualtgress to migrate your application. And then once that's done, you'll be able to use Dualtgress. So and our intent
now the the thing we're focusing on with Dualtgress, our intent is the way that we think about Dualt is Dualt is Git for data. It has a full Git style command line. You can push to DualtHub. You have the data sharing use case. For Dulthgress, we've gotten rid like the command line isn't implemented. You can just run a server. It's version controlled Postgres. So we got rid of a lot of the complexity. You can still push and pull just but not to pull up to any sort of cloud storage or something. So we're much more focused on Dualtgress being an actual version control database, whereas Dualt kind of has the Git command line and focus on CSV imports and a bunch of stuff that we were focused on for the data sharing use case that we just kinda aren't focused on for for Dualtgress.
And as you mentioned, all of this is its own set of technology, its own ecosystem, which is beneficial in that you're able to create these different facades for MySQL or Postgres. I'm wondering if there are any other ecosystems or use cases that you're thinking about bringing this core version controlling capability to, whether that's something like these various lake house formats, OLAP engines, or anything like that.
We fundamentally believe that there's going to be a version controlled for one of every format document, graph, OLAP, which is generally column major formats instead of row major, both row major. OLAP to be row major. Spreadsheets. We just think that the model is going to be required as we get more semi trusted, semi people writing to these things. We're to need the
features that GitStyle version control provides. As far as like what we're interested in doing, I'd say depending on the success of DOLT and how much value we can generate economically from it, I'd love to be able to The company that built all of those, I'd be happy to If someone used the Prolytree, it's free and open source. It's Apache two. You can just take it and build anyone on top of it. That helps us. I'd say the format that we're most interested in currently is
document, specifically MongoDB format, and just basically storing JSON. And the reason for that is DOLT stores JSON documents. So SQL supports JSON document, JSON column type columns. It has since about 2013, 2015. And DOLT stores those JSON objects in probably trees themselves. And so DOLT is already
the best SQL database for storing JSON. You can actually query it fast. You can because it's stored in this tree like structure, we actually, like, parse it and put it in a version control structure, and then you can find things quickly. It shares storage between versions. So if you have, like, a large JSON document that's common across, like, a million columns, it'll
portions of that document will only be stored once and content address. So we already believe we have the best database for JSON. And so MongoDB being a JSON only database, we were interested in just kind of porting the interface and seeing how fast DOLT is compare in comparison. And then it it potentially we think it might be not only faster, but then you also have the version control functionality of DOLT on top of Mongo. Now Mongo has been a little litigious
lately, so maybe they'll they'll hear this and send me a cease and desist, and I should I should keep my keep their keep their name out of out of my mouth or something. But that that doesn't really scare us.
And another aspect of the ecosystem question, particularly when we're talking Postgres, is that a large portion of the well, some portion of the popularity of Postgres is due to the set of plug ins that are available for it and extensions to it. And I'm wondering how much of that you are able to capitalize on by having that facade and how much of it goes too deep into the core of the Postgres engine for you to be able to support without substantial reengineering.
So we're very familiar with Postgres extensions and all of the different play ways that you can extend Postgres, including the foreign data wrapper file interface
because we initially tried to do Dual Press as one of those things. But the changes that we require and the hooks we require into the engine just aren't there for then, like, the fundamental assumption in any database is that there's, say, one schema at a time. And in DOLT, that just can't be the case because different branches can have different schemas. So we had to fork. Now,
we do support extensions. It took about a quarter of one of my more talented engineers to figure out how to support extensions. Native is obviously where you can just put it in your config file. We need to do work for each extension to expose the hooks that that those extensions use in C to to make them work. So if you do have an extension and it doesn't work, please come create an issue, and we'll figure out how to support it. We do have the fundamental primitives to support the extensions,
but we might need to do some work to expose. Now there's some interesting version control problems with extensions. Since they're c code and they write things directly to the file store, the things that they write aren't content addressed in the way that the rest of the stuff in DOLT is. And so for instance, if you upgrade an extension version, that kind of invalidates
the version of the rest of the data that's been stored by that extension. So there's like, I'd say there's interesting version control trade offs. Like everything we do in DOLT, for instance, like vectors. Right? We implemented vectors in DOLT. Those are version control. Do the vector intakes version control. You check out a commit, the index and the vectors that are stored are are that versions. Right? So we can't make those same guarantees with the datas that that's put into DualGRESS with extensions. So, again, there's some trade offs from a version controls perspective that you that we would wanna make people aware of if they were gonna really go deep with DualGRESS and extensions.
So you just touched on something that made me very interested in a completely different direction, where you're talking about being able to have that versioning of vectors, where one of the biggest challenges around things like rag pipelines and evolution of of agentic systems is the need to figure out what is the appropriate chunking strategy, what is the best embedding model,
what are some of the ways that I need to be thinking about re ranking in the retrieval layer and having that ability to say, okay. I'm going to try five different embedding models and three different chunking strategies. So now I need to have 15 different versions of these embeddings.
Each one of those versions can be its own branch. I can then run an evaluation or a b testing against all of those. And then whichever one has the highest recall rate and accuracy or precision, that's the one that I keep and the rest of them get trimmed where without that native versioning functionality,
it ends up being a much bigger operational headache. And I'm wondering how you're seeing people start to capitalize on that type of functionality, whether for vectors specifically or other styles of experimentation and AB testing that they might be trying to do.
It's funny you use that example because that's the exact example we used on launch, was to test different embeddings on different branches. I think that, honestly, the even the more interesting thing is usually with these rag systems, like new data showing up every every day. Right? Like, your your things and documents are getting inserted, and you can get to a point, say, where your users are reporting that the rag powered LLM that they were using works better yesterday than today. Right?
And in a traditional database system, you're gonna have a that's gonna be a potentially an unsolvable problem to figure out exactly what happened. In DOLT, you can look at the diff and see what exactly documents changed. And we've seen that with one of our customers who runs a machine learning feature store in it. They had a problem where their vision model was predicting all of sudden over predicting some classification,
and they decided they just rolled it back. And then they ran a query, and they were able to see someone inserted way too much of web specific label, and it now is over over indexing on that that label. So same same features apply to vectors. The, we haven't got that much traction on the vector use case in DOLT. It's interesting. I my impression of reg pipelines is
they're kinda going out of favor as the LMs get more sophisticated and there's kind of other ways of injecting context. In fact, like, can you can kind of give these things too much context and confuse them. And so I'm not sure REG is proving to be the be all and end all of that it was, but I I could be wrong. I it's not we've we tried it here at Adult Hub, and it didn't help for the use case that we were we were targeting. And so my experience with it was
was kinda negative. So the but I'm I'm sure I'm sure it works for some people.
It definitely has been I don't know if falling out of favor is the right term, but it is becoming a tool in the toolbox instead of the tool in the toolbox where a lot of focus now has been more around things like agentic memory patterns and having those tiered memory structures as well as the incorporation of knowledge graphs as more of that semantic map over the contextual documents.
Yeah. Yeah. Think, obviously, the space is evolving quickly. And the thing I love about vectors, and I just I don't wanna is like universal similarity search is amazing. Being able to, like, say these three JSON blobs are the closest semantically to this to this one that you've sent in. That's just super cool and, like, tech that you would never have thought would then you can do that for images. You can do that for audio, anything. Right?
And it'll give you it'll depending on the model, it'll it'll be human like. The human will say, yeah. Those are those three are very similar, which is incredibly powerful and, you know, what a time to be alive. So, you know, I don't wanna I I I love I love vector the vector capability just for that specific, right, use case. I've again, yeah, it's it's becoming a yet another thing in the toolbox.
Absolutely. And so as people are incorporating DOLT or DOLT GRES into their overall system architecture, what are some of the anti patterns that you see people falling prey to?
It's funny. So I I read a weekly if you just come to our if you come to Dole Hub and you sign up for just sign up. So I get you gotta get an account or sign up for a mailing list. I I write an email every Friday and just kinda what's happening this week in DOLT Hub Land. And one of our and you it's comes from me, so you can just reply to it. And I get, you know, I get a handful of replies every week from our ever growing community. And a user
literally asked the same question. Can you write a blog about when not to use DOLT? And and so over the holidays, obliged, and I was like, hey. I I told my I told the list, I bet you didn't know you'd have this kind of power. You can make me write. So the three things I kinda came up with are in order to use DOLT, you kind of have to have control of your database. You know, a lot of in a lot of companies, the databases are run by SRE or DevOps.
You don't really get a you as developer don't get get to really tell them which databases to run. And so if that's the company you live in, you know, you kinda gotta use DOLT for a side project or something because or convince them to use it, which is often an uphill battle, although it gets easier over time as DOLT gets more mature. The second is
if you want kind of an out of the box versioning solution, you know, you've got a really complicated Google Sheet and you do think you're gonna port as DOLT and then all these nontactical users are gonna have this version control table. Those tend to be heavier lifts than you think. You're going from spreadsheet to database, and then you're adding versioning on top to these users. Like, that might be that might be a bridge too far in a lot of cases. And then the last thing, and we've kind of touched on this, is like, Dolce and OA to be SQL database. So if you don't need that, you're picking the wrong tool. Right? So if you're I know. If you want a version control data warehouse or data lake, like, there's you know, like, a fast or something else or really not available.
If you want embedded,
like, a SQLite database, and this is really popular. Like, version control, a lot of people think this solves, like, the sync problem on mobile where you put in you put a DOLT on behind a mobile app, and when you're offline writing to it, and then when it wakes up, it syncs back up to your server. And and so DOLT can be run embedded in Golang, but it is not a C library like SQLite. So it is far less portable that you we don't have people rebuilding iOS or Android apps with DOLT as the backing stores on the same problem. If you want a graph database, you know, there's this company called Terminus built a version control graph database. I haven't heard much from them lately, but it exists five years ago that we talked a lot. I'd say make sure that what you want is something like Postgres or MySQL or Oracle or SQL Server or, like, you know, the the closed source traditional stat
databases and, and then DOLT's for you.
And as far as the applications of DOLT that you have seen people using it for, what are some of the most interesting or innovative or unexpected uses that you are aware of?
It's funny. It's my baby. They're all interesting to me. I I really didn't expect video games to be a main use case. Like, we've got dozens of video game companies using it to store their video game configuration. I just didn't even know video game configuration got that big. You know, when I think of video game configuration,
I think of, like, the player stats in Madden. Right? But, really, every single one of the the objects in the game is basically controlled of how it looks, what it does. It's all controlled by configuration. So that's been really kind of shocking and interesting to me. Honestly, having done it done this for so long and then and like, you know, you've you've got a business and it's growing and you're feeling good about it. But then you use Cloud Code and you're like, this thing can make rights. And you're sitting there and you're like, I wouldn't let this thing make rights if it wasn't a Git because it doesn't do them right all the time. Well, I have Git for databases.
You know, that's the most exciting thing that's happened to me in a long time. And, you know, when you talk to other engineers, they confirm it. Even people are writing that online. Like Andre Karpathy is on Dorkash talking about it. You got the cursor guys telling Lex Friedman that databases need branches. And so, you know, it's not just us, it's the world. And to have spent a good part of seven, eight years building something, to have this brand new, amazing use case kind of pop up is really, really exciting for us. And we're just excited to see what the next year holds, we're ready to lock in and keep building for our users. The other thing is, DollsFree and open source. I'm not trying to sell you anything here. You just pick it up and use it. That's all we want. We want it to be used far and wide and become one of the formats.
And in your own journey of building this technology and company and ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
It's interesting. I like, so, you know, we set out to build a data sharing platform and then we ended up building a database. And we really didn't wanna build a database because we thought it was too hard. And it is hard, but it's not too hard. So my office is not too hard. We did a team of 15 people and it works pretty well. My big advice to people that are looking and listening here are like, Oh, that's cool.
Want to build something hard and technical, I want to build a database. Have a good idea for some features that I could add to a database, It's just to stick with your conviction. I've been doing this for a long time and it gets hard. There was a time when we have a Metris email and I'd open the email and there were five DOLT servers. And
you come to work every day and you're like, Why is no one using this? And really, the answer was, Well, it just wasn't good enough yet. And so now we open up the same email and it's thousands of servers, and that's exciting. And we can feel it in the bug queue and the users showing up questions we get. It's all empowering. But you have to be walking through the dark wilderness alone for a long time to get to that point.
And you talk to investors and they're like, Yeah, Tim, databases take a long time. Don't get discouraged. You're like, Really? So I guess the easiest way to lose is to give up. So maybe the advice is only give up if you're forced to, if you have conviction. So the big surprise is just how long and how hard the journey is and just how emotionally
draining it is to just come to work every day and build for your users, you know, no matter how many of those there are. And I just want make sure people listen to this that are listening to this and are like, Oh, that's a cool database. And I have an idea for a cool database that they want to encourage them to build it and stick with it because we need more. We need more stuff. We need a cooler tech. Always. It's fun. It's why we do this as engineers.
And as you continue to build and iterate on DOLT and the overall ecosystem of technologies around it? What are some of the things you have planned for the near to medium term or any particular problems you're excited to dig into? Yeah. Absolutely. So the the year so the the big focus right now for half the team is dolt2.o.
And so the difference between dolt1.o and dolt2.o is if you use the tip of main, you're kind of getting close to two dot o already. It has all the features we need. The four goals we had were automatic garbage collection. So DOLT like Postgres makes disk garbage. That used to not be used to manually have to trigger that, and it was intrusive. That's all been fixed. I mean, the latest version of DOLT ships with an automatic garbage collection turned on by default, so you'd never even notice. The second is we changed the actual disk format to what we call archive format, which saves about 60% of disk space on average for users. So adults even smaller in this format. We wanted to reach MySQL
parity, performance latency, performance on Sysbench, which we did. We're actually slightly faster than MySQL now, about 5% faster on Sysbench. And then finally, we wanted to reiterate that we're at the the storage and version versioning functionality,
features were locked. So we're kind of cleaning up a bunch of code and getting that ready. That's the big thing is stay tuned for DOLT two point zero, kind of signaled to the world that step function and chain improvement has happened since one o, which came out in May 2023.
The second thing is about how the other half the teams focus on DOLT DRESS. We want we wanna get DOLT DRESS one o this year. You really see that as much just bigger, more power, more interesting database market. It's where everyone's playing and we want to be part of that. And so that's a lot of syntax bug fixing. If you're listening to this and you're like, you want to help us out, just grab Dolgres, take a pgdump of your schema
try to jam it into Dolgres. And if you get a bug, like you get an error, cut an issue and tell us about it. We'll try to fix it. We gotta fix it. So that's kind of my call to action on that side. But yeah, it's an exciting year for us. We just expect more and more people to show up with agentic use cases over the year. In fact, I was just playing with this thing called Gastown,
which is built by this guy, I know Steve Yeghi. It's getting a lot of pub publicity. It's like a an agent orchestrator. It basically spins up, like, 50 cloud codes to do what you want. It's pretty crazy. But it uses an agentic memory system called Beads, and Steve wants to move. It uses Beads currently uses a combo of SQLite and Git, and he's like, DOLT's perfect for this, and he wants to move the back end of that. So and that just you know, I just talked to him yesterday. So the you know, we expect more and more of that to happen start happening this year as people realize that all these all these agent things need need version control.
All right. Are there any other aspects of the work that you're doing on DOLT, the use cases that it enables, this overall question of version control for data that we didn't discuss yet that you'd like to cover before we close out the show?
Oh, I feel like I talk too much. The so I don't I don't think so. I'm excited about this. The the thing we didn't talk about, which is really interesting, is and I maybe I would just encourage users if they're interested in how this all works to go read about it, is DOLT's built on a brand new data structure called a Prolitree.
A Prolitree is basically a content address of the tree, and I touched a little bit on it with, like but the idea here is you can break large, large pieces of data down into chunk of content address chunks and then search those chunks really quickly and share storage between them. And that's the primitive
that enables all of this. And so if you're really interested in computer science, you really want to dig into how this works, like we didn't just we didn't take a Postgres or MySQL and just adapt it. We had to we had to do some deep computer science to make this work. And a lot of if you have the engineers on here, a software engineer on here, we'll find that really interesting. So if you just search search it, the top result is is our documentation.
Probably p r o l l y sort of kind of short for probabilistic. That's a really fun fun technical deep dive if you wanna if your your listeners wanna geek out. And
Yeah. I'll definitely add a link to that in the show notes. And so for anybody who wants to get in touch with you and your team and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Not allowed to say version control because we've we've fixed that one. The it's interesting. People have figured out the best way to store, retrieve, manage, fork, merge, sync context. And context is that is, I mean, what an agent or you are sending to an LLM. Then because that's kind of the stack trace in this world, and we don't really have good tools to manage and store that and have that across session.
And so I think and there are some attempts like this thing beads that I'm, I was talking about that that Steve built as an attempt, but I think we I don't even think we agree that we Clearly, don't agree that that stuff needs to be stored and even shipped along with the artifact that the thing generated kind of as a start stack trace. And so for me, that's the most interesting
problem that I've seen that is missing that, like, many, many, many people are trying to solve that everyone sees, but there we haven't reached a consensus about how this is gonna look and work. And that that's partly because the Codex Cloud Code, these things don't make it easy for you to kinda see and manage the context because that's their secret sauce. That's what they that's how they differentiate from their the other ones. So I think over time, that's what kinda data is gonna be generated in the AI world. What are we gonna need to store? What are we gonna need to see? What are what are the AIs going to need to see? That's
the frontier for me. A lot of the other problems in databases
that I've experienced in my career seem to be solved. Like if you want a really big, infinitely scalable database, you can get that. If you want a really fast one for a specific use case, there's a ton of those. If you want a very controlled one, now there's one of those. And so the I think there's the whole set of use cases that AI is unlocking that in context management and storage is the big one that I see that is pretty interesting.
Alright. Well, thank you very much for taking the time today to join me and share all of the great work that you and your team are doing on DOLT and making version control one of these core primitives that's available for data, which is definitely one of the long running challenges, particularly in data engineering, but also in application design. So I appreciate all of the time and effort that you've put into making that a reality, and I hope you enjoy the rest of your
Yeah. I appreciate you having me and thank letting me talk about DOLT.
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
