Turing Award Winner: Postgres, Disagreeing with Google, Future Problems | Mike Stonebraker | The Peterman Pod podcast

⁠¶ Intro

00:00

Computer science may well not be a growth industry going forward.

00:04

This is Mike Stonebreaker. He's a Turing Award winner, famous for his fundamental contributions to database systems like creating Postgres and more. What was the hardest part of that quotation?

00:15

Uh query optimizer.

00:18

How do you identify the people who aren't smart?

00:22

Well, I mean it's it's very easy.

00:24

He shared interesting technical takes from his experience.

00:27

On our benchmarks, large language models get zero percent.

00:32

Why did you disagree so much with MapReduce?

00:35

That wasn't the only thing Google was stupid about.

00:38

I'm curious your thoughts on unsolved problems and And what you think the future might look like. Here's the full episode.

00:48

🎵 Music

00:52

The first thing I want to go over is the story of how Postgres got started. But for that, I kind of want to start at the beginning. How did you get into building database systems?

When I graduated, I had the good fortune of being hired at Berkeley. And I it was clear I had to you know, continuing what I did for my PhD was not gonna go anywhere. Then a as well as today. Uh you're way ahead if you get adopted by a mentor. who notes the robes. So Jean Wong, who who is still alive and still Still kicking, uh, took me under his wing and said, Well, let's do something together.

01:40

And this was nineteen seventy one, which was the year after Ted Cod wrote his pioneering paper in in CACM. Jean Wong said, Well let's let's take a look at database stuff. And At the time, the the competitors were a thing called the Codicil proposal, which you're probably too young to have ever heard of. And so it was a low-level spaghetti network proposal where you executed queries by following pointers.

02:16

And then the the alternative was the IBM proposal, which was a higher a thing called IMS, which is still available and it's hierarchical data. It's a trees you you're organized your data as trees. And even at the time, IBM realized that trees were not general enough to to solve many people's problems. So they hacked on a a way to make it a limited network structure. So it was clear that was a horrible hack.

02:54

The Codicil proposal had all kinds of bad properties, uh, besides being low level and and really hard to debug. Uh, it also had the property that if anything changed in your what's now called your schema, you basically had to throw away everything and do it all again because it was absolutely rooted at the physical level. Whereas Ted Cod's stuff made perfect sense. And so Jean said, Well let's build one of these puppies. That's clearly the the next thing to try.

03:28

So we started building Ingress in nineteen seventy two. Uh while I was an assistant professor at Berkeley. As you know, if you're an assistant professor Well you have to you have about you get five years to prove that you're a big shit and they fire you or they give you tenure. So Ingress was my ticket to getting tenure, which happened in nineteen seventy six. That was where it started. And then again, you know, happenstance.

04:01

Uh at the time, you know, a lot of people would build prototypes which were sort of studenty like code, which means you could get it to run, but if you gave it to anybody else, they couldn't. So we put in the first ninety percent to get something we could run. And then for whatever reason, we put in the next 90% to get it to where it really worked. So the University of California version of of Ingress was really worked.

04:35

And so over the next couple of years about a hundred universities started running it'cause Unix became the big thing. And so this was a database a free database system that ran on Unix. And so it was quite popular in in the academic world. And so we got started getting uh you know, lots of visitors at Berkeley who would say, gee, this is really nifty looking stuff. What's the biggest appl biggest day Ingress application you have?

05:12

And we'd be forced to say not very big because we And so this was brought home uh in spades when Arizona State University. considered running Ingress on their student records data, all 40,000 students worth. And they could get over that they had to get an unsupported operating system from Bell Labs. They could also get over they had to uh run an unsupported opera uh unsupported database system from these guys at Berkeley.

05:49

But the project went down in flames when they realized there was no COBOL available for Unix, and they were a COBOL shop. So unsupported operating system, unsupported database system, no COBOL doomed us to Irrelevant. And it was clear the only way out of that was to start a company. And so in nineteen eighty, uh, we got venture capital as it existed then, uh and started Ingress Corporation. to move uh ingress to uh to Dex uh VMS, uh you know, a a real

06:34

a real operating system and we had a real company that would support Ingress. And that was the start of the commercial journey.

⁠¶ Competing with Oracle

06:43

I saw that Ingress was competing with uh Larry Ellison's offering at Oracle.

06:49

Yes.

06:50

I saw that uh Ingress was was certainly better than what w they were offering. But they were still competing somehow. How how did they compete?

07:01

Uh Larry Ellison is a fabulous salesman. And he at the time he he wa he made present tense and future tense indistinguishable. And so be he basically lied to customers. He would ship stuff that didn't work. and how have his initial customers help him debug it. So I think he he in he engaged in what I consider very shady business practices.

07:33

But lying to customers I think is is, you know, unconscionable. So for instance Uh there's a thing called referential integrity, which is if you if you fire an employee And he's the last person in a given department. Do you wanna delete the department or do you wanna have it be a department a ghost department? It's it's all that kind of stuff.

08:04

And so Ingress Corporation implemented referential integrity. Uh Oracle Corporation wrote two manual pages that said, here's the definition of referential integrity, which everybody agreed to. And then he then down at the bottom it said, not yet implemented.

08:25

Interesting. Yeah, I had interviewed someone who worked at Sun Microsystems and they had a similar opinion that they Larry Ellison was a little bit shady. So it seems to be a commonality. Um, I also saw somewhere else in something that you had said was that um When Oracle acquired MySQL, that everyone kind of got a afraid of that and moved to Postgres.

08:55

That was the genesis of the of Postgres replacing MySQL as the preferred open source relational database system.

09:06

So you you created Ingress and there was a lot of technical innovations in it so that it was better than the the incumbents.

⁠¶ What made Postgres special

09:15

But ultimately it it went away and you developed Postgres. What what was the thing that Ingress didn't do that Postgres would do?

09:24

Well the big thing that guided us at the very beginning was for the academic version of Ingress was we were gonna support a geographic information system that the neighboring professor Praveen Varaya wanted. And so to support a GIS system, you need points, lines, polygons, line groups, that sort of stuff. And it was clear that Ingris couldn't do it because the data types we put into Ingress were the standard ones, integers, floats, uh, text strings.

10:11

You couldn't support you couldn't efficiently support uh GIS types on top of that. So as a GIS, the academic version of Icris was a complete failure. And that was in the back of m of our mind. The other thing that happened, this is a little out of chronologic chronological sequence, but it helps make the point, is that the commercial version of Ingress, I think around nineteen eighty five. Uh you know, there was ANSI had just proposed uh a date and time standard for relational databases.

10:55

And so uh commercial ingress implemented date and time. uh you know, using the standard Gregorian calendar. And so I was associated with the commercial version of Ingress as well as I was still at the University of California as a professor. So I got a call from from a an English customer who said, you know, you implemented date and time wrong. And I said, Huh? Uh we implemented the Gregorian calendar and you can subtract

11:30

Uh and you know, i i if it has you know, days have thirty or thirty-one months except for February, except for leap years. So subtraction on dates works exactly the way you would expect it to. But he said, that's not what I want. in his particular world. He said he was He was dealing with with bond financial instruments and for some reason I mean you got the same amount of interest on a f on his financial bonds during each month, no matter how long the month was.

12:12

So he had the date you bought the bond, the date you sold the bond. He wanted to do a subtraction, multiply it by the coupon rate, and say that's what that's the interest we paid you. But of course, his version of subtraction was March fifteenth minus February fifteenth is thirty days, because that's the definition of his calendar. And so he had to uh retrieve two dates out to user code, do the subtraction in user code, put the answer back, and it cost him a factor of two or three in efficiency.

12:52

And he said, why can't I just overload your definition of subtraction with what I want? And of course with Ingress it was hard coated. And the problem was this is a case where you wanted bond time just like you wanted uh points, lines, and polygons. And so Postgres was engineered to have an extendable type system. So you could have whatever data types you wanted.

13:20

and they were very efficient. And that was the main gist of Postgres was that it had that flexibility. Uh and as In business data processing, a lot most people were happy with the standard data type. But relational databases started to spread to all kinds of other places, what are called abstract data types. or, you know, uh stored procedures, you know bunch of names they're called, uh, you know, had had great applicability.

14:00

And so Postgres that was that was the big thing in Postgres. Uh we also Postgres also uh supported what the AI guys at the time wanted in the way of inheritance. Uh we also supported time travel, uh, but the implementation absolutely sucked. And and it got taken out after a while. So there were a huge number of of really nifty things in Postgres.

14:32

You mentioned you you want to hire the extraordinary software engineers and I think you've you've said before that you have no trouble finding those people. How do you identify those people in your hiring that they're the extraordinary ones?

14:51

I mean I have a good feel for how difficult stuff is. If they get three X the am the amount done, you know, in school that I think is reasonable, then th then they're incredible.

15:05

On the flip side, you had this interesting quote. I wrote I wrote it down. He said, um, I can't stand people who are who aren't really smart. It's challenging to talk to them. How do you identify the people who aren't smart?

15:20

Well I mean it's it's very easy. And and it it rapidly you can rapidly surface whether they're smart or not. You know, what was your master's thesis? What did you do? Uh well how did it how did it exactly work? Well, how did you deal with error conditions? Uh how many processes did you have? Why didn't you use threads? I mean you you you ask them technical qua deep technical questions questions.

15:54

You you gave a talk and I think there's also a paper behind it of this idea that uh one size fits all database systems, not optimal. One size actually fits none, and that what you really want is database solutions that target specific needs.

⁠¶ One size fits none

16:12

What database offerings do you see today that are one size fits all?

16:15

In 2004, when I wrote the paper, we had an academic project. Which was building what became stream base. And so a stream processing engine looks nothing like a relational database. And we had the gist of an idea for column stores for the d for data warehouses, which was popularized by Vertica. Looks nothing like a row store. So here were three wildly different implementations that had no resemblance to each other. And in each case, they were an order of magnitude faster than the other guys.

16:55

So it's pretty clear that once I you know that in with those three instances you give up an order of magnitude uh when you're running uh a database system that isn't that isn't architected for your kind of stuff. I think that's still true. I mean, I think Click House is a column store. Pinecone faster than user-defined types uh on on text-based vector processing. And so I think it's it's still very much the case.

17:36

And I think There's no difficulty putting a common parser on top of multiple implementations. Uh Postgres has so far chosen not to do that. They don't implement a column store. And so I think they are not they are not competitive, you know, on sizable data warehouses.

18:02

They also don't have multi-node support. Again, for people with big data warehouses, that's table stakes. So I think it's just as true today as it ever was. I think that What is true is that if you want to get going, you have a database problem, you know, the answer is choose Postgres. And there's a huge programming community, all kinds of all kinds of you know, data type implementations, it's free. Uh and you can find Postgres talent easily and get going.

18:44

And I and so I think it's it's it's a great choice for lowest common denominator. And until you're tr trying to do a million transactions a second, it works just fine. Until you're trying to support a petabyte data warehouse, it it I say at the low end it it's absolutely the right one size fits all at the low end it's At the high end that's just not true.

19:12

GPUs, do they make available some new opportunities to optimize databases?

19:19

Probably, but I think the the big challenge is that GPUs are You know, SUMD, uh SIMD, you know, single instruction multi data. And that's that's the anathema of indexing. And so whenever indexing is the right answer, they're probably not a good idea. And I think uh also you've gotta architect them so that the So that the bandwidth So that the bandwidth from storage is is not not the bottleneck.

19:59

And so if they're an add-on to the CPU, as often as not, the bus connecting it to the the GPU to the CPU is a bottleneck.

20:09

Can you explain why indexing would be not as effective when there's SIMD?

20:17

So let's let's say I'm I'm uh looking for Ryan's I'm looking for Ryan's salary and I have a bee tree. So you go to the root of the bee tree. You find you find the divider that has both sides of Ryan. You follow the pointer. That's a memory access for sure. Then you do it all again, and you do this like three or four times. So that doesn't parallelize well. So the answer is indexing doesn't parallelize well.

20:54

You mentioned B trees. When you first implemented uh that first version of Ingress. Did you write all of that by hand?'Cause I imagine there's probably not some existing B tree library or something.

21:08

Yeah, we wrote the original version of Ingress was all written from scratch.

21:12

What was the hardest part of that implementation?

21:16

Uh query optimizer.

21:18

And why was that hard?

21:20

It's tough c it's It's just algorithmically difficult. It's still if you ask most any senior database programmer what's the hardest hardest part, they'll still say the optimizer.

21:36

MapReduce came out at some point in the early two thousands and it kind of took the data world by storm. People were really impressed by it. They thought Google really knows what they're doing. This is the best thing since sliced bread.

⁠¶ Why he disagreed with Google

21:50

But it seems like when I look at the literature and what you thought at the time, you kind of disagreed heavily. Why did you disagree so much with uh MapReduce?

22:01

Well I think I'm not sure. There were a lot of not very enlightened people who said, Google, Google is really smart. They must know what they're doing. And so we'll do whatever they say. And so they would they would uh they would engage in Hadoop or engage with Hadoop. But Hadoop is ridiculously inefficient. And so uh at the time You know, others, you know, Dave DeWitt and others who who were involved in our two thousand and eleven paper, we understood distributed databases.

22:42

and understood that you could beat the heck out of Hadoop uh with a distributed database system, which is basically what that 2011 paper says. And of course it was it's true. And but that wasn't the only that wasn't the only thing Google was stupid about. So Google also had the opinion that eventual consistency was the right way to do concurrency control. And so that was Postulated from on high by Google all during that same period of time.

23:24

And it it wasn't uh and all the database people said, you know, you're out of your friggin' mind. It solves one particular kind of problem but only and that very rarely occurs in practice.

23:42

Why did they pursue eventual consistency?

23:45

Okay, well the idea is that you have an East Coast database and a West Coast database and they're replicas. So you want them to be the same. If you say I'm going to do a transaction, I'm going to decrement by one the number of widgets in the West Coast warehouse. Then I'm gonna with before I commit that transaction, I'm going to update the East Coast warehouse, pay pay a message over and back.

24:12

अप्टाइड़

24:14

And then to make sure everything goes well, it takes a it takes another round trip of message to make sure that both of them actually do the commit correctly. So it's expensive to do a distributed commit. And it still is. And so the idea was, well, you you do the e the you do the West Coast update, you decrease the widgets by one, you just send a message asynchronously and not in a transaction. So that eventually the East Coast uh warehouse gets decremented by one.

24:51

So meanwhile If you're on the East Coast, you you decrement, you know, foodstuffs by one, you send an asynchronous message, eventually the West Coast gets it, and eventually everything settles out. So if you're allowed to to go below zero, then what will happen is if the East Coast guy and the West Coast guy simultaneously sell the last widget, then then eventually uh the State of the warehouse will be minus one and somebody won't get their widget their widget.

25:36

And so

25:37

Uh if i if you're allowed, like Amazon to say, usually ships in 24 hours, then maybe you're can ob allowed to oversell. But most enterprises can't do that. And so eventual consistency does just doesn't work. So we talked a million hours ago about referential integrity. So referential integrity in a sales system is uh integrity constraint is stock is greater than minus one. And that fails with eventual consistency. And so uh Jeff Dean final of Google finally figured that out.

26:22

And uh when they did Spanner, Spanner had a conventional transactional system. And so Google compl uh completely abandoned eventual consistency. and completely abandoned MapReduce.

26:37

So the trade-off's basically um correctness for performance.

26:42

So it it's performance versus data integrity. And if you don't care about your data than you're willing to deal with with bad things happening.

26:54

So did you ever talk to the Google team while they were doing those things that you thought were so wrong?

27:00

We talked to them before the two thousand eleven paper. and said, Why why don't we why don't we partner up and do some stuff? And they weren't they weren't interested. So they declined.

27:19

Have you seen other examples in other big tech companies where their databases are database solutions where you actively disagree with them? Like maybe Amazon or or Facebook.

27:31

Well I gave a talk at Amazon maybe three years ago and I told them all the things I thought they were doing wrong. And I think uh Amazon's problem is that they are supporting, you know, fifteen different database systems. And that's about twelve too many. So so I think they have their own culture and I told I I said you're supporting too many database systems. And at this point they haven't chosen to retire any of them.

28:09

Why do you say that the fifteen should be three?

28:12

They're supporting a graph based database system, and it's well understood that a graph based database system is almost never the performant option. And so if you want a graph, if you want, if you like the idea of having a user interface that deals with nodes and edges, that's fine. Put put a layer on top of a relational database system that gives you that user model.

28:46

And so most of their database systems, there's some other of their database systems that better at what it does than And so so the answer is you should retire. You should retire any database system that isn't performance. in in a big enough market to justify the maintenance.

⁠¶ Why he chose academia over big tech

29:14

You've um influenced industry significantly from academia. And my one thought that I had is

29:23

What?

29:24

Why not work directly in industry? Or why why do you prefer the position of being in academia and having influence in the way that you have versus just uh taking a job at AWS or something like that, being a very, you know, distinguished engineer there.

29:42

Uh'cause that gives you a boss.

29:45

Yeah.

29:46

And that gives you company rules, limits your ability to publish. limits your ability to go talk at conferences, uh limits your abil your ability to go go poke at what what various competitors are doing that they won't tell their competitors. But mostly I really like being in startups and I and I after the commercial version of Postgres got acquired by Informex.

30:19

You know, I was working part time for Informex, which was a two thousand person company. And I didn't feel like I could make a difference'cause it was bureaucratic and and whatever the president wanted he got. So I think I'm just not cut out for I'm not cut out for politicking. I don't do that very well. And I have a hard time interacting with people I think are dumb. And that again so I guess I I I have I have some problems with with big companies.

30:57

I I wanna talk a little bit about DBoss. I just thought it was a really interesting technical model. Can you explain what DBoss is?

⁠¶ Replacing state in an OS with a DB

31:07

We started the academic project in nineteen, twenty twenty, something like that. And the gist of it was Uh at that point, Mateza Haria, who was on the faculty at Stanford, was also one of the founders of Databricks, was the original creator of Spark. And so he said uh At the time Databricks, you know, basically was running people's spark jobs on the cloud. And so he said at any given time, we might be orchestrating a million Spark jobs.

31:54

And so we have to write a scheduler that's gonna decide who to run next. At scale a million. And he said there was no we tried all the all the schedulers written by the OS folks, and they they couldn't they didn't scale. So we put all the scheduling data in a Postgres database. And basically a Postgres application was doing scheduling. And then it it sort of clicked that by and large, most everything you do in an operating system is managing data at scale.

32:33

And you should do that using database technology. So why don't we just replace at least the upper half of Linux with a database system? So that was the gist of the academic project. And we worked on it at Berkeley and Stanford uh in the early early twenties. And it was it was very successful. It clearly it clearly worked.

33:03

And in the process, uh the Stanford folks wrote an extension to uh JavaScript so that you could program you need some programming world that can can talk to your implementation. So if you're doing what amounts to a programming language and you're r running on top of what amounts to an operating system that is a database.

33:35

then the obvious thing to do is put all your state in the database. And that's exactly what they did. And so we had an innovative programming language model, an innovative operating system model. And and of course then the idea was, well, can we start a company? And so we talked to the VCs. who to a person said you're s you're dreaming if you think you're gonna displace Linux. However, this programming language stuff is really nifty. We had what amounted to extensions to JavaScript.

34:19

They would allow any any program to have all of the nice features of a database system. You know, stuff was durable. You could have transactions. If it failed, you'd fail over. You know, it was all that nifty stuff. So we got funded uh to start a company in twenty twenty three. And that was DBoss Incorporated. And we decided that that was the name of the project.

34:54

But we were ba we were basically in the programming language business. And so at the current time, uh DBoss has a version of TypeScript, a version of Java, a version of Joe, Go. and a version of Python, which which are basically seamless. It runs what looks like vanilla programs. In the world of the cloud. There's every incentive for you to structure your your s your application as a workflow. And so we decided that we would support a workflow system, period.

35:38

And so th the workflow that that D Boss supports in those four languages is the steps in in a workflow in the individual micro ops, whatever you want to call them, are transactional. Uh workflows are durable so that once you do a step, it's not forgotten. Uh and it's clear that we can make uh workflows atomic if there was a market for it, which means the whole workflow would either finish or look like it never happened. So it has very, very nice properties. And is

36:27

a great deal faster and a great deal easier to use than the competition. So The company is selling and innovating in this area. And so so the idea is that You want to make state of your application persistent when you put it in the database. Uh and then it and then you figure out how to do it fast. And I think they're they're not going to be able to do that. business model as we were talking earlier is very much

37:02

get and get leaf level programmers interested. So it's been very much uh You know, tell us leaf level program or tell us what you need that we don't have. get it quickly and convince people to try it. And

37:24

We've been very, very successful with others with other startups who want to choose the best thing. And we're starting to be to be successful with the the big boys. So it's It's an interesting interesting market and I think the key thing so far is Probably two thirds of the customers are doing agentic AI, which means that they have a large language model surrounded by a bunch of stuff. uh that that uh adds more signal. And so far. Most of agentic AI is read only.

38:09

meaning uh you want to produce a prediction for whether Ryan is is going to be a good customer or not. And so that just runs some stuff and then produces a new thing that's given to somebody. So basically read only, which means that uh you're not you're not actually updating Ryan's credit rating or And so I think I'm not fairly quickly this the whole world is gonna move to using, you know, agents to do read-write applications.

38:52

And that's going to make that's going to make them very, very databasey. And DBoss does that stuff really, really well. And so, you know, for instance If you want to write an agent or two agents that move$100 from my account to your account. And so you debit my account, you increment your account. And these two agents have to agree to commit. Or you have to back everything out.

39:28

which is to say the workflow needs to be what I called atomic, which is it all happens or it looks like it never happened. And so I think the the demands on in this market w will escalate with with things, with people wanting stuff to be read write. And so I think that And that will bode well for the market and bode well for D Boss.

39:56

And this this what's being offered in the market today to application developers differs from the original research project, where that was actually swapping out the guts of operating system with the database. I see. That's I mean, that's really cool. I never imagined replacing all the state of a of an operating system with a database. What's the there there's gotta be some trade off there.

40:22

Well a file system written on top of a DBMS is faster than than than the Linux file system. The scheduling engine is competitive with other scheduling engines. You can make everything fail over so you get high availability. without having to do anything else. The answer is there there's really no downside.

40:49

Then why wouldn't Linux incorporate that and upgrade itself with this.

40:55

You hope they would. In other words, you should you should keep all the device driver junk down at the bottom because that's there's a lot of it and no one wants to do that and replace everything else with the database implementation.

41:12

Is that something that you've mentioned to Linux people and w what's their typical reaction?

41:17

Back in the academic project, when I'd mentioned that to operating system folks, they would get very, very threatened. Which is this is the database guys trying to take over their turf. And I think the programming language guys did oh, you know, which is you know, the the The way to implement the runtime for a programming environment is with a database.

41:46

That's uh that's interesting. I mean, if it's objectively true, then maybe it will take over.

41:52

Well, I mean it took Java ten years to become widely accepted. I just think the time constant is

42:00

Substantial.

⁠¶ Future problems in databases

42:02

I think we talked a lot about the past of databases and I'm curious your thoughts on unsolved problems in databases and what you think the future might look like.

42:12

Okay, so I think two different things that I'd like to talk about. The first one is like everyone else. Three years ago, we started to look at what were large language models good for. So we've been trying to get what's now called text to SQL. uh to w we've been we've been trying to make it work. on real world databases. Especially real world data warehouses.

42:50

So we've been trying the technology on four different production databases warehouses where we've gotten the workload, the actual workload that's run.

43:04

And

43:06

you know, from the actual users using the system. And we've gotten them to reverse engineer the text that corresponds to that. sequel. So we have text and sequel. For we have four benchmarks.

43:26

When you say text is sequel, you mean uh like a human prompting um model or something? Like I would just in English that text would be, you know, everyone over four years old.

43:38

Tell me all the professors at MIT who won the Turing Award. And so an LLM is supposedly good at that. And so uh the text to sequel benchmarks, there's a one called Spider, another one called Bird. And the best LLM systems are pretty good at those benchmarks, you know, like eighty percent accuracy or better.

44:04

So not superhuman.

44:06

Not superhuman, but they're pretty good. Like you would consider using them and and you know like current Current leaderboard is something like 85% accuracy, which I mean it's getting there. You say maybe it's not quite ready for prime time, but it's simply it certainly looks a looks pretty good. Well on our benchmark.

44:30

uh large language models get zero percent. And if you enhance them with rag and and all the tricks, it goes to ten percent. And if you give as a prompt The From clause, in other words, all the actual tables that need to be accessed. and all the actual join clauses that need to be joined, then accuracy goes to about thirty five percent. So the definition of this stuff doesn't is not ready for prime time and not gonna be for a while, if ever.

45:10

So what's the difference? Number one, data warehouse, you know, LLMs are trained on the pile. Data warehouse data is not in the pile. And there's an adage that if you haven't seen the data a couple times before, you have no chance of recurgitating it. That's number one. Number two, uh query complexity on spider and bird is maybe 10 to 20 lines of sequel. Real world data warehouses, it's a hundred lines of sequel. Complexity is bigger. Number three, the schema in spider and bird is clean.

45:57

You know, the table names are m are mnemonic, the column names are mnemonic, and there's no duplication. In data warehouses, people have materialized views all the time. It means there's redundancy. And and column names are often underscore z, upperscore, blah. And so they're not mnemonic. So that makes it a lot harder. Uh and then they also have idiosyncratic data. So J term is a popular thing at MIT. It's a one month term in January. not unique to to MIT, but not very popular.

46:44

So not in the pile, idiosyncratic data, simple queries, schema, schema is a mess. make make it not work. And those are true of every data warehouse I know of. And so I think the the technology simply doesn't work and isn't gonna work any time soon. So we've been so what do you do? So well first of all we published our benchmark. It's a thing called Beaver, which is an anonymized and abstracted version of these four data warehouses.

47:27

And so if you think you're really good at doing text-to-sQL, try a real benchmark, not a fake one. So number two, uh borrowing from what I just said. If you don't have all the join terms and you don't have The From Claws, your toast. What's more, if you don't break down the query into simpler pieces, you're to That says to me that uh you want to give uh your retrieval system simpler pieces, which include the From Clause, include join terms.

48:13

That's number one. Number two, the minute you want to talk to two different structured databases. You know, like your data warehouse and your CRM system. Then it's pretty clear to me that doing a structured data join using an LLM is a bad idea. It's just you're much better off you leaving them as tables and doing a join and sequel. So our point of view is we are trying out

48:49

turning everything into tables. You know, we're we're we're working with the Department of Transportation in the city of Munich, Germany. And they have six people full time who are answering citizens Complaints. which are of the form, how come I I don't have enough time to cross this intersection next to my house before the light turns? All kinds of stuff. How come the trolley doesn't stop for enough time for me to get on the trolley? You know, it's how come the trolley doesn't come? Uh

49:30

more than once an hour. I mean, all it's all this stuff. Their database is the trolley schedule, that's SQL. The light sequencing, that's SQL. the intersections, that's CAD. the federal You know, country of Germany regulations of this stuff. That's text. City of Munich regulations for this stuff which is taxable. So you gotta join uh SQL SQL CAD text and text. So our point of view is turn it all into SQL, all into tables, and do a join with what amounts to a query optimizer.

50:16

So that's what we're working on. I think other people will have other ideas, but I think it's an extremely fertile area because people really want to do it. So that's number one. Uh number two, we talked earlier about agentic AI. The minute this becomes read write, it's a distributed database problem. And you want atomicity, consistency, all that stuff. I think a very interesting area. That's pretty much what I what I am what I'm working on now.

50:52

On that benchmark where it's zero percent right now, what percent is human? Like if you took someone who really knows SQL and how what would they score? Like the average human?

51:03

So once you disambiguate the the text a a knowledgeable SQL user programmer with the schema will do will will get very high accuracy.

51:15

Okay, like ninety percent or something.

51:17

At least.

51:18

Okay, okay. Wow, it's I'm s surprised that the LM so s scores so lowly on on this kind of benchmark. Maybe when this goes out someone who works at Anthropic will reach out to you or something and say

51:31

But I'd love to I'd love to find out'cause I mean it it's a terrific success story.

⁠¶ Technical book recommendations to learn databases

51:38

For people who wanna deeply understand databases and they're looking for some material to study, is there a book that you'd recommend that's a top technical book to learn databases?

51:50

And papers in the literature. I think uh Joey Hellerstein and I published a red book, what's called the Red Book, which is called Readings and Database Systems. It's now Eight years old. I mean I think that w that would be a great set of readings for eight years ago. popular papers from the literature.

52:19

If you could go back to yourself when you just graduated, give yourself some advice knowing what you know today, what would you say?

⁠¶ Advice for younger self

52:27

back uh when I f first took the job at Berkeley. And without thinking about it much, we said, let's write a database system. And we we knew nothing about date databases, not nothing about implementations. We were not skilled programmers. like Bill Joy. So starting off doing something that was that crazy was really pretty crazy. And and you know, you you effort and you make stuff work and you learn along the way. And so I think the answer is Think outside the box. Think crazy thoughts.

53:08

And try and do them. And I think that's a good thing. To me, it's not at all obvious. The the better question is: if you were starting out today, what would you major in? Uh,'cause I think, you know, computer science may well not be a growth industry going forward. And I'm not sure I would recommend eighteen year olds to major in computer science. I mean I think health health care and and the building trades are are

53:41

are safe bets and everything else looks much riskier. Uh if if you're about to get your PhD and are trying to decide what to do. Then I think life is pretty easy. You know, take take the most prestigious job you can get. and find a mentor who's willing to help you. And then Pick some area that isn't you know, like our our stuff, you know, which is called Rubicon is definitely not going with the flow. So choose something that's not that isn't going with the flow.

54:22

and try and make it work. Both my wife and I said, Fo follow your passion Somehow the money will work out. And I don't believe that for a minute, but I think that's what you have to tell your kids. And your grandkids.

54:38

But if you don't believe that then uh why do you have to tell them that?

54:42

Uh my wife is is a good example. So she has an She has a master's degree in computer science, an undergraduate degree in computer science. And she wanted to be a teacher, uh you know, a you know, K K twelve teacher. And her parents said, You can't do that, it doesn't pay enough money. And so I think I think she ever since that time has regretted that decision. She wasn't passionate about doing computer science. It was simply a trade.

55:16

And so I think find something you're passionate about and and you will you know either Ye you won't starve. You may not make a lot of money, but I think chances are you'll be happier than if you do something you're not passionate about.'Cause I think A lot of people I know view their job as simply a job, and life is what happens between five PM and eight AM. I don't feel that way at all. I really like what I do. It wouldn't matter whether I made a lot of money or didn't.

⁠¶ Outro

55:52

Awesome. All right. Well, thank you so much for your time. Really appreciate it. Thank you for listening to the podcast. It's a passion project of mine that I really enjoyed building. Another passion project that I've been working on kind of in secret is building an ergonomic keyboard that I wish existed and I finally have a prototype, so I'd love to show you.

56:11

what we've built. It's ultra low profile and ergonomic, and I couldn't find anything like it on the market. So that's why we built it. I'll put a link to the keyboard in the description. You can take a look and learn more about the project there. We could definitely use your support. Also, if you have any feedback from me about the show, I'd love to hear it. Comments on YouTube have led to guests coming on like Ilya Grigoric.

56:34

and David Fowler. I wasn't aware of them until someone dropped a comment. Also, feedback in the comments helped me learn to reduce the number of cliffhangers in the intros. So your comments definitely make a difference. Please keep letting me know what you'd like to see more of in the show, and I'll see you in the next episode.

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Turing Award Winner: Postgres, Disagreeing with Google, Future Problems | Mike Stonebraker

Summary

Episode description

Transcript