Hello and welcome to PostgresFM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and I'm joined as usual by Nik, founder of Postgres.AI. Hey, Nik.
Hi, Michael. And let's welcome our guest.
Yeah, we are joined by a very special guest, Sugu, who is a co-creator of Vitess, co-founded PlanetScale, and is now at Supabase working on an exciting project called Multigres. So welcome, Sugu.
Thank you. Glad to be here.
All right. It's our pleasure. So it's my job to ask you a couple of the easy questions to start off. So what is Multigres and why are you working on it?
Multigres is a Vitess adaptation for Postgres. It's been on my mind for a long time, many years, and we've even had a few false starts with this project. And I guess there is a timing for everything and finally the timing has come. So I'm very excited to get started on this finally.
Yeah timing is an interesting one it feels like for many years I was looking at PlanetScale and Vitess specifically very jealously thinking you can promise the world you can promise this you know horizontal scaling with a relational database for OLTP and it's it you know all of the things that people want and we didn't really have a good answer for it in Postgres. But all of a sudden, in the last few months, it seems almost, there are now 3, 4 competing projects all doing it. So why now?
Why is it all happening now?
I would say, I think there is a timing for every new idea to gather momentum. Like the idea may be good, but the timing may be bad for one reason or another. But once the timing is good, then it is obvious that it is the right time. And it feels like now is that time. Like I would say, for example, PayPal, there have been previous attempts at moving money online, but it never worked. But PayPal just came at the right time and therefore took off during that time.
Many such examples where, you know, some ideas came too early and didn't take off. But for Postgres, this feels like it's the right time.
That's interesting. Let me reveal something. We met in 2018, I think, at Percona conference. And in 2019, we met again in Nutanix at the meetup. Go meetup, right. And discussions about Vitess for Postgres started.
And a couple of times we had a call, I tried to involve a couple of guys to and from from my understanding it never worked because You could not do it yourself being busy with other thing I guess my sequel related and the guys are looking at the complexity other guys looking at the complexity of related and, and didn't proceed. And actually, in 1 case, it was, they decided to build from scratch. It was, it was SPQR project. It's live and there's sharding for Postgres.
Yeah, Andrey Borodin.
Yeah, Andrey Borodin. Borodin. Yeah. Yeah. So and other folks also involved. And so for me, it was disappointing that it doesn't work. And at some point I saw the message in Vitess, I think, that we are not going to do it. So like, don't expect.
I feel so bad because I was so excited about doing it. And then I realized, oh my God, you know.
But now you started and last week PlanetScale decided to support Postgres. So what's happening? Like, I don't understand. Just, just right time. Right. Yeah. Enough, enough number of companies using Postgres, which really needed.
At least 1 horse will win. Yeah.
So yeah, it's great. But, but yeah, long, long story to this point.
Yeah. Sometimes when there are multiple projects there's kind of slight differences in philosophy or approach or trade-offs, willing to trade 1 thing off in relation to another. And I saw your plan, I really liked that you used, you mentioned building incrementally. So Vitess is a huge project, lots and lots of features, But I've heard you talk in the past about building it quite incrementally while at YouTube.
You know, it didn't start off as complex as it is now, obviously, and you did it kind of 1 feature at a time. And it sounds like that's the plan again with Multigres. Is that different to some of the other projects? Or what do you see as your philosophy and how it might differ slightly to some of the others?
I think my philosophy is that I would say I don't want to compromise on what the final project is going to look like. So any path that deviates me from hitting the target on the final project, I do not want to take. But any shortcuts that can get us early results, that we can do, but only those that ensure that it doesn't risk what we become eventually, which is a project that should feel native as if it was for Postgres, by Postgres kind of thing. I want it to be a pure Postgres project.
In this case, I must ask, in this case, are you sure you're still good with choice of Go language? Because sometimes we see projects which have really like sub millisecond latency on average, like 100 microseconds, for example, on average. And Go definitely will bring a few hundred microseconds of latency overhead. So usually it's not a big deal, but maybe in some cases it's some deal, right? Are you happy with Go?
Yeah, I mean- Because you were 1 of the first big Go language users building Vitess, as we know from various interviews and so on. Yeah. So it's still a good choice because now there is Rust, right?
Yes. I would say, by the way, when we started Go, compared to where Go is today, it was a nightmare.
Yeah, like 10 milliseconds.
Like, yeah, like 10 milliseconds or something round trip is what we were paying for. Those days we had hard disks, by the way. So that's another 3 to 5 milliseconds just within the database. But things are a lot better now. And at this point, the way I would put it is, Like the trade-offs are in favor of Go, let's put it that way. Mainly because there is a huge amount of existing code that you can just lift and port. And rewriting all of that in Rust is going to just delay us.
And at least in Vitess, it has proven itself to scale for, you know, like multi, hundreds of terabytes. And the latencies that people see are, they are not affected by a couple of hundred microseconds. So I think the, and plus there's, there's this inherent acceptance of this network latency for storage and stuff. And if you bring the storage local, then this actually wins out over anything that's there.
This is exactly what I wanted to mention. Yeah, I see PlanetScale right now, they came out with Postgres support, but no Vitess. And I'm very curious how much it will take for them to bring it and compete with you. It's another interesting question. But from past week, I see my main impression is like, main stake is on local storage.
And this is great because local storage for Postgres, we use it in some places where we struggle with EBS volumes and so on, but it's considered like not standard, not safe, blah blah blah. There are companies who use it, I know. I know myself, right? And it's great. Like today, for example, Patroni and since Postgres 12 we don't need to restart nodes when we have failover. So if we lose node, forget about node, we just failover and so on. And with local storage, not a big deal.
But now I expect with your plans to bring local storage, it will become more, like I expect it will be more and more popular, and that's great. So you shave off latency there and keep Go, which brings 200.
Good compromise. Effectively, it's a win because 1 network hop completely eliminates language level overheads.
Sounds good. Maybe your goal will improve additionally.
I wanted to go back, Sugu, you mentioned not wanting to compromise on feeling Postgres native. That feels to me like a really big statement, coming from Vitess being very MySQL specific, saying you want to be Postgres native feels like it adds a lot of work to the project. It feels like a lot to me. What does it mean? Is that about compatibility with the protocol? What does it mean to be Postgres native?
There's 2 answers. 1 is, why do we still think we can bring Vitess if it was built for MySQL? And how do you make it Postgres native? That's because of Vitess's history. For the longest time, Vitess was built not to be tied to MySQL. It was built to be a generic SQL 92 compliant database. That was actually our restriction for a very long time until the MySQL community said, you need to support all these MySQL features, otherwise we won't
move forward. CTEs, right? Common Table Expressions, with, right? It's I guess SQL 99 feature, not 92.
Yeah, I think the first parser I built was SQL 92, which is the most popular 1 that I know of. So that's answer 1. Answer 2 is more with the behavior of Postgres. What we want is completely mimic the Postgres behavior right from the beginning. Basically, in other words, we plan to actually copy or translate what we can from the Postgres engine itself, where that behavior is very specific to Postgres.
And the goal is not compatibility just at the communication layer, but even internally, possibly even, you know, recreating bugs at the risk of recreating bugs.
In this case, it's like, so there is Citus in the hands of Microsoft got everything open sourced. So before, resharding was only in paid version, now it's in free version, open source, so it's fully open source. And they put Postgres in between, so they don't need to mimic it, they can use it, right? And latency overhead is surprisingly low, we checked it. Well, it's C, but it's whole database in between. But it's sub millisecond, so it's acceptable as well.
I think it's half a millisecond or so in our experiments with simple select 1 or something, SELECT, just like. So don't you think it's like quite, like in comparison, it's quite a challenging point when you say I'm going to mimic a lot of stuff, but they just use Postgres in between?
Yeah, yeah, I think there's a difference in architecture between our approach between Multigres versus Citus. I think the main difference is it's a single coordinator for Citus. And there is some bottleneck issue with that. If you scale to extremely large workloads, like that goes into millions of QPS, hundreds of terabytes. So having that single bottleneck, I think would be a problem in the future. Whereas,
yeah. I understand you can put multiple, like you can have multiple of nodes there and also you can put a PgBouncer to mitigate the connection issues. So it can scale as well.
That's good. That's not, that's something I haven't known before. So, yeah, it's possible then that they may also have something that can viably scale for OLTP.
Yeah. So we're still exploring this and more benchmarks are needed. Actually, I'm surprised how few and not comprehensive benchmarks there are published for this.
Yeah, what I know of Citus is probably what you told me when we met. So I was about 5 years old.
Yeah, yeah. Another big difference and this is typically Nikolay's question is on the license front I think you've picked about as open a license as you could possibly pick which is not the case I think for many of the other projects. So that feels to me like a very Supabase thing to do and also in line with what Vitess did and that seems to me like a major advantage in terms of collaborating with others, other providers also adopting this or working with you to make it better.
Is like, what's your philosophy on that side of things?
My philosophy is, my metric for success is adoption. Yeah. And the only way to have a project be adopted is to have a good license, a license that people are confident to use. That has been the case from day 1 of Vitess. We actually first launched with BSD license, which is even more permissive than Apache. And then when we moved the project to CNCF, they said, oh, no, no, we need Apache license, which is why we converted it to that 1. So adoption has always been, yeah.
Why do they say it? Do you know?
Why CNCF wants Apache? I think Apache is a pretty good license. They just made it a policy. I mean, had we asked to keep the BSD license, they would have allowed us, but we didn't feel like it was a problem to move to Apache.
Yeah, and I remember you described like, when you did it, like in YouTube, you thought about external users. You need external users for this project to grow. I guess at Google, AGPL is not popular at all, we know.
Oh, banned.
Yeah, yeah, banned. And Citus is AGPL,
which is interesting.
Also, compared to Citus, I think you have chances to be compatible with RDS and other managed Postgres, to work on top of them, unlike Citus, which requires extension and so on, right?
Correct. Correct. Yes. This was actually something that we learned very early on, wanting to work. We made a 5 line change on MySQL just to make Vitess work initially. And it was such a nightmare to keep that binary up, to keep that build running. Fork. Yeah, to keep that fork alive. So we decided, no, it's like, we are going to make this work without a single line of code of change in MySQL. And that actually is what helped Vitess move forward.
Because people would come in with all kinds of configurations and say, make it work for this. So in this case, actually, we'll probably talk about the consensus part. That is 1 part that we think it is worth making a patch for Postgres, and we're going to work hard at getting that patch accepted. But I think what we will do is we will also make Multigres work for unpatched Postgres for those who want it that way. Except they will lose all the cool things about what consensus can give you.
I'm smiling because we have so many variations of Postgres these days. I would expect people coming not only with different configurations of Postgres, but also with various flavors like Aurora. We have a client who just migrated from regular Postgres to AlloyDB. Hello from Google Cloud. And they already sharded on application side, but imagine they could come to you and say, let's support AlloyDB now. It's almost Postgres, right? So yeah, so these things might happen as well.
Don't they claim full compatibility with Postgres? I thought you Not full,
but but most of it. It's, they did interesting stuff in memory like, like column storage and memory for tables. It's row storage on disk, but column storage in memory. But it looks like kind of Postgres, and we actually even had some questions answered from my team unexpectedly because we don't normally work with AlloyDB, but it looks like Postgres. So I could imagine the request, let's support AlloyDB as well.
Anyway, but my question, I remember featuring the test that we work with RDS and managed MySQL. Did those features, like has this feature survived?
No, actually later we decided that at least, we call it actually managed versus unmanaged. Managed meaning that means that Vitess manages its databases. And unmanaged means that the database is managed by somebody else, Vitess just acts as a proxy to serve queries. At some point of time, we realized that supporting both is diluting our efforts. And that's when we decided, okay, you know, it's not worth it to try and make this work with every available version that exists out there in the world.
And we said, okay, we will do only managed, which means that we will manage it ourselves. And if you want, we'll build the tools to migrate out of wherever you are, and we'll make it safe, we'll make it completely transparent. In other words, you deploy Vitess on both and then we'll migrate your data out without you having to change your application. But then Vitess can be more intentional about its features, more opinionated about how clusters are to be managed.
And we were able to commit to that because at that point, Vitess had become mature enough. People were completely trusting it. They actually preferred it over previous other managed solutions. So it wasn't a problem at that time.
Yeah, it's not a surprise. That's why I asked. Because you talk about local disks, backups, HA, a lot of stuff, right? And basically-
Yeah, 5 nines is like what Vitess shoots for. And like most big companies that run Vitess do operate at that level of availability.
So what's the plan for Multigres going to support? So only managed version, right?
Only, So it would be, yes, it would be only managed versions because I believe that the cluster management section of Vitess will port directly over to Postgres, which means that you will, once it goes live, it will be coming with batteries included on cluster management, which should hopefully be equal to or better than what is already out there. So I don't see a reason why we should try to make it work with everything that exists today.
So it means there is no, like, this is the same like with Citus, it doesn't work with RDS on 1 hand, but on another hand, we, like, I don't see it's only a sharding solution. It's everything, which is great. I mean, it's interesting, super interesting. So a lot of problems will be solved. And I expect even more managed services will be created. I don't know how it will continue, like in terms of Supabase, because of the very open license and so on.
But also I expect that many people will think, reconsider their opinion about managed. We had episode about this. This is my usual opinion about managed services because they hide superuser from you, they don't provide you access, it's hard to troubleshoot problems. In this case, if problems are solved in this and this gives you a new way to run Postgres, so if many problems solved, it's great. If you want
to do so,
for example.
Yeah, if you may not know, the initial focus of Vitess is actually solving these problems first. Sharding was actually came much later, like protecting the database, making sure that they survive abusive queries. Basically, that's what we built Vitess for initially. And the counterpart of taking away power from the user, like you said, is 1 is, well, we now know exactly how to make sure that the cluster doesn't go down. And 2, we countered that by building really, really good metrics.
So when there is an outage, you can very quickly 0 in on a query. If a query was responsible, Vitess will have it on top of it, like on the top of the line, saying that this is a query that's killing your database. So we built some really, really good metrics, and which should become available in Postgres, probably from day 1.
That's interesting. I didn't see, maybe I missed, I didn't see in readme you were writing right now in the project.
There's a last section called observability. I missed that. I need to revisit.
We're actually building something there as well for regular Postgres. I'm very curious, I will definitely revisit this, interesting, okay. So yeah, great. Yeah.
Also, I feel like this is quite a big difference on the, at least with Citus in terms of the philosophy or at least the origin story.
I feel like that started much more with OLAP-focused features in terms of distributed queries and parallelised across multiple shards and aggregations and columnar, and loads of things that really benefit OLAP workloads, whereas this has come from a philosophy of let's not worry about optimizing for those cross shard queries, this is much more let's optimize for the single shard very very short quick OLTP queries and let's make sure we protect it against abusive queries.
So it feels like it's coming, architecturally, it's coming from a very different place of what to optimize for first.
And historically, that was YouTube's problem. Surviving the onslaught of a huge number of QPS and making sure that 1 single QPS doesn't take the rest of the site down.
Yeah. Yeah. Perfect. Makes loads of sense. So actually, before we move on too much from that, what, where do you see sharding as becoming necessary? Like, is it just a case of a total number of QPS or like writes per second type thing. We've talked about sharding in the past and talked about kind of a max that you can scale up to perhaps in terms of writes, in terms of WAL. WAL per second I think was the metric we ended up discussing.
Are there other reasons or kind of bottlenecks that you see people getting to that sharding then kind of makes sense as it's now time or you should be considering at this point?
Well there is a physical limiting factor which is the single, if you max out your single machine, that is your Postgres server, then that's the end of your scale. There is nothing more to do beyond that. And there are a lot of people already hitting those limits from what I hear. And the sad part of it is they probably don't realize it.
As soon as that limit is hit, in order to protect the database, they actually push back on engineering features indirectly, saying that, you know, this data, can you make it smaller? Can you somehow lower the QPS? Or could you put it elsewhere?
Let's stop showing this number on front page.
Yeah, yeah. And it affects the entire organization. It's a very small, it's a very subtle change. But the entire organization slows down. Like we experienced that at YouTube when we were at our limits. We like the default answer from a DBA was always no. We used to even kid, no? The answer is no. What's your question? And when we started sharding, it took us a while to change our answer to say that, you know, bring your data I like it we can
scale as much as you want. Believe it or not, we went from 16 shards to 256 in no time. And the number of features in YouTube exploded during that time because there was just no restriction on how much data you wanted to put. And coming back here, the upper, like reaching the limit of a machine is actually something you should never do. It's very unhealthy for a large number of reasons, like even if there is a crash, like how long is it going to take to recover?
Like the thing that we found out is once you can shard, it actually makes sense to keep your instances way, way small. So we used to run like 20 to 50 instances of MySQLs per machine. And that was a lot healthier than running big ones. For a couple of reasons. One is, if you try to run so many threads within a process, that itself is a huge overhead for the machine. And it doesn't do that very efficiently, whereas it does it better if you run it as smaller instances, I think.
It's more of a feeling, but I don't know if there is proof or whatever. But like Go, for example, wouldn't do well. Go, I think, beyond a certain memory size or beyond a certain number of goroutines would start to slow down, would not be as efficient as it was before. Mainly because the data structures to keep track of those threads and stuff, they are growing bigger. But more importantly, on an outage, a smaller number of users are affected.
If you have 256 shards and 1 shard goes down, it is 1 256th of the outage, right? And so the site looks a lot healthier, behaves a lot healthier. There's less panic if a shard goes down. So people are, you know, a lot less stressed managing such instances.
Right, I wanted to mention that this discussion was with Lev Kokotov, PgDog, which is a competitor as well, a new sharding tool written in Rust. And we discussed that there is a big limitation when Postgres... So replication, physical replication has limitation because it's single threaded process on standby. If we reach like somewhat like 150, 200, 250 megabytes per second, depending on core and also number of, not number, structure of tuples and so on.
We hit 1 single CPU, 100%, 1 process, and it becomes bottleneck and replica standbys, they start lagging. It's a big nightmare because you usually by that time, but that's like at high scale you have multiple replicas and you will float a lot of read only queries there. And then you don't want, don't know what to do except as you described let's remove this feature and slow down development and this is not not fun at all.
So what I'm trying to do here is trying to move us to discussion of replication not physical but logical. I noticed your plans involve heavily logical replication in Postgres. But we know it has, it's improving every year. So like when we started the discussion 5, 6 years ago, it was much worse. Right now it's much better. Many things are solved, improved, but many things still are not solved. For example, schema changes are not replicated, right?
And sequences, there is work in progress, but if it's committed, it will be only in Postgres 19, not in 18. So it means like long wait for many people. So what are your plans here? Are you ready to deal with problems like this? Pure Postgres problems, you know?
Yeah, yeah. If you ask me, I think the Postgres problems are less than what we faced with MySQL.
I wanted to involve physical as well, because this great talk by Kukushkin, which describes very bad anomalies when data loss happens and so on. Let's talk about this.
Yeah, we should talk about both. I think overall the Postgres design is cleaner, is what I would say. Like you can feel that from things. Like the design somewhat supersedes performance, which I think in my case is a good trade-off, especially for sharded solutions, because some of these design decisions affect you only if you are running at, you know,
if
you're pushing it really, really hard, then these design decisions affect you. But if your instances are small to medium size you won't even know and then you benefit from the fact that these designs are good. So I actually like the approaches that Postgres has taken with respect to the WAL as well as logical replication. And by the way, I think logical replication theoretically can do better things than what it does now, and we should push those limits.
But yes, I think the issue about schema not being as part of logical replication, it feels like that is also a theoretically solvable problem except that people haven't gotten to it. I think there are issues about the transactionality of DDLs which doesn't even exist in MySQL. So at least in Postgres, it exists in most cases. There are only a few cases where it is not.
And for such things like a proxy layer, like Multigres or Vitess, it's a no problem for them because you should say, oh, OK, this particular construct is not transactionally safe, well, then we'll even prevent you from doing it transactionally because we don't want you to get the wrong impression. We'll let you do it non-transactionally, and we know that it's non-transactional, and therefore, we can do something about it, right?
Those abilities don't exist previously, But eventually, if it becomes transactional, then we can actually include it in a transaction.
Yeah, just for those who are curious, because there is a concept, all DDL in Postgres is transactional. Here we talk about things like creating this concurrently because we had discussion offline about this before recording. So yeah, creating this concurrently can be an issue, but you obviously have a solution for it. That's great.
The way I would say it is we have dealt with much worse at with MySQL. So this is much better than what was there then. Sounds good. Yeah.
Okay, good. And let's talk about physical replication because I saw you are going to use it. So each shard is going to have a standby. And yeah, so with quorum commit, right? So like we don't lose data because data is on local disks by default, as I understand, like ephemeral in cloud. So we want to be sure that data is written to at least 2 places, for example, or 3, right? Configurable, of course.
Here, this interesting talk by Kukushkin, he presented it recently at an online conference by Microsoft, describing that synchronous replication in Postgres is not what you think.
Correct. So, right. Correct.
What are you going to do about this?
Well, I was just chatting with someone and essentially synchronous replication is theoretically impure when it comes to consensus. I think it's provable that if you use synchronous replication then you will hit corner cases that you can't handle. And the most egregious situation is that it can lead to some level of definitely split brain, but in some cases it can even lead to downstream issues.
Because it's a leaky abstraction, it's a leaky implementation, there are situations where you can see a transaction and think that it is committed. And later, the system may fail. And in the recovery, may choose not to propagate that transaction, or may not be able to, and it's going to discard that transaction and move forward.
But this is the same as with asynchronous replication, it's the same, we're just losing some data, right?
Exactly, yeah, it is the same as asynchronous replication, Yes.
It's not split-brain, it's just data loss.
It's data loss, correct. It's data loss, but for example, if you are running like a logical replication off of 1 of those, then that logical replication may actually propagate it into an external system and now you have corrupted downstream systems that don't match the source. So those risks exist and at Vitascale people see this all the time, for example, and they have to build defenses against this and it's very, very painful.
It's not impossible, but it's very hard to reason about failures when a system is behaving like this. So that is the problem with synchronous replication. And this is the reason why I feel like it may be worth patching Postgres. Because there is no existing primitive in Postgres on which you can build a clean consensus system. I feel like that primitive should be in Postgres.
I now remember from Kukushkin's talk, there is another case when a primary transaction looks like not committed because we wait a replica, but the replica somehow is like lost connection or something. And then we suddenly, and client thinks it's not committed because commit was not returned. But then it suddenly looks committed. It's like not data loss, it's data and loss
somehow. Boom.
Like suddenly, and this is not all right as well. And when you think about consensus, I think you are very good describing these things like concept and distributed systems. It feels like if you have 2 places to write, definitely there will be corner cases where something will go off if you don't use two-phase commit, right?
Correct.
And here we have this. But when you say you're going to bring something with consensus, it immediately triggers my memory how difficult it is and how many attempts it was made to bring pure HA to Postgres, just to have auto-failover. All of them failed. All of them. And they say, let's be outside of Postgres. So here maybe it will be similar complexity to bring these 2 inside Postgres. Is it possible to build it outside this thing?
It is not possible to build it outside. Because if it was, that is what I would have proposed. The reason is because building it outside is like putting band-aid over the problem. It will not solve the core problem. The core problem is you've committed data in 1 place, and if that data can be lost, and there is a gap when the data can be read by someone, causes is the root cause of the problem. That is unsolvable. Even if you later, later raft may choose to honor that transaction or not.
And that becomes ambiguous, but we don't want ambiguity, right?
What if we created something extension to commit, like make extendable to talk to some external stuff to understand that can be finalized or something? I don't know, consensus. Correct. Sugu: Correct. So if you, essentially, if you reason through about this, your answer will become a two-phase system. Yeah.
Without a two-phase system.
Which scares me. But as I told you, a two-phase commit in the Postgres OTP world is considered really, really slow and the rule is let's just avoid it. I see your enthusiasm and I think, I couldn't find good benchmarks, 0, published.
This is not two-phase commit, by the way. This is two-phase synchronization.
I understand. It's not, in two-phase commit, it's like more communication happens. I understand this.
So two-phase synchronization, the network overhead is exactly the same as full sync, because the transaction completes on the first sync. Later it sends an acknowledgement saying that yes, I'm happy, you can commit it, but the transaction completes on the first sync, so it will be no worse than full sync.
Yeah, compared to current situation when primary commit happens, but there is a lock which is being held until- Correct,
it is the same cost.
Yeah, it is the same cost. We wait until standby, And for user it looks like when lock is released, it thinks, okay, commit happens. But the problem with this design, if, for example, standby restarts, lock is automatically released and commit is here and it's unexpected. This is data on loss, right? So you are saying we can redesign this, network cost will be the same, but it will be pure. Yeah, that's great, I like this. I'm just thinking, will it be acceptable?
Because bringing autofailover is not acceptable. There was another attempt last year from someone and with great enthusiasm, let's bring autofailover inside Postgres. Actually, maybe you know this guy, it was Konstantin Osipov who built a Tarantool database system. It's like memory. He was X MySQL in performance after Zaitsev. Zaitsev was X MySQL then Osipov was MySQL. And so Konstantin came to Postgres saying let's build this.
Great enthusiasm, but it's extremely hard to convince such big thing to be in core. So if you say it's not big thing, this already...
So I can, it's, I'll probably have to explain it in a bigger blog, but essentially, now that I've studied the problem well enough, the reason why it's hard to implement consensus in Postgres with the write-ahead log is because they are trying to make Raft work with the write-ahead log. And there are limitations about how the Raft, how commits work within Postgres that mismatch with how Raft wants commits to be processed. And that mismatch, so far, I have not found a way to work around that.
But, the variation of Raft can be made to work.
Interesting.
The way the, I don't know if you know about my blog series that I wrote when I was at PlanetScale. It's an 8 part blog series about generalized consensus. People think that Raft is the only way to do consensus, But it is 1 of a thousand ways to do consensus. So that blog series explains the rules you must follow if you have to build a consensus system. And if you follow those rules, you will get all the properties that are required by a consensus system.
So This 1 that I have, the design that I have in mind, follows those rules, and I am able to prove to myself that it will work, but it's not Raft. It's going to be similar to Raft. I think we can make Raft also work, but that may require changes to the write-ahead log, which I don't want to do. So this system I want to implement without changes to the write-ahead log as possibly a plugin.
Yeah. Well, now I understand why you could, like another reason you cannot take Patroni not only because it's Python versus Postgres But also because you need another version of consensus algorithm
Correct, correct.
And among those hundred Thousand millions of ways.
By the way Patroni can take this and use it because it's very close to how FullSync works.
Good. Okay.
I was just thinking, watching Alexander Kukushkin's talk, he said a couple of things that were interesting. 1 is that he was surprised that this hasn't happened upstream. So you definitely have an ally in Kukushkin in terms of trying to get this upstreamed, but also that he thinks every cloud provider has had to patch Postgres to, in order to offer their own high availability products with Postgres, each 1 has had to patch it.
And they are having to, you mentioned earlier today how painful it is to maintain even a small patch on something.
I don't think it's every, I think it's Microsoft for sure, knowing where Kukushkin works at. Yeah. But maybe more, not every. Yeah,
all I mean is that there are a growing number of committers working for hyperscale and hosting providers. So I suspect you might have more more optimism for consensus or at least a few allies in terms of getting something committed upstream so I personally think there might be growing chance of this happening even if it hasn't in the past for some reason.
Yeah, I feel like also being new to the Postgres community, I am feeling a little you know, shy about proposing this upfront. So what I am thinking of doing is at least show it working, show it working at scale, have people gain confidence that this is actually efficient and performant and safe.
So I also plan to, I don't know if you've heard of FlexPaxos, which is actually, in my opinion, a better way to handle durability, because today's cloud environments are a lot more complex, and a simple majority-based quorum is actually very hard to configure if your needs are different, which actually FlexPaxos does handle. It's actually something I'm a co-inventor of, of some sort. And this blog post...
I only heard the name. That's it. So I like... Yeah. Can you explain a little bit? Not super difficult.
Oh sure. Yeah. So the... Actually, let me explain what is the reason why. So FlexPaxos was published a few years ago, about 7 years ago or so. And you'll see my name mentioned there, which I feel very proud of. And the block series that I wrote is actually a refinement of FlexPaxos. And that actually explains better why these things are important. The reason why it's important is because people think of consensus as either a bunch of nodes agreeing on a value, right?
That's what you commonly hear. Or you think of like reaching majority, reaching quorum is important. But the true reason for consensus is just durability. When you ask for a commit and the system says, yes, I have it, you don't want the system to lose it. So instead of defining quorum and all those things, define the problem as it is and solve it the way it was asked for is, how do you solve the problem of durability in a transactional system?
And the simple answer to that is, make sure your data is elsewhere.
Yeah, I love how simple you make it.
Yeah, if you make sure your data is elsewhere, if there is a failure, your challenge is to find out where the data is and continue from where it went. And that is all that consensus is about. And then all you have to do is have rules to make sure that these properties are preserved. And Raft is only just 1 way to do this. So if you look at this problem, if you approach this problem this way, you could ask for something like, I just want my data to go across availability zones.
As long as it's in a different availability zone, I'm happy. Or you can say, I want the data to be across regions, or I want at least 2 other nodes to have it, right? So that's your Durability requirement. But you could say, I want 2 other nodes to have it, but I want to run 7 nodes in the system, or 20 nodes. It sounds outrageous but it is actually very practical.
In YouTube we had 70 replicas but only 1 1 node the data have to be in 1 other node for it to be durable and we were able to run this at scale. The trade-off is that when you do a failover you have a wild goose chase looking for the Transaction that went elsewhere but you find it and then you continue. And so that is basically the principle of this consensus system. And that's what I want to bring in Multigres.
While making sure that the people that want simpler majority based quorums to also work using the same primitives.
Just quickly to clarify, when you say the wild goose chase, is that because it was 1 of 70, but different transactions could have gone to different of the 70 or it's always the same 1, but you have to know which 1 that is
No, it could be anyone the way we we ran it the way we ran it It is 1 it could not be at any given point of time There's only 1 Primary which means that there is only 1 Transaction that you have to chase down.
The latest 1.
The latest 1, yes.
Yeah, makes sense.
Yeah. There was a time when we found that Transaction in a different country. So we had to bring it back home and then continue. It was once it happened in whatever the 10 years that we ran.
It's interesting that talking about sharding, we need to discuss these things, which are not sharding per se, right? So it's about HA inside each shard, right?
It's actually like what I would call healthy database principles, which is, I think, somewhat more important than sharding.
Yeah. Yeah. Yeah.
It is true that it is to do with it being a distributed system, right? And that is because it's sharded, no?
I think they are orthogonal. Okay. Yeah, I think sharding, like you can do sharding on anything, Right? Like you can do sharding on RDS. Somebody asked me, like, what about Neon? I said, you can do sharding on Neon too. It's like you put a proxy in front and then it does the sharding. But the problem with sharding is it is not just a proxy. That's what people think of it when they first think of the problem because they haven't looked ahead. Once you have sharded, you have to evolve.
You start with 4 shards, then you have to go to 8 shards. And the evolution is not linear this way. Actually, it's an exponential growth because 4, 8, 16. But at some point of time, it changes because your sharding scheme itself will not scale. Like if you, for example, are in a multi-tenant workload and you say shard by tenant, at some point of time, a single tenant is going to be so big that they won't fit in an instance. And that we have seen.
And at that time, we had to change the sharding scheme. So how do you change the sharding scheme? Slack had to go through this, where they were a tenant-based sharding scheme, and a single tenant just became too big. They couldn't even fit 1 tenant in 1 shard. So they had to change their sharding scheme to be user-based. They actually talk about it in 1 of their presentations.
And Vitess has the tools to do these changes without actually you incurring any kind of downtime, which again, Multigres will have. I keep talking about Vitess, but these are all things that Multigres will have, which means that you are future-proofed when it comes to. And these are extremely difficult problems to solve. Because when you're talking about changing the sharding scheme, you are basically looking at a full crisscross replication of data. And across data centers.
Yeah, and also, like I know, it has version 3, right? It was when you changed, basically created a new planner, right, to deal with arbitrary query and understand how to route it properly and where to execute it, which is it a single shard or it's global or it's like different shards and so on. Like what's, are you going to do the same with Postgres? I think yes. Right.
So that's the part that I'm still on the fence. By the way, the v3 now has become Gen 4. It's actually much better than what it was when I built it. The problem with v3 is that it is still not a full query. It doesn't support the full query set yet. It controls supports like 90% of it, I would say, but not everything. On the temptation side, there's the Postgres engine that supports everything. So I'm still debating how do we bring the 2 together?
If it was possible to do a simple git merge, I would do it. But obviously this 1 is in C, this was in Go. And the part that I'm trying to figure out is how much of the sharding bias exists in the current engine in Vitess. If we brought the Postgres engine as is, without that sharding bias, would this engine work well for a sharded system?
So this looks like CytoSort, if you bring up the whole Postgres. There's a library, libpg_query, by Lukas Fittl, which basically takes the parser part of Postgres and brings it to... And there is a Go version of it as well.
Oh, libpg_query you said?
Yeah, yeah, yeah, I will send it to you. So many, many, many, many systems use it when we need to parse. Yeah, 1 day I told it to SPQR guys about this, and eventually I think they ended up using it. I think Lev Kupotov uses it, Pijadok also uses it, the Rust version.
Is it like 100% Postgres compatible?
Well, it's based on Postgres source code. So parser is truly broad, but it's not whole Postgres. So maybe you should consider this. Right? If you're thinking about parsing, I mean, queries and so on, but I'm very curious, I also noticed you mentioned routing, like read-only queries routed to replicas automatically. And this concerns me a lot because many Postgres developers, I mean, who use it, users, they use PL/pgSQL functions, all PL.. Python functions, anything, which are writing data.
And the standard way to call function is select, select function name. So understanding that this function is actually writing data is not trivial. Right, and we know in pgpool, which I, all my life I just avoid. I touched it a few times, decided not to use at all because it tries to do a lot of stuff at once and always considered like, no, I'm not going to use this tool. So pgpool solves it like saying okay like let's build a list of functions which are actually writing or something like this.
So it's like patch approach you know workaround approach. So this is going to be a huge challenge, I think, if you...
Yeah, yeah.
For automatic routing, it's a huge challenge.
Yeah, I think this is the reason why I think it is important to have the full Postgres functional engine in Multigres, because then these things will work as intended is my hope. What we will have to do is add our own sharded understanding to these functions and figure out what does it mean to call this function, right? If this function is going to call out to a different shard, then that interpretation has to happen at the higher level.
But if that function is going to be accessing something within shard, then push the whole thing down and just let the push the whole SELECT along with the function down and let the individual Postgres instance do it.
Yeah, but how to understand? Function can contain another function and so on. It can be so complex in some cases. It's also funny that there is still there is actually Google Cloud, CloudSQL supports it, kind of language, it's not language, called PL/Proxy, which is sharding. For those who have workload only in functions, this can route to proper shard. It was created at Skype. It still exists, but not super popular these days. But there is a big requirement to write everything in functions.
In your case, if you continue, like, I would expect in some case you would say, okay, don't use functions. But I'm afraid it's not possible. Like, I love functions. Actually, Supabase loves functions because they use Postgres, right? Postgres, like, it provokes you to use functions.
Oh, really? Oh, yeah, yeah. Actually, I saw that. Yeah. So in Vitess, I feel like this was a mistake that we made, which is if we felt that anything, any functionality that you used didn't make sense. Like if I were you, I wouldn't do this, right? Because it's not, it won't scale. It's a bad idea. You know, it's like those we didn't support. We didn't want to support. You said, no, we will never do this for you because we'll not give you a rope long enough to hang yourself.
Basically, that was our philosophy. But in Multigres, we want to move away from that, which means that if you want to call a function that writes, have at it.
Just put a comment, it's going to write something.
Yeah. If you want a function that calls a function that writes have at it. If we cannot, like the worst case scenario for us is we don't know how to optimize this. And what we will do is we'll execute the whole thing on the VT gate side.
There's another, I remember there is interesting solution in, I think in AWS RDS proxy, which when they, as I know it, maybe I'm wrong, when they needed to create a global, it's called I think Aurora Global Database maybe or something like this. So there is a secondary cluster living in a different region and it's purely read-only, but it accepts writes. And when write comes, this proxy routes it to original primary, waits until this write is propagated back to replica and responds.
Oh, wow. Yeah. I don't think that feature, I don't know how even that feature can be supported.
No, no, it's just some exotic, interesting solution I just wanted to share. Maybe, you know, if we, for example, if you originally route a write to a replica, then somehow in Postgres you understand, oh, it's actually a write.
Okay, Yeah, so maybe 100% is theoretically impossible to support.
Yes, it's okay. It's super exotic. Okay.
But I think if people are doing things like that, it means that they are trying to solve a problem that doesn't have a good existing solution.
Exactly.
So if we can find a good existing solution, I think they'll be very happy to adopt that instead of whatever they were trying to do.
Well, this is just multi-region setup. I saw not 1 CTO which wanted it for like dealing with Postgres like say we are still single region we need to be to be present in multiple regions in case if 1 AWS region is down right it's also okay yeah so availability and business characteristics so yeah anyway okay yeah it's it's exotic but but interesting still yeah So
you've got a lot of work ahead of you, Sugu.
I feel like we barely covered like 1 of so many topics.
Let's touch something else. Maybe it's a very long episode, but it's worth it, I think. It's super interesting. What else? What else?
I think the other interesting 1 would be 2PC and isolation.
Isolation from what?
Like the 1 issue with the sharded solution is that, again, this is a philosophy for the longest time in Vitess we didn't allow 2PC. You said you shard it in such a way that you do not have distributed transactions. And many people lived with that. And some people actually did
not adopt Vitess. Let me interrupt you here, because this is like the most, the best feature I liked about Vitess, it's this materialized feature when data is brought.
Oh yeah, materialized is another topic. That's actually a better topic than 2PC.
Well, yeah, because this is your strength, right? So this is like, I love this idea, basically distributed materialized view, which is incrementally updated.
Yes, yes, yes.
That's great. We need it in Postgres ecosystem. Just maybe as a separate project, even, you know, like we lack it everywhere. So yeah, this is how you avoid distributed transactions basically, right?
No, this is 1 way to avoid it. 1 way, yeah. Like there are 2 use cases where materialized views are super awesome. You know a table that has multiple foreign keys, that has foreign keys to 2 different tables is the classic use case, where the example I gave was a user that's producing music and listeners that are listening to it, which means that the row where I listen to this music has 2 foreign keys, 1 to the creator and 1 to the listener. And where should this row live?
Should this row live with the creator or should this row live with the listener is a classic problem. And there is no perfect solution for it. It depends on your traffic pattern. But what if the traffic pattern is 1 way in 1 case and another way in another case? There is no perfect solution.
So this is where in Multigres what you could do is you say okay in most cases this row should live with the creator let's assume that right so then you say this row lives with the creator and we shard it this way, which means that if you join the creator table with this event row, it'll be all local joins. But if you join the listeners table with this event row, it's a huge cross shard while this chases.
So in this case, you can say materialize this table using a different foreign key, which is the listeners foreign key into the same sharded database as a different table name. And now you can do a local join with the listener and this event table. And this materialized view is near real-time, basically the time it takes to read the WAL and apply it. And this can go on forever. And this is actually also the secret behind resharding, changing the sharding key.
This is essentially a table that has real-time present with 2 sharding keys. If you say, oh, at some point of time, this is more authoritative, All you have to do is swap this out. Make 1 the source, the other is a target. You've changed your sharding key. Actually, the change sharding key works exactly like this for a table.
Distributed denormalization technique. This is what
it is.
Yeah, yeah, yeah, exactly. And the other use case is when you reshard, you leave behind smaller tables, reference tables, we call them. And they have to live in a different database because they are too small and even if you shard them, they won't shard well. Like if you have a billion rows in 1 table and a thousand rows in a smaller table, you don't want to shard your thousand row table. And there's no benefit to sharding that either. So it's better that that table lives in a separate database.
But if you want to join between these 2, how do you do it, right? The only way you join is join at the application level, read 1 and then read the other. And so at high QPS, it's not efficient. So what we can do is actually materialize this table on all the shards as reference, and then all joins become local.
Yeah. And you definitely need logical replication for all of this. So this is where we started, like challenges with logical replication.
Yeah, yeah. You do have the, so the reason why 2PC is still important, because there are trade-offs to this solution, which is, there's a lag. So it takes time for the things to go through the WAL and come to the other side. Whereas 2PC is essentially, basically the transaction system itself trying to complete a transaction, which means that it will handle cases where there are race conditions, right?
If somebody else tries to change that row elsewhere while this row is being changed, 2PC will block that from happening, whereas in the other case, you cannot do that.
Yeah, if it's just user views of some video like on YouTube, we can say, okay, there will be some lag, probably some small mistake, it's fine. But if it's financial data, it should be 2PC, but latency of write will be high, throughput will be low, right? This is...
I actually want to... I read the design of... Which is again, by the way, very elegant API, and I assume... I can see the implementation on the API and I Don't think we will see performance problems with 2PC.
We need to benchmark it
We will be we will benchmark it, but I will be very surprised. I think there are some isolation issues that we may not have time to go through today because it's a long topic. Like the way 2PC is currently supported in Postgres, I think it'll perform really well.
Isolation issues when we sit in read committed and use 2PC. You mean this, right? Not in repeatable read. In default.
Yeah, read committed I think will be, there will be some tradeoffs on read committed, but not the kind that will affect most applications. MVCC will be the bigger challenge. But from what I hear is most people don't use, like the most common use case is read committed.
Of course, as default. Yeah, it's faster.
It's a default. Yeah. So people won't even, yeah, I don't, I think this.
They're already on some, they're already in bad state. It won't be worse.
It won't be worse. Yes.
Yes. Yeah. So to PC, of course depends on the distance between nodes, right? A lot, like if they are far, we need to talk like client is somewhere, 2 nodes are somewhere, and if it's different availability zones, it depends, right? So this distance is a big contributor to latency, right? Network. Because there are 4 communication messages that are needed. So.
Correct, correct. Actually, you can, I have actually the mathematics for it? But you're probably right. It's about double the number of round trips.
Yeah, if we put everything in 1 AZ, client and both primaries, we are fine. But in reality, they will be in different places. And if it's different regions, it's nightmare, of course. But at
least it's- Yeah, the 2PC is not done by the client, by the way. The 2PC would be done by the VTgate, which would be- It should have the nodes nearby.
Ah, should have, In 1 availability
zone? Unless you did some crazy configuration, they should be pretty close to each other.
Pretty close means still different availability zones in general case, right?
No, no. The availability zone is only for durability.
Okay.
For replica level. But a 2PC, you're coordinating between 2 primaries, which may actually be on the same machine for all you care. It's not... Well,
Imagine the real practical case. We have sharded schema, we have... Every shard has primary and a couple of standbys, right?
Correct. Correct.
So are you saying that we need to keep primaries all in the same availability zone?
That's usually how things are.
Ah, interesting. I didn't know about this. By the way, I wanted to rattle a little bit about PlanetScale benchmarks they published last week. They compared to everyone. They compared to everyone. I wanted just, it's not like, I'm sorry, I will take a little bit of time. They compared to everyone, and they just published like PlanetScale versus something. And this very topic, they on charts, we have PlanetScale in single AZ, everything client and server in the same AZ.
And line, which is like normal case, client is different AZ. And line with same AZ is active, line is normal, not active. And others like Neon, Supabase, everyone, It's different. And of course, PlanetScale looks really well, because by default, they presented numbers for the same availability zone. And below the chart, everything is explained, but who reads it, right? So people just see the graphs. And you can unselect, select proper PlanetScale numbers and see that they are similar.
But by default, the same as that number is chosen. And this is like benchmarking, you know, like.
Well, I think if you look at the architecture, like, even fair comparison, PlanetScale should come out ahead, like, like the performance of a local disk, of course, should.
But this was SELECT 1 disks.
Disks, SELECT 1 is not a benchmark.
Well, it was part of benchmark. It's just checking query path, but it fully depends on where client and server are located. So what's the point showing better numbers just putting client closer? I don't like that part of that benchmark.
Okay, yeah. I saw the publications, But I didn't go into the details because I thought, well, it has to be faster because it's on local disks. So why?
Yeah, for data which is not fully in cache, of course. Yeah, local disks are amazing versus ABS volumes.
You're right. Yeah, if the data is in cache, then there is, yeah, then all performance, the performance of everything would be the same.
Yeah, well, I wanted to share this. I was annoyed about this, but I fully support the idea of local disks. It's great. I think we need to use them more in more systems.
I think, I wouldn't be surprised if you reached out to PlanetScale. They may be willing to, like if you want to run your benchmarks, they may be willing to give you the...
Yeah, there is source code published and in general benchmarks look great, the idea is great. And actually, with local disks, the only concern is usually the limit, hard limit. We cannot have more space. But if you have sharded solution, there is no such thing.
Correct. But speaking about the hard limit, today's SSDs, you can buy 100 plus terabytes size SSD, single SSD, and you can probably stack them up on
1 next to the other. Yeah, yeah. But in cloud there are limitations, in cloud there are limitations, for instance.
Okay, yeah. I saw AWS SSD over 100 terabytes.
In Google Cloud, 72 terabytes is hard limit for Z3 metal. And I didn't see more. So 72 terabytes, it's a lot, but sometimes it's already notable.
Yeah, at that limit, your storage is not the limit. You will not be able to run a database of that size on a single machine. Why not? You will max out your CPU.
We have cases, CPU. Well, again, the problem will be replication. If we talk about single node, we can
have- Or replication.
360 cores in AWS, almost 1,000 cores already for Xeon scalable generation 5 or something. So hundreds of cores. The problem is Postgres design. If replication, physical replication was multi-threaded, we could scale more.
By the way, replication is not the only problem. Backup recovery. If your machine goes down, you're down for hours. Recovery is something of that size.
Yeah. Not many hours. Someone in my team recently saw 17 or 19 terabytes per hour for recovery with pgBackRest or WAL-G. In AWS. I was like, my jaw dropped. On our podcast I usually say 1 terabyte.
Can you repeat that? 17 or 19
terabytes per hour? 17 terabytes per hour. With local disks. So this is important. With EBS, it's impossible.
It's good to know.
Yeah. With Michael, I was always saying 1 terabyte is like what you should achieve. If it's below, it's bad. Now I'm thinking 1 terabyte is already... Yeah, yeah. So with EBS volumes, we managed to achieve, I think, 7 terabytes per hour to restore with WAL-G. And that's great. But there's
a danger there. You could become a noisy neighbor. So we actually built throttling in our restore just to prevent noisy neighbors.
With local disks, you lose ability to use BS snapshots, cloud disk snapshots.
Correct, correct, yeah.
That's what you lose, unfortunately. And they're great, and people enjoy them more and more. Yeah, so I agree, and as I remember, for 17 terabytes it was 128 threads of 4g or pgBackRest I don't remember local disks I need to update
my technology is changing too fast.
Exactly, yeah. And hundreds of cores, terabytes of RAM already, right? Like,
yeah. Yeah, yeah. But it does go straight to your point of the smaller they are, the faster you can recover still.
You don't hit some of these limits. These systems were not designed with these types of limits in mind. Some weird data structure, suddenly the limit of this is only 100 items. And you hit those limits and then you're stuck. Recently, Metronome had an issue. They had that, the routed. MultiXact. The MultiXact thing, which nobody has ever run before, but they hit that problem.
Yeah, we saw many problems also when you're on the edge. And it pushes forward Postgres actually sometimes, but if you want to be on the safe side, but I really like the, like, it's kind of resilience characteristics when, even if it's down, it's only a small part of your system is down. That's great.
Correct, Nikolay: yeah. That's mature architecture already. That actually makes it easier to achieve 5 nines uptime. Because that's the way you calculate. Like if only 1 node is down, you divide it by the number of users
being affected. Budget.
Downtime budget. Yeah, budget. Yeah.
That's good. Cool. I think it's maybe 1 of the longest episodes we had. Enjoyed it. Oh my God. I enjoyed it. I hope we will continue a discussion of issues with logical, for example, and so on, and maybe if things will be improved and so on. Looking forward to test POC once you have it. Thank you so much for coming. Sugu: I am so excited. Yeah, thank you.
Yeah, is there any last things you wanted to add or anything you wanted help from people on?
I would say it feels like nothing is happening on the repository except me pushing, you know, a few pushes, a few things, changes, but a huge amount of work is happening in the background. Like some of these design work about consensus are all like almost ready to go and there's also hiring going on there are people coming on board very soon so you will see this snowball It's a very tiny snowball right now, but it's going to get very big as momentum builds up. So pretty excited about that.
We may still have 1 or 2 spots open to add to the team, but it's filling up fast. So If any of you are very familiar, this is a very high bar to contribute to a multi-base. You have to understand consensus, you have to understand query processing. But if there are people who want to contribute, we are still looking for maybe 1 or 2 people and also on the orchestration side and the Kubernetes side of things. Yeah, so that's...
Do you mind a small joke in the end? Just not to finish on...
I do not mind at all. Let's hear it.
Yeah, so I know what you're doing. You're writing a lot of markdown right now and then you will feed it to AI.
I wish! Oh my god. I almost hope that day never comes but It is so fun working on this project, creating it. Why do I want to give it to an AI to do it, you know?
Okay. Good. Thank you. I enjoyed it a lot.
Yeah. Yeah. Thank you so much for joining us. It's great to have you as part of the Postgres community now and I'm excited to see what you get up to.
And me too.
Wonderful, thanks so much.
