PgDog

Nikolay

00:00

Hello, hello. This is Postgres.FM. My name is Nikolay, Postgres.AI, and as usual, co-host is Michael, pgMustard. Hi, Michael.

Michael

00:09

Hello, Nikolay.

Nikolay

00:11

And we have a super interesting guest today, Lev Kokotov. Hi, thank you for coming.

Lev

00:17

Thanks. Hi, Nikolay. Glad to be here.

Nikolay

00:19

Yeah, and the reason we invited you is when I saw PgCat, it was interesting. At that time, I had a spike of interest to sharding solutions, and I saw PgCat was started as a connection pooler alternative to PgBouncer. Actually, at the same time, a few other teams started some new projects. But Then Sharding was, oh, okay, Sharding was there. It's quite straightforward way, just comments. It was interesting, but I'm not a cat guy. I'm a dog guy.

00:54

And once I saw PgDog started, we focused on sharding. It obviously attracted much more attention from me. So that's why we invited you, actually. Awesome. Yeah. Let's talk about, I don't know, like maybe sharding itself and probably let's start with the idea that not everyone needs it Because on a single Postgres cluster, you can grow and reach multibillion evaluation, go to IPO, I have examples. But obviously, sometimes it's really tough, right? Tough job to maintain a huge monolith.

01:32

So what's your take in general about the idea that you don't need sharding?

Lev

01:39

Oh yeah, well that's a good 1. I agree in principle. So you know, Postgres can be pushed quite far and we pushed it very, very far at Instacart, pre-IPO I should add. But we IPO it as a massively sharded database. And we absolutely had to because a lot of things started to run quite slow. We had a lot of writes. Instacart was, you think most apps are 90% read, 10% write. I don't know if that was the case for us.

02:09

I think we're a little bit more like maybe like 80-20, maybe 70-30, but we were doing a lot of writes.

Nikolay

02:15

Hold on. 80-20 means 80 writes or reads, as usual reads, right?

Lev

02:22

Yeah, yeah, reads. Yeah, yeah. I'm thinking like 90% typical workloads, 90% reads, 10% writes.

Nikolay

02:27

Or even more, or even more sometimes. Social media is definitely more reads, right? But here's the thing.

Lev

02:34

Yeah, here's what it was a little bit different, not by much, but even that that small percentage, like 10%, 15% writes was enough to push our r5.24xlarge, I don't know, like 192 cores, almost a terabyte of RAM over the edge.

Nikolay

02:48

Sounds like RDS. Actually, I'm super curious. It's off topic, but I'm super curious about so-called Instacart zero-downtime upgrade approach because we believe, at Postgres.AI, we believe it leads to corruption. But I hope we will discuss at a different time. Let's hold on to this.

Lev

03:09

Yeah, happy to talk about it on another podcast. That was 1 of the guys who worked on it.

Nikolay

03:14

Consider you invited. zero-downtime upgrades made on managed Postgres, where, which doesn't allow to change recovery target LSN.

Lev

03:26

Yep. I did it myself. We had zero-downtime.

Nikolay

03:28

Yeah. It's super interesting topic. Maybe in a few weeks, I'm really looking forward. So yeah, I understand that writes, we can scale reads easily because it's just a replica until some point, right? But writes, we have only the primary. And you need to go either to services, microservices, or sharding or combination of them. This is several routes, actually 2 major routes here.

Lev

03:57

Yeah, Instacart actually sharded what we call functionally sharded. I don't know if that's a real term in the industry, but we just took tables out of our main database and put it into a different database, you know, functional sharding air quotes.

Nikolay

04:10

Vertical split. Vertical position. Exactly.

Lev

04:15

And that happened even before I joined the company. So that was way before, you know, IPO talks and all that stuff. So you can run out of capacity on a single machine quite quickly if you use it a lot. Sorry, that's a little bit of a tautology, but that's just the case. Like, it depends on your workload. We did a lot of machine learning. So we wrote a lot of bulk data, a lot, like daily. We would have like hundreds of gigabytes of data that were completely new data every day, basically.

04:41

So a lot of writes were happening and sharding was the only way, basically. We were running out of... Our vacuum was always behind, and when vacuum is behind, everyone gets a little bit scared. Performance is worse as well. We were getting a lot of lock contention on the WAL. That happens a lot. The WAL is single-threaded to this day, and that's totally fine, but you end up, you know, when you write like hundreds of thousands transactions per second.

Nikolay

05:06

Well, it's single-threaded if you use WAL writer, but backends can write to WAL themselves if you turn it off. Like this setting, I forgot.

Lev

05:16

Yeah, No, you're right. You're absolutely right. But there's still a log. There's still a log. Yeah. You have to grab a log.

Nikolay

05:22

We see it in performance insights. If it's on RDS, like queries are spending time on LWLog.

Lev

05:31

That's the guy. And that's the guy who takes your website offline on Sunday, every morning, like a clockwork.

Nikolay

05:37

On the one-night-defensive foreclose. And I would say traditional optimization would be, let's write less. But I assume only in Postgres 13, WAL-related metrics were added to register statements and explain. It was very difficult to understand which queries, well, with performance insights you can have based on weight event analysis to identify queries which are like WAL write intensive, right? And maybe optimize them, but it's still like limited approach, right?

Lev

06:11

Well it's limited in its success in a company because when you tell somebody like, hey, can you write less to your database? People are like, I'm sorry, but is that your job? You can write less,

Nikolay

06:23

but we have, well, WAL writes are problem for a query, consider a query which writes to a table and 2 situations, This table has a couple of indexes versus 20 indexes and when we insert something, every index gets some insert and it amplifies the amount of WAL written for the same insert.

Lev

06:46

It's really important to know that every single index has an important business use case. And that's why it was put there in the first place. So like all of these are invariants. Like when you say like I have 20 indexes on a table and it's right up, like that's by design.

07:00

And when you want to write more data to that table, because your business is growing, like us as a database engineering team, we're like, all right, we're going to make it happen, because we can't go to like 25 individual teams and tell them like, hey, can you fix your query, they're going to be like, I don't have the time, I'm trying to sell groceries here. You know, like I have other concerns.

Nikolay

07:19

But anyway, I'm trying to say I agree with you that without sharding, we are limited in, in like, we need to squeeze, squeeze, squeeze, but it's very limited. And at some point, you cannot say we can grow to X anymore, right?

Lev

07:34

That's right. And that 2X is actually on the low bound. What most engineering leaders expect is like 3, 5, 6X. 10. Yeah, 10 to have that runway. If you don't have that runway in your system, it's a red flag for most.

Nikolay

07:50

And it goes to CTO level, basically. The decision is like, should we migrate out of Postgres to, I don't know where, right?

Lev

07:58

Exactly.

Nikolay

07:59

I know this so many times, including sitting on CTO position myself, like 15 years ago. Like I know this pain of sharding. Yeah.

Lev

08:08

It's really funny. You're just kind of sitting there and be like, Hey guys, I need 10X capacity. And your database engineer is like, well, it's the WAL. There's nothing I can do. It's the WAL. It's like, it's just like a disconnect between these, like, you know, these 2 people, you need to like, you just need to make this thing scale. Like that's the end. And a lot of these are out of control. So,

Nikolay

08:24

yeah. Let's like, we can also enable compression for full page writes. We can also partition so writes become more local and some, I don't know, like you're skeptical, right?

Lev

08:35

I am, because all of that, partitioning is great. I would never speak bad against partitioning, but...

Nikolay

08:42

Not every partitioning, sorry for interrupting, not every partitioning. If, for example, people say, okay, we are going to spread all customers evenly distributed among partitions. This is not going to help us in terms of WAL writes. But if you say, we will have hot partition, like receiving most writes, and kind of archive partitions where the vacuum and everything was going like stabilized. In this case, it's super beneficial for WAL writes and write intensive workload optimization, right?

Lev

09:12

Absolutely, but then you're going to end up having to read those cold partitions anyway. It's a temporary band-aid, which we did. Again, absolutely, we partitioned a lot of tables and that helped us with a lot of things. But at the end of the day, you just need more compute. That's the bitter lesson of database. I don't know if you're familiar with the bitter lesson in AI, where you just need more machines to solve the problem. It's the same thing in databases.

09:36

You just need more compute to just be able to write, read more stuff and write more stuff.

Nikolay

09:40

Yeah, I'm still pulling us back intentionally because we are having conversations about like scaling problems. And interesting insight I've got yesterday talking to a very experienced Postgres expert, that partitioning not only helps with data locality, write locality, and full-page writes, and volume amount as well, but also with backups, surprisingly, if we do incremental backups with WAL-G, pgBackRest, with new incremental API, I still need to learn about this more.

10:16

Or snapshots, cloud snapshots of EBS volumes and RDS also relies on it, as I understand. Imagine writes are spread out everywhere, and you constantly change a lot of blocks, versus you change only specific blocks, and some blocks are not changed. And it's less pressure on backups as well. It's interesting and it affects DR and RPO, RTO.

10:42

But I just wanted to ask, like my understanding, If we go to sharding blindly and miss some optimizations, I remember I was dealing with Mongo and each node of Mongo, it was maybe 10 plus years ago, I saw each node of Mongo can handle much less workload than Postgres, 1 node of Postgres. Maybe like jumping to sharding too early, it's premature optimization and you will be not in good shape in terms of how much money you spend on compute nodes because you missed a lot of optimization steps.

11:16

What do you think?

Lev

11:17

Alexey Nosovsky Absolutely. If your first step is like, my database is broken, I need to shard, you missed a lot. Of course, you should look at all the possible optimizations. And that's really important to keep you on a single machine. I think the benchmark for me is when you need to start thinking about charting is when you try like 3 or 4 or 5 different things and then you have like a person or 2 working on it full-time and at the end of the day they're like I ran out of ideas.

Nikolay

11:42

Do you have some numbers like for modern hardware like how much WAL to be written per day, for example, like terabyte, 5 terabytes, where it's already an edge, right?

Lev

11:55

Yeah, for sure. I can give you the exact number. We benchmarked Postgres on in the best possible scenario, actually, It was on EC2, it had NVMe drives, it had an ext4, a RAID 0. So we did not, we did not want durability. We wanted performance. That thing could write 4 or 5 gigabytes per second using Bonnie++. Now Postgres was able to write about 300 megabytes per second of just like whatever. I think we're using copy, like the fastest possible way to dump data into PG.

12:24

And it's just, it's the nature of the beast, right? You have to like process it, you have to put it into a table. I'm sure there's a lot of like checks and balances go in between and again, lock contention and all of that stuff. So you can't necessarily squeeze out everything out of your machine if you have the capacity. So ultimately you need to split the processes themselves between multiple machines once you reach that number.

12:48

And then, you know, for everything else, if you have, again, if you have write amplification, things are getting a little bit more tricky. And then, so I think that's, that's usually the number. Once you're writing like a, you know, 200 megabytes per second of just the WAL. And you can see that in RDS there's a graph for that just just WAL writes. I would even start early. 200 megabytes

Nikolay

13:08

of WAL per second each WAL is 16 megabytes but on RDS it's 64. They changed it. So how many WAL files like it's a lot right?

Lev

13:21

It's a lot and that's And that's the scale you're looking at.

Nikolay

13:23

10 plus is already too much. Yeah, I agree.

Lev

13:26

Yeah. So that's the red line, right? If you're at the red line and you're thinking about sharding, we're gonna have a bad time. And that's okay. It's okay to have a bad time. These lessons are learned when you're having a bad time. I had a really bad time.

Nikolay

13:41

Yeah, and I pulled BC calculator just to check 200 megs per second, it gives you 17 terabytes per day. Yeah,

Lev

13:52

that's mild, honestly, for what we did at Instacart. We had like petabytes of WAL in S3, like after like a week. And That's just, again, it's the nature of the beast.

Nikolay

14:02

Multiple cluster.

Lev

14:03

Single cluster. Yeah, it's just, it is. It's called selling groceries. What can I tell you? We had like, you know, multi, hundreds of gigabytes of machine learning data that would come in every day because the model was retrained on a nightly basis and all the embeddings have changed, right? Again, the use cases are almost like, I mean, they're interesting, but they're kind of off topic, I guess, because they're just like, you just write a lot of data. That's what you do for a living, right?

14:29

That's what Postgres is for. That's what databases are for. So the numbers are like, well, this is the number, and that's when you should split up.

Nikolay

14:37

Yeah, so let's probably rest for more sane situations. Let's say 10 terabytes per day or something of WAL data. It's already too much. If you approach that like not soon, it's already a WAL you will be hitting, several WALs inside, obviously you will be hitting. Including light. Yeah, yeah, several WALs. Yeah, I was trying to say lightweight lock, WAL-related lightweight locks, WAL-write, right? There are several of them. Yeah, but it's, Yeah, this is the scale to be scary.

Lev

15:20

It's hard to put a number on it because you can push that further, and who knows what's going to happen to your app. Maybe that's as high as you'll go.

Nikolay

15:26

And not everyone has local NVMEs because in cloud environments, they are ephemeral. It's kind of exotic and risky.

Lev

15:35

It is. It's definitely risky. I love those because when you click reboot on the machine, it wipes the encryption key and all your data is gone. So you better be careful which button you press. There's like a hot restart, which is safe, and there's the actual restart, which will delete all your data. So stay on RDS if you can.

Nikolay

15:53

But no, I don't agree here because you just you should just use Patroni and multiple replicas at least at least 3 nodes and you'll be fine to lose 1 node.

Lev

16:04

You lose 1, something ran a bad, anyway, that's totally off topic.

Nikolay

16:11

Okay, thank you. For me, it's enough to understand that there are cases definitely when you do need to split vertically, and splitting vertically usually is limited as well. It's hard sometimes, and at some point we need sharding. Yeah, so it's enough to explain that definitely there are cases where you need to scale Postgres beyond 1 primary, it can be vertical or horizontal. The result is sharding, right?

Lev

16:39

That's right.

Nikolay

16:40

So, good. Yeah. This is great explanation. Sharding is definitely needed, but not for everyone maybe, but for many.

Lev

16:48

I mean, yeah, I hope not for everyone. That would be, I mean, you know, we solve the problem once and for all.

Nikolay

16:54

Actually. Anyone has access to

Lev

16:56

it, yeah.

Nikolay

16:56

Last week, Michael and I had discussion about snapshots and I said that CloudSQL has hard limit, 64 terabytes, because this is a limit for GCP persistent disks, including persistent PD, SSD. And I said, RDS obviously can allow you to grow further, but beyond 64, but it's not. After we recorded, I checked, and the same limitation, 64 terabytes. It's another reason to have multiple primaries and scale beyond 1 cluster.

Lev

17:33

Well, you'll hit that sooner on Postgres because of the 32 terabyte table limit. And actually I did hit that limit once and that was a scary error. The only thing that it says is like, there's no more bytes to be allocated or something like that in some kind of database file. I'm like, what? You know, and then you're done. Like, there's nothing you can do. You have to move data out into that table. You have to think database offline, obviously.

Nikolay

17:59

It's a single table. Okay. Okay.

Lev

18:01

Yeah. A single table. Yeah. Which is actually not uncommon.

Nikolay

18:06

Well, it's still not super common as well. Yeah. I see only cases like kind of 10 terabytes for table. And it's already when any DBA should scream, where is partitioning?

Lev

18:20

Well, DBAs like to scream, but application engineers be like, I can't, aren't you the DBA? Like, what am I doing? Do something, right?

Nikolay

18:29

But partitioning unfortunately requires application code changes.

Lev

18:33

Yeah, precisely.

Nikolay

18:34

Yeah, so sharding is needed for big projects, obviously. Yeah. Agreed. What's next?

Michael

18:45

Can we go? I reckon it's time to go back to the origin story of PgDog and obviously then PgCat. It would be great to hear like a little bit of that story, Lev.

Lev

18:54

Yeah, absolutely. I'm happy to. Do you guys care about the name of the project? I don't know if, or do you want to know what the inspiration for that came from?

Nikolay

19:01

Sure, it's fun, yeah.

Lev

19:03

Okay, well, when I started PgCat, we just got a new cat and I really loved that cat. And I was working on Postgres, so the 2 things came together naturally. You could probably guess the origin story of PgDog now, I got a dog, and then I'm like, look, I love this dog. What can I call my next project? Obviously PgDog. Yeah. So the naming issue solved. PgCat came from the idea that it was really simple. It was sharding was not even in scope back then. It was just like, we ran PgBouncer.

19:33

PgBouncer could only talk to 1 Database at a time. It makes sense, you know, you were pooling like just the Postgres instance. We had a bunch of replicas. We needed to load balance. And we needed a load balancing algorithm that was smart. When a replica went offline because of, again, hardware issues, scaling issues, whatever, we needed to remove it from the rotation without affecting the app. So we would regularly lose the replica.

19:55

And then most of the site would go offline because we had a Ruby gem that would randomize access to those replicas, and when 1 of them broke, it's just, you know, it worked okay. But doing this in the application code is really hard. Especially in Ruby, there's a way to inject exceptions into Ruby, like, sideline, and basically that breaks your state. So we had like multiple gems working against each other and we just needed to do that in a place where it made more sense.

20:22

Like a load balancer is typically outside the application because you have multiple applications so I can do it in any way. So I just built a load balancer basically. It was after actually I left Instacart. I was just doing it as a side project just to keep my mind going. So I built it. It was really simple. Used banding logic, which was kind of novel at the time for Postgres. If you receive 1 single error from the database, it's removed from the rotation.

20:45

It's very aggressive, but Postgres never throws errors, like network-related errors, unless there's a serious problem. So that actually worked pretty well. I talked to a friend of mine on Instacart. I was like, hey, look, I built this on the side. That looks fun, right? Like, we thought about this. And he's like, yeah, all right. Didn't you quit? I'm like, yeah, but no, I have some free time to work on this. Right. And he's like, okay.

21:07

And then he took the code, added a bunch of features that I didn't add. Cause obviously I didn't have a use case anymore for it. And he's like, oh, great. We're going to deploy it. Right. And we tried it and they use it and they're, I mean, they put so much work into it. They wrote a blog post about it. You probably know about this. So that went pretty well, you know, so it's working in production. It's great. And then I'm like, all right, well, you know, sharding is the next 1.

21:28

They have a bunch of sharded databases that we sharded and adding sharding routing to that would be great because again, it was done in the application layer and application layer routing, I think we'll all agree is a little bit iffy, especially if you have more than 1 app written in more than 1 language, like you have to repeat the same logic across all apps. So I added just a common system.

21:49

I knew like there's 2 sharding schemes at Instacart. 1 uses, you know, the actually the hashing function from partitions in Postgres. I love that because you can actually split data both at the client and in the server.

22:00

So you have multiple ways to move your data around and the other 1 is just a custom 1 like we use like SHA1 and take the last few bytes and then mod that that's just you know, it's random, but it was available in multiple systems as well The data for that came from Snowflake so we could actually Shard the data in Snowflake and then ingest it into the instances directly. And then on the routing layer in Ruby, same hashing function, you know, the sharding key is always available. So that was good.

22:26

So I added both of them and they're like, great, that's great. And they tried, I think, the SHA1 function and I think it's working pretty well for them. So that was fun. Then I started another company that had nothing to do with any of anything. PostgresML, you might've heard about it. That came from, you know, the idea that we shouldn't ingest, you know, hundreds of gigabytes of machine learning data into Postgres, why we should just ingest a 3 megabyte model and run inference online.

22:50

You know, it was okay. Stayed a couple of, 2 and a half years there. Didn't work out. There's a lot of startups too, left. And then I had some free time and I'm like, well, what do I like to do in my free time, guys? Writing Postgres poolers. This is what I do. This is what I do to, you know, rest on vacation. I write Postgres poolers. I'm like, well, let's do sharding for real this time. Let's actually take what we built at Instacart, make it into software because That's what we do, right?

23:16

We come up with an idea, well, we find a problem, we find a solution, we write it in code, and we don't have to solve it again every single time manually for like, you know, hopefully hundreds, hopefully thousands of use cases for this. You know, we'll see. I'm still doing my research. But yeah, So that's PgDog. Sharding is, you know, sales and number 1. Everything else is, you know, there's obviously, it's obviously a pooler, it's obviously a load balancer.

23:38

It has all the features that PgCat has. Almost all of them, I'm adding them as I go. It's a rewrite, it's brand new. I like new code. That's what everyone loves to hear. Hey, you rewrote it from scratch. Great. You know, that code that we battle tested in production and serving like half a million transactions per second. Well, that's obsolete now, I guess.

Nikolay

23:57

You're just going to

Lev

23:58

take this brand new code base and check it out.

Nikolay

24:01

Yeah. And it's written in Rust.

Lev

24:04

Yeah, absolutely.

Nikolay

24:05

Yeah, it's in my kind of dreams to find some spare time to learn and try it because, yeah, it looks like many folks move to Rust. So what do you think?

Lev

24:20

Well, it took me about a decade to get good at it. So the sooner you start, the sooner you'll get it. You know, it takes about 10 years to be good at it. So at least that's what it took me. That's okay. Again, if you start today, eventually you'll get good at it and that's okay. It's a journey, you don't have to learn it immediately. It's just, it's such a paradigm shift in how things work.

24:45

The compiler is very aggressive about checking for things, especially concurrency errors, which is for multithreaded asynchronous poolers, very important. I don't have concurrency bugs in PgDog or PgCat because I'm using Rust. I don't have data races. And that's really important. And the number of bugs that I ship is considerably lower because the compiler is like, hey, listen, this variable, you're not using it. And I'm like, oh yeah, crap, I'm actually importing the wrong variable here.

25:13

I'm using the wrong variable. Good catch, because that was going to be a bug. A lot of things that just in other languages are not available Rust makes really nice. So it's really worth, again, you ask anyone who writes Rust, but like, that's the best thing that ever happened to me since sliced bread. I'm like, that's true. I haven't thought about sliced bread in a while, but Rust is great. Cool.

Nikolay

25:35

Another side question is license. I saw feedback. I joined that feedback. So, PgCat was, I think, on Apache or MIT. I don't remember exactly, but kind of permissive. And for PgDog, you chose AGPL, right? Can you elaborate a little bit why?

Lev

25:57

Yeah, yeah, of course. Yeah, happy to. The answer is really simple. AGPL is actually pretty misunderstood. You can use that code and the application anywhere you want and never have to share anything back. As long as you use it internally and you don't provide, like, you don't use PgDog like publicly as like a service for running PgDog. And even if you do that in that case, all you have to do is just tell us like, what did you change and send us patches, like what did you change?

26:23

So like, it's pretty minimal. But it's a red flag for everyone and that's okay. Yeah, I'm building a company around it. Building a company around MIT code base is, I think, probably possible. I've never done it successfully. Building a company around AGPL, I think, has been done before. And I think it's probably fine. But if it becomes a hurdle, I'm not like married to it. I just thought AGPL looks cool. I like the ideas behind it. I like free and open source code.

26:53

I don't think MIT is necessarily the original idea behind, you know, free and open source code. I like MIT because I don't have to think about it.

Nikolay

27:01

I'm checking PostgresML. PostgresML is MIT, right?

Lev

27:07

Yeah.

Nikolay

27:08

Codebase. Rasta and MIT. So it's interesting how you decided to change it. I agree it's misunderstood. But it's already so. The majority of people misunderstood it and we cannot change it with single project. So it's reality. Yeah, but you don't care if

Lev

27:29

it's open.

Nikolay

27:30

Okay, yeah, like you do it.

Lev

27:32

Yeah, yeah, because like if somebody tells me like look I would love to use your code, but AGPL is a deal-breaker I'll be like well, we'll work something out. You know that's that's not a big deal. You know But I think that's a good. It's a good thing to have a good license. It's important.

Michael

27:48

You mentioned starting a company around it. It strikes me it's going to be tough. Like, obviously, New codebase, the main use cases are at scale. But normally startups, like, the easiest way of getting started is serving smaller companies, right? Like it's harder, like going straight to the enterprise with something that's not yet, like what's the, you've got a plan though, it'd be great to hear like what's the plan?

Lev

28:11

What's the plan? It's okay, don't freak out, it's gonna be okay. Yes, it's not actually that uncommon to have enterprise startup products. If the problem is interesting enough, there's always going to be somebody who's going to be like, oh, great. Somebody's working on it full time. That'll be amazing. At this early stage, how this works usually is I'm looking for design partners. So it's companies like Instacart who are like, hey, this is a great idea. We're gonna try it out.

28:36

We're gonna develop it together. And at the end of the day, it's gonna be in production because we built it together. And that's actually good because you wanna build it with the users. Like you don't wanna build it by yourself like for several years and then show up and be like, hey, does anyone need Postgres sharding? And Nikolay's like, well, I don't know, maybe, maybe not, depends. So what I'm actively looking for right now, like codebase is okay.

28:59

I'm sure there's bugs in it, performance issues, and that's totally fine. I'm just looking for people who'd be like, this is an interesting idea. I like the idea of Postgres sharding. I like the way it's done at the pooler. It's not done as an extension. It's not done as some kind of other thing that I can't even think of. I like the way it's done. So I'd like to try it out and help you finish the job.

29:18

You know, by deploying it in production, by benchmarking it, by finding bugs, by reporting bugs, by even fixing bugs would be great, but not required. My job.

Nikolay

29:27

I can confirm you're very quick reacting to requests. I remember looking at PgCat, I had the idea that mirroring, let's have mirroring to have like a kind of A-B testing, A-B performance testing, right in production. And you implemented it, it was great. I think you were, like you were, or no?

Lev

29:47

Actually, to be perfectly correct, it was actually Mostafa and Instacart who implemented it, because he had a use case for it. He was my design partner.

Nikolay

29:56

It's a very common request, and the only problem, like, you need to have this pool already in production to use it. This is the trickiest part.

Lev

30:08

It's 0 to 1. It's always tricky. New stuff, especially in the hot path, always going to be hard. But if the problem is there, and If the problem is big enough, I'll find my champion. Yeah, exactly.

Nikolay

30:21

So yeah, and you mentioned PgDog. Maybe let's move back to technical discussions. A little bit out of business and license and so on. So first of all, it's not like PgCat, it's not explicit sharding when you command with SQL comments how to route. This PgDog has automated routing, right? And second thing, there's no Postgres middleware for this. So it's just a pooler with routing. Can you explain architectural decisions here and what do you use and what kind of components?

31:04

You got a parser from Postgres to understand queries. I'm very curious how you are going to automatically route selects of functions which are writing, for example, right? Or select for update, which you cannot route to a physical standby, a replica, right? Or I don't know, something else, like how you are going to, or You obviously will have some limitations, already have some limitations, right? So can you talk about this a little bit?

Lev

31:40

Yeah, of course. Select for update is actually really simple. That's a clear intent to write something. So that's an easy 1, straight to the primary. No problem. That's an easy 1, which I actually should implement. Now that I'm thinking about it, I'm routing it to the replica right now. Bug issue incoming. Thank you very much.

Nikolay

31:58

You're welcome.

Lev

32:01

The other 1 is the functions. Obviously It's impossible to know if a function is writing or not by just looking at it. Even if you look at the code, you're not going to know. So like static analysis, I don't think is necessarily possible. So for that 1, I think it should be pretty easy. You put it in the config, you have a list of functions that actually write.

Nikolay

32:18

This is what pgpool does, right?

Lev

32:20

I think so, yeah. I'm not sure.

Nikolay

32:24

pgpool does everything, so I'm sure they do it as well.

Lev

32:28

Exactly. You have to be careful. You can't do everything. People don't believe. I know that approach is not perfect because if you add a new function, you have to update the config and you're always going to forget that you're always going to have issues. So for that 1, I don't have a solution. My theory so far is that that is. Not as common as I'd like, but I will probably prove it wrong and then we'll figure something out probably.

32:52

Some kind of migration process that says, if you want to add a function that writes, send it through the, like you should be writing migrations and sending them through the, through the pooler, not some kind of side channel. And you can probably mark that function as like, hey, this function writes, like put in a comment or something. And then PgDog is gonna be like, great, good to know. You're gonna need persistent storage for that kind of stuff, which you could probably implement.

Nikolay

33:11

Yeah, I agree. Some dogs behave like cats sometimes.

Lev

33:15

Exactly. Yeah, exactly. People will forget to put that comment in and there's going to be issues. But that's just software, you know. When there's people involved, there's always going to be... The more manual stuff you have to do, the more problems there are going to be. But if you're writing a function by hand in the first place, you know, That's just got to be part of your review process.

Michael

33:35

I was reading Hacker News comments and somebody asked about aggregates as well. I think there's a limitation around those at the moment. What's the story there?

Lev

33:44

Well When I posted the thing, I didn't have support for aggregates at all, but then I'm like, hey, you know what, let's add some. I've been thinking about it for months. I might as well just do a couple of simple ones. So I added, yeah, I added like count, simple 1, you just sum the counts across all shards. Maxmin, again, super simple. I added sum as well. Sum is really just sum everything.

Nikolay

34:03

Yeah. Yeah. Some of the- Let's, let's, small comment here. It's MapReduce basically, like an analogy for it. It's, it's cool, right? So.

Lev

34:13

It is cool, right? This is cool. Like, you know, it's scale

Nikolay

34:17

to billions, trillions and so on.

Lev

34:20

Precisely. Yeah, this is like a MapReduce for Postgres is phenomenal. I'm actually, okay, so I'm going to release a blog post in a couple of days. I'm actually sharding and doing MapReduce for pgvector. I don't know if you've heard about this 1. Yeah, yeah. That one's really fun. pgvector is like a-

Nikolay

34:35

No, no, pgvector we know, but what do you do with it?

Lev

34:40

Well, you're going to have to wait till the blog post comes out, but it's really fun. I'm doing both MapReduce and like a machine learning algorithm to route queries in the cluster. Because like scaling, like searching vectors is a completely different problem than searching B-trees. So I don't know how many people will need that solution.

Nikolay

34:57

Well, HNSW is terrible if you go beyond 1 million records. It's a big problem still, So.

Lev

35:04

That's the number I had in my blog post as well. I don't know why 1 million just feels like the right number, but that's exactly what I said. I'll answer you, over a million, you probably need to shard your pgvector index.

Nikolay

35:13

Or use a different approach, yeah. Exactly. That's great. And MapReduce, someone told in the past that Pale Proxy by Skype, very old tech, also looked like MapReduce, but it required you to use only functions, which is a huge limitation, especially if you have ORM or GraphQL, it's a big showstopper. And also it was Postgres in the middle, right? For routing and for this MapReduce. But in your case, this is more lightweight software in the middle, PgDog, right?

35:49

And it does some simple, like, arithmetic operations. Yeah. And do you plan to define some interface for more advanced operations that the user could define, like, beyond simple sum or count or other aggregates?

Lev

36:10

I haven't worked much with custom data types, UDFs and all that stuff, So that's going to be a learning curve for me, I'm sure. I'm sure it's not that hard, but like once you add custom functions, you need to add custom logic. I think that should be pretty straightforward to implement if there's a synchronization between, you know, the this is working.

Nikolay

36:28

This would give full-fledged MapReduce capabilities to this, right?

Lev

36:33

Yeah, absolutely. More open

Nikolay

36:35

and interesting perspectives, I suppose.

Lev

36:38

Absolutely, yeah. If I find someone who thinks this is cool as well, we could definitely build it together.

Nikolay

36:44

Not only thinks it's cool, but has some production to try, right? Because it's just- Absolutely. Exactly.

Lev

36:50

Yes, absolutely, yeah. I think that would be pretty terrific. It will work pretty well, but yeah. So again, there's a lot of interesting things about aggregates that, you know, for example, like percentiles, like notoriously difficult, basically impossible to solve, I think, at the sharding level, because you need to look at the whole dataset to compute it. You can approximate it.

37:14

Approximation functions should be like a feature that we add to like PgDog that says like, you know what, I don't care about the exact number, like average, the simplest 1, you could estimate it. I think

Nikolay

37:25

for percentile, like we could define some custom data type, right, to remember how many members were analyzed and so on, like, I don't know, to bring not just 1 number from each shard, but a couple of numbers and then it would be possible to understand percentiles, maybe, should be. Maybe. Not super difficult.

Michael

37:46

Feels like a similar, like HyperLogLog had some similar, like, I don't know how you do that cross-shard, but it feels like there might be some secret sauce in what they've done already that could be applied cross-shard.

Lev

38:01

Yeah, HyperLogLog is like a counter, basically, approximates how many members are in a set. Actually, there's an extension for it in Postgres, which you can use. Yeah, it's pretty fun. But yeah, it's, I need a statistician, like a data scientist

Nikolay

38:16

to come

Lev

38:16

and be like, all right, this is how you approximate percentiles. And I'd be like, great. Do you know Rust? Good stuff. Yeah.

Nikolay

38:24

What won't do with Rust.

Lev

38:26

Exactly, yeah.

Michael

38:28

Well, so what is, like, what's next? What are you looking, I guess it depends on what people want, but what does tomorrow look like or the next week?

Lev

38:35

Well You're gonna be shocked to hear this, but I'm an engineer who does sales now I literally just like yeah as I said, I'm building a company So I'm literally just sending as many like LinkedIn and emails to whoever I can think of to find design partners for people to be like, hey, I want to help. This problem exists. First of all, that's the first feedback I need to get. Be like, I would like sharding, I would like it solved, and I would like to be solved at your way. That would be great.

39:02

Or, you know, just tell me how you'd like it to be solved. And if like there's an interlap and I think there should be overlap, solve it together. So that's what I'm doing mostly, but you know I'm an engineer, so I need, I need like a safe space from, from all the social activities. So I still code. So that's why the pgvector sharding is coming out because I needed to code a little bit and I thought the idea would be cool. And I'm gonna keep doing that.

39:24

I'm gonna keep adding these kind of features, keep adding tests, benchmarks, fixing bugs, finding more use cases, like SELECT for UPDATE that I forgot. But yeah, that's that's the plan.

Nikolay

39:37

I have specific question about rebalancing without downtime. Right. If 1 shard is too huge, others are smaller. We need to rebalance. What do you think? Will this feature first of all be inside open source core offering? Because we remember in Citus it was not until Microsoft decision, as I understand, Microsoft decision to, or like a Citus team decision to make everything open source. And second question actually, triggering by my this question, like, how do you compare PgDog to Citus?

Lev

40:14

Yeah, the open source, I would like to stay, I don't foresee myself writing closed source code. Maybe things around deployments and orchestrating the stuff in companies. My idea, we'll see if it's realistic, is to sell managed deployments of PgDog to on-prem deployments to companies. So probably the the actual code that orchestrates that stuff will probably be proprietary, you know, mostly because I'm embarrassed how much bash I use. Nobody really wants to know how that sausage is made.

Nikolay

40:45

Do you follow Google Bash code style?

Lev

40:49

Yeah, somebody taught me to use curly braces for my bash variables, and ever since then I've been doing that religiously. So I learn new things every day. But I think the core will stay open source forever. I don't see... Even that data migration part, like that's a, that's a known, there's a known solution for it. There's no point of building a closed source solution that does the same thing. Like it's already been solved. So might as well just...

Nikolay

41:12

It doesn't exist yet, right? This, this rebalancing feature.

Lev

41:16

Well the rebalancing feature is basically, again, it depends on your sharding key. It depends how you store data on the shard. Like Instagram wrote that blog post a long time ago where they pre-partition everything into smaller shards. And that's how Citus does it underneath. Like Instead of like, you see, if you say you want 3 shards, it's not going to build you 3 tables. It's going to build 128 tables and move them between the shards. And then when 1 shard gets- How? How?

41:39

Well, logical replication became a thing in 10. So, Asidus uses that to move things around. I think logical replication makes sense up to a point. You really have to catch the tables at the right time. Once they get a little bit too big, logical propagation can't catch up anymore. So that's going to be an orchestration problem. I've seen logical kind of... You can

Nikolay

41:59

partition it virtually, like what peerDB did. They implemented virtual partitioning, splitting. If your primary key or partition key, partition key in this case, allows to define some ranges, you can use multiple streams to copy initially and then even to have CDC. So it's kind of interesting. That's right. So yeah. Well,

Lev

42:27

it's funny that you mentioned that because what is it, Postgres 16 allows us to now create logical application for

Nikolay

42:33

other tests. Binary equals true. Postgres 17. 17? Or 16. Maybe 16. 1 of those. Maybe you are right. I just looked at the computation a few days ago and I already forget. Yeah, but still binary is good, but it's whole table. Maybe we don't need it. If we rebalance, we don't. Well, in case if you split already partitions, it's fine. But if not, yeah, I'm very interested to understand design decisions here. It's going to be interesting.

43:02

Because I think for sharding at large scale, this is 1 of the key features to understand this.

Lev

43:09

Yeah. Well, so when you started from like a 1 big database to shard it into like 12, what we did at Instacart was we created a replication slot, we snapshot at 12 databases, we restored them 12 different times, deleted the data that's not part of the shard, synchronized it with logical replication and launched. So deleting data is faster than writing it from-

Nikolay

43:31

Copy it as physical replica first, right? Or copy it logically, like provisioning logical replica, basically dump restore before binary.

Lev

43:42

No dump restore. So RDS, we were on RDS, so EBS.

Nikolay

43:46

This leads us to this discussion about upgrades, because recovery targetless doesn't exist. How, like, let's follow up on this. Like, this is good bridge to 0 downtime upgrades on RDS.

Lev

43:57

Yeah, those are fine. Oh yeah, so the Second question, yeah, Citus. So the difference is mostly philosophical in the architecture. Citus runs inside the database, which limits it to 2 things. First of all, the database host has to allow you to run Citus, Maybe because of AGPL, maybe because they just don't want you competing with their internal products. Again, you know, it's business, it's all fair game. And then the second 1 is performance.

44:26

When you run something inside Postgres that needs to be massively parallel, you know, you're limited by the number of processes you can spawn and by number of connections you can serve. So PgDog is asynchronous, Tokio, Rust, lightweight. It's not even threaded. I mean, it's multi-threaded, but it's mostly like asynchronous, like task-based runtime.

44:41

So you can connect, I mean, I'm gonna pull up a big number here just for, you know, for effect that you could have like a million connections going to PgDog from a single machine and that technically should work because it's using Ebola underneath. But for Postgres, you could probably do like, you know.

Nikolay

44:56

You need PgBouncer or something in front of

Lev

44:59

it. You need PgBouncer and you need most of those connections to be idle because concurrency-wise Postgres can only do maybe like, you know, 2 per core. That's the myth at least. So again, like Citus, single machine, they have some kind of support for multiple coordinators, but I think the readme just says like, please contact us.

Nikolay

45:21

Any case you're in the hands of Microsoft, you need to go to Azure or you need to do self host everything. You cannot use it on RDS because extensions required. In your case, no extensions required. This is the key. You can run it on RDS because the extensions are not needed. Exactly. So you can run it on any Postgres. I see a lot of guys are trying to develop Extension ecosystem.

45:48

I'm over time became big Like opponent of extensions idea because we have a lot of managed services and if you develop extension it takes a lot of time to bring that extension to managed provider. It takes years. So if you can do something without extensions, it might be better in some cases. And sharding maybe is such a case.

Lev

46:16

No, absolutely. Yeah. Because if you develop an extension that gets installed by RDS, like, I don't know if they're gonna pay

Nikolay

46:22

you for that. Approve it and support.

Lev

46:24

Approve it or support it or all that stuff. RDS notoriously upgrades like once a year.

Nikolay

46:29

We need to wrap up soon, but like a few words about encryption. Does PgDog support encryption now already? Because it's super important for... And it can be a bottleneck. I know... Odyssey connection pool was created because PgBouncer needed 2 layers of PgBouncers to handle a lot of... Yeah, this is what several companies did in the past, two layers of the bouncers because of a handshake. So, okay. Tell us handshake. What, what, what's in this area encryption?

Lev

47:02

Sure. Yeah. It supports TLS. It's using a library, Tokyo, it's like one of the rest libraries that implements TLS. It's completely fine, you can use TLS. And my personal favorite, I finally find a library that implements the SCRAM SHA-256 authentication. So now that's finally supported.

Nikolay

47:20

Yeah, I saw MD5 is going to be duplicated in the next few years in Postgres, so...

Lev

47:26

It still comes up. I mean, it's been duplicated for 10 years, and people still use it just because it's really simple to implement. And SCRAM is really hard.

Nikolay

47:35

And one more small question about... Yes, about prepared statements. I know PgCat supported them, right? In transaction pool mode, right?

Lev

47:46

Yeah, the implementation wasn't great, but it does support it. Yeah, PgDog supports them too. Much better implementation this time.

Nikolay

47:52

Okay, cool. Well, I'm not out of questions. I'm out of time. Yes! But it was absolutely interesting. Thank you so much. I'm definitely rooting and going to follow the project. Best of luck to you and your new company. Maybe Michael has some questions additionally. I took the microphone for too long this time. Apologies.

Michael

48:18

Well, Lev, is there anything we should have asked that we didn't?

Lev

48:21

Oh, no you guys are, I think you covered it.

Michael

48:25

Well, really nice to meet you, thanks so much for joining us and yeah catch next week Nikolay.

Nikolay

48:30

Thank you so much, Bye bye. Thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript