Multi-tenant options

Michael

00:00

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and I'm joined as usual by my co-host Nikolay, founder of Postgres.AI. Hey Nikolay.

Nikolay

00:09

Hi Michael, hi.

Michael

00:11

And today we are joined by special guest Gwen Shapira, co-founder and chief product officer at Nile, to talk all things multi-tenancy. Hello Gwen, thank you for joining us.

Gwen

00:21

Thank you for having me, it's very exciting for me to be in your show.

Michael

00:24

Oh, we're excited to have you, thank you. So to start perhaps you could give us a little bit of background and what got you interested in this topic of multi-tenancy in general?

Gwen

00:34

As many things it started with an incident but this time not actually one of mine, so when my co-founder and I started Nile we actually started with a very different idea. After about 9 months, we were like, this idea is not working very well. We developed some things, we're not finding the market we hoped to find. We were sitting in the Hacker Space over in Mountain View, and we're like, what did we learn? We talked to 200 companies all doing SaaS. What have we found out as a result?

01:13

And the things that called to us is that very early on, basically when the first lines of code are getting written, you have to choose a multi-tenancy model. And then about 2, 3 or 4 years later, depending on how fast you're growing, you have to change it. And we heard a lot of stories on what caused people to change it and whether they regret earlier choices or they're like, we didn't know better and how things went for them.

01:50

And then we started looking in different blogs and we found so many by very famous companies with either incidents where something that was done to a single tenant caused a whole chain of events that took down their entire system, sometimes for days. And also a lot of slightly better stories, how we sharded our highly multi-tenant database.

02:20

And we found story after story after story on how people had to re-architect their entire database, which is, as you guys know, extremely painful to do after you're a successful company 3 years in. And we're like, this is a good problem. So many people have it. It is so common. My past was in databases, not as much Postgres, more Oracle and MySQL. But I've seen this problem again and again in all kinds of companies. My co-founders seen this problem again and again in all kinds of companies.

02:57

This is such a good problem. Everyone has it and nobody's working on it. Why is nobody working on it? And that's how we got into it.

Michael

03:07

Yeah, nice. Should we go back then in terms of how do you like to describe the different models or the different options that people have at the in the early stages?

Gwen

03:18

Yeah, so everyone basically starts with 1 out of 2. And I'm using AWS terminology, even though there is other terminology that people apply to it. AWS calls it the pooled model versus isolated model. And in a pooled model, you basically create your tables as normal, and then add a tenant ID Column to each and every 1 of your tables, pretty much.

03:50

Some, maybe not everyone, some have shared data, but most of your tables are going to end up with a tenant ID Column that tells you which tenant this Row belongs to. Very easy when you start out and all you have to do is sprinkle some work clauses every now and then and you're pretty much good to go. How hard can it possibly be? The places where it gets you is that you have no solution. Everyone is in 1 big pool.

04:21

And if you have a problem tenant, if someone grows really, really large and suddenly you need to Query start getting slow for them, If you need to do an upgrade and 1 Customer absolutely refuses to accept changes or needs their own time window. There is a lot of different ways you may discover that by putting all your Customers in 1 big pool, you save a lot of effort, you save a lot of money. This is by far the cheapest option you're going to have. It's shared resources in 1 Database.

04:56

But you are not allowing yourself to do anything specific for any 1 Customer should they need it. The other approach is basically the reverse. It gives every Customer its own Database or sometimes its own schema. This still counts as isolated, even though It's not all that isolated. You share quite a lot in that scenario. But the schema is separate. And it's quite a bit easier to move them out if needed, if the schema is separated.

05:28

And In this scenario, first of all, there is a nice benefit that you get help from a lot of popular frameworks. I think Ruby has an, I think it's called apartment plugin for having this kind of multi-tenancy model. Django has something. So a lot of very popular frameworks have something that helps you in that model. But if you accidentally grow to a large number of databases, it starts being very painful. Obviously, a database with a large number of objects is no longer as easy to work with.

06:02

Suddenly, you start learning how much space and memory the catalog can really take when you have connections. Suddenly you start learning that pg_dump can take a very long time. If you actually have a database for each tenant, then doing any kind of maintenance on 100 databases is already not fun. If you end up going into a thousand databases, it's really not fun. And if you think about it, a lot of SaaS have customers in the hundreds of thousands, not just thousand.

06:36

So it becomes very painful exactly when you grow.

Nikolay

06:43

Yeah. For example, upgrade includes dump restore schema. And we've had cases with 250,000 tables, 2 million indexes. Well, indexes are there not dumped, but tables.

Gwen

06:55

Exactly. This is worse than no dump, right? If only we could dump indexes.

Nikolay

07:00

Yeah, and then you need to update statistics after upgrade for all of those tables. It's a nightmare, honestly.

Gwen

07:07

Yeah. On the other hand, at least you get to do it to a customer at a time. Imagine that everyone is in 1 really big database and now you have to upgrade all of them together.

Nikolay

07:16

Yeah. Yeah. It's painful, all this.

Gwen

07:21

Yeah. And usually, and then there is the mixed model where I think pretty much everyone ends up with where you basically have, you start with the pool model, and as you grow, you shard it, but you're actually pretty smart, and you realize that not all customers are of the same size, and you can have some dedicated shards for your biggest or most sensitive or most demanding customers.

07:51

And this is, I think, if you look 5 years into the life of a company, I would say that this is the dominant model. Some variation of we have shared databases with pool and then some dedicated databases with specific customers.

Nikolay

08:15

What is the problem we're trying to solve? Is it security or performance or both? And if we go back to some customers, they share this pool, it affects security goal, right? So we don't achieve it.

Gwen

08:28

Absolutely. So this is 1, First of all, this is 1 driver that people actually start with the isolated model. They know that they're going into a sensitive area. They're focusing on SaaS for healthcare, SaaS for finance. Those companies definitely start with isolated model and try to figure out how to manage large number of databases. A lot of times those companies don't become huge.

08:57

There is that many hospitals in the United States, but they still have to build all the tooling to manage a large number of isolated databases. For other companies, it's more complicated. I would say maybe 70% of the time, the reason for eventually moving customers out and sharding and isolating would be performance. It's amazing how many performance problems can be solved by just having less data in each database.

09:34

On the other side, there are the story where 2 years in, suddenly a very sensitive customer shows up or you want to sell into, like you thought you were building a normal CRM or some kind of a rug database, but then a healthcare company shows up, a bank shows up, or even worse, a government shows up, and they show up with a list of demands and since they usually have good amounts of money to back those demands there is a lot of incentive to figure out a solution for them.

Michael

10:17

Nice. We have 1 kind of ugly duckling in the Postgres world that I'm not sure quite fits either of these models. I wonder if it's worth discussing row level security briefly, because if I was to bucket it based on those definitions, it's kind of the pooled model in a way because all of the data is together but there is some isolation

Gwen

10:39

between tenants

Michael

10:40

absolutely yeah

Gwen

10:42

yeah I have a love-hate relationship with RLS I think a lot of people do. Because you're right, on 1 hand, it's absolutely a lifesaver in the pooled model. Developers make mistakes as joins get, and conditions get more complicated. It's very easy to misplace a work clause and actually leak data that you don't want to. So RLS will prevent you from doing it if you do it right. It turns out that a lot of times the rules could get complicated and then it leads to bugs.

11:19

It also turns out that a lot of times the rules get complicated and it leads to terrible performance. And 1 thing that developers really don't realize, I'd say almost no developer realizes it until they run into it. The work conditions that RLS introduces are not optimized like the work conditions that you introduce. Because Postgres, thankfully, is very good about security, It treats the work conditions in RLS differently.

11:52

I think they have, they call it security conditions or something like that. And they are very, very conservative on how they optimize it and how they plan for it. This has benefits. You get very strong security guarantees, very few bugs as a result. This is fantastic. On the other hand, the plan would be sometimes significantly worse than what you would come up with if you were to look at it really hard and do it yourself. And with RLS, there is basically no way to force the plan you want.

12:28

You cannot set, enable, disable different rules because, again, the main overriding rule is that we're very conservative on how we optimize those RLS conditions. So Some people call RLS a performance killer. I wouldn't necessarily go this far, but you can definitely run into gotchas and you need to be aware that it's not a normal wear conditions that you're looking at.

Michael

13:01

Yeah. Nice.

Nikolay

13:02

So what does Nile offer today and what is the ideal solution to all this in Postgres context of course?

Gwen

13:13

Yeah, so basically we wanted to do maybe 3 things. First of all, give isolation while not degrading the developer experience. So for example, we partitioned data by tenant out of the box for you, completely transparently, because we know that a bit later on, you're going to want it, and it's going to be a pain in the ass to edit. We shard it Transparently. Basically, your database may be spread across multiple different charts.

13:50

We will route the queries for you and make sure that they are working as you should. So, in a way, you get the model you will have anyway in 4 to 5 years, but you're getting it from the get-go and without doing a lot of the work because we are doing a lot of the management for you. The other thing is that we have done some work to basically bypass RLS and still give you isolation. So the queries, you kind of do the same RLS set tenant ID equal. We use that to actually direct queries.

14:33

We rewrite the queries immediately to the partitions that we know has the data for that tenant. So we have a small extension that kind of replaces table names with partition names in the query itself. And this vastly improves performance in the majority of cases. I mean, we've seen it in a bunch of cases, especially if you have slightly weird indexes that RLS may conservatively not use. The improvement is quite stark, depending, obviously, on table sizes.

15:10

You can get a benchmark that proves anything, so I don't want to throw numbers out there. But obviously, if you break down a table with a million rows into thousand tenants with a thousand rows each, then you can show quite a, you can see where I'm going with that. The other thing we did, and that was probably the most work, and this is still work in progress, is allow moving tenants around. Because 1 of the biggest problems is that the tenant gets large or noisy.

15:44

And you want to give it its own machine, moving it is usually a long downtime. If you catch it after the tenant is already large. By doing the compute storage separation, we can basically make it transparent. It's a latency spike while we're holding off some queries, while we're moving things like setting up sequence ideas, moving, pointing into a different compute into the same part of the storage. But it's essentially a no downtime operation.

16:21

So we think it's a huge deal because again, it's just a problem that we keep seeing again and again.

Nikolay

16:30

Yeah, I'm curious, is it all open source what you build or only parts of it?

Gwen

16:35

Right now it's mostly hidden. We have started registering parts of it under a Apache license veto. So yeah, the goal is to open source it and we already publicly declared that it's going to be open source. We have not a date, but a point of completion where we plan to open it.

Nikolay

17:01

Yeah. So I've heard about this, several interesting things here. So 1 is extension for this, I guess it's called, in your documentation, it's called RLS virtualization or how's it called?

Gwen

17:15

We call it tenant virtualization. The extension itself, I think we called it Karnak. We call everything after stuff in Egypt and Karnak is a famous temple.

Nikolay

17:28

So data is stored in separate tables and in the same database, but extension rewrites queries to basically route query to proper table, right? Is this based on?

Gwen

17:42

It's in separate partitions And we rewrite queries to go to the correct partition. We basically bypass RLS, we bypass the planner trying to make those calls. We found out that with a large number of partitions, this is significantly more efficient.

Nikolay

18:03

I see. So it's Postgres partitioning. I see. And you mentioned also another thing you mentioned here is sharding, right?

Gwen

18:12

Yeah. So we use foreign data wrappers to allow... So we have 2 things in the architecture. First of all, we have a proxy, it's a routing proxy. So it keeps track of every connection, which tenant is the current tenant and it routes it to the shard that has the correct tenant in it. And then we also have some cases where some developers want to write queries that touch multiple tenants. Those are not going to be as fast, but we do allow them by use of foreign data wrappers.

18:52

And mixing partitions with our own partitioning rules with foreign data wrappers and still keeping things efficient, we didn't want the planner on any machine to be aware of all the partitions in all the other shards, because it just explodes the planning time in ways that we saw as unacceptable. So what we did is represent each shard with a table and then put hierarchical table inheritance on top of it.

19:30

And the end result is basically a union all between the table with the table inheritance that points to all those other shards, and the table with the partitions. Now this gives us basically predicate pushdown because the planner will push the plan to all those different charts. The other charts know that they have partitions, which the source planner didn't know. They will plan correctly with all the partitions, but they only know about a subset of partitions.

20:05

So we see it's a bit hacky and it's a bit weird to explain. And we think we can do better with some modifications to Postgres, which we have not done yet, but this does give us predicate pushdown, fairly fast planning, and the ability to do queries that cross tenants in situations where this is required, essentially.

Nikolay

20:28

Yeah, if you don't involve 2PC, just rely on front-end wrappers, no two-phase commit. I'm curious what kind of anomalies can happen there.

Gwen

20:40

Yes, and we prevent a lot of things that could cause anomalies. So we do have a transaction coordinator, but in order to not overload and also not overcomplicate our architecture, we limit some things. So, DDL has to be done on a single tenant, and you cannot mix cross-tenant queries. Sorry, DML, INSERTs and UPDATEs have to be done on a single tenant, and you cannot mix in a Transaction cross, so the moment you start a Transaction, you have to know what tenant you're working on.

21:14

And then we route it to the correct shard, which has the correct table, and everything has the absolutely correct guarantees. If you need to do something cross-tenant, you do not involve it with any kind of UPDATE. You could still be exposed to some anomalies, I agree, because it could be ongoing Transactions from other people in other places. So you get the basic Read Committed guarantees that Postgres gives you, I believe, but not anything more than that.

21:47

But again, we believe that cross tenant Queries are rare and mostly done in analytical cases, where you do reporting where it's slightly less critical to have those.

Nikolay

22:01

So you forbid the writing to 2 shards in 1 Transaction?

Gwen

22:05

If you want to write to a shard, it's fantastic. You tell us what tenant you're writing data into, and we will direct you to the correct shard.

Nikolay

22:13

I mean, if there is a Transaction which needs to write to 2 different shards, this is a big problem, because without PC.

Gwen

22:24

Yes, exactly. And we don't let you do that, essentially, in order to avoid anomalies.

Nikolay

22:29

I see. Another question here, have you considered the approach used in Vitesse? As I understand it, maybe I'm wrong. Where mostly for analytical Queries maybe, to avoid distributed Transactions, data is brought synchronously from 1 shard to another. And we have it locally, like basically kind of Materialized U on top of logical replication, for example, or something. And it has eventual consistency approach, of course, but you can just join it in 1 Postgres, in 1 shard, right?

23:00

Have you considered this approach?

Gwen

23:01

Y.K. We have considered it. I think maybe CitusDB has something similar, if I remember correctly. I'm not 100% sure. But yeah, it's something that we were like, yeah, this is a good idea that we may examine in the future. It's definitely, we are trying to build something useful gradually, and we understand that early on, it's almost safer to have a bunch of limitations that over time will resolve, rather than allow people to do something unsafe and also build a kitchen sink.

Nikolay

23:37

It sounds to me like Postgres versus MySQL approaches because MySQL approach, you remember MyISAM? Maybe you don't remember, but it was like quite bad. You need to run a repair table all the time because it's not ACID and so on. It allowed too much.

Gwen

23:58

Exactly, yeah. And Marathon had a lot of issues. I mean, there is a reason why InnoDB became extremely popular.

Nikolay

24:06

Right, right.

Michael

24:07

But also, with a multi-tenancy use case, I think you're quite right that, well, we, I mean, you'll find out soon enough, right, if lots of people want these cross-tenant or cross-shard queries which are by definition cross-tenant queries and if they don't, if you don't need to worry about it, you save a bunch of effort having to even implement that. So yeah, I like that a lot.

Nikolay

24:30

Yeah, last comment here is I'm excited to see that finally Postgres ecosystem receives some tension in the area of sharding. I guess it's just time has come and more and more databases became too large to be handled.

Gwen

24:45

I'm almost surprised. I'm honestly surprised it took that long. I mean, again, if you look at MySQL.

Nikolay

24:51

We just, it's just unfortunate. First time I touched this topic, it was 2006 immediately when we started working with Postgres, honestly. And There was a PL/Proxy from Skype at that time already, but it required you to write everything in functions.

Gwen

25:08

Partitioning didn't exist back then.

Nikolay

25:10

Existed. It was based on inheritance. It required much more manual. It was fun, actually. You understood it better, you know? But yeah, but it was not convenient, 100%. Not super convenient. It's just, I see that it's just unfortunate how it turned out in Postgres ecosystem. And now definitely there is huge pressure. Many companies need partitioning... Need sharding. And it seems

Gwen

25:38

like How would history work? Imagine that YouTube picked Postgres and not MySQL as their first database.

Nikolay

25:46

Yeah Google or Facebook they both chose MySQL somehow Yeah.

Michael

25:52

The test would be for Postgres first, right?

Gwen

25:54

Exactly. It could have turned out so differently.

Michael

25:57

Yeah. Funny, isn't it? Changing topics slightly or LLMs are quite top of mind at the moment, are you seeing people's initial choices change as a result of asking for advice earlier from our robot friends?

Gwen

26:15

Oh my god, we're seeing so many weird things, it's just unbelievable how much things are changing. First of all, we're having people show up on our Discord and say things like, I'm using Nile because my LLM thought it's a good idea. And I don't really know Postgres, so I need some help, but my LLM assured me that this is still a good idea.

26:44

Like a lot of people are just like people who are very much beginners, like maybe I can say when I started developing, the first time I had to use a database, my company sent me to a 3 weeks database class. I think it was Oracle about 20 years back. And I came back a lot more confident that I know how to use Oracle and not to leave transactions open for too long because people will yell at me. These days people don't do the three-week class before they try using a database.

27:17

So you see a lot more, people are using a database earlier on and they do expect more hand-holding from the vendors. Like the LLMs give them advice up to a certain point, but eventually, if things are slow, they will come to you and say, hey, why is my query slow? I'm sure you guys have seen your share of that. I'm also seeing people use Postgres for their LLMs in different ways. And this is really exciting to me.

27:52

People use Postgres via MCPs, people use Postgres with vectors, people building AI applications on top of Postgres. We're seeing a lot of that. And, I mean, personally, I'm really excited that people can program with LLMs, not knowing a lot about Postgres, not knowing even a lot about software engineering at all, and still get reasonable security guarantees. You don't need to know to ask your LLM about RLS or about, Are you sure this query actually properly isolates tenants?

28:36

And it's also interesting how much the results differ when people use different LLMs. Like I would say syncing models do fairly well iterating in order to get good code and checking their own results, again, given via MCP access to a database. I would say that if you use HRGPT 4.0, you will get a lot of random hallucinatory stuff still in your code.

Nikolay

29:08

Yeah, it's funny. I already told Michael, we had the cases doing consulting, But like maybe already almost a year ago. I started noticing that people send us Like some we are building this part of database. Can you review it? We are reviewing we use different lamps supporting this review and then we have a call and I'm curious, code looks great, I mean schema looks great but something is off, right?

29:42

And then we have a call and I see they open tabs and charge a PT, Claude there as well so I realize they used LLM to create schema and then send us for review and we use a lens to review it and then There's like a 4 party process, you know It's it leads to good place, but someone needs to jump with proper expertise and say, this is not a good approach.

Gwen

30:07

This is hilarious. Do you think at some point you and your customer can just step out and let the LLMs figure it out between themselves?

Nikolay

30:15

Well, there's a problem here because like it's great, but there is like it does 80% of drop in 1% not not even 20% of effort very quickly, but there is 20% of problems, which again, like, I don't know, maybe in next few years, it will change. I think it will change. But right now I feel my internal LLM trained much better than than Charger PT.

Gwen

30:39

You trained your own LLM, right?

Nikolay

30:41

Yeah, no, I mean my own.

Gwen

30:43

Oh yeah, no, but I think you actually trained your own.

Nikolay

30:46

Yeah, we experiment. We do some stuff, we experiment. We have some things like we have, we start with fine tuning, moving to our own LLM, but not yet there still. So yeah, there is a lot of stuff can be done there to properly. It's hard to compete with Cloud and they have very high pace. With Cloud 4 release, I see, wow, it's really great. But it's still missing many things you learn from practice, which were not discussed. That's why they don't bring it. Many problems were not discussed yet.

31:24

And you explore them if you have a lot of data and heavy workloads.

Gwen

31:30

So you're saying that even training on the Postgres mailing list doesn't have all the information in it, essentially?

Nikolay

31:39

Well, yes. For example, random problem, we recently touched it. There is a buffer pool in Postgres. And there are 128 basically partitions. So you can have 128 locks. And if you have a huge buffer pool, this becomes bottleneck. And some people say, let's maybe make it tunable, configurable. But when you start researching this topic, you end up finding some recent, well, recent last August and September last year, conversation in hackers mailing list, which is open-ended.

32:17

It's not complete, because somebody needs to run benchmarks and prove that this is worth having new setting. And that's it. There's a patch proposed, but that's it. No experiments yet.

Gwen

32:31

This is, by the way, 1 of the reasons I was so excited about your LLM approach, because if you think about what is the bottleneck for doing a lot of the database improvement things, and I am feeling it Very personally, running benchmarks is hard. Properly planning a benchmark is hard. LLMs can help with that, but again, they sometimes just go off the rails.

32:57

And even if they help plan that, I don't know LLMs that actually run benchmarks to the point where they provision the machines in AWS and know that you have to provision a separate machine as a database and a separate machine, maybe a few of them, to drive the workload and they both need to have appropriate resources. Like all those and then don't get me started on analyzing the results, which is kind of 99.9% of the work.

33:28

So the fact that you actually kind of started your LLM for I have an LLM that can actually do benchmarks, is just, I think this will be the biggest breakthrough in both people tuning their own Postgres, and also Postgres as a community being able to advance the state of the art?

Nikolay

33:49

Let me share sad news. Like, I'm not giving up, but it's a roller coaster. I spend more than we spend, like the team of maybe 5 engineers spent more than 1 year trying to achieve that. We achieved many things, but first of all, we chose Gemini because they gave us credits. And I think it was a huge mistake. Gemini has a lot of problems. Suddenly you have 500 errors, which are so many problems. It's just not mature product, Gemini, and it has hallucinations all the time.

34:27

It's good, for example, for JSON, working with JSON, because when you need to run experiment, we decided to choose JSON as config format. It writes much better than GPT-4, 4.0 and so on. But many things, it just hallucinates. It invents all the time some things. It just makes up results all the time. And we have a system to control it but it bypasses all the time. Like it's silly hard. So then yeah, we experimented with additional DeepSeek, Lama and we've fine-tuned a lot.

35:02

All versions of GPT, all modern fresh versions, we also bring them all the time. And Claude, Claude is much better, we just added it to this system we have. But After 1 year, I decided, you know what, I'm like, benchmarks is extremely hard topic. And we cannot trust it anymore. I mean, we cannot trust LLM to create precise configuration and process results fully. So we decided that LLM is just more like connection thing. When you engineer benchmark, expert needs to engineer.

35:41

I don't trust any LLM for now, because any experiment, we plan to publish maybe 15 to 20 experiments in our blog last year. And if you open our blog, you see just 1 experiment. And even there we screwed up and someone on Twitter said this is not right and we quickly corrected, which is good. And we had achievements like interesting things, bottlenecks popped up here and there. It's really fun to iterate with LLM.

36:11

But once you allow to think, to design experiment and to treat results, in 99% of time we have wrong, wrong results, wrong conclusions and so on. So for now we are thinking, okay, this is just, accelerator of performing experiments, but design and understanding results should be in human brain for now. 100%.

Gwen

36:34

I love the fact that Benchmarks is also hard for robots, to be honest. It's so hard for humans, right?

Nikolay

36:41

Yeah, and we collect so many artifacts, But somehow it's super hard still. So you always think where is the bottleneck? And simple question, right? But for now, it's extremely hard to let LLM find bottleneck and draw proper conclusions.

Gwen

37:02

And honestly, if you just had an LM that always said it's a network, it would be correct about 80% of the time.

Nikolay

37:10

What if it's local? If it's, there's no network, everything through Unix sockets and we had these cases as well. So I agree with you in production. Like in production, yes, but in experiments when we learn Postgres behavior on single machine running pgbench locally, we don't care sometimes. So there's no network there sometimes. So it's hard. So I had many moments of frustration. But it's so good, like I still believe that iterations are great.

37:42

So if you say this is great benchmark, just check it on new version. Just changing 1 thing. This is good. We have automation, we have interface, and it repeats the process of analysis again. This is where LLM helps a lot. Because without LLM, you could just, you have some form, and oh, we don't have this parameter programmed. It's not exposed in the interface. It's bad. With LLM, you have freedom to change things and iterate based on existing good benchmarks.

38:16

So, yeah, I cannot say we are there yet. This project, like, right now is we are, like, thinking about next level of it, where I think we will let LLM, we will give it less freedom, you know, that's the key, and control more by human brain.

Gwen

38:37

Human in the loop kind of thing.

Nikolay

38:39

Yeah, yeah, exactly. Exactly. So design and first analysis, only human should be there. But once you have confidence that you're moving in the right direction and you just need to iterate and expand, for example, to different versions, platforms, everything, this is where you can relax. You already verified results. You can say just repeat, but on different something. This is where LLM already can bring you. It can just speed up everything.

39:06

You can throw it to this benchmarking process and have like 10 experiments running in 10 different versions or something.

Gwen

39:13

And I think this is also kind of almost the general directions that agents are taking shape. I mean, people started to 2025 was supposed to be the year of the agent. I think it's almost becoming the year of the human in the loop with the agent. Yeah. Like all the successful products I see are you tell the agent to do some stuff. You ask it, please plan something. You give it feedback. You then say, okay, now that we have a good plan, go and execute on it.

39:43

You come back an hour later, you had your coffee. Okay, let's see what you've got. Here's some feedback. Go fix some stuff. I think it's always every successful product is a bit like this.

Nikolay

39:57

Right. But sometimes humans start using different LLM when reviewing things, right? Being lazy, right?

Gwen

40:06

Yes.

Nikolay

40:08

That's interesting. So I'm not sure how successful it will be, but my gut tells me that we need to move, move, move in this direction anyway and have some, I don't know, like more experiments and so on. I hope we will have more soon to publish, I mean, and start iterating. But yeah, it was a rollercoaster last year, So now we are rebuilding stuff. We will see how it works. And for example, 1 of the experiments we must do, I think, is to conduct various benchmarks for RLS.

40:44

Because it's obviously... That would

Gwen

40:46

be a fantastic example. Yes. And I mean, this is something that humans with experience are pretty good at finding cases where you're like, is RLS going to actually be an issue and have the stories that then the LLM can go implement and test.

Nikolay

41:06

Yeah, we had several cases and also Supabase has public materials, blog posts about this. Obviously, there is already quite known case when you have like current_setting function inside RLS expression. And this is, yeah, if you select count 1000000 rows, it's terrible. And it's quite easy to fix actually. But yeah, so these kinds of experiments to collect them and see how. Actually, my goal with this experiment I think will be to prove that they are not a problem if you do it right.

Gwen

41:42

Interesting. I would contribute, I think there was a recent post on the bug tracker where basically the optimizer, the planner refused to use, I think it was a GiST index or a GiN index due to the belief that the RLS optimization is incorrect, is unsafe. I can look it up and send it to you. Yeah, it's interesting. But yeah, that can also be interesting. Like if you have a fix for Postgres, then you can, it will obviously be nice to showcase.

Nikolay

42:18

Right. I think also if you, so you don't use RLS, you said bypass it, right?

Gwen

42:25

We bypass it, yeah.

Nikolay

42:27

Yeah, that's interesting. I'm curious if you, in this mixed schema when we have partitions or shards, and RLS doesn't make sense at all to involve RLS additionally locally if some of shards have a mixed pool of customers.

Gwen

42:46

Okay, 1 of the questions that I have in mind, and this is something that we're trying to help figure out for our users, often on top of the tenant, you still have permissions for specific users in the tenant. Like you have an admin that can do anything, and then you may have someone who is not allowed to see some rows at all because they're too sensitive, all this kind of stuff. And our users ask us, they can do it with RLS inside those partitions, or they can do it in their application.

43:19

There is a lot of application level tools or like middleware kind of tools that will do it for them. Is it better to do it in the app layer or in Postgres with RLS is a good question that I don't have an immediate answer for.

Nikolay

43:35

Right. So there are several layers of multi-tenancy basically. Yeah. Not multi-tenancy, but different layers. And If your customer being a tenant for you, they might have tenants for them, but also inside them, they might have additional clusters or segments of users inside each tenant. So it's apartments and rooms inside apartments.

Gwen

44:05

Nice. I like that. Yeah. This is 1 of the things that make multi-tenancy confusing, right? Because it's almost like Matryoshka kind of scenario.

Nikolay

44:17

Yes. This term, by the way, used in Postgres ecosystem multiple times, starting with, you mentioned GiST original paper by Hellerstein mentions that Matryoshka.., RD-tree, Russian doll tree, so basically Matryoshka tree.

Gwen

44:32

That's true. And also in embeddings, they have the Matryoshka type embeddings where you can make them of any size.

Nikolay

44:42

Yeah, I read about this as well. Yeah, it's funny. So yeah, great. And what are your plans for AI? I saw MCP servers already with some integration.

Gwen

44:55

We basically have 3 directions. 1 is MCP server and making it public, giving it authentication. Right now, it's open-source, you can run it on your own, but we're not hosting it, so we should start hosting it at some point. So things that just make people who use LLM, make it easier for them to use it. Other thing we've done that we think is very useful for LLMs, this we already have, just make it 0 time to create new databases. Because LLMs just 0 time and 0 cost.

45:28

Because they love creating a lot of them. Every time something goes wrong, okay, let's try from scratch with a new database kind of situation. So we're making it fast and cheap. The other thing that we're still working on is really to make our documentation more LLM friendly. LLMs.txt is absolutely not enough. It is actually, it creates a very large file. The LLMs tend to get lost in it. We need to figure out how to make it better.

45:59

And then also, everyone is kind of thinking, can we have our own agents? Can we do something around that? We're kind of thinking about that. Something we already have that is useful is just that with the multi-tenant model, the embedding, the vector indexes are much smaller. And this is a huge deal for people building those agents and LLMs.

Nikolay

46:26

Yeah. And actually, you mentioned, so this zero startup goal, you mentioned separation of compute and data. How is that achieved? Using what approach? Can you elaborate?

Gwen

46:39

Oh, yeah. Sorry. This is more than a two-minute answer, but the short one is that you kind of, you patch Postgres, and then you find a better way to do your storage and basically wrap every Postgres storage function with an equivalent with your storage. And then you also need to apply the WAL continuously to the storage layer.

Nikolay

47:05

Great job doing this like in actually 20 seconds. I understood very well. Yeah. So, and do you have plans to make this open source as well?

Gwen

47:16

Yes, absolutely. I mean, this is a first of all, as you know, WAL readers have to be registered, ours already is, and with the open-source license. And then, yeah, we are planning to open

Nikolay

47:31

source our

Gwen

47:31

storage layer, our extension. I think at this point it's about seven different patches we've made on Postgres. I don't think we'll want to open source it as a Postgres fork, because the number of patches is quite small and we are maintaining it. I think everything from Postgres 12 to 18 at this point, or maybe 13 to 18, something along those lines.

Nikolay

47:53

All supported versions, I guess. Yeah. That's great. Yeah. Well, looking forward to checking it out. And actually, one more question for me, maybe last one. Are you open to some benchmarks we probably will do? We plan to do some benchmarks with various platforms. It all started after acquisition of Neon and there's a discussion on LinkedIn. So I thought about trying some benchmarks for different platforms. How do you think, like what are your thoughts about it?

Gwen

48:24

Oh my God, this is very scary. We are benchmarking ourselves all the time. So I'm keenly aware of exactly like what benchmarks make us look good and what benchmark make us look bad. And I think that's also how we react to benchmarks in social media. Unless it's a benchmark that shows something in Postgres itself and like clearly attempting to educate and help people. You can design a benchmark to make anyone in the world look bad.

Nikolay

48:52

Benchmarketing, it's called.

Michael

48:53

You can design

Nikolay

48:53

a benchmark to

Gwen

48:53

make everyone look good. Exactly. So, I will, at any given time, I can publish my benchmarks that makes Nile look fantastic, and you can run benchmarks that you decided how to build them, and may be very realistic and may even expose a problem that we have that I didn't know about. But, yeah, in general, I love benchmarks. I just have opinions on how benchmarks are used by marketing people.

Michael

49:25

Very good answer. For anybody interested in a bit more about Niall's architecture, Gwen gave a really good talk at PGConf.dev recently and the video just went up on YouTube so I will put that in the show notes for anybody that wants a deeper dive there.

49:40

There was also a good talk I saw at a PgDay Paris event by Pierre Ducroquet, I'm not sure if I'm pronouncing that anywhere near correctly, all about multi-tenant database design, especially focusing on something we didn't focus much on today, which was the downsides of schema per tenant design, including things like observability and monitoring, which I thought was really fascinating.

50:04

So anybody considering going down that route definitely check out that video and I'll put that in the show notes as well. So yeah thank you so much Gwen I think we're out of time it's been a real pleasure.

Gwen

50:14

It's been a pleasure Thank you for having me on.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript