pg_duckdb

Michael

00:00

Hello, and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and I'm joined, as usual, by Nikolay, founder of Postgres.AI. Hey, Nikolay.

Nikolay

00:08

Hi, Michael. How are you?

Michael

00:10

No, not falling for it. Today, we are delighted to be joined by 2 guests. Firstly, Joe from Hydra. Welcome, Joe.

Joe

00:18

Thank you.

Michael

00:19

And also Jelte from MotherDuck. Welcome, Jelte.

Jelte

00:23

Hi. Hi all.

Michael

00:24

So thanks so much both for joining us. We are going to be discussing a new-ish project called pg_duckdb, which very interestingly is a cross organization collaboration. So Jelte, I saw the recording of your excellent lightning talk introducing pg_duckdb to PGConf EU, a nice job. Perhaps you could start us off with how you describe pg_duckdb to a Postgres savvy audience who may or may not know what DuckDB is.

Jelte

00:52

All right, so it's an extension for Postgres and it embeds DuckDB. And DuckDB is an in-process analytics database, like SQL database. So it's a bit like SQLite, but then more for analytics, like big queries with lots of joins and stuff like that. And lots of joins and less like SQLite or like Postgres in general for like smaller datasets, like getting a hundred rows or something like that. It's index scans, stuff like that. Indexes are not something that's commonly used in the DuckDB land.

01:25

It's like, oh, we just scan the whole table, and we compute stuff very fast on it. Yeah, and that is then embedded into Postgres so that you can use that for some of your more hazard queries that don't rely on indexes.

Michael

01:39

Nice, and I'm sure we're going to come back to indexes in a bit once we dive into the details. But Joe, I think the origin story of this project is super interesting. I'd be fascinated. I was reading the MotherDuck blog post and it mentioned that Hydra kicked off the effort. So I'd love for you to share a little bit more about that if you can.

Joe

01:59

Sure, Sure. So our team had been working on a crazy weekend project that we initially were calling pg_quack. It was called pg_quack because well, ducks go quack, but also because we didn't necessarily expect this to work. We kind of were writing it off as a quacky project. So it was a little bit of an internal joke. But the performance of DuckDB was looking super great when it came to all kinds of analytics queries and aggregates like Jelte mentioned.

02:30

And we had this aha moment that we could actually take pg_quack pretty seriously and do it for real in collaboration with folks that know a whole lot about DuckDB. You know, we're Postgres guys by background mostly. So not to get too far ahead, but we decided to build pg_duckdb for real as a production-grade extension together with the folks over at DuckDB Labs, the creators of DuckDB, which is Hannes and Mark, as well as the folks at MotherDuck, like Jelte.

03:08

So we've been working together with them on it, oh, for at least last 6, 7 months now.

Michael

03:16

Yeah, super cool. And I saw a few other organizations mentioned as well. How do they all fit in?

Joe

03:23

Yeah, so this is open source, it's an MIT licensed extension. So for folks that want to participate, or can see pg_duckdb being really helpful in their use cases, they can either tell us directly what they'd like to add or open a PR or open issues. I mean, certainly there are a good number of those as the project is currently in beta. So be gentle with it, but also tell us the things that could be improved on.

03:49

One of the things that was surprising to me was the immediate reaction that we saw to how DuckDB could be useful to accelerating analytics queries in Postgres. And there are a lot of organizations that went. Oh, yes, okay. We want to support this we want to get involved. So that's probably some of the names you were seeing

Michael

04:11

Yeah, I think I saw was it Neon and Microsoft maybe some others as well.

Joe

04:15

Yeah, there's a whole range of folks that are interested in us moving this forward, and they're providing either direct or indirect support for the project.

Michael

04:25

Yes. Jota, from your side, what are you... I should mention both of your thoughts on this, but what are you seeing early users most excited about? Like what features are they most keen to see added or improved?

Jelte

04:40

I think one of the big things is that you can sort of read Parquet files or read CSV files from blob storage, like from S3 or from Azure or from all the big clouds, basically in Cloudflare as well. And I think that's a big feature that people are very interested in. And I mean, there's so many open issues of people wanting a specific thing that it's hard to pick 1 that's the thing people want. But 1 that has come up a few times is support for spatial stuff, like PostGIS things.

05:20

And that works slightly different in DuckDB so we haven't started on that we're first sort of building a bit more the base but that's definitely 1 of the things that 1 of the features that comes up like people wanting to do spatial and let's do spatial data basically. Yeah. But that's not supported right now.

Nikolay

05:38

Yeah. So, 1 of the biggest features is that data can be on, I mean, it's not a feature, it's like a requirement here. Data is on object storage like S3, right? In Parquet format, file format, for example, and it's already column store. But like, I mean, it's unlike Postgres, right? But this sounds like, I remember some attempts to have access to Parquet files on object storage as extension before pg_duckdb, right? And it's also like It could be done in FDW way, foreign data wrapper way.

06:19

I think some SQL interface over those files and that's it. But DuckDB is like bringing something else. I've never tried myself, but I've heard many opinions about very good performance. Right?

Jelte

06:38

Yeah, yeah. So, that's 1 of the big differences between pg_duckdb and some of the other foreign data wrapper based ones. Is that we sort of flip it around a bit, instead of saying like, oh, this table is stored somewhere else. We'll read that using DuckDB or something, or something else. Like, instead We're saying like, okay, we push the whole query, we put it in DuckDB, and then let DuckDB read the Postgres data. So the Postgres sort of becomes a foreign data wrapper, kind of inside DuckDB.

07:12

Does that make sense? So it's more like a query stealing approach, is what Mario from Hydra, 1 of the developers there, called it. It's like, instead of letting the Postgres execution engine do a lot of things, it's like, we don't want the Postgres execution engine to do a lot of things because the DuckDB execution engine is much better for these analytic stuff, like the sorting and the grouping and stuff like that that are joining. We'd wanna do as much as possible inside DuckDB.

Nikolay

07:39

Yeah, that's interesting.

Joe

07:41

Yeah, and also, you know, kind of back to Michael's question a little bit too, A really common pattern for a lot of application developers is to have Postgres, but also to have an S3 bucket. There's a lot of event data that can get generated that really should not live on disk in Postgres.

07:59

We're talking to a company yesterday that had generated, you know, has 2.6 terabytes of event data that they really shouldn't be storing necessarily on Postgres, but they could offload into Parquet format in S3. But that data still needs to be accessible by Postgres.

08:19

Because 1 of the important things is that pg_duckdb can perform the join between heap table and Postgres together with the Parquet files in object storage so that there's not this siloing or needing for ETL processes between Postgres and external analytics databases and S3. So how do you smooth over the differences between columnar format and row-based format?

08:47

And I think pg_duckdb goes a long way to do that because ultimately when you're developing applications, thinking about table format is pretty low level. So if you can avoid that as much as possible, that's probably just easier.

Nikolay

09:02

Yeah, but here I'm confused a little bit, because this thought is like, let's have tiering of storage. And, for example, I remember several cases, big companies where this was needed and the decision was made for scaling purposes of main clusters. We need to offload archive data somewhere else, maybe to different cluster or somewhere else. Maybe sometimes we just delete it.

09:29

It could be, for example, e-commerce and old orders, order history, exceeding for example a couple of years, we don't need that. But offloading this data to S3, this sounds great and I remember there's also extension pgt from Tembo, again I haven't tried it yet, It's in my to-do list. But offloading that, I'm thinking, if I also receive column storage for this kind of data, is it right? Because I still need only 1 row in these cases, right?

09:56

In some cases, I do need like aggregates, like, I don't know, time series, I would love to upload to S3 in column store format, but not in these cases when I just need 1 row.

Joe

10:09

Yeah, I mean, I think it's different needs for different users. If you are a, you know, application developer just trying to build something quickly, then if we can tell you, hey, dump a lot of your event data into Parquet format in S3, or archive it out of Postgres on some increment every 7 days, every 30 days or something, then that's easy enough. Then maybe the thinking about columnar versus row should subside.

10:37

But yeah, I mean, data tiering and data lake is not as different as, you know, they're really not that different. It's just a function of what the user interacts with.

Nikolay

10:49

Right, but my concern is if I need the row store, like I need to find all columns for 1 row to present it. For example, again, orders, right? I need 1 order. If we pull it from archive, I expect not good performance. But if it's column store, I will be concerned about performance to collect all the rows versus row storage. Let's formulate the question differently.

11:12

Let's talk about the use cases where this particular pattern when we need to upload a lot of data to S3 and benefit from column store, analytical queries like aggregates, what kind of data it could be in existing Postgres database? It can be SAS or something, I don't know like what kind of data is beneficial for us to offload to this?

Jelte

11:36

A very common case is like time series data. I think that's 1 of the, that's by far 1 of the most common cases where it's like, okay, at least sort of the old data we store in, I mean, you're not going to look up a single row in some series of events. Or if you do, if it's like a week old, it's not so terrible that you don't look that up too often. So it's like, okay, it's fine. If that takes a little bit longer, make sure fetching a bit too much more data that you actually need.

12:03

If that means that you can do these aggregates much quicker. I think that's 1 of the most common cases, I would say.

Nikolay

12:09

Yeah, that makes sense. And that means that, for example, Timescale Cloud, they have it bottomless timescale of loading some data transparently, so we don't see it, but it goes, some archive data goes to S3, but this is not open source. PjWB is open source, which I like. You mentioned MIT license, that's great, but I see some kind of, not competition, but alternative here. I can choose between going to Timescale Cloud if I need this, or just use it if I can.

12:41

So, I mean, if I can install extension, which is not available in managed service yet. But yeah, it makes sense to me. Indeed, especially if data is old, and I expect it to go to object storage, I definitely not in the usually I'm not interested in single row, I'm interested in aggregates. Right? Yeah, this makes sense for me. Cool.

Michael

13:04

That's a good question though. Are there any, I'm sure you're talking to lots, but are there any Postgres managed services that already offer it in a preview or developer mode?

Jelte

13:14

I mean, Hydra is offering it, but other than that, not yet. But yeah, we're definitely in talks with a bunch of them. Yeah, it's a project that many people are interested in.

Michael

13:24

Yeah, nice.

Joe

13:26

Yeah, so Hydra currently has a closed beta of a fully managed Postgres service that has pg_duckdb extension installed.

Nikolay

13:35

Nice. Let's talk about some numbers, if you can pull them from top of your head. For example, if I need to aggregate over many million rows which are sparsely distributed. We know like in the worst case, for example, if I touch, I don't know, 1000000 rows, which each 1 is stored in separate buffers in the worst case, for example, right? Very roughly, I need to touch, And in the worst case, it's not hits, it's reads. I need to get from disk 1 million buffers, I mean, pages, right?

14:11

And in the worst case, it can be absolutely worst. I can imagine it may be like 1 millisecond for each if latency is bad and we do it in not without parallelization it's 1000000 seconds it's terrible right and this is what I can see in Postgres in some cases what should I expect here if I need for example to calculate sum of of rows like 1 million rows stored in this format? What like is it seconds, minutes?

Jelte

14:40

You mean if it's sort of blob storage? Like how many rows?

Nikolay

14:44

1000000 rows and some simple aggregates like count sum or something. What do you see?

Jelte

14:51

I would say it's generally within a second, I would say, or within a few seconds at least.

Nikolay

14:57

Including data transfer from S3, everything, right?

Jelte

15:01

Yeah, yes. So because it's a per cave format, you only get the columns that you actually requested. So it's already, even if it's a lot of data, you generally just because you have a bunch of columns. So you only grab the data from S3 that you actually need. And that is then sort of streamed through the WB engine to get these, to get these aggregates.

Nikolay

15:22

Yeah, that's, that's super cool.

Jelte

15:25

But I mean, it's like with all these questions, it always depends a bit on what, on what you're doing exactly. But for instance, we, we, we, we like do, there's click bench results. ClickHouse has a benchmark and their pg_duckdb is quite... From the Postgres compatible ones, it's... I don't think it's at the top. The one with MotherDuck support is almost at the top, I think, but the one with Parquet is a bit lower, but still much higher than tuned Postgres, I think.

15:56

I don't know by heart, but I think it's at least 10 times faster than the tuned Postgres version, but maybe even more.

Nikolay

16:04

Yeah, that's great. And I can imagine this opens possibilities to stop saying don't do analytics in Postgres. So this can be a transaction which does aggregates in Postgres if it's within seconds. We don't have negative consequences of long-running transactions anymore, right? It's not only about user experience. It's also about how Postgres performs vacuuming, still, and so on. This xmin horizon problem and so on, right?

Jelte

16:35

Yeah, yeah.

Joe

16:36

Yeah, I mean, pg_duckdb was really kicked off to try and solve the known limitations of analytics in Postgres. You know, not even at huge scales, Postgres can be fairly inefficient for doing analytical processing. So, you know, DuckDB, that's what it excels at. We're trying to get the project to the point where the user doesn't necessarily need to know very much about DuckDB at all. Since these are Postgres users, the service should look and feel exactly like Postgres for the most part.

17:13

And then we make those differences clear if you were to run an EXPLAIN plan, it will show you exactly where and when DuckDB is taking over for that execution if you want to look a little bit deeper.

Michael

17:28

Would that be like a custom scan node or how would it show?

Jelte

17:31

Yeah, that's a custom scan node. And basically, it's always like it either takes completely over or it doesn't take over at all. There's no, I mean, that's sort of one of the things we'll probably want to add at some point, like to say like, oh, this part is actually, needs to be executed in Postgres because DuckDB doesn't know how to do it. Right now, it's either everything or nothing.

17:51

If you want to do something that DuckDB doesn't support, but still needs DuckDB execution, then just look at this query. It's not supported at the moment. But yeah, It's one big custom scan node, basically, that does the DuckDB execution.

Nikolay

18:05

And do you see situations when you need to tune something? There are some tuning capabilities inside DuckDB, right? But if we see only one node in the plan, we cannot do it from there. What's the workflow to understand how efficient it is?

Jelte

18:24

So, I mean, at the moment, the main things you can tune in pg_duckdb are like just that amount of threads that DuckDB uses and the amount of memory that it's allowed to use. Because it's, I mean, because you're running it inside... normally DuckDB really likes to take all the resources of the system and go as fast as it can. So that's, I mean, your laptop tries to fly away and overheats. That's what it wants to do.

18:50

But, I mean, if you were inside Postgres, that's probably not what you want, because you also want to run some other queries. So that's why we have some limits. So I think by default, it will try to use as many threads as are available, just so that it completes quickly, but at least the memory is trying to limit. But you can lower both or increase both.

Nikolay

19:12

And we can run this queries on Postgres standby nodes, right? On replicas?

Jelte

19:18

Yes, you should be able to. I don't think I've personally tried yet, but I think that should be fine. And if it doesn't work, please open a BR or like an issue.

Nikolay

19:28

I'm already thinking about a common approach, like I would say a traditional approach right now is to have OLTP database Postgres for everything but for analytics we offload data to some another system sometimes. Or like with Hydrate it's a different table with column storage as I remember it's also like additional effort to maintain data synchronized.

19:54

Here if some data, okay, again, maybe also here, maybe it makes sense in some cases to have original table stored in Postgres and raw storage, but this table, maybe with maybe projections, some Columns only, go to this additional table which is stored in this format in S3, Parquet, DuckDB processed, right? And this is also an interesting question. Do you see it? But I wanted to ask a different question. So if we do this, we can run analytical queries right inside.

20:26

And I already see that it makes sense if it just selects to do it on replicas, not to, like, primaries. We have only 1 single primary in a single cluster, so it's not good to take memory for this, and so on, and affect all TP workload processing. In this case, it's great that if it's fast, because host-and-buy feedback is on, like, spin-horizon is almost not affected, great.

20:52

But there, I probably would like to give all memory, all CPU power for this processing if it's maybe it's dedicated replica. Since I'm new to this, sorry if my question is not right, but what about compression? So do I have any capabilities to choose compression method and ratio and so on in case of Parquet?

Joe

21:17

Well, if you were to copy the heap table to Parquet format using pg_duckdb, then, okay, you know, I think you'd see what, 5 times compression.

Nikolay

21:33

Depends on data, right?

Joe

21:34

At worst, depending on the data, 5 times, sometimes you see 10 times compression. But this can also be the case for the other columnar formats as well. pg_duckdb supports the Iceberg reader as well as the Delta table reader. You know, you could use Delta tables or Iceberg files in object storage as well. All of those will have better compression than, you know, than just standard Postgres heap tables.

Nikolay

22:04

Yeah, this is what I expected. And I'm just... As I understand, there are no control, you just get it, that's it. For now, you can...

Jelte

22:14

I'm not entirely sure. It might be that the copy... I mean, we sort of expose all the copy options that DuckDB exposes. So I think they might expose like a compression option that you can say like which compression type you want to use to write the Parquet file?

Nikolay

22:33

Yeah, because usually it's a trade-off between CPU utilization and the size.

Jelte

22:39

Yeah, yeah.

Nikolay

22:40

Yeah, this discussion of CPU triggers this question in my head, and I know Column storage is beating Postgres completely in terms of compression. I remember a funny case when I said 1 billion rows is 1 terabyte, but VictoriaMetrics CTO said, how come it's just 1 gigabyte? And we could not understand each other, right? Because in Postgres it's indeed 1 terabyte. But yeah, 5X, you said it's the lowest, that's great. And I think it also cost saving.

23:15

And it's not only disk saving in terms of if you're offloaded to S3, but also memory, right? If we have 1 primary multiple standby nodes, we like, in many cases, the shared buffer pool and page cache state is mirrored between standby nodes. And if this is data, which we don't need often, it's not good in terms of cache efficiencies right if you're if loaded if you're floated to this you like less memory to process your workload and this is probably a huge cost saving as well, right?

Jelte

23:48

If you offload the computer to standby now, if you mean,

Nikolay

23:52

or are you offloading? In general, we take some data, we offload it, like if it's archived data, we don't need it often, and it also have some patterns where column storage makes sense. We offloaded to S3 using this approach. And if it's processed by a DuckDB, like, well, it still needs memory, right? It still needs memory if we often execute some aggregates, so we consume memory on Postgres standby anyway, right?

Jelte

24:19

Yes. Yeah, but even then the memory can still be less than it would be on Postgres tables, because also in memory sometimes it's compressed. Like if there's many values that are the same, sometimes it will just distort the amount of values and the actual value. It's just like repeat end times, basically. So that also makes the compute... That's 1 of the tricks that this column storage engine generally uses to improve this compute, and also the memory usage.

Nikolay

24:50

Right, and again, if you do it in row storage, we pull all those blocks with all columns and very inefficient in terms of memory consumption.

Jelte

24:57

Yeah, and they all have the exact same value, basically.

Nikolay

25:00

Yeah, yeah, yeah. So this makes sense completely to me.

Joe

25:05

You know, the common pattern for so many, so many SaaS companies is that really only the last 30 days of data or so is being updated very frequently. But then after that, the data maybe shouldn't be deleted, but it should be accessed. And it can live in object storage, which is 2 cents per gigabyte per month, as opposed to on disk and Postgres, it can be 10 cents, 15 cents per gigabyte per month for some services.

25:39

So not only is it cheaper from a cost perspective of the raw gigabyte per cost, but also the compression as you mentioned just earlier is quite good. So you can substantially archive data, still have it accessible to the application. And pg_duckdb also does support caching.

25:59

So you would be able to cache the parquet files from S3 to where Postgres is, so that you can avoid running over the network to connecting to S3, because that would be really 1 of the major costs, is running over from Postgres to S3 to retrieve those files.

Nikolay

26:17

Are you talking about caching in memory or on disk, like a BUS volume or something?

Jelte

26:22

On disk caching. Basically, it downloads the file and then it's like, oh, this 1, we were going to use it a lot of times, it doesn't change.

Nikolay

26:32

Yeah, that's interesting. And also, I noticed that, of course, rights are supported, right? So we can update this data, right? Insert update and so on. And in this case, I'm very curious about the performance of writes as well. Is it like, how to compare it? What's the sense of it?

Jelte

26:57

I think writes are, I mean, you can do, like you can store like a Parquet file, but I don't think we currently support updating values in a Parquet file. We do support the real-life copy. You put a query and you put all the results in 1 big Parquet file. Or, yeah, that's the main writing to Blob Search to support. What we do support is also like writing to MotherDoc which is like a sort of managed DocDB. That's I mean that's the company I work for.

27:28

It's like we have like DocDB in the cloud basically. So there we do support inserting into tables that live there and updating them, like sort of more similar to how you would insert it to Postgres table. Does that make sense?

Nikolay

27:45

Yeah, it makes sense. And additional question, if I insert data, can I do it inside a complex transaction, which probably inserts data to a regular table in Postgres?

Jelte

27:58

So, right now you cannot write to both a Postgres table and a MotherDev table in the same transaction, specifically because when a crash happens, I mean, it kind of works. It's just that when a crash happens, maybe you only wrote to 1 of the places. So that's something we want to protect against.

Nikolay

28:18

Because 2PC would protect here.

Jelte

28:20

Yes, 2PC would protect here, but 2PC is not trivial to implement.

Nikolay

28:25

And it's super slow as well. It's not fun to use as well.

Jelte

28:29

Yes, So there's definitely something you're considering for like a future version. But for now, it's like, okay, you can write to MotorDoc and Postgres. You can write to MotorDoc in a transaction, you can do multiple writes, but you cannot write to Postgres in that same transaction.

Nikolay

28:47

Will it just produce an error or just, you mean, it's not good?

Jelte

28:52

Yeah, no, right now it will produce an error. Like it will say, like, oh, it's a protection against, like, maybe we'll lift, like, we'll have a setting to say, okay, you know what you're doing, or you don't care about slight inconsistencies in case of crash, because it will still roll back, like the happy rollback path, or query fails or something, it will still roll back.

29:17

That still works, but actual crash cases where Postgres really really goes away without sort of doing its normal cleanup yeah you have some race conditions where then you write only to MotherDuck and not to Postgres.

Michael

29:35

1 core feature we haven't talked about yet is creating temporary tables within Postgres using the DuckDB columnar format. When would it make sense to do that versus caching a parquet file locally? When are you seeing people use that?

Jelte

29:59

I don't know if I've seen many people use that. The main reason... 1 of the main reasons it's implemented is because we wanted to add MotherDuck support. And this was really easy to also... I mean, it was like, we sort of get it for free. So that was... It wasn't necessarily like, oh, this is a feature we super need. But it was like, oh, this is like that lines of code extra, and now we have temporary table support. And it also makes the testing a bit easier, because then you don't have to...

30:32

Like, for the MotherDuck testing, you need an account, and you need everything, so you can have this... These DuckDB tables, you can have them locally. Cool. Yeah, but 1 of the things that could be useful for us, like for like ETL kind of jobs, where you have like a, like you create some temporary tables, you do 1 query, you do 1 query, create a temporary table to do some local transforms, you create another temporary table to do some transforms, and then you store somewhere else.

31:01

So it's also not completely useless or something. It's just that it wasn't our main, like, oh, we need this. Yeah, so that's it.

Michael

31:10

I think that's what I find. Like, I found this project quite confusing to get, and not just this 1, like all DuckDB Postgres related projects, like I find it hard to get my head around, and partly because it's like a, feels like a bag of features, like firstly I can copy to Parquet format and I can offload queries, like I can query through Postgres to Parquet format, but I can also execute a query that's completely local to Postgres using the DuckDB engine.

31:40

That's a totally different feature, but it's also the same extension. It took me a while to get my head around that these are different use cases, but all provided by the same extension, if that makes any sense.

Jelte

31:54

Yeah, that's definitely... I totally understand that it's a bit confusing. 1 reason, I guess, is because DuckDB can do so many things. Yeah. So it's like, okay, and you kind of want to be able to combine all of these things. So it's like, oh yeah, you want to join a Parquet file with, you want to read the Parquet file and join it with some Postgres table to, for instance, sort of hydrate the user IDs with the recently updated user information.

32:23

So that the username is actually what it is now and not what it was like 2 months ago when the record was created in the Parquet file. And that sort of means that DuckDB really needs to be able to read Postgres tables. But then, I mean, if you want, you could also just make it read Postgres tables also fine. So it's, I think both, I mean, both Hydra and MotherDuck, we had some sort of, we had some idea about what features we think are important.

32:53

And I mean, you start a bit from there and like, oh, these are the use cases we want to support. And then how can we glue all those use cases together in a way that works nice. And then you also end up with some other use cases you didn't even think about.

Joe

33:08

Yeah, it's a really good point. I've seen Postgres itself described by many people, even myself, as a multi-tool type of database. With adding DuckDB into Postgres makes it probably even more of a multi-tool database where the average user is not going to use every feature, nor should they try to. But if there's something there that's useful to them that just kind of comes along with DuckDB, certainly there's no reason for us to try and remove it or disable it.

33:42

You kind of get all of these DuckDB cool features along with now Postgres, just as a complete kind of all-in-one multi-tool type database. In fact, I've kind of started to use the term like multi-store a little bit. I think single store and unistore was taken. So, you know, but really what's what's happening is you have 2 different databases kind of working working together and the data is also can be stored in multiple formats both in, you know, heap as well as Parquet.

34:15

So, I mean, that could be another way to think about it.

Michael

34:19

Yeah. Yes. Because of time, I'm starting to think about wrapping up. Do you guys want to talk a little bit about what you're working on at the moment or what your plans are for the future?

Jelte

34:28

Sure. So I think there's a, I mean, we released like the latest version like a week ago or something. And the things we're working on now, we're working basically on 2 big things. 1 is redoing the whole way we read from Postgres, because before we were sort of doing low-level reads, sort of like the actual bytes, sort of re-implementing the heap reader in a sense. But that means you don't get index support automatically, then you also need to implement the index reader and stuff like that.

35:04

So, so now we're trying to more sort of build a query again to do, to do that, like to build a Postgres query again, and then give it to the Postgres planner and then make Postgres figure out, and that's, that's working very well because then, then you get foreign data wrapper support, so that you can call foreign data wrappers from Postgres from within DuckDB again. So it's definitely all the way down. So that's 1 big feature.

35:33

And the other is, 1 thing people have been running into, sort of we've seen people run into in the first 2 versions is that to be able to read from Parquet, like you use the DuckDB functions to read Parquet files and reads Iceberg files, Iceberg catalogs, you need to know what the columns are. And right now you need to specify those columns and the types in your query, which is a bit of a hassle. So I'm working on a way to make it so that you don't have to do that anymore.

36:04

Because Postgres wants to be very strict about these things. And the DuckDB, like, there's 1 of the big differences between DuckDB and Postgres. Postgres is very strict about certain things. And DuckDB is like, oh, you typed this, you probably meant to do this thing. This kind of makes sense. I can understand what you're doing, what you're trying to do here. Let me give you the sensible result. And Postgres is much more like, oh, this is not exactly how I expected it.

36:28

So, that's sort of where there's a bit of a problem. So I'm trying to make the UX there a lot nicer. And I have a way now that's working quite well, but it needs a bit more polish. So yeah, probably after Christmas or after New Year's, I'll also create a VR to get that easier to work with. So you could just do SELECT * from read CSV and .f2, or read parquet and .f2 to specify the callouts. That's sort of the user experience we want.

36:58

But right now you need to do SELECT * from read parquet and then specify all the callouts that you actually want to read from the parquet file, which kind of defeats the purpose of the star.

Michael

37:09

Makes perfect sense. And how about from your side, Joe?

Joe

37:12

Yeah, I think Jelte covered it well, as far as the way that Postgres scan tables. That should be merged in, I think, relatively soon, and then we'll let it bake before the next you know official release. As Jelte mentioned, we released a past version just a week ago so we have a pretty good clip at this point of shipping new releases. If there's something that folks want to see, certainly head on over to the repo and open an issue or let us know.

37:42

And then as far as Hydra goes, as I mentioned, we have a managed cloud Postgres service that's in beta that has integrated pg_duckdb. So we're looking to roll that up out into public beta, so people can begin to use pg_duckdb without having to self-manage it. Today, you can obviously take pg_duckdb and put it on EC2 box with Postgres or whatever you want to do. So yeah, looking forward to shipping that out.

Nikolay

38:14

Yeah, that's great. I just thought, like, 20 years ago, MySQL was dominating in open source database field, and they have several engines. I remember you needed to pick up, 1 was not crashing, and the other was had the full tech search. I mean, my son and then the DB, it was, yeah, trade off. And now like 20 years later, we talk about various engines, basically, cloud friendly, because it's, you can upload it to object storage.

38:53

It's amazing that we can combine these in Postgres right now, and it looks like it's only starting, right? A lot of stuff can be added, developed, and so on. It's only the beginning as I see it.

Joe

39:08

Yeah, it's pretty interesting to see Data Lake come to Postgres. It goes to further your point, which is many of the layers of the modern databases are being pulled apart. There's multiple engines, there's object storage, there's compute with Postgres. So I think that trend is continuing to play out.

39:30

As far as the average app developer is concerned, like I was indicating before, I don't think they necessarily need to know or care so much about the differences between a database versus a data lake. So hopefully, The interface being Postgres helps bridge the gap of using something really cool like DuckDB, but together with what you're used to using in Postgres.

Nikolay

39:54

Yeah, that's cool.

Michael

39:56

Nice 1. Yeah. Well, thank you so much both for joining. Thank you, Nikolaj, as always, and catch you soon.

Nikolay

40:03

Yeah. Thank you guys.

Jelte

40:05

Thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript