pg_squeeze

Michael

00:00

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and as usual, I'm joined by Nikolay, founder of Postgres.AI. Hey, Nikolay.

Nikolay

00:08

Hi, Michael.

Michael

00:09

And today we have a special guest, Tony from Cybertec, a Postgres developer there, who is responsible for, among other things, the pg_squeeze extension. So, welcome, Tony. Antonín: Hello, thanks for the invitation. It is our pleasure. So, today we are going to be talking all things pg_squeeze. Perhaps Tony could give us an overview of what it is.

00:34

Antonín: So pg_squeeze is a Postgres extension that does similar things like the CLUSTER command or the VACUUM with a FULL option, except that it does not require the exclusive locking of the table. In fact, it needs the exclusive lock only for a very short time at the end of the processing, but it's actually a tool that lets users get rid of table bloat. Yeah, it's hugely powerful, isn't it? So online bloat reduction of tables. Antonín: Yes, basically that.

01:11

Yeah, big, big problem and it's a really powerful tool and I think a lot of our listeners will be familiar with pg_repack, which I think originated from a... Nikolay, you can probably tell me what it was originally called, a reorg or something like that?

Nikolay

01:25

Yeah, something like that. It already doesn't matter, it's an ancient name.

Michael

01:31

Yeah, but pg_squeeze, you also have like a bunch of other features and it's done in a more modern way in terms of how... Maybe you could tell us a little bit about how it's implemented and how long you've been working on it. Antonín: Yes, the motivation was actually to use... Well, it was an assignment, it was a task that I got from my employer and the idea was to adjust pg_repack, the existing pg_repack extension so it can be scheduled, so it can run according to schedule automatically.

02:01

But when I investigated the pg_repack code, I realized that it's implemented in 2 parts. Part of the logic is on the server and the other part is on a client. And I realized that the client part is not really simple. It does some tricky things with connections. It even terminates connections at some point. So I realized that if it should be executed from background worker, it would be quite strange because the background worker would have to use the client library to connect to server.

02:37

So the server would have to connect to itself over client, over the libpq library. So The fact that the client is not trivial made it almost impossible to implement the functionality in the background worker. So that was 1 reason I thought it's better to start from scratch. And once I was considering doing everything from scratch, I got opportunities to do some things differently.

Nikolay

03:08

Yeah, just instead of replacing files, you decided to use logical replication, right? Because By that time it already became quite mature, unlike the moment when pg_rework and pg_repack was created originally. There was no logical replication yet at all. Is this right? Antonín: Yes, pg_repack uses triggers. Maybe I should tell shortly how both extensions work. The common part is that they copy the useful data from the table into a new table and then swap the files.

03:50

Each table is, the table contents is stored in a file, so the extensions copy the useful data and then switch the links to the files. All Right, also accumulating changes in some Delta table and applying them this can be a headache itself. Antonín: This is where pg_repack uses triggers And around the time I was considering writing such an extension from scratch, there was a new feature in Postgres and it was the logical decoding.

04:25

I think the logical replication was not yet implemented at the moment, but I thought it could be useful for this purpose, so I spent quite some time studying how it works, and then I decided to use the logical decoding.

04:42

I considered it more, I will say compact or I didn't like the idea to construct well the pg_repack somehow has to create the triggers so it plays with the SQL with the DDL statements it has to construct the DDL statements to create the temporary table as well as the triggers and I thought this is quite complex and error prone So that's why I preferred the logical decoding

Michael

05:16

I think logical decoding should be lower impact as well in terms of overhead, so it feels like a lighter weight way of doing it as well, to me. Is that fair? Antonín: Yes, I think the fact that it works at a lower level in the stack should even be probably more efficient.

Nikolay

05:36

Yeah, because in case of PjRepaq, any write during initialization of during this massive copy using, I think it uses create table as select. When we do it, all the writes to this table also go to this delta table using triggers. And basically we write it twice and it goes twice to wall while if we take logical recording, we just extract it from wall and it's written once, right? This is already a big difference.

06:05

Antonín: Well, I think with pg_squeeze the insertions into the new table are actually locked too. I think there's no way to... Well, You could avoid the logging if the table was unlocked, but then it's hard to switch. I didn't go that far in this direction, but... I mean, in our case, in pg_squeeze, we just accumulate changes in the slot, and We don't write twice to 2 tables. This is the big difference in terms of how this delta is accumulated. Antonín: I see, I see. Yes, that's right.

06:43

This is obviously a difference. I didn't realize that the original idea was to be able to schedule it and scheduling pg_repack means you do it on client using, I don't know, like regular Kronos or something. In this case, you brought it to, you brought all the interface to SQL and we just can schedule it using pg_cron or something, right? This is the idea, I didn't realize it. Or you can schedule it right with pg_squeeze, I don't know this. Okay, Michael is shaking his head, I understand, okay.

Michael

07:13

No, I'm saying you can, I'm saying you can schedule it within pg_squeeze, it has, like, a pretty much, like, it uses the same syntax as crontab is that yeah?

Nikolay

07:23

I didn't know this I'm not like I use it only a couple of times manually somewhere very long ago so yeah that's great and in general be able to run everything at SQL level, it's good. I mean, not CLI level, but SQL level. It gives freedom sometimes, right? You can do it in application or anywhere, in PSQL or something. Antonín: Yes, and it's good for development and troubleshooting. Yeah, yeah, yeah, that's great, that's great. Good.

Michael

07:55

Tony, I was reading, I think you have a super fan, previous Cybertec employee Kaarel, he wrote a really good introduction post. I couldn't believe it was 2016. I mean also it had lots of these features at the beginning. How's it developed over time and how much have you had to keep working on it? Antonín: Well, I think there was not much functionality added since the initial version.

08:25

I did some optimizations to avoid the restriction it puts on vacuum, because 1 problem is that the xmin horizon, you probably know what it means, must be prevented from advancing for some time. When the data is being copied, then this horizon must stay constant. It must not advance and the other problem is that...

Nikolay

08:53

So, sorry, once again, so what exactly you did, like, if we create a slot, it already freezes our state in terms of horizon until we start consuming changes from the slot, right?

09:06

Antonín: Yes, and I think the initial version did block this horizon for the entire processing, but at some point I realized that it's only necessary for the initial load for the copying of the data then it can be Then it can be allowed to advance Yeah When when it's done Antonín: After the load after the copying is done you only need to process the changes that were applied concurrently, so you no longer need to block the xmin horizon for this stage. This is a huge problem for huge tables.

09:48

If we have a 1TB table, it means that this horizon is blocked for long and it affects the whole cluster because autovacuum cannot delete freshly dead tuples, as we discussed many times on this podcast. In this area, have you considered the idea of speeding up initialization?

10:08

If, for example, we have regular int8 or, I don't know, like some numeric primary key, we could split the table into some batches or like areas, and then we could use multiple INSERT processes which would work in parallel, speeding up the initial load. Did you consider this? Antonín: No, technically I didn't have this idea. Do you mean processing the table in multiple iterations? No, I didn't get that far.

10:44

So you say that with pg_squeeze the xmin horizon is a problem with the processing or in general? It's in general when working with logical replication decoding, it's a problem. Initialization is a huge problem. Sometimes we're dealing with huge databases, we avoid initialization completely. For example, if we want all tables, sometimes we have a couple of recipes how to convert physical replica to logical replica, just because initialization is a headache. It can take sometimes days.

11:18

During this time, xmin horizon is not progressing, and not a vacuum is blocked. So I know some tools. I don't remember. Maybe PeerDB was doing this, basically, virtual partitioning of the table. Not real partitioning, but just kind of partitioning, splitting it to pieces and then loading. If we have very fast disks and a lot of CPUs, but of course we can speed this process up. And this means that xmin horizon is blocked for a shorter period.

11:55

I think, yeah, we could consider in this product, we could consider this, I think, this approach. And pg_repack doesn't do it. It just creates a table as select, so it's a single process and it can be slow, while here we have an opportunity to speed it up. Antonín: This sounds like 1 of the cons we may discuss later, but from this perspective, the pg_squeeze behaves like a logical decoding.

12:28

Yeah, and If we consider this approach in the future, if we think how to improve this, it's worth remembering that we need to synchronize all these sessions, which perform INSERTs of individual partitions, virtual partitions. We need to synchronize them using a single snapshot, right? And so they deal with the same data. How it's called, I don't remember. When you start a transaction, you can specify a snapshot. And there is a function which exports the snapshot name. So we can...

13:01

It's like pg_dump. pg_dump does this as well. Antonín: Yes, pg_squeeze also uses the kind of snapshot that is called historic snapshot.

13:11

It takes snapshot of the database as it was at some point in time, but I'm not sure this, you need for the snapshot to work you need to make sure that the xmin horizon does not advance because yeah because otherwise you must not lose you should copy even the DELETEd tuples that are still visible to the snapshot, because without that you will not be able to apply UPDATEs. If the tuple gets updated, you need to copy even the old version.

13:42

Right, but if the slot is created already, it should guarantee that the snapshot is not lost, right? I think so. Antonín: Yes, this does guarantee it, but I'm just explaining why we cannot advance it more than what is correct.

Michael

13:59

Also, Nikolay, I don't think this is as big a problem because I think you're talking about logical rep, like the slow initialization of a huge database, like the entire database, whereas it will, yeah, but just 1 table And we can be talking about partitions here though, right? If we're talking about, you're

Nikolay

14:21

talking about

Michael

14:21

a terabyte table, if we could get it down into 100 gigabyte partitions, pg_squeeze can work on a table by table basis and only blocking it by partition by partition.

Nikolay

14:32

Well, right, but if it's on partition table, this can be a problem, right? But on the other side, if we're dealing with huge bloat, usually a terabyte table means a hundred gigabytes table. Because while we're copying it, it already squeezes, right? That's what's happening. Maybe this explains why this is not a big concern. And you're right, people should partition huge tables anyway.

Michael

15:00

Anything you've had to do in particular around partitioning, Tony? Or is, I guess it just works like any normal tables? Antonín: No, I think it's just normal. I only checked at the beginning the kind of table must be checked. I don't think there's any special treatment.

Nikolay

15:20

Yeah, unless it's TimescaleDB. I remember there's an issue with logical. There are hyper tables which are kind of like partition tables. Yeah, they're cool and sharp. Right, and with regular partitioning, we just replicate, logical decoding works at partition level, that's it. Not that's it, but basics is this. And I remember there's some issue with TimescaleDB, so I wonder if TimescaleDB is compatible with pg_squeeze. Maybe there are issues there, actually.

Michael

15:55

Do you know, Tony? I don't know. Sorry. Antonín: No, it's OK.

Nikolay

16:01

So I just recently learned there are issues with logical... We had a database with TimescaleDB, we needed to move it using logical, and also on the fly removing bloat, right? But it didn't work easy. So yeah.

Michael

16:17

Good to know.

Nikolay

16:18

I have another question in general. So this is extension, right? This is an extension like pg_repack. Why extension? It could be not extension, right? It could be just some bunch of... I don't know. Is there anything that requires this project to be a Postgres extension? Because if it's not an extension, you can use it anywhere. There are managed Postgres systems which don't have even pg_repack. They don't have it. And people suffer from load.

16:51

I mean, not people, databases suffer from load there, and I see it. And if they don't have pg_repack, they don't have pg_squeeze, we cannot use this approach or we need to implement something similar ourselves. We're using logical decoding, right? Antonín: Well, extension is, I consider extension just a means of packaging, but what is important is that this code is a shared library because it's C code running on server.

17:19

So it must be shared library and the easiest way to get shared library to the user is to make it an extension. Does it answer your question? Yeah, it answers. Yeah, I just think if for example, there is a managed service provider, what should they add? pg_repack, pg_squeeze, or just pg_squeeze? What's our recommendation to them?

17:46

Because I know some managed Postgres service providers which become quite popular, and they lack this functionality, and they also don't tune into vacuum by default, which means growing projects accumulate a lot of load and they need some tools. So what like what do you think they should just consider pg_squeeze as an extension because it's convenient because it has SQL interface, right? Antonín: Yes, I mean pg_squeeze versus pg_repack would Well, yeah, yeah.

18:22

Yeah. So so to me a SQL level interface is an ability to schedule, it's an obvious advantage, right? Yes. Yeah. So, I just said, maybe a rhetorical question, so no need to answer. Antonín: If you ask me, I would recommend pgsql. If someone likes pg_repack, then let him use it. Okay, I have another question. We touched a little bit this initialization and you mentioned some pg_repack constructs some DDL and so on.

19:02

When pg_squeeze performs this initial load, is it possible, like theoretically, not right now, I understand like maybe right now it's not possible, but theoretically, is it possible for example to, because it's logical decoding, Postgres should obviously support this, Is it possible to change some Column type, for example? Imagine we had the Primary key int4 and we want to migrate to int8.

19:29

It's already not a use case of bloat removal, but it's kind of like mutation of Schema, not possible, right? Antonín: No, it cannot, but I remember I was suggested this feature. It was suggested to me, but I think it would belong to a different extension. And 1 thing that makes the pg_squeeze a little bit simpler than it might be is that it assumes no catalog changes.

19:59

So it checks whether anything changed in the catalog and if it did, then the Transaction is aborted and the processing is not effective. So if anything changes in the catalog during the processing, then the processing ends up with error and everything is a Rollback. So if it should change the catalog in addition to copying the data, it will be more complex. So the main reason is simplicity, right? For the sake of simplicity.

20:35

Antonín: Yes, and I think actually it's different features, so I would put it into another extension. Actually, I wrote something similar, it's called pg_rewrite, and it tries to use this, the same logic or similar logic like pg_squeeze to partition table, to turn non-partition table into partitions. I'm not sure how much it is used, but it's another use case. Yeah, I didn't know about this.

Michael

21:06

No, I didn't know about that either. I was just going to give you some context. This came up in a discussion we were having in an episode that we called Column Tetris, because that's what we've heard it called by lots of people. The idea of not just changing a data type but keeping the existing data, like keeping all the existing columns but changing their order for alignment purposes to minimize dead space in between columns.

21:31

So putting all of the 16 byte columns together, then the 8, then the 4, then the 1. So I'm guessing it's the same answer, but you could use the exact same mechanism. Antonín: Yes, the mechanism is another use case for this mechanism but not implemented in the pg_squeeze. Yeah.

Nikolay

21:56

Yeah. Is it implemented, like, can we use pg_rewrite for this or it's only for to make the table partitioned? Antonín: No, that pg_rewrite extension so far only handles the partitioning. But potentially it could be used for this, like table rewrite for some reorganization schema. Antonín: Yes. Interesting. And also, pg_rewrite, while it's doing this, it eliminates bloat, right? Antonín: Well, it basically creates new tables, so yes. So bloat is removed.

22:32

Doesn't this mean that pg_rewrite performs the same job as pg_squeeze? Antonín: Well, but it always needs 1 non-partitioned table and it always creates partitions. So not all users that need to get rid of both also need partitioning. Yeah, I see, I see. I'm just noticing that this project also does the job of that project. Doesn't that mean that maybe it could be a single project? This is my idea. I don't know. I just see the need in this. Sometimes we need to convert primary key data type.

23:13

Not only int4 to int8. I have tables right now which were, for example, people decided, well people, sometimes it was myself, decided to use data type text to store UUID values And this is not a good decision, also because of disk space. And it would be great to convert it to UUID data type easily without headache, taking care of all the things that are required to achieve real 0 downtime.

23:48

For example, sometimes I see people create quite smart migrations, but then they forget about autovacuum running in transaction ID wrap around prevention mode, for example, right? And it blocks deployment, for example. There are many things to remember, right? In such project, to be able to rewrite or to reorganize the order of columns and get rid of bloat, maybe sometimes clustering as well. I know pg_repack supports clustering. I'm not sure about pg_squeeze.

24:23

To reorganize, also supports, right? Antonín: Yes, you can specify index. So I mean, there are many things, but they are united by the single idea that tables should be basically rewritten, right? And partitioning is another use case. Maybe it should be a single, I'm just like thinking. Antonín: Yes, the problem is that this would end up in many extensions and someone must maintain it. I think the ideal case is that this finds its way into the core. Oh yes, let's discuss this. Why not? Why not?

25:05

Let's discuss this. For example, Pidgeot's quiz. Why we cannot propose something like as an alternative to VACUUM FULL and CLUSTER commands, Maybe just an option to them like online, right?

Michael

25:19

CONCURRENTLY?

Nikolay

25:20

CONCURRENTLY. It doesn't matter which words. And everyone already receives this. Like, what are the obstacles on this path? Antonín: Well, the obstacle is a review. I already submitted a patch which modifies the CLUSTER command and VACUUM with the FULL option. Let's see what happens. So the patch is already in the commit fest. Can you tell more about this patch? Antonín: So far it does not contain the scheduling feature.

25:51

That's something rather easy to add, but otherwise it's almost identical to... The functionality is basically identical to the functionality of pg_squeeze, except that it does not change the visibility of the data. If you know the pg_squeeze, when it writes the data into the new table, it uses new transaction ID. So, the data kind of moves into the future. Well, this is similar to what happens with Alter Table commands. If Alter Table rewrites the table, it also uses new transactions.

26:31

So this is a documented anomaly of MVCC. I'm looking at this commitfest, which is called VACUUM FULL / CLUSTER CONCURRENTLY And this is this is great. Actually, This shows how poorly I'm prepared for this recording. I'm sorry, I apologize for this. But this is huge. I promise to look at this and advertise it on my Twitter and LinkedIn. This is great. This is great. Like I think so many people would be happy to have this functionality in core. This is huge work.

27:05

So basically, yeah, yeah, pg_squeeze in core, that's great. Antonín: I would be happy to because I would not have to maintain the code anymore. As I say, if it was incorporated into the core, then this visibility problem would also be gone. The extension is kind of limited. Some things happen lower in the stack. So the extension needs to, it must use the new transaction to insert the data. But if it were incorporated in the core, then it would act just like CLUSTER and VACUUM FULL.

27:52

These commands do not change visibility. And what is the current status? It waits for review, I see, right? So version 5, version 6 already, right? So there's already some progress, some reviews happened and it waits for another round of reviews, right? Antonín: Yes, yes, that seems so, Yes. Yeah, that's great, that's great. So well, I'm excited to see this. I don't know. Antonín: Okay, thanks. That's great.

Michael

28:28

Yeah, for sure. I mean, I did a couple of hours of research and didn't find that Nikolay, sorry don't don't beat yourself up.

28:35

Tony though I've got bad news for you, I think you probably are still going to need to maintain pg_squeeze for a while because the scheduling stuff looks really useful and I imagine I'm looking at the list of features and you might have got the wrong impression listening to this that you value simplicity that it might not be that powerful and might not have that many features.

28:55

But there's so many features to pg_squeeze things that I wasn't expecting to see like it does it Nicola be happy about this it does Analyze on the table afterwards by default you can skip the Analyze, I saw that was a feature but by default it runs Analyze on it and you can even switch the tablespace, so you can when rebuilding I know not many people use tablespaces these days, but that's a feature and I learned something new I didn't

29:21

realize the heap of a table could be in a different tablespace to the indexes that blew my mind But in so I learned quite a lot reading your very thorough readme, which was great. Antonín: Well, I'm not sure how many people use these features, but when I was writing it, I tried to do it in a general way. And especially the analysis is necessary because it is needed to evaluate if there is the bloat, if the bloat exists. So that's basically why it analyzes the table. I think it's analyze after.

30:07

Antonín: Okay, so sorry, I was wrong. I don't remember why. You don't remember why you added such a good feature. I like it.

Nikolay

30:16

Yeah, Speaking of table spaces, I think this maybe not huge but there is some potential that this, this is old stuff, like we used table spaces when, like before cloud basically, because Some disks were fast and expensive, some disks were slow and cheap. So we tried to organize the storage. What I'm thinking right now in the cloud era, it looks like table spaces, use of them almost disappeared. But if we think about tiered storage, which might get popularity over time quite soon.

30:56

For example, if we store a huge heap, a huge table on object storage on S3, just because it's much cheaper, we might decide, OK, we want indexes on, or at least some indexes on a regular disk. I mean, EBS volume, for example, right? I don't know, like it's just my fantasy, but it might have some chances to happen, to be popular in some cases, I don't know.

Michael

31:29

Another different extension. Tony, there's 1 more cool feature of pg_squeeze that I liked. There's various monitoring tables, there's a log of the jobs that have run and an errors table, which I thought was extremely good. And it does retry it can do retries as well. I think it doesn't by default but you can set up several retries.

31:49

That all seems to me really powerful in terms of, you know, as you say, if things if if you're unlucky and a DDL command happens to be 1 on the table just as it was kicking off 1 of these squeeze jobs and it gets canceled, you might want it to be able to retry again quite soon. So these features all seem cool and the idea of an error log as well in terms of having that as a table and having a SQL interface to it all seemed really cool.

32:19

I think that was all from the beginning, so do you remember much about implementing those and what you were thinking at the time? Antonín: The monitoring for me, it was even important for the development and testing. I think 1 reason was to to make sure that those concurrent changes, I mean, that do actually happen during the test. So when the data is being copied to the news file, some applications may do changes. I call these concurrent changes.

33:02

And I wanted to make sure that during testing these changes really happen. So that was 1 reason I did the monitoring. Of course it should be useful for the users if for example it takes too long time it's good to know how many rows were already copied and what was the progress. Yeah. I loved it.

33:26

The only thing I was surprised that you didn't include, I don't think people necessarily need it, but it shows off how much work you're doing is it didn't show me a before and after in terms of like table size I thought there's an opportunity there to show off and show how much success you're bringing but yeah It shows the number of tuples on the initialization, or maybe number of rows, and then number of concurrent inserts, updates and deletes maybe? I think there's like 4 columns here.

33:58

Cool. What are your, like what's the future of pg_squeeze? I guess the core stuff makes sense. Antonín: I don't really think much. I don't really think that far. But then it will be maintained for some time and the next step then might be to submit a patch for the scheduling. But I'm not sure if the scheduling is appropriate for the Postgres core. I think that if this was merged into the core, then I think people would bring up many new ideas. So I didn't really think that far.

34:41

I think there is not many features to be added. I don't have many ideas what else should be added and at this point I'm not too eager to add new features because it works and as I hear sometimes that some important customer is using it then I tend not to touch it anymore. I did the coding the best way I could do but with the positive feedback I tend to not add new features. I'm fine if I'm only supposed to maintain it. Yeah, like for new major versions, I'm guessing there's some work.

35:27

Antonín: Yes, for major versions. Nice. Nikolay, anything else you wanted to ask?

Nikolay

35:34

No, no, I think we're good. Thank you so much for coming and explaining how it was created and how it's going. And I'm super excited, like, with this surprise that there is a proposal to have VACUUM FULL CONCURRENTLY, it's really great. Going to look at it closer. Antonín: Okay, thanks for this feedback. This is like, if it succeeded, it's a logical step for this project I think, right, which can influence every Postgres instance.

Michael

36:10

For sure. Yeah. Well best of luck, thank you so much for joining us and I hope you have a good week. Antonín: Yes, thanks for the invitation and have a nice day.

Nikolay

36:22

Thank you, bye bye.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript