Column Tetris

Nikolay

00:00

Hello hello, this is Postgres.FM, episode number again, I don't remember 126, great, more than 2 years and my name is Nikolay Postgres.AI as usual My co-host is Michael pgMustard. Hi Michael

Michael

00:17

Hello, Nikolay.

Nikolay

00:18

How are you doing today?

Michael

00:20

Yeah, good. How are you?

Nikolay

00:21

Great as well So let's play some Tetris column Tetris. I mean talk about this game

Michael

00:29

It's not a very fun game. Is it I like Tetris, but Column Tetris not so much.

Nikolay

00:34

Yeah well I think you can know 0 about it then you usually get excited about it and then you kind of think okay it exists but at some point you stop playing too much. Let's unwrap it. What is it? So columns can have different sizes, Data types have different sizes.

Michael

01:02

Yes.

Nikolay

01:03

And most folks say, for example, integer, big int, small int, but I prefer saying int4, int8, int2, because you see exactly how many bytes it occupies. Timestamps, regardless of with time zone, without time zone, they all occupy 8 bytes. UUID, I think, 16, right? It's huge. Or 8 as well. I think 16. I always forget.

Michael

01:32

I actually can't remember.

Nikolay

01:33

Yeah. Boolean, which should take only 1 bit, takes whole 8 bits, so 1 byte, right? It's a very wasteful way to store your flags, like true-false values. What else?

Michael

01:49

There are the big ones, aren't there? I guess you've got varying length, all the numeric ones. In fact the documentation is great listing what they all are but the other varying length ones I think of the other big category

Nikolay

02:02

yeah Marlena like text and watch our and dress on a race just some be of course all these guys they are variable length and it's different story. They are not fixed size data types right but these data types which have fixed size it's interesting to see that sometimes just reordering columns you can save disk space and get better performance

Michael

02:31

yeah I would say most people including myself when they first learn this find it very unexpected

Nikolay

02:37

Yeah, I agree Same it was long ago, but yeah, same

Michael

02:44

And mine was probably more recent. Yeah But yeah, so I don't know if you want to get into any of the details as to why I don't fully understand the reasons why but my understanding was reading

Nikolay

02:58

let's just say it's something about efficiency of how CPU works and if it's 8 bytes worth, it's better for performance to read and write this number of bytes, 8 bytes in modern systems. But it was a surprise to me to realize, you know, we talked before, the recording, the alignment happens not only to 8 bytes, but also sometimes to 4 bytes and to 2 bytes. So, yeah, but let's have an example.

03:32

For example, if you have, very typical example, you have a column, 4 byte column, which is primary key, say ID, integer.

Michael

03:43

Which we don't recommend.

Nikolay

03:45

Yeah, we don't recommend, and I always, like, you know, it's interesting, like you see something, you learn something usually by doing some mistakes and some going through some pain in production and so on and then and then you start recommending something but still people use integer for primary keys for example Supabase and OrioleDB they just had great release it was on on top of Hacker News but by the way congratulations it looks

04:14

great benchmarks look great beta 7 right yeah and I admire Alexander Korotkov a lot like he helped me personally many times. And for example, recently we had a discussion about track planning and we didn't mention that there is a good thread and discussion of sampling to mitigate overhead from pg_stat_statements. I discussed it with Korotkov some time ago. I dropped a line and he immediately commented on that thread. So it's cool. It's cool. I mean, great.

04:49

But he used integer for primary key in examples of benchmarks. But you know, when table is named like, I don't like users, can we have 2 billion, 2.1 billion insert attempts to such table? Maybe yes. I don't know. It depends on the size of the project. But anyway, we don't recommend using integer for primary keys as you mentioned I I stay at this point again, but I'm less aggressive judging people when they use integer 4 primary keys.

Michael

05:21

So a lot of the argument for int4 primary key versus int8 would be saving space. But due to this alignment padding issue, if you're going to follow it with an 8 by 8 type.

Nikolay

05:34

Create a timestamp, for example.

Michael

05:36

Yes, exactly. Super common. If that's the order when you create your table, you're not saving space even, other than, well, and we'll get to this, other than maybe in some indexes, you're not saving space by using the smaller data type. So I think it's really fascinating. And as I said, for me, it was unexpected.

05:55

Not because I knew better, but because I think in like, like when you create things in any other, in most other formats, like you see their length grow as you add things together and you don't

Nikolay

06:06

you

Michael

06:07

don't consider that the lower level structure is going to be like grouped into little pay I guess I guess the 1 the 1 exception is when you're creating maybe maybe some PDFs or something like let's say you wanted to print out the documentation into PDFs, you probably would want to start new chapters on new pages, for example. There's some formatting when you do books and things. You don't want the chapter to be at the end of a page. So it's that kind of thing, right?

06:35

It's kind of shifting data to start at a new point to make things easier for the thing.

Nikolay

06:42

Yeah, when you use Google Docs, for example, you insert a page break. Yeah. Yeah. Similar. A good analogy, I agree. Back to example again. If it's int4 primary key and then create it at timestamp, and then for example, I don't know, like org ID or something, group ID, also int4, or int8, doesn't matter. At storage level, what happens between Id and created, between the first and second columns, Postgres will just fill it up with 4 zeros.

07:16

And we can see it using pg_hexedit on Linux if you have Ubuntu or something. You can have this graphical interface which works on top of Page Inspect extension. You can see zeros with your own eyes, right? It's from Peter Geoghegan.

Michael

07:34

Is that the Peter Geoghegan? Yeah. Yeah. Yeah. I've seen him using it.

Nikolay

07:39

Yeah, I used it a few times, but I remember it required Ubuntu. I'm not sure maybe it's possible to run on macOS right now, so I use it in virtual machines like it would be overhead to run.

Michael

07:50

But these zeros are that padding yes?

Nikolay

07:53

Yes since the second column is 8 bytes for efficiency Postgres adds 4 zeros so the first column also basically takes 8 bytes instead of 4. So 2 first columns, they both take 16 bytes instead of 12. And it means if you used int8 or bigint, primary key, it would be the same in terms of storage here, as you said.

08:27

And we do recommend using integer 4 because who knows, maybe you will achieve 2.1 billion for your sequence which is generating the numbers and once you achieve that, sequence itself is 8 bytes always, it's okay. But if the column where insert is 4 bytes, with zeros always, you will end up having problem because you cannot insert into it anymore. Because larger values, larger than 2.1 billion won't be possible to insert, because the capacity of 4 bytes is 4.2 billion and we have signed int.

09:09

It's a bad situation because a big surgery will be needed on this table. It's really not an easy topic. And we touched it a few times in the past. But what I didn't realize until very recently is that padding can happen also to 4 bytes and to 2 bytes. So If you have, for example, int2, int2, int4, 3 columns, they take 8 bytes. But if you have int2, int4, int2, moving 1 of the int2s to the end, the first int2 is going to be padded to 4 bytes, then 4 bytes, then 2 bytes.

09:51

So overall it will be 10 bytes. This was a surprise to me because I thought they will take only 8 bytes. It's not so. Padding happens at different levels. 8 bytes, 4 bytes, and also 2 bytes, as you said. Same thing if you use Boolean, and you can see similarly that it's going to be padded either to 2 or 4 or 8 bytes depending on the subsequent column's

Michael

10:21

data type. Exactly, always the next 1.

Nikolay

10:24

And also, to make the picture complete, if we have, for example, just... The First example we used. Int4, primary key, createdAt, timestamp taking 8 bytes, and then, for example, I don't know, like orgID or something, also 4 bytes. So padding for first column is going to be up to 8 bytes, then we have 8 bytes. And in the end, it also will be padding, but not for the tuple. It will be padding for before next tuple. So overall, we will end up having 3 8-byte wards, so it's 24 bytes, right?

11:07

And just moving the third column to the second position, we'll go down from 24 bytes to 16 bytes. Also there are 24 bytes for tuple header, which is actually 23, but 1 byte is not used, as I remember. But it's also padded always to 24 bytes. So overall, we will have 48 bytes per tuple instead of 40. And this is amount of saving per tuple we can achieve if we just move, we just back 2 int4 data types together into 1 8 byte position.

11:48

And this will contradict with our recommendation to use int8 primary keys, but again the reason for int8 primary keys is to avoid the risks of surgery for the table. If you do know you will never achieve for your sequence value, you will never achieve 2.1 billion value. Never ever. It's okay. I could even use 2 byte integer, small int. If I know I want to achieve 32, 000, right? I don't know. Something like this.

Michael

12:23

I'm not good at factors.

Nikolay

12:25

It requires careful planning, so you need to think about the future and all possibilities. My rule of thumb is just using for int8 and that's it but back to this additional example this is kind of significant saving right from from 48 to 40 bytes It's kind of 20 percent, right?

Michael

12:47

Yeah as a percentage, but then I think I think where this becomes meaningful is when people have huge tables so we're talking about only a small number like bytes and this day and age even RAM's not that expensive like it's expensive at scale but we can get quite powerful machines for not very much money these days and loads of storage for not very much money so I think this program

Nikolay

13:13

because the storage is not only about storage it's spamming RAM not with zeros

Michael

13:18

I am a huge fan of doing things efficiently and I but I think at scale it's where it starts the investment in this kind of thing starts to pay off and I remember a blog that 1 of my favorite blog posts on this topic is by the team at Braintree and they mentioned by going back, like once they discovered this phenomenon, by going back through their largest tables they calculated it was about a 10% saving on disk space from where they were

13:46

before to to a more optimal setup so I believe that I don't see any reason why that that wouldn't be true for a lot of organizations that hadn't done this deliberately in the past or people hadn't been conscious about this but it's a 10% a lot in the by some benchmark.

Nikolay

14:04

Noticeable. It's noticeable.

Michael

14:07

Yeah, but it's also not like sometimes when we talk about performance, we often talk about orders of magnitude and when I don't think we're talking about that in most cases.

Nikolay

14:14

Still, you know, like again, like there are stages of impression here and minor counterpression. Well, it's important problem, but I don't know, like it's It's worth thinking about it when you first design a table, which will obviously be large in the future. I know, for example, GitLab, they have public documentation and there is a Column Tetris page there. I know it's a rule to check it. If it's a new table, to always check it. Maybe there is also some automation, I'm not sure.

14:45

But we have bloat as well in different dimensions. This dimension is kind of like programmed bloat in columns level, but we have in rows level we have bloat there. And 10% is probably… for example, if I see extra 10% of load in both indexes and tables, it's not triggering action from me. Exactly. Right? But if you design a table from scratch, of course, it's worth thinking about it. Yes.

Michael

15:16

Mostly because it's so cheap to think about. And I guess we'll talk about this in a moment. It's not a difficult equation to work out how you should order things.

Nikolay

15:27

And it can be automated. We discussed also before recording and I have report, I have analysis script in my postgres_dba old toolkit, which will suggest to you. It's easy. Just start from the largest or from smallest, doesn't matter. Maybe it matters a little bit. Actually start from 1 byte columns, then put 2 bytes columns, and so on. Or vice versa, start from the biggest size bytes and then smallest.

15:56

It will provide you, it will reorganize columns so you have less cases when you see zeros in physical storage. But don't you think it makes Column Order weird? For example, ID can be not at the first position and it's so strange to deal with such tables.

Michael

16:18

Yeah, I haven't had to do this optimization myself.

Nikolay

16:22

I did it a lot.

Michael

16:24

I think I would start large. I think I would want to start at least, maybe not largest to smallest, but at least the eights first, or at least a group of eights first, because I'd want my ID in the first place. If you're doing SELECT * [from a] table, you're just doing some testing. Just having that ID in those first few useful columns first is beneficial for me as a user.

16:48

But also, I was reading the Braintree post, and they've done some optimizations around, they've suggested putting the ID first because it's going to be read the most so in terms of unpacking the row Apparently there's some optimization around that as well. Okay. So that, like, so they also, they also recommended some other super minor optimizations because they've built tooling around this.

17:11

Like for example, within your, let's say you've got a bunch of 8 byte columns, within those you're probably better off putting your not null columns first because they're more likely to have data in them. And then putting your default columns a bit after that because they will almost always have data in them.

17:30

So like there's some of these, like probably right at the edge in terms of optimization, but now we're not talking about storage They don't we're talking about Like we're talking about read performance So it's and and actually I owes a good point because if your data is taken up less space some of your queries are going to also require less IO and that's faster cheaper all the all good things right well it's

Nikolay

17:55

this is I didn't see that did they show some benchmarks to prove that there's a reason I mean, foundation for these decisions to reorder columns. It's interesting, I need to check that. I didn't know, I didn't see it. But yeah, I think probably you're right and I need to adjust my recommendation logic and put 8 bytes before 16 byte columns because ID should be there, and the created add should be there, and so on. Although if it's UUID type, Again, it's going to be 16 bytes. I don't know.

18:36

Anyway, this is the topic. I think it's super easy not to know about it, obviously, but it's also super easy to overestimate the benefit of paying too much attention to it. But interestingly, what happened recently with Postgres versions 17.1, 16.4, 5, a couple of weeks ago they were released and caused, for example, TimescaleDB to not work with them.

19:04

What happened in some structures, there was additional Boolean type which was added, and physically, like physical layout changed in minor version of structures and causing some extensions like TimescaleDB to stop working. And yeah, it was not a good situation and I think lack of testing, but more interesting like next week the deploy new releases so 17.2, 16.6 and others, they fixed this problem. And actually, originally, Boolean value was added to fix some problem.

19:46

It was a fix, because in minor versions of Postgres, unlike some other systems, Postgres doesn't release new features. It's only about fixes. But this fix caused incompatibility in terms of physical layout. It's bad. And how next releases fix it? This Boolean was moved to different location in terms of order, where some padding happened, where some zeros were present, right? Feeling 1 of those zeros. So Column Tetris, as I've tweeted, Column Tetris well played, right?

20:27

So, I mean, Postgres hackers themselves played Column Tetris a few weeks ago, fixing some problem, fixing a bug which was another fix of another bug. That was interesting to observe.

Michael

20:43

I think it's worth discussing, We've discussed a bit why you would do this, like save on disk space, less IO, better caching, like better use of RAM. But I think when is also useful to discuss. Obviously, we can do this when we're creating a new table. We can think about column order. That's normally when we're adding the most number of columns. But I think also sometimes when you're adding new feature or you're doing refactoring, you might be adding a few columns at the same time.

21:18

And I think it's worth pointing out that that's another time where you can make use of this. But equally, I think we're moving to a world where there's more and more emphasis on incremental improvements. And I know for my own projects, I tend to add 1 column at a time. Like, I tend to be doing a migration.

Nikolay

21:39

Oh, yeah, like 5 releases per day.

Michael

21:42

Yeah. I'm not doing 5 per day, but, you know, like, they keep things small, and they tend to not be multicolored.

Nikolay

21:49

Well, in software engineering, this makes sense completely, because if you mix many things in 1 big change, making them depending on each other, it's bad, right? So we try to split it to small steps and release small steps sooner, because if we have a problem with 1 of the steps, it doesn't block others from deployment, right? This means that we probably indeed add columns 1 by 1 or sometimes in pairs.

Michael

22:17

Not normally, or maybe other people have different experience. But yeah, so even in pairs, I guess the order does.

Nikolay

22:24

But in this context, in previous context, what do you think? First of all, when Postgres creates, When we run create a table with many, many columns, what? Why Postgres doesn't do it?

Michael

22:35

You're going to ask why doesn't it do it by default? Itself, yeah. Then it's unexpected in a different way. Like we were saying, I want my ID to come first.

Nikolay

22:45

No, No, no, no, no. Imagine we had a logical order and physical order. Instead of at num, column in pg_attribute, we would have

Michael

22:56

to num.

Nikolay

22:57

And we know what to present to user, like logical. And we know physical because at creation time Postgres would Have some algorithm to reorganize it right and I think it's possible I don't I'm not sure if it was discussed like it should should be discussed a few times But obviously it's not happened.

23:15

It didn't happen yet but father let's let's think like let's dream a little bit additionally and Based on this case with new releases, minor releases, which played Column Tetris, when you have an existing table, imagine we have an existing table and we have some padding happening there and we add a boolean column and table is huge we could add it and Postgres could add it probably to some different position not increasing size at all

Michael

23:47

right well I don't know what needs to then happen to what happens then to the data files yeah

Nikolay

23:56

well existing was is different if we if we need fully right we do it right but since post this 11 if you if you define default for a column, it will be virtual, so without physical rewrite. It's just, it will say, okay, until this moment, I think in xmin or something, until this transaction ID, all old rows, They virtually have this value even we don't write it physically But it was great optimization and actually what I discuss here.

24:28

It's also in the same area of possible optimization of Postgres could have Right, So only future rows will have it. They had zeros already, padding.

Michael

24:40

Would the padding look different?

Nikolay

24:42

Well, there is a specific additional place in Postgres, I don't remember, I forgot, but in Postgres 11 it was created for defaults, right? Saying all rows kind of have this value for this column which is new. All future values will be read, All future rows will have it in normal way. Same here. We have padding zeros, right? In future, like future this column, say it like, fired, right? Bullying. True or false. Or null, 3 value logic.

25:19

We can have it not in, like, in some position where we had zeros for old rows. All new rows will have not 0. That's it. It's good, right?

Michael

25:30

Yeah, I mean, it would be beautiful, it would be lovely. You know, in real Tetris, if it's getting quite fast and you're starting to get some of these gaps, every now and again you get a shape that looks exactly like a gap you've got right down near the bottom and you want to just pick it up and pop it in. It would feel like that, wouldn't it?

Nikolay

25:50

Similar, yeah. Another good analogy, right? So I think there is some sense in this optimization. I don't know, maybe I should discuss it with Andrey and Kirk. We're probably slowly returning to our normal hiking sessions on Postgres TV. We have actually already quite a long list of ideas.

Michael

26:12

Yeah. There's 2 other things I wanted to discuss with you on this. 1 was doing it with existing data. So we find out that this is a phenomenon or we've inherited a system that wasn't set up in optimal or it's just evolved over time we've been adding columns as we've been developing new features as a system evolved And we decide it is worth it for us to, or like, we want to look into how would we reorganize the table into a different column ordering. How would you go about doing that?

26:43

Or how do you go about doing that?

Nikolay

26:45

It's the same complexity for surgery as for int4 to int8 primary key conversion. I wish pg_repack would do it, but we had a discussion and I think I had some very quick and dirty prototype to organize cold motor when you are repacking table using pg_repack.

27:06

It will be a good moment because it rewrites everything, but there are some doubts and fears and I think we just had lack of attention from very experienced hackers there, so it was not... I think somebody in that pull request, let's check it, and have it shown on somebody, try that again. Because the idea is simple. pg_repack is rewriting the whole table. If we need to physically reorganize the order of columns, we also need to rewrite the table.

27:36

Let's just change the column order at this moment. But this is again, this is something which is not ready. You cannot use it in production. I don't recommend it. For this, although there is a cooking dirty prototype for this. How I would do it, I would just apply this new table approach, same as in 8, But it requires some additional effort. It's similar to pg_repack You need some kind of delta table, you need trigger, and you need to write changes to that delta table. Then you create a copy.

28:15

While you create a copy, this trigger should write all changes in this delta. Then copy is created. There are interesting tricks there to minimize the duration of transactions, as I remember. It was long ago, last time I used this and developed, I think, maybe 5 years past. We had a very serious workflow developed for a big company originally, and we used it several times, this approach, under very heavy loads and very mission-critical systems where Postgres was in the center.

28:46

And then you just have some transaction to switch. You should be very careful. Foreign keys will be the biggest problem actually in terms of

Michael

28:57

switch. Well, I had an alternative.

Nikolay

28:59

Logical replication.

Michael

29:01

Yeah. And I think if you've only just discovered this and you have several large tables that

Nikolay

29:08

you want

Michael

29:08

to do it on all at the same time like I realize it's heavy-handed if it's not your largest table or if there's You've got tons of data that is already pretty optimal. But if you're doing 1 big retroactive...

Nikolay

29:22

Maybe.

Michael

29:23

Well, what's the downside?

Nikolay

29:25

Of logical? As always, downsides of using logical Under heavy loads, it's tricky sometimes. In this case, we need publication for 1 table, so we need to carefully work with retries and so on, because some locks will be needed. They won't block other sessions, but you can just fail. For example, if you have a long-running, as usual, long-running autovacuum process, which is processing your table to prevent transaction ID wraparound, you won't be able to get a lock for long.

29:58

I mean, well, it's running, so you'll fail it. You need to... There are some nuances here, right? But I think it's obviously a valid approach. And combining all these thoughts, I think PjSqueeze from CYBERTEC, which is an alternative to PjRepack, probably is a tool where this should be maybe... Maybe it's already supported there.

Michael

30:18

I checked. I couldn't see it in the documentation

Nikolay

30:20

But it makes sense getting rid of bloat let's also reorganize table a little bit and get rid of padding which can be considered also kind of bloat right

Michael

30:30

Yeah I think so. I think depending on your definition...

Nikolay

30:33

Vertical bloat.

Michael

30:35

Like, if you rebuilt your table now, how much smaller would it be? Is kind of how I think of bloat, the delta between. How big is it now? How big would it be if

Nikolay

30:49

you think you could... Well, you're talking about dump restore, you can check it like that, or vacuum full, but it won't remove padding zeros.

Michael

30:56

No, so therefore maybe it doesn't count in that strict definition, but if you allow for column reordering in between, then it suddenly, it counts.

Nikolay

31:05

This would be a good feature for pg_squeeze and unlike pg_repack I think I think it should people should not have fears because if you use logical and you organize your table like cold mother. I don't see problems with same. Yeah, because it is a pocket works with substituting real file nodes and it's like basically substituting files, it's kind of a hack, right? And it sounds scary, while pg_squeeze is using official API basically, logical decoding, right?

31:38

Which is good because sometimes pg_repack is not available. If it's a managed Postgres offering and they forgot to add pg_repack, I don't know why they could do it, but actually I think Supabase doesn't have pg_repack.

Michael

31:50

Interesting.

Nikolay

31:51

Yeah, yeah, I think so. I think I checked last week, we had a new client come, they are on Supabase, and I think I checked and didn't see pg_repack among supported extensions. I wonder how Supabase clients deal with bloat, or maybe they don't care yet. But by the way, again, congratulations with OrioleDB. It's great.

32:15

OrioleDB is a super innovative thing, which has high chances to be widely used because it's, I think it's an Apache license and the plan is to have it as an extension for regular Postgres, which is great. But it's not yet there because some changes need to be done in Postgres core. But there are chances it will be done.

Michael

32:40

And it's not recommended for production used yet?

Nikolay

32:43

Well yeah, of course, because basically it requires patching the original Postgres.

Michael

32:50

I think it's more than that. I think there's still others.

Nikolay

32:52

Yeah, it requires many, many years to build a reliable database. It can be considered a kind of new database because it's heavily modified Postgres, right? Storage layer is heavily modified there. I can't wait to see, like you know, we discussed it with Korotkov that branching can be native in Postgres. This could be huge.

Michael

33:15

Yeah, there's a lot of promise there I think, yeah.

Nikolay

33:18

But anyway, back to pg_squeeze, I think it's a good idea if it's supported. You have many ideas today, but back to practical.

Michael

33:27

These are more episodes, these are separate episodes.

Nikolay

33:31

Yeah.

Michael

33:31

Yeah, back to practical stuff. There's 1 more thing, So yeah, so I think those are 2 like ways of doing it. And I also wondered at what kind of volume are you generally seeing? Like what kind of size tables are we talking about before this starts to make any sense normally?

Nikolay

33:48

You know, it's hard to say, because again, like I told you, even if I have a 1TB table, which is already above the threshold where partitioning should be used, and I see additional 10% of bloat, I don't care. So if I see additional 10% of padding, I don't care. 10% because of engineering time is very expensive, right? Yeah. Yeah, I know it's not only about storage.

34:13

Like people say it's a storage source, but it's I'm mostly concerned not about storage, although storage also matters because it affects backup sizes and replication size and so on. What matters more for me is state of memory. And this is like additional spam in the buffer pool and page cache. It's not good. So 10% I wouldn't be bothered, but it's 20% I already kind of in warning mode, right? It's 30, 40, it's already, it's worth doing it, right?

34:45

And by the way, it's interesting, you know, like I always tell people these bloat, let's connect topics in very weird way. So, bloat, regular bloat, when we have dead tuples and they were cleaned by autovacuum or vacuum, and we have gaps in terms of tuple, like slots for tuples are not filled and page has only few tuples and many empty spaces. This is called blow, right?

35:14

So over time Postgres will probably put new tuples there, but maybe no, because new tuples are coming in different pace and we end up having too much space used than it could be. And dump restore would help, or repack, or pg_squeeze, or VACUUM FULL, which is not good in production, and so on. So usually how do we understand the bloat? How do we understand the bloat level?

35:41

We use some scripts, some queries or tools use some queries or monitoring systems use some queries but these queries they are like they sometimes are heavy but not so heavy as full table scan would be right still they are light but they are estimate estimated yes And you see a bloat level 40, but it's not 40. You dump restore and you still see 40. How come?

36:15

And Interesting, if you take example, 1 of those we discussed, for example, Boolean int8, Boolean int8, Boolean int8, repeated like 10 times, and rinse, and check bloat level using Estimated bloat using 1 of those queries which Actually all originated from 1 Work, so you will see some bloat. I can demonstrate bloat like 50% It's insane right and you dump restore you are come for you Pj repack and this bloat cannot be eliminated because it's not bloat.

36:48

I think it's very related to mistakes the errors these scripts like error related to padding

Michael

36:58

You could that could be fixed right that

Nikolay

37:00

could be fixed I tried I didn't have enough time a few years ago, and the complexity was high. I think, first I need to make sure my analysis script in postgres_dba works really right. I need to check these 4 by 2 byte padding cases. And if it's right, I think jumping from there to those estimate, bloat estimate queries, I think it's possible, should be, right?

Michael

37:28

You need to know the column order, but that's...

Nikolay

37:30

I know it from practice. This is how I can easily demonstrate that you should not trust blindly those estimated bloat queries. Yeah.

Michael

37:42

Cool. Last thing, I think Posts on this normally don't mention indexes, but 1 good 1 recently did by Renato Massaro. And I don't think it's as important as heap for perhaps obvious reasons, But it did somewhat surprise me that There are some somewhat surprising Alignment padding issues related to indexes as well, and I thought that was fascinating

Nikolay

38:11

Yeah, but we usually we cannot change Most cases Changing column order and yeses just because of this is not a good idea, because column order matters in terms of query performance.

Michael

38:26

Yes, so I completely agree. I think there are some rare cases where, if you're only doing equality type, there's some rare cases where you can you do have some flexibility on the on at least a couple of the columns if you're always including them in all queries like that kind of thing but the reason I the reason I brought it up was I've seen twice now, including the person who wrote this article, twice now people have spent time on something that then turned out to be not worth it.

38:58

So they had spent time reducing, like they didn't need a timestamp, they really only needed the date, for example. So they thought, I'm going to, I've got this index on, they only had a single column index on this timestamp, and I'm going to create a B-tree, I'm going to create the index again, but only on the date, like the truncated part of it. And they spent development time thinking about this, doing this.

Nikolay

39:22

And then they stopped. Well, date is maybe, how much does it take? 4 bytes?

Michael

39:28

It's padded to 8, actually, in BG.

Nikolay

39:30

Yeah, you say, like, having two-column index, 4 and 4 bytes, and just single-column 4 bytes is kind of same? Or what?

Michael

39:40

Single column 4 bytes is the same as single column 8 bytes. And that's really surprising.

Nikolay

39:45

Yeah, cool.

Michael

39:47

So that time was wasted for them, even though it wasn't a multi-column index. The padding meant that they didn't gain anything.

Nikolay

39:54

So guys have more two-column indexes if it's about 4 byte columns.

Michael

40:01

Yeah, maybe you would, yeah.

Nikolay

40:02

Or maybe covering indexes as well. Oh, covering indexes may be different. You know about including? If it's a single column 4 byte index and we have another like passenger, right? Passenger.

Michael

40:16

Yeah, but it only gets stored in the leaf pages.

Nikolay

40:18

Yeah, so it's different. Okay, yeah, that's interesting. And I didn't go there ever. This is good that this article raises this topic. So

Michael

40:28

Yeah, I think it's a good thing to be aware of. Probably not something you should be. You shouldn't, don't, reorder your indexes based on performance reasons not based on the storage size it would be my recommendation

Nikolay

40:42

I agree yeah

Michael

40:43

oh yeah but yeah a really good point that I hadn't realized until I read this article.

Nikolay

40:49

Okay, good stuff. Thank you for choosing this topic.

Michael

40:54

Oh, you're welcome. We finally got to it. We've brought it up a few times. And yeah.

Nikolay

41:00

Okay.

Michael

41:01

Have a good week. Take care. Bye.

Nikolay

41:02

Have a great week.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript