Return of the BUFFERS - podcast episode cover

Return of the BUFFERS

Feb 07, 202552 minEp. 133
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Nikolay and Michael return to the topic of BUFFERS for the third (and final?) time! They discuss the news that it'll be on by default with EXPLAIN ANALYZE in Postgres 18, and what effect that might have. 
 
Here are some links to things they mentioned:


~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!


~~~

Postgres FM is produced by:


With special thanks to:

  • Jessie Draws for the elephant artwork 

Transcript

Nikolay

Hello, hello, this is PostgresFM. I'm Nikolay from Postgres.AI. And as usual, my co-host is Michael from pgMustard. Hi, Michael. Hello, Nikolay. So I chose a very boring topic. And we talked about it a lot I hope this is the last time we were going to talk about it maybe right BUFFERS or pages or blocks or how how to name it depending on the perspective you have, right but BUFFERS episode 3, which title are you going to choose?

Michael

Oh, it's like the third in a trilogy, isn't it? Maybe it's like Return of the BUFFERS or something, I don't know.

Nikolay

Right, but we have a few reasons to revisit this topic today. Of course, this topic has been very heartwarming for me. For a couple of years, we discussed how important it is to consider BUFFERS when optimizing queries at micro-level, when we talk about EXPLAIN ANALYZE, or at macro-level when we talk about pg_stat_statements and other extensions, and aggregate everything and see parts of workload or the whole workload.

And recently, finally, as we already mentioned a few times, BUFFERS were added to EXPLAIN ANALYZE by default after a few unsuccessful attempts from previous years. I think it was David Rowley, right?

Michael

Yes, and Guillaume Lelarge played a big part as well.

Nikolay

Right, and all previous hackers also who made even unsuccessful attempts made some shift in this direction. Of course, it's really a small technical change. So now you don't need to write the word BUFFERS. You just write EXPLAIN ANALYZE. And as a reminder, EXPLAIN just provides a plan for the query, which planner is going to choose, optimizer is going to choose. And then EXPLAIN ANALYZE actually provides both the plan and actually execution. It actually executes the query, so you see more data.

And by default, buffer information, buffer hits, reads, BUFFERS, dirtied and written, we are not present in the plan. And that was a bummer for us who dealt with queries at scale, right? So because this is important information, as we discussed, and we will discuss again, I think. But now in Postgres 18, you don't need to write BUFFERS explicitly. This information will be included by default. And that's great, that's great.

I think it's a super important small change which will affect lives of many engineers. Because...

Michael

Yeah, I think just worth, I know you've alluded to this already, is an upcoming version of Postgres 18. It's been committed, and things that get committed have a very good chance of making it to the final version. But things do get reverted.

So there's a small chance that we'll still have to talk about this more in future but yeah we already had the ability to request BUFFERS we could already add EXPLAIN (ANALYZE, BUFFERS) and other parameters but a lot of the time when people are asking for help online or when people set up auto_explain although I see BUFFERS more in auto_explain I don't know what your experience is but yeah when people are asking for help online or when people set

up auto_explain, although I see BUFFERS more in auto_explain, I don't know what your experience is, but yeah when people are asking for help or people are writing blog posts or people are just getting familiar with Postgres, they'll come across EXPLAIN ANALYZE and then not know that BUFFERS exists or not use it when it would actually have shown the issue and so there's a bunch of cases where by having it on by default we'll be able to help

more people online when they first have an issue without having to ask them to go back and get another query plan or it will just be more likely to be in, well it will be in those query plans by default if they're on Postgres 18 or newer.

Nikolay

Yeah, yeah. And I know you have also opinion and I agree with you that some other things should be included if we provide something like all or verbose. There is verbose but it doesn't include everything and this is slightly strange and unexpected if you don't know details. But let's focus only on BUFFERS. It's really hard for me to imagine it will be reverted. It might be, but I think, yeah.

Michael

I could just imagine this aging really, you know, aging like milk, we say sometimes, just in case. I just wanted to make people aware that might have happened.

Nikolay

Yeah, so we talk about Postgres 18, which is going to be released this fall in many months and it will hit production only in a couple of years, on average, maybe, or maybe even more. So it's still in far future, in distant future. But I wanted to just think in advance, since we are both building some tools and methodologies and everything, I wanted to think, is it going to have some positive, I think, positive effect on how we are going to optimize queries in the future.

For example, defaults matter a lot at scale, and I learned it building 3 social networks in the past. The simple lesson I remember some spammer was sending on behalf of administration was sending the message, send text message to this number without any reasoning. And if you send it to 1000000 people, there will be some people who will follow. Because, you know, like, why not? Of course, if you add some reasoning, it will work even more. So defaults matter a lot.

And we can put everywhere words like don't forget BUFFERS, don't forget BUFFERS, but if it's still like if default behavior remains without BUFFERS, there will be a lot of people who will not use it, right? And this is what we see If you go to explain.depesz.com, it has history. And, of course, actually, even explain.depesz.com also advocates for BUFFERS since not long ago, which is great. But if you just check history, you see a lot of plans provided without BUFFERS, right? Because defaults.

Now we have good reason to expect that after a couple of years we will see majority of plans with BUFFERS. Do you agree?

Michael

Yes, maybe not in the full, but maybe in like definitely in 5 years time, maybe in 234 years it will become the majority.

Nikolay

Well let's say if it will be 18 plus version, this plan like again 99% or even almost 100% that will include BUFFERS because only few people will bother turning it off.

Michael

Oh, I meant actually people still running very old versions. Yeah, that was my understanding.

Nikolay

But imagine we're already thinking about people with new versions and I think we will have BUFFERS. What does it mean for analysis? Are we going to highlight this information or use this information even more intensively? What do you think?

Michael

Well, good question. I think people will ask... People already have lots of questions when looking to EXPLAIN ANALYZE output for the first time and people are like what does this statistic mean what does that statistic mean and I think BUFFERS are another kind of 1 that at first people are a little bit confused by what does it mean? Like what is that number? Is that a lot? Is it not a lot? What unit is it in? All these kind of questions.

So I suspect it will immediately in people's heads have a bit of extra confusion. But once they're familiar, I can't see how it isn't helpful. It's just going to be so helpful, especially once people start, like, people and tools start converting that into data volume sizes. I know it's an imperfect, but let's accept the imperfection of timesing it by 8 kilobytes, so each buffer being a page, and multiplying it by 8 kilobytes, you get an idea of the data volume being read.

When people see a query and notice it's reading a gigabyte of data, even if it's from shared memory, when it's not using an index or when it's using a specific index, and then it's only using a few kilobytes when it's using a different index, suddenly they realize the index is helping speed things up because it's more efficient because Postgres is only having to read far fewer blocks, like much less in order to serve that query.

I think it becomes a lot more intuitive that indexes help because they're a more efficient way of running certain queries, not their magic. And I think this is currently not amongst Postgres folks, but definitely in the developer community as a whole, there's this perception that indexes are magic. And I think BUFFERS being present help is like a good step towards no they're not magic they're just a they're very clever idea that makes things really simple

Nikolay

yeah yeah that's a good point I was listening to you and thought I'm contradicting to myself when I say buffer, BUFFERS, hit and read and written and dirted. It's incorrect, it should be operations, BUFFERS, buffer reads, hits, writes, and I don't know how to deal with dirt.

Michael

Dirties? I don't know.

Nikolay

Yeah, yeah, yeah. Pieces of dirt, right? No, no, no, like actions of dirt. But anyway, like I'm saying, these are operations And if you sum up some numbers, it might be the same buffer which you hit multiple times. Not sure about other operations, but hit, we definitely have in the nested loop, we can have hits for the same buffer many, many, many times. But when you say we need to convert, I totally agree. I did it for a long time.

I noticed that very long ago, once you converted to bytes, megabytes, gigabytes, sometimes terabytes actually, tebibytes, right? Gibibytes. Once you convert it, engineers have a ha moment. Always. I mean, those who see it first time. Because 100 BUFFERS, buffer hits, or reads, 100 buffer reads, there's nothing to them, actually. It's what it is. Hard. But once you say, okay, we here we read 800 kibibytes to return a number of like a 8 byte number. You can feel it already how much... It's just...

To return 1 byte, you are reading 100 or 800... How many? KB. Each KB is 1024 bytes. Wow, like it's not very efficient, right? How about having I don't like hash hash or hash table or something to find this number and then Oh, okay, this is why like, because it's a sequential scan. And this sequential scan is just scanning whole table. If we remember big O notation, sequential scan has terrible, terrible performance, because so many operations.

And it also depends on the size of each tuple, how many tuples are present in 1 8-kilobyte block, 8-kilobyte buffer, right? So if only a few tuples there, we need to read a lot of data from memory or from disk. So once you switch to a B-tree index, it becomes only a few buffer operations. Even if you have a billion rows, it will be like 7. I don't remember exactly, but

Michael

it's

Nikolay

like 7 buffer hits. That's it. And even, okay, they are still like 8 kilobytes, it's a lot, but it's already much better than if it would be sequential scan, which is insanely slow in this case. And you think, okay, that's why it's slow, because we deal with huge volumes of data. And when we apply index, we suddenly deal with very low volumes of data. Right? And this is number 1 reason why database becomes fast. This is how indexing works. Like, we just reduce the number of I/O Operations.

And when I say I/0 Operations, I often include operations with memory hits. Because, well, we know operations with memory versus operations with disk, it's like 1, 000 times difference. It's, of course, a lot, but still, like, we can compare it, and 1 read is roughly like 1000 hits to memory. Unless this read is also from memory, from the page cache in this case. So they're very similar.

So what I'm trying to say, we talked about this reasoning and I agree with you, we need to convert to bytes and present it. At the same time, we need somehow like have some remark that these buffer hits might be to the same buffer. It's not only... It's a volume... It's some abstract volume of data. We can have a kilobyte or megabyte of data heating the same 8-kB buffer, right? So it's okay. But comparing volumes of data, it's so good.

Because first of all, you start thinking about data volumes, which is the number 1 reason for slowness, especially if we forget about concurrency. Because concurrency is a different topic. If we take just 1 query, there are cases when it can be a very CPU-intensive workload, but in the majority of cases it will be I/O Intensive workload for 1 query, for databases, right?

And our goal is to help Postgres with indexes, with maybe some redesign, to help Index to deal with as few operations with memory and disk as possible, it means we need to focus on BUFFERS.

Michael

Yeah, are there any... I think there might be a few other exceptions, but I think they're more... Yeah, I don't think that... Well, maybe it's only 1 I think maybe a CPU as well actually I was thinking JIT compilation what that's

Nikolay

probably recently we had a recent discussion about our RLS and there we had inefficient memory work CPU work like check some volatile function or stable function also, right? So if it's in loop for all rows, it's CPU intensive work.

Michael

Yeah, there have been, I was just trying to think, there have been a couple of other times I've really not been able to spot issues by looking at BUFFERS. But they're not actually, it's not because Postgres isn't reading data, it's just it's not reporting it in EXPLAIN ANALYZE yet. So actually, maybe getting it on by default will encourage some more reporting of BUFFERS as well. So, for example, I don't think triggers report the BUFFERS read.

Like, if you have, you know, the famous case of not indexing a foreign key and then having on cascade delete. Like that trigger might tell you it's taking many, many seconds because it's having to do a sequential scan of the reference table but no BUFFERS reported as part of that. Same with in memory operations, like sorts or hashes.

They could be really slow, even though they're done maybe you've got quite large work mem, and maybe it's the dominant part of the query plan, but no BUFFERS reported as part of that. Because it's in memory. It will report the BUFFERS if it spills to disk. For reads of tables and indexes, we get reports even if it's from memory. But for operations like sorts and hashes, we don't get. So there are a few things that are still doing work that it would be good to get those.

But it doesn't go against what you're saying in principle. It just means I can't use what Postgres is actually giving me in EXPLAIN (ANALYZE, BUFFERS) to actually diagnose the issue there.

Nikolay

Yeah, I remember also touched the topic that it would be really great to see details for BUFFERS. For example, how many BUFFERS from heap, from index, maybe distinguish in each index, some details to see. Because if we go back to the original approach for dealing with EXPLAIN ANALYZE, and I see it like even many hackers, blog posts, provide plans, ANALYZE plans, but they don't use BUFFERS still. I hope this will change, of course, over time.

So usually, okay, we have time, we try to guess, okay, this time is lost here, why? Okay, because sequential scan. But in terms of data volumes, what do we have? Only rows, right? Expected and actual rows. Logical rows. And sometimes we think, okay, we fetch 1 row here.

Michael

Oh, we get width as well. Like an average estimated width. Which is... Times by rows gives you some idea, but it's only the width of what's being returned not the width of what's being read

Nikolay

Exactly. This is like, okay. Yeah, we can multiply with by rows get big basically number of bytes But it will be like logical level and an underlying level might be very different, very different, drastically different. For example, if there is a lot of dead tuples, which Postgres checked to return this very row, right? Maybe it checked a thousand dead tuples. And it's hidden if we look only at rows and timing. And we have no idea why timing is so bad, right?

Okay, we returned 1 row in the index scan, but why so bad? Here's why. We go to BUFFERS and we see that a lot of BUFFERS were hit or read. I don't know, it doesn't matter. That's why timing is better. And then, my point is that it's already very useful, But imagine if we saw, oh, actually, from this particular index, from this particular table, we got those operations with BUFFERS. It means that to return this row, so many BUFFERS, even with index scan, it's bad.

Oh, that's why, because it's bloat. This row had many versions not cleaned, and the index scan needs to check heap table to get version information. So this would be useful to like detail BUFFERS, to have detail BUFFERS maybe, right? And I

Michael

had this case recently, I was in January, I was doing some office hours calls just to get an idea of what kind of issues people were facing and I had some really good conversations thank you to everybody that jumped in on those and but 1 of them they were they were using our tool pgMustard and it was 1 of the tips we show people is, we call it read efficiency, to look for exactly that. When are you reading lots of data?

And we actually, for long-term users, used to call it table bloat potential, or like bloat potential. So it could be index bloat, could be table bloat. But we renamed it read efficiency a few years ago because there are other possible causes of it as well. So it's a tricky 1.

And because we didn't, so this office that I was called, they were hitting quite a few read efficiency issues but they couldn't diagnose exactly where it was like was it index level was it table level they looked at both it didn't they didn't seem bloated they reindexed they they tried a lot of the things and it couldn't reduce it so it was a it was a really interesting call but that yeah that was my first port of call is in this used to be true

more in older versions of Postgres but indexes can get especially can get extremely bloated so can tables in some cases but yeah it's it's amazing that you can suddenly see that by turning on BUFFERS. And that is something I wasn't expecting to be true. Because I see bloat as kind of like more of a system-wide issue, But to be able to spot it in query licenses was really cool.

Nikolay

Oh, bloat can be very tricky. Sometimes we have, with our consulting clients also, we have like bloat is low, but some query suffers. And it turns out to that, like, we call it usually local bloat. So for particular IDs, the situation is super bad. It can also be not a bloat, but some, you know, like sparsely stored records. They are distributed among many BUFFERS. And you expect, okay, I have very narrow table, only like 4 or 5 columns. I expect a lot of tuples to be fit inside 1 page.

But somehow reading 1, 000 records, rows, I deal with thousands of buffer operations, what's happening here. And if we just check CTID, we can understand that each row is stored in each particular page, so we have a lack of data locality here. And so clustering with pg_repack without downtime could help, or partitioning helps usually in such situations and so on. Yeah, yeah, yeah. So, and the buffer is exactly the way to fill these, all these problems better.

But I agree, I like your idea to talk about read efficiency and maybe write efficiency as well. Because for end user, it's obvious. Okay, I have, I return this data, I return only 25 rows on my page. So I expect not gigabytes of data to be fetched from buffer pool, or even worse, from disk, right? Or from page cache, which is between disk and the buffer pool. So I expect maybe only like some kilobytes, right? Maybe tens of kilobytes. Even not megabytes, if it's quite a narrow table.

Postgres should not do a lot of work to show 25 rows. If it does a lot of work, something is not right and we need to optimize this query. And without BUFFERS, we only can guess. With BUFFERS, we see it, right? With buffer details, it would be even better, right? Someday maybe we will have detailed BUFFERS per each database object. Some statistics would be interesting, I think, to observe. And there is criticism of this approach.

I recently had an interesting discussion on Twitter on X, And somebody told me that in Oracle, there is ability to flush caches. Right. And in Postgres, it's tricky. Usually it means you, we need to restart Postgres to flush the buffer pool and also to echo 3 blah blah blah to flush Linux caches if you want to go hardcore and very cold state, to check very cold state. And this is interesting.

Michael

I think you can restart.

Nikolay

Restart what?

Michael

The Database. Like if you're testing locally, for example, or on a clone, if you restart the Database, doesn't that reset caches?

Nikolay

Restart of Postgres will make empty only the buffer pool. But we also have, if it's Linux, We have also page cache, right? We can flush page cache with simple command like we of course if you have sudo

Michael

Yes,

Nikolay

so it's it's easy. I always Google it echo 3 to some path and and then sink and that's it

Michael

If you're a managed service, what's the best you can do?

Nikolay

Well, managed service, you cannot do this, of course.

Michael

Restart is like, I

Nikolay

think the best

Michael

you can

Nikolay

do. Reboot. Reboot, yeah. Yeah, yeah, reboot. Well, this is hardcore. So let's talk about this. I also see a constant desire to check the cold state in the Database Lab we built. Some users use it for query optimization with some bot we have, drawbot, very old stuff, but it works really well for query optimization and also like experiments, right? So you can check on thinkloan, on branch, you can check various ideas And there people say, okay, I see the hot state. Data is already cached.

I want to check the cold state. And I think it's natural desire, but yeah. Well, every time we say there is no such easy way. And interesting that, I didn't know, but interesting that in PostgreSQL 17, in pg_buffercache, there is a new function, pg_buffercache_evict, and you can try to use it. I think we are going to try it at some point when Postgres 17 will be more present on production systems. We probably will consider adding this to make the buffer pool empty, right?

But unfortunately, this function, I suspect it won't work really well. It's definitely not for production, first of all, right? It's for like lab environments we build. And unfortunately, per documentation, it won't evict BUFFERS which are pinned by, for example, autovacuum, vacuum process and so on. So in some cases it will fail, we will need to think how to deal with it. Of course, restart is definitely working in this case.

So if we think about why do people need cold case at all, What's your opinion on this?

Michael

Well, I think it's coming from a good engineering rationale. I want to make sure the worst case isn't too bad. It's definitely, you want good engineers to be thinking about edge cases, to be thinking about the worst possible case. But I also think that in practice, the kinds of optimizations we're doing here, the whole point of the discussion here is we're saying try and reduce the amount of data being read to as minimal as possible.

If it's an important query, try and get a very, very good access path for that query. That will involve getting the buffer numbers down as much as possible, whether they're cached or not. Maybe you want to explain to people in the PR, or you need to explain to your team, how much faster has this made it? So I can understand wanting to compare cold cache state to cold cache state, and then having these kind of relative differences. But I would encourage going the opposite route.

It's much easier to work with warm cache, especially when you're doing query optimization, you're naturally going to warm up the cache if you're running the same query over and over again. So I would encourage more going the opposite direction and say compare... Yes, it definitely applies for OLTP, but even for OLAP or OLAP queries, I don't see the downside of saying to your team, these, I mean, it's still EXPLAIN ANALYZE, right?

If you're showing query execution plans before and after, which is often what people are doing, kind of show this is what it was, this is what it will be, it's already imperfect in terms of reporting numbers. It's not exactly the same as what the client's seeing. So If we're already imperfect, I don't see the downside of comparing warm cache to warm cache, so I tend to run things a few times, get the...

Nikolay

Before optimization and after optimization, we compare 2 cases, both should be warm, and we focus on not... Even if we have reads still, for example, buffer pool may be smaller in lab environment. We usually have much smaller buffer pool because we run multiple Postgres instances on the same machine. It's a shared environment. So if you deal with large volumes of data, it might not fit the buffer pool, so you are going to see reads.

But the idea is let's just summarize reads and hits, maybe even writes and dirties, how to name it, right? So all 4 buffer operations, if we just summarize it and use it like as universal metric, this is now our number of... Or just maybe reads and hits, depends, right? So but overall, we just see this is overall volume of data. And this is what you return, 25 rows.

Michael

I like that a lot that's exactly what we do but I get the sense that what people really want is to compare timings so they want to say this query that used to take Because sometimes it's coming from a complaint, right, or something. This query that used to take 20 seconds now is 500 milliseconds. Or this query that used to be 200 milliseconds is now less than a millisecond. So they want this kind of difference in timing, even though behind the scenes it's a difference in BUFFERS.

So for that, if you want kind of apples to apples comparison, instead of trying to get cold and be fair on the cold side, I'd just say it's easier to be fair and do it on the warmer side. I know it's not perfect, and it might be that your real users are hitting a cold cache state, but if that's what you want, that's the direction I would go.

Nikolay

Yeah, well, we have slightly different methodologies here. I usually say, okay, we start from the statement, the query is slow, it takes like 30 seconds or 1 minute. It's super slow, unacceptable, we want it below 100 milliseconds or 10 milliseconds. And then I say, forget about timing for now, completely. Focus only on BUFFERS. If we check BUFFERS and see a low volume, 100 BUFFERS, 100 buffer hits and reads, to return 25 rows, we know our micro-optimization is over.

We need to go and see the whole picture of what's happening with concurrency. Probably we'll have some... Maybe this query is waiting on lock acquisition of all those seconds. That's it, right? So it's already this. But if we confirm large volumes of data, we forget about time. We focus only on obtaining sane numbers for BUFFERS. Sane means, okay, hundreds, thousands, not more. Even thousands to return 25 rows, it's already quite a lot. And we must do it. Sometimes, of course, it's not easy.

Sometimes we need to do denormalization, some complex surgery on our database involving long-lasting operations with backfilling and so on, but once we achieve this, in most cases we can have same number of BUFFERS and then we say, okay, now we can compare timing. And Comparing timing in lab environment, it still might differ from production.

Michael

Yeah, of course.

Nikolay

But of course, I know, I know, I use it as well. I say, we improve timing hundred, like 10000 times. It happens, right? And this is good for final comment you know in the end but only when we did work on BUFFERS fully right this is me

Michael

yeah given it's you you you could with your reputation and your teachings you could leave that last line as we improved buffer read or read efficiency for example. Read efficiency and it would be probably almost the exact same number.

Nikolay

Not necessarily, but yeah, maybe.

Michael

You'd be surprised how often it's close.

Nikolay

Okay, good, good, good. Yeah, that's good. Maybe I should think about it and also present this information during consulting activities and in tools as well. Yeah, I think it's good. So yeah, read efficiency. And you take number of rows and width and just multiply, get bytes from there and then...

Michael

Yeah, it's a tricky 1 because like different scan types you expect different amounts of efficiency so yes it's slightly

Nikolay

more involved also interesting right

Michael

yeah but I think I think the main differences are sequential scan like sequential scans very different from well sequential Scan and Bitmap Scan have some similarities. Index Scan and Index Only Scan have some similarities, but then like looped index scans are somewhat different to like range index scans in terms of how many tuples, like in a loop for example, you can't get fewer than the number of loops, like it has to do at least that many.

So yeah, there's some subtlety, but yeah, that's it essentially.

Nikolay

Yeah, But back to the cold state, I also agree. I can imagine you would need it. And unfortunately, it's possible, as I see, only in lab environment where you are alone. Because this is a global operation to flash operational system page cache, for example.

Michael

And when you control the hardware, right? Or like the, you know, the VMs.

Nikolay

Well, if it's a managed situation, again, restart, as you said, also an option, but again, this only works if you're alone, and this is expensive. We aim to make experiments super, super, super, extremely fast and cost-efficient. So it means a managed environment. Many people work on the same machine, so if 1 decides to flush everything, others will notice, right? And With this function, pg_buffercache_evict, I think if we manage to make it work, I see it's useful.

Even imagine if we have page cache, or in the case of DBLab, it's ZFS-Arc, which is common. If 1 block is cached, it's cached for all clones, all branches, it's very good in terms of efficiency. But sometimes, indeed, engineers want to check inefficient situation.

Okay, in this case, I think what we will do, we try to okay, if user requests for their particular database, particular Postgres instance running on this machine, we can ask to evict maybe everything from buffer pool or for particular tables and indexes, say everything. And if we succeeded, there are chances that these buffers are cached in underlying cache, right? Page cache or arc. In this case, query timing is not going to be very different, but plans will be different.

We will see reads instead of hits. Observing buffer numbers, we will see the difference. Okay, reads now, not hits. And what I'm thinking, In this case, we could consider it relatively cold, this situation.

Michael

Lukewarm.

Nikolay

Yeah, yeah, yeah. And we could just say, okay, we know that reading from disk is roughly a thousand times slower than from memory. So we could do some estimation of real call state.

Michael

Well, or there's that... How often do you see clients having track_io_timing on?

Nikolay

Oh, very often.

Michael

Great. Okay, that's really positive. But that because that gives us an idea of how long those reads rather than hits took. And you could do some division of reads by I/O timing to get an idea of where they're coming from. I think there might also be some work. I think I saw something either in a commit for a century or a hackers thread to try and get

Nikolay

the details. It's like, there's just that cache for micro level, right? Yeah. Well, it's possible if you have access to everything, it's possible from proc, process number, I/O, you can get it from there, I think.

Michael

But it would be good to have it in Postgres core so that we could even get it from managed services, for example.

Nikolay

Yeah, yeah. Well, there is something that can be done here and improved observability of part and work with particular query if you have control. Yeah. But I'm trying to say, like, even if you with this evict, we can get some interesting information, even if we cannot allow ourselves to reset the whole, everything, including page cache. So it already can be useful in non-production environments to study the query behavior if it's a clone. Yeah, yeah.

So I feel there's a lot of things that can be improved in this area to work with queries. And especially it's important to continue this work Because I suspect we will use some tools to work on this at massive scale. Like for example, if we fetched from... We already know during consulting, we did it with a few clients. With auto_explain, we fetched a lot of queries, hundreds usually, sometimes thousands. It's insane. Nice. Examples with plans.

And then, well, actually, you know, because we partnered, in some cases we use pgMustard to provide additional insights, which is great. And then we have a lot of stuff. Unfortunately, you know, unfortunately, I still don't see how to avoid human here.

Well, not like, I like to involve human there, But I would like to involve less, you know because usually Right now we have a lot of cases of query analysis like that and then we need to draw some common picture from it right and this This like going above all those plans and say, okay, we have a pattern here. For example, bloat is obvious or something like this, like, or index health is not good. And to draw this picture, human is involved heavily right now.

I'm thinking like over time, probably we will build some additional macro-level analysis for micro-level cases of hundreds or thousands of them. This is going to be great. And then looking at BUFFERS, and actually we have BUFFERS there as well. You were right in how to explain timing is not present. People are afraid of overhead BUFFERS also have overhead, but kind of smaller. Right. And yeah, we had articles, you had articles about this as well.

Yeah. So yeah, and so with cold cache it's tricky, and there is this function, that's it. So Postgres 17 plus. What else? There is macro level, right? And at macro level we can talk about pg_stat_statements, blocks, read, hit. By the way, there it's, yeah, also naming, we talked about it. It's also naming like it's data volumes, but it's okay. And so read, hit, written, dirtied. For shared buffer blocks. In pg_stat_statements, also there is pg_stat_kcache if we want real disk I-O to be observed.

And again, I advertise to include pg_stat_kcache in any setup. It's a great extension. And there we can talk about some macro level, like, okay, this query ID has this volume per second, or this volume per query, or it's responsible for 90% of all hits to the buffer pool happening in database. Maybe it doesn't look super slow, but it's so intensive with shared buffer pool.

Michael

That's a good point. I almost always encourage people to look at their top queries by total time. Unfortunately, with pg_stat_statements, at least, there's no kind of total blocks. Because it's split into these many, many steps, you could order by shared hit plus shared read and get a good sense of 1 type or by dirtied for a different, but yeah, you'd have to sum in most cases I'm thinking of that would give a good system-wide aggregation, you need to sum 2 columns, I think, to get a good...

Nikolay

Reads and hits, right?

Michael

Ordering. Yeah.

Nikolay

Yes, yes. So we usually... It's underestimated approach with pg_stat_statements to order by sum of reads and hits.

Michael

Yeah.

Nikolay

It's super important. This is how you identify I/O-intensive queries. And now they can be mostly hits, but if database grows, they can be converted more and more into reads. Or vice versa, you're going to double your memory size and the buffer pool size, so reads will be gone. But just sum them and consider them together to understand how IO intensive the query is. And think, oh, this query probably needs optimization, especially if it returns only 1 row or 0 rows.

So, yeah, in this case, we can also consider buffer-focused optimization, which is important. I think it's underestimated.

But also, it's worth mentioning this interesting blog post from Andrei Lepikhov, who gave, I would say, a hacker researcher perspective on this, saying, actually, I didn't mention that this is my point for sure for lab environments we build it's important to say timing is super volatile it's unpredictable you can have timing 1 in 1 today another tomorrow 1 on production another on clone of production Very different timing.

And it will be a big question to track all reasons why timing is unpredictable. And there is a concept, there is a simple methodology, if we deal with a complex problem, let's split it to pieces and analyze pieces separately. So when we take a clone and bring a single query and optimize it, we already get rid of concurrency issues, heavy locks, lightweight locks, everything, like forgetting about them. And we deal with a single query alone in 1 clone and it's already super good.

But we still need to have something reliable in terms of metrics to optimize. And timing is not reliable because it's unpredictable. It depends on the states of caches. If it's a shared environment, maybe CPU and disk are super busy, and there's noise from neighbors when we optimize in this case, right? But once you focus on BUFFERS inside optimization, you have invariant metric. Okay, it can be reads versus hits, but some of it will be constant if you repeat the query.

For this particular clone, for this particular query, these parameters, everything is the same. Plan is the same, right? So you have the same thing. It's repeatable, it's reproducible. This gives you a foundation for optimization because without it you have very shaky foundation, like not solid foundation, right? Timing. So, and then you optimize and you... I like your idea about optimization, like, actually I used it. I used it. I said, okay, 10,000 times timing.

And I remember I mentioned BUFFERS as well, but I didn't expect people will value this comment a lot. Maybe with BUFFERS by default, we should push on this methodology more and more. But what Andrei is talking about in this article is that for research purposes and so on, instead of timing, maybe we should focus only on reads. And reads is something that could be a target itself for optimization and decision if the system works efficiently or not efficiently.

Because we are talking about databases and I always the number 1 thing in terms of what takes time. And I like this idea. Yes. So like this is very close to what I see for many years building lab environments, non-production environments, which can help people scale their teams and databases and processes and so on. Yeah, so what's your opinion about this article?

Michael

Yeah, I thought it was good. I think all of, I think this blog's great, but it's very, I think it's very hacker focused. And I think that's good, right? Like This is a good topic for hackers as well. There's a lot of blog posts I see from people often explaining performance issues and not including BUFFERS, and it seems like a missed opportunity to educate on that front.

So this is a great argument for the people doing the educating, I think, or even doing the implementations of why it can be helpful to see this. I think he makes some good points as well about hardware changing over time and his initial paragraph is even about I can't reproduce this paper.

Nikolay

I can't reproduce timing, yes.

Michael

Yeah, because all this and it's not just that it's also there was and this is a different point slightly but there's insufficient setup information which is really common in benchmarks it's really common to not include which exact machines were being used or which exact. So without that information, BUFFERS can still be used as like, oh, this is interesting.

Sure, there might be optimizations over time that Postgres can use fewer BUFFERS for certain operations or certain indexes become more efficient. There's been a bunch of work to optimize B-trees in Postgres over the years, and I would expect that to lead to slightly fewer buffer reads in a bunch of cases. But they would be easier to see, and easier to see the differences in than timings.

And I think he's also got a really good graph here that shows that there's not 0 variance in buffer numbers, but it's much, much lower and it's way more stable than timings. Nearly completely stable with some exceptions, which is my experience as well.

Nikolay

Yeah, yeah, yeah. I agree, especially for example, if you run the query first time real cache is not there, it will give you,

Michael

yeah, planning

Nikolay

time, it will give you huge overhead and huge buffer numbers, but second execution will fully shave it off.

Michael

Yeah, planning BUFFERS.

Nikolay

Planning BUFFERS is a separate topic. It's a very important topic. I like to have it.

Michael

Me too, but I don't fully understand them yet. Like why do we get so many more in the first execution and then why do we sometimes get none reported at all?

Nikolay

Ah, okay. You mean, yeah, yeah, yeah. I think we discussed it maybe, not sure if you, I saw some very weird behavior in planning BUFFERS. Yeah. Maybe something is not tracked as well, as you've mentioned.

Michael

I think it

Nikolay

must be. Foreign keys, yeah, yeah, yeah. So maybe we should dig into this topic and raise it, the discussion about that. Yeah. So back to Andrei's blog post.

Michael

Yes.

Nikolay

I agree. And like I cannot reproduce and we built lab environments sometimes very different from production just for the sake of cost optimization. Like if you have 360 core Epic, 5th generation Epic on production, and almost terabyte or more of memory, it doesn't mean you cannot optimize your queries on Raspberry Pi. You can on magnetic disks, very cheap. And that query fully cached might be slow for you on production, might be like a second.

You go and optimize, you understand that you have shitty hardware. In this case, you just see, okay, it takes minutes or hours. But BUFFERS are the same, and plan is the same.

You can make decisions on super cheap hardware with smaller caches, everything is smaller, very slow disk, you can still make cautious decisions and move forward with optimization, find solution, see that you optimize buffer numbers like a thousand times and expect similar effect, maybe the same effect on production, affecting time accordingly. Right. And this is the, this is the power of lab environments. I think everyone should have them super cheap, shared.

So many people and CI pipelines work together and focusing on BUFFERS is magic. Right? Yeah, I

Michael

think it's easy. I think you're completely right. I think it's easy for people to make the mistake of then also not using sufficient data volumes. I know you don't. I know that's not what you're advocating for. But yeah, it's the data volumes that matter way more than the power of the disks or the specific CPU.

Nikolay

Yeah, usually people say, okay, we are going to have weaker hardware for non-production, and we are going to focus on relative optimization. So if it was a second in production, 10 seconds in this non-production, which is weak, we are going to optimize from there. But it's still weak because of the status of cache and so on. Just focus on BUFFERS and then you can understand it. Again, there are exceptions, I agree, such as we discussed with RLS and so on.

It's possible that your particular case is CPU-intensive, but these are exclusions from the main rule. Absolute majority of cases, you will be dealing with BUFFERS during optimization, right? So congrats with Postgres 18 change. And I also want to thank Andrei Lepikhov for this blog post, which is very welcome. I think we need more posts about the importance of I/O Metrics at various levels, like 1 query level, micro level, or macro level like pg_stat_statements, pg_stat_kcache.

So yeah, I hope this topic will grow and be more popular, this methodology.

Michael

Yeah, agreed. Another thank you to the people involved in getting this committed to Postgres 18 and big thanks oh and I think this might be our first episode since we both got Postgres Contributor status Nikolay so congratulations to you and thank you to whoever nominated us or whatever the process is there. Thank you to everyone involved.

Nikolay

Good, see you next time.

Michael

See you soon, bye.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android