Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and this is Nikolay, founder of Postgres.AI. Hey Nikolay, how's it going?
Hello, Michael. Going great. How are you?
Yeah, good. Thank you. So what are we talking about this week?
Let's talk about why we can have writes on replicas.
Yeah, or the thing I've seen people get confused by is why they can get writes even on the primary for SELECT queries, like for read queries. But yeah, I like your framing.
Yeah, it can happen on primary as well. Yeah. I mean, this surprise can happen on the primary as well.
And specifically where people might notice this is seeing if they get buffers in query plans like shared
dirty. Exactly.
But also, I think the place I've seen people spot it, at least when they mention it in blog posts, is pg_stat_statements or monitoring tools that are looking at pg_stat_statements and seeing that some of the top offending queries for buffers dirtied could even be SELECT queries, and that really surprised them. So it's kind of one of those surprising moments where you think, ah, is something badly wrong? And then you look into it and realize maybe not.
Yeah, actually, to be fully clear, writes can happen during SELECTs because of temporary files as well, right? And if you think about writes to disk. But this is pretty straightforward. We see, again, in pg_stat_statements, we can see it as well as temporary bytes or blocks written and read and...
Yeah, is it local or is it temp? I think it's local.
No, local is different. I always forget what local
is. No, you're right. You're right. Local is for temporary tables, that's why I get confused. Local is for temporary tables and temporary objects. But temp is for like, if for example, sort or hash spills to disk because it's too big for work_mem or work_mem times hash mem multiplier then you get temp blocks read pretend blocks written But I think what people are spotting is the dirtied and being like, wait, that should... Dirtied means, you know, an Update or Delete, normally.
It's so confusing, right? I'm checking pg_stat_statements documentation and for local blocks, read, dirtied, written, and local block read time, write time. It only says total number of local blocks read by statement. Total number of local blocks, like everyone understands what local block is. Well, local block, okay, I think you're right. It's a block. I also saw like the more and more I deal with it in Postgres, almost how many years, like 20 years, right?
The more I dislike the fact that we call operations blocks. These are operations, not blocks. Because if we say 10 blocks written, it's actually 10 block writes. Maybe it's the same block which we've written 10 times, right? Or rare. So it should be block reads, block writes, block... Dirty, I don't know the proper word how to say. Dirtying... Dirtyings, I don't know. But anyway, here, of course I think it would be great to have understanding what local block is.
Local block, I think, it's, again, operation. Local block operation, it's what is caused when we work with temporary tables, which I always try to avoid because of different reasons. Maybe we should talk 1 day about temporary tables. But temporary tables, temporary blocks, temporary blocks is not about temporary tables, but rather about temporary files which can happen during execution of a Query dealing with normal tables anytime because work memory is not enough, right? So again, this is easy.
This part is easy. If we see some SELECT or Update or anything is producing a lot of temporary blocks, read or written. We need to just consider raising work memory or optimize Queries so it deals with lower volumes, smaller volumes of data, right? For example, I don't know, like, just increase selectivity of our, I don't know, it's a separate topic anyway. Let's just point that writes can happen because of temporary files.
Of course, SELECTs can produce temporary files because work_mem is not enough. That's it. But then sometimes we have huge work memory, definitely, which is enough, and SELECT is still writing to disk. Or, well, technically dirty, blocks dirtied. These operations, I will keep calling these operations, not data. These events, these operations can happen during SELECTs.
And if it's just dirtied, it not necessarily goes to disk immediately, because this is the work for Checkpointer and background writer. But it's already like, it's not good. Like SELECT is producing some writes to the buffer pool, to shared_buffers area. And of course, if it's written, it's even worse, because I'm pretty sure this makes SELECT also slow similarly as it happens when it creates temporary file and writes to it right
yeah good point although I think in the 2 cases that we're going to be talking about, actually, no, no, there could be a lot of them. I was thinking, I was thinking it's generally only in a lot of the cases I've seen, it's only 1 block at a time. But actually, you could easily have a scan doing a lot of these. So yeah, could be a lot of data.
Yeah, and actually to be even like absolutely precise, a block written not necessarily goes to disk because this is written to the page cache. And then like pdflush or something, this is already about Linux, it should write to disk at some point. Written doesn't necessarily mean that it goes to disk. And I think to understand that, we should use different extension. Because pg_stat_statements is great, it's available everywhere.
Because it's like, I think this is the most popular extension, should be at least among observers. And there is opinion that it should go to core at some point. It should stop being an extension, especially after query ID propagated to pg_stat_activity to auto_explain, right? Definitely, pg_stat_statements should be in core.
But what I'm saying is that there is a KCache which extends the predecessor statements to allow users to see metrics related to actual physical resources such as CPU, system CPU, and also disk I/O. Real disk writes, real disk reads how many bytes are written, how many bytes are read for each query ID, which is great. I think We talked that finally buffers are committed to EXPLAIN analyze. By default, you run EXPLAIN ANALYZE in Postgres 18 unless it's reverted. I hope it won't be reverted.
In Postgres 18, EXPLAIN ANALYZE will always include BUFFERS in output, which is great victory. We talked about it a couple of years on this podcast, I think almost since the very beginning. And I was thinking what topic to choose next to complain about occasionally, regularly.
Or maybe campaign rather than complaint.
Campaign, okay.
More positive, yeah.
Right, so what kind of campaign to launch after this? Like, I was happy, but I was like, oh, I will be missing us to talk about buffers and explain the lies and so on. I chose this topic. I think we should focus on all managed Postgres platform builders and advertise to include pg_wait_sampling and pg_stat_kcache to extensions. Because this is super important to have them both to be able to do proper query analysis in different aspects. pg_stat_statements is great. It has a lot of metrics.
It keeps growing in terms of number of metrics, but it doesn't have what pg_stat_kcache and pg_wait_sampling provide. So we need those extensions on each serious database, especially heavily loaded, and pg_stat_kcache is great here. And this would allow us to see, oh, actually this query being SELECT caused actual physical disk writes.
Yeah, I'm going to have a think. If we only get 1 campaign, I'm not sure that's mine, but maybe I like it. It's a good idea.
Okay, let's please consider this. I'm definitely launching campaign at least on my Twitter and LinkedIn and actually I already launched it and I saw a reaction from So actually I mentioned that these extensions are present on all serious setups, which we helped to maintain for quite some time. And pg_stat_kcache has been used like in serious databases I'm dealing with for a very long time, many years. And I know some platforms use it, but in different countries, right?
And the only case I see is pg_wait_sampling is available on Google Cloud SQL. pg_stat_kcache is not available there, but pg_wait_sampling is, which is great.
Are you sure?
I'm sure. pg_wait_sampling is available, so you can have... Imagine RDS, they have performance insights, which is great. But this information is not available through SQL interface. I'm not sure about CloudWatch, maybe it's available there, but it's a different API. I want to see details right in SQL because I have tools to analyze pg_stat_statements and I want extension to ActiveSession history analysis and also to physical metrics, which pg_stat_kcache provides.
So Cloud SQL has pg_wait_sampling, but it doesn't have pg_stat_kcache. Others don't have any of these 2. And the reaction was from one of Cloud SQL engineers I saw who promised to think about it. I mean, it would be great and Cloud SQL could be much better if they have pg_stat_kcache. I think they're already much better in terms of observability, better than RDS because pg_wait_sampling is better than performance insights.
Usually we combine multiple tools for observability and you go to for active session history for RDS or Aurora you go to one place, to their own place, for everything else, you go to another place. It's not convenient. So in the case of Google, they have a code, I think, query insights or something like this. They have it exposed in their own monitoring interface, but they also expose it for anyone through SQL interface, which is great, right? Through pg_wait_sampling.
If they have pg_stat_kcache, I will be happy, but I also I'm going to ask others to consider these two extensions. For example, Crunchy Bridge, Supabase, who else? Everyone. I actually think RDS should consider pg_wait_sampling as well, and pg_stat_kcache as well. Why not? So, yeah, it should be just an option. Users could decide. So back to our question.
I think if you have pg_stat_kcache, you can see actual writes happening from queries and you can see SELECTs and you know work memory is enough, so it's not temporary files. You can confirm it through pg_stat_statements or also through individual query execution via EXPLAIN ANALYZE, BUFFERS, and that's it, you're thinking what's happening, right? Why SELECT is writing to disk? Or at least, is it writing to the buffer pool, which is already quite a surprise, right?
Yeah. And well, until recently, I knew this was a phenomenon. And I'd read about it a couple of times seen on the main list and in a post or two. But I only knew of one of the potential reasons and the reason I think this came up recently is there was a good blog post by Alex Jesipow. I'm not sure how to pronounce that. Sorry. Who mentioned a second reason as well. So the first reason is related to setting of hint bits and maybe we can go into detail on that one first.
Cause that's the one I've seen most often and it comes up the most. Do you want to discuss that one first?
Yeah, let's do it
So why do why do we have hint bits and what are hint bits?
By the way, maybe it's Alex Jesipow but Alex is from Germany I see and works at Luminovo, which is interesting because We we work together last year. I mean with his colleagues. It's interesting. I just realized. It's cool. Yeah. Yeah, I'm going to send him this episode for sure. So we have 2 cases, right? 2 cases. And I'm not sure which 1 is the easiest. Let's start with easiest.
I think Hint Bits is easiest to understand, and it's most common. So my understanding of this is when new data is written to a page, at that moment, Postgres doesn't yet know if that could be part of a much larger, longer transaction, and we don't yet know if that transaction is going to commit or get aborted and rolled back. So we at that moment cannot set, we cannot say on the data page that this transaction has been committed and therefore should be visible to new reads of this page.
So that information is in a separate log once the transaction does commit. And that means when you go, if a, if somebody's reading the page, they need to check which of these row versions has already been committed for sure. And if there's any ambiguous, any ones where we don't know yet, because, but based on these hint bits, based on these 4 different flags, we can tell what status it is. Is it unknown? Is it definitely committed? Is it already out of date? Like is it already being replaced?
So if in the case of these reads causing writes, it's a subsequent read coming along, seeing a new row version that hasn't yet, that we don't know yet whether it's been committed, checking the commit log and then setting, like having read it once, it sets that and dirties the page, writes the page again to wow, so that any future reads now don't have to check.
Right, yeah, so yeah, I think this can be like invisible, dead, frozen, right, for tuple and the thing is that the most interesting part of it that this this write can happen on the replica which is kind of surprise for people, right? How come?
There's a really good, oh, sorry, just to go back to hint bits, there's a really good page on the Postgres wiki that describes it really succinctly, and there are 4 hint bits. xmin committed, xmin aborted, xmax committed, xmax aborted.
Yeah. So, yeah, that's it actually. So we just know about this phenomenon and that's it. It should not be a surprise that it happens. And subsequently...
And it's healthy, right? It's not a sign of anything having gone wrong in healthy like it will happen Quite often.
Yeah, and in case it is subsequent. So it already is not causing this, right? So so it's it's because it's already updated and it's kind of you can consider it a kind of like dealing with warming up caches, but it's vice versa because it's writes. So it's only, this overhead is only present on the first very call with the same parameters. Of course, different parameters can cause such writes in different pages.
And it's a necessary result of us wanting to be able to serve things concurrently. Because of MVCC, we need versions of rows. And we need to then know which 1 should be visible to which transactions. So in order to do that, this is a necessary part of that.
Right. So yeah, let's talk about the second case.
Yeah, this. Yeah. The second was more interesting to me. Like I hadn't come across this before.
Yeah. Let's talk about it.
So in Alex's post, I think you're probably right on the surname pronunciation, but I'm not going to try it again. He describes these as page pruning he calls it which I have not heard it called that before and when I think of pruning I think my mind naturally goes to for example the mechanism for removing an empty page at the end of the heap, for example. Is that also called pruning? But this is different.
I think it's truncation. If you talk about removing last page when vacuum does it, it's truncation.
That makes sense. So this is in almost like a, I think you just, before the call you mentioned it was described as a kind of in-page vacuum. Yeah. So it's like.
And it can happen during SELECTs on the fly, like it's interesting, right? Which is like also strange. But if we, if we recall how HOT updates are organized, which is great, a feature, unfortunately, not available always because it requires 2 special conditions to be met. First is we are updating only, we are changing values of the columns which are not indexed. And second is there is enough empty space in the same page where our old tuple is stored.
In this case, so-called hot chains are created, and it can be multiple versions in the same page of the same tuple. In this case, what happens when, if we have index scan, we only know the page and offset for the first. Indexes are not updated in this case because it's hot update. This is optimization to fight index write amplification, which is terrible Postgres MVCC behavior many projects are suffering from. When updating 1 row, having many indexes on the table. We deal with...
We need to update all of indexes, producing a lot of WAL writes and just making updates slow, heavy, and so on. In case of hot updates, those 2 conditions I mentioned are met. Postgres writes only to the same page where tuple is already stored because there is enough space and it doesn't update indexes at all because we are changing the value which is not indexed.
And it produces a new version tuple, new raw version inside the same page, creating chain, it can be new, new, new, and when we have index scan, index scan points to the first version in the page, and then it's quick to jump between versions and find the actual 1 inside the same page.
1 crucial thing that I think becomes important later is that that chain information is stored in the header of the page, whereas the row, like the data from the row version is stored at the end of the page.
Right, right. This is true. Yes, it's so. And if we already have old versions, at some point they need to be deleted. It can happen during vacuuming or it can happen earlier if during dealing with this page Postgres says, okay, we can clean up old versions right now, why not? And can happen during SELECT as well, which is very strange. Yeah.
Well, it's so cool. But I think, again, there's like a condition where it will only do this if there is not that much space left on the page. So I think the number is 10% of the page is left in terms of free space. And there's some subtlety around if you've changed fill factor.
And so which means that this is an effort to maintain hotness of updates further, because if without this, we would need to go to another page. And this will definitely lead to updating all indexes, right?
Yes. But I think the optimization of not doing it unless the page is quite full means we don't have to do it that often. So for example, if we have maybe like say a really quick 10-15 updates of the same row, and they all fit on the same page, and then we're having reads of that row in between, we're not cleaning up each version each time we do a read, we're waiting until we get full and then doing it in 1 go. So I think it's quite a pretty smart optimization.
It's a kind of trade-off, balance between 2 worst situations and so on. So We don't do it too often, but also we try not to allow this tuple to drift to another page, which would cause the need to update indexes because indexes point to the old page. This is quite interesting. Again this can happen during SELECT and yeah but it can happen only on the primary right yes or no it's my guess I don't know honestly why because because can happen on replica
yeah well I actually I don't know well let us know in the comments.
Yeah, this is an interesting question, because if it's happening on replica, it means we have different, very different content of pages on both replica and primary. vacuum happening on a replica, please no, because this changes the content of page. I cannot understand how page content should be synchronized in terms of
tuples. It's physical replica, yeah.
Yeah, so it should happen only on the primary because...
Okay, yeah.
And the replica should just get this content. Now, with Hint Bits it's different. Hint Bits gets additional information and it doesn't change how tuples are stored inside the page. This thing is changing. It's cleaning up the space for new tuples. So it should happen on the primary, I think.
That makes sense. Yeah. Logically on a physical replica.
Right. Okay. Well, good.
There's this 1 more open question. Yeah. I saw when this blog, this latest blog post was shared to Reddit there was a really interesting comment that didn't that no 1 has replied to and there's almost no activity on it but somebody posed the question I think it's a good 1 whether there might be a third possible cause of this. And that's, so it's more of a question to the listeners because I'm not actually sure, I don't know myself and haven't had, you know, to confirm for sure.
But there's a PD_ALL_VISIBLE flag on pages. And that might be a third way if if SELECTs can flip that based on visibility of all the rows in the or the tuples in the page then that might be a third case but I don't I don't know if SELECTs can so yeah again if you know I'll be really interested to hear
yeah yeah so I guess that's it right yeah so yeah
so yeah great great blog post there's also an old 1 from Nikolay Sivko on the Okmeter blog that I'll share So there's a few articles to read more about this.
Correction, Nikolay currently has a new startup which is called Coroot. Yes. Yeah, so then Okmeter is his old startup. And Peter Zaitsev from, like, founder of Percona joined Nikolay recently, not very long ago, and I'm like, Coroot is a great project. But this, I guess this blog post was created while Nikolay was working on the previous project and just started observing the writes happening from SELECTs and was very surprised and decided to blog about it.
Yeah, well it was a monitoring project as well, right?
Yeah, it had very good Postgres related features, which I guess in some practices were inherited by Coroot. But Coroot is amazing, just a couple of words about it. If you, if you like flame graphs, Coroot is, can show you dynamically, like imagine you have a dashboard, you choose time for Postgres and you see flame graphs but hanging down so they are like 180 degrees rotated or mirrored, right? Horizontal mirrored.
You can see details like If you have debug symbols installed, I think everyone should have debug symbols always installed. You see deep details what's happening, where time is spent. Dynamic diagnostics like this, it's crazy. I think probably I should add this to recommendations for Postgres platform builders, managed Postgres platform builders, because to consider Coroot is also like a good thing for observability.
It's open source, right?
It has open source. It has, I think, it's like an open core model. I'm not sure. But the main thing is definitely open source. Again, I'm not sure.
Thanks so much, Nikolay. Thanks everyone for listening. We'll catch you next week.
Thank you, Michael. See you soon.
