Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and this is my co-host Nikolay, founder of Postgres.AI. Hey Nikolay, what are we talking about today?
Hi Michael, let's talk about performance cliffs and 1 case particularly, track_planning. The reason why I brought this to our attention is that I observed recently several, actually over the last few years, I observed several strong cases in production systems, various production systems, where lack of planning phase tracking caused huge amount, huge effort invested to troubleshooting.
And you know, our best example is, we talked about this many times, during planning Postgres locks all indexes with access share lock. All tables, all indexes involved, even those which will not be used, it still locks them, right? And we talked a lot about lightweight lock, LockManager contention issues, right? And it happens during planning. And if we don't have track_planning, pg_stat_statements.track_planning enabled, it's not visible to our top-down query analysis.
We cannot see which queries spend a lot of time in planning and a lot of buffers, maybe, right, and so on.
Yes, and crucially, not only is pg_stat_statements not on by default although a lot of people have it turned on but pg_stat_statements track timing is off by default even when you enable pg_stat_statements so most folks that enable pg_stat_statements don't enable track_planning including most cloud providers.
So let's zoom out, actually. This is a good point. By default, Postgres presents only poor metrics for query analysis. If you have I/O timing enabled, in the pg_stat_database you see I/O timing. You have a number of transactions, you can understand how many transactions per second not queries per second. It's like Postgres doesn't track it and that's it so you don't understand like? You don't understand throughput. You don't understand the average latency in general. It's super hard.
And even at the highest level, like my whole database, how is it doing? How many TPS? How many TPS is fine? How many QPS? Queries per second. And what's about latencies? But if you want to go down and understand which groups of queries, parts of workload are responsible for, like, where do we spend time, For example, or do some work with the buffer pool. To do that, you don't have it by default. You need to install the pg_stat_statements extension.
And this is actually what most people I know do. This lesson is learned and pg_stat_statements is maybe the most popular extension because of that, right? Definitely. And we did have a couple of episodes about pg_stat_statements. It's hard to overstate that this extension is not good. It's really valuable. And there are opinions that it should be already...
It's time for it to go to the core engine, because it sucks that we install this extension and Postgres needs to parse, analyze query once again, just for metrics. So it should be part of core. But it's already a non-default situation when you have a good query analysis. And it has problems, as we discussed many times, it doesn't track failing queries and so on. It's another story. But next step, we install pg_stat_statements, we see all queries, well, top 5,000 by default, right?
It has pg_stat_statements.max parameter, which controls how many normalized, aggregated, or official languages, normalized queries, queries without parameters, are tracked. But then, at some point, and actually I think many folks don't understand it. Honestly, I didn't feel it so deeply as I do it now, like last couple of years maybe. We track only part of our, of Postgres work. We track only execution. We don't track planning.
And planning can be a huge headache if workload is heavy and machine is huge and so on. So in heavily loaded projects, not tracking planning means like you don't... You can, like, very roughly, You can be in trouble in 50% of production cases when you need to investigate what's happening, what's causing high CPU consumption, for example. But pg_stat_statements doesn't know, it doesn't see it. It doesn't see the planning phase at all. Only execution phase. So it feels like very weird.
It's like we have, okay, we install pg_stat_statements, we tell all people who have LTP install it, we know overhead is quite low, and we will discuss it in depth in a moment. But then we say, okay, actually, the pg_stat_statements extension we actually installed, it's like half of it. Half of the extension, because it's only about execution. But planning is a super important phase and we don't have it. So all folks who have pg_stat_statements, most of them, most of them.
All folks who haven't changed settings of pg_stat_statements, they have only part of solution.
Yeah, I would probably make the case that it's a bit more than half. Like I think planning in general is like a much lower proportion of performance issues are planning related than execution related, in my experience.
But I do take your point that it's not it's not 0 percent planning related so even if it's 80 20 or 90 10 it's still there's still a whole category of issues that we're not spotting by having this off and I think there's a couple of extra things like that I don't think it's just heavily loaded systems I've seen planning dominated queries that are analytical in nature where it's just somebody trying to diagnose why a single query is slow and that
has hidden as well if you're just looking at things like...
This is just-in-time comment actually.
Well, there's just-in-time compilation as well. But, yeah, good. No, but sticking to planning, I think there's a few confusing things.
And I see this mistake made by all sorts of quite expert people as well when they're they might be looking at a query a single query plan and they say this query plan only takes 100 milliseconds but there's 10 milliseconds of planning as well like they're only looking at the execution time they don't realize that the planning is in addition to the execution time, which they should be saying is 110 or how they should be summing those 2.
And the same is true in pg_stat_statements, we should be summing the 2. And we can't sum the 2 if it's off or we can, It's just it mistakenly tells us that there's 0 planning time where it's actually just not being tracked
and Just a set once had only metrics Call like total time to mean time and so on before Postgres 13 and since 13 it was renamed it was renamed right it never tracked planning right
there was no planning it was only execution but the naming made it clearer. And on 13...
And 13 it was renamed to total_exec_time, min_exec_time and so on and there is another group of metrics total_plan_time and so on, min_plan_time and so on. And it means that, okay, we have now the setting, but it's not on by default. And It means we have a second lesson to learn. Should we enable it? We say install pg_stat_statements. Don't go without it. It's the number 1 recommendation in terms of query analysis. Always have pg_stat_statements settings.
I remember when it appeared, some rumors said, it adds 7% of overhead, but it's worth it. Well, as we can see now, it's actually very small overhead under normal circumstances, but we had an episode showing, remember, 4 million TPS, showing that on edge case, overhead can be drastical and drop performance. It's like performance cliff. But I doubt normal projects reach that point. Everyone should be aware of that overhead, but again, it's a cliff. It doesn't come immediately.
It comes only when you reach a lot, a lot of queries per second for a particular query ID, right?
Yes, and I think this suffers. Tell me if you think I'm wrong I think the problem here is that the easiest way to stress test Postgres or the easiest way to do a quick benchmark is using pgbench yes which has this exact
it's not a lot of secrets immediately let's okay let's talk about like we are like problem like it just says Edmonds is number number 1 recommendation for query analysis. Install it. Everyone agrees? Good. There is some overhead. We don't see it. Like, it's not huge. Okay. Well, we saw it can be huge, but with some pathological workloads, right? And we will discuss it. Now, question of this episode, basically. Should everyone start saying, enable track_planning?
Should we, like, first default situation when Postgres doesn't have anything and then we install extension, okay, everyone learned this lesson. It's kind of solved. Should we consider the second lesson similarly and tell everyone enable track_planning? Because this gives you full power of producer statements now
Yeah my current opinion And I'll be happy to adjust this as the episode goes on is I would enable this very early in a project these days and Doing research for this. I did come across some interesting things. I didn't realize including why and when the default was made off. But yeah, I would say when you've not got high volume, high load, turn it on while you've not got high load and then reassess if you end up with a pathological
workload. Which I
don't think you will.
I understand this approach And usually it's good. For example, if we had some overhead which would grow monotonically with... Maybe linearly or somehow with our workload. I would understand this approach. Because take these weights and go to the gym with additional weights all the time. So you got used to it. But in this case, it's very different. We don't have any noticeable overhead, as we saw from benchmarks. We don't have it and it happens only at very, very, very, very extreme situations.
Let's talk about history. I just learned about what you found right before our recording so I was super surprised. Tell us more about it, very interesting.
Yeah so 1 thing, like an easy thing I do in preparation for most episodes is just check when was the feature introduced, what was the commit, like what was the discussion around it And I noticed it was in version 13 that it was introduced and I went to the version 13 release notes to see what was said, kind of as a high-level summary. And the release notes now have these awesome little links to the commits. Even for old versions? Well, 13 had it.
I knew they did it for the latest version and I was surprised to see them in 13 but very pleasantly surprised so thank you to whoever did that.
a portal, that's great. Yeah. So every item in RezNodes we can quickly trace discussion, commits and so on, right? That's great.
Yes, exactly. Or you can go to the commits and from the commits you can go to the discussions. In most cases. Yeah, in most cases. But often new features especially simpler ones like this like small like a new parameter they'll only have 1 commit this 1 had 2 which which piqued my interest immediately so I opened them both and the first 1 was pretty pretty normal it made sense that this was added.
And then the second 1, it was when I realized, oh it was during the beta, it was during the beta phase for 13, somebody had reported a performance issue with having track_planning on and turning it off made the performance go back to exactly how it was in 12.
So basically they had a regression according to a synthetic benchmark and then asked could we turn it off and there was it was pretty unanimous I think it got a few replies and all the replies were in favor of turning it off for the version 13 release and as far as I can tell it's not been revisited since.
Yeah, so I now understand much better what's happening here. Let me like unwrap it here. So what happened? It was obviously, it would be obviously good if it was enabled for all by default, right? But then it was learned that there is performance degradation 45%. Okay, it doesn't matter, actually, it can be 90% in some cases, it depends on your machine and so on. And it was related to spin lock contention in the results.
And this is exactly what we have recently observed in our benchmarks using our AI workflow which we discussed also several times. I wanted to say how I see our benchmark workflow for synthetic benchmarks, pgbench and so on. We built a lot, and we collect 80 artifacts for each run, very comprehensive configuration, everything is stored, we can iterate. And I see LLMs are just like kind of oil so eventually we should have well oiled machine right engine and but the Main part is not in LLM.
It's just it just makes it easier to iterate So we wanted to check overhead and we checked it So pgbench which was also used in like in this Research mentioned in the mailing list pgbench by default is not limiting TPS But in normal circumstances under normal circumstances in production, of course we don't, so this is not just load test, it's a kind of load test which is called stress test. So we are on the edge, We are checking what's happening on the edge.
And in production we don't have it normally. You don't want to work at 100% CPU usually in OLTP because you will experience various interesting phenomena like this basically, right? So spin lock. And we checked it and like we took very big machines, I think 192 cores with 5 generation, fifth generation Google Cloud, still spending Google Cloud credits, right? Fifth generation of Xeon scalable Intel and almost 200 cores, a lot of memory.
It doesn't matter because we also took small scale, scale 100, that means 10 million entries in pgbench accounts only. And we started, first we started not select only, no difference. Like if you use read-write workload, you don't see difference. But once you switch to select only workload, you quickly observe. So what we did, and like in our show notes, we will have links to detailed details, all details with all numbers and reproducible with all pictures and so on.
So we start with just 1 client and 10, 20 and so on, 100 and so on until 200 clients and we expect that we should basically grow in terms of TPS, right? Because more clients, more TPS, and Postgres these days scales to many cores quite well. But we quickly saw that we've enabled on this machine, quite powerful machine, right? We've enabled pg_stat_statements.track_planning Peak is reached after 30, between 30 and 40 clients. Very early. And then TPS goes down.
And without track_planning, similar picture actually, but it's later, until like 70-80, like 2 times to the right, 2 times more clients and TPS also higher. And once the peak is reached and we go down, so first the lines are together, but once the peak is reached and we go down, we see like kind of 2 times difference. Peak is reached 2 times sooner And TPS is 2 times slower. Interesting, right?
But looking at this, I was thinking, okay, why do we have peak without track_planning reached much sooner than 196 cores, I think, vCPUs? Because normally it should be there. How many cores we have? This is our maximum, most optimal point of load. There are nuances because we run pgbench on the same machine, so we limit number of threads by 25% of cores, so we couldn't have all CPU consumed by clients, basically.
So anyway, Then I recalled our February tests and we had a podcast episode about it. pg_stat_statements overhead, 4 million TPS. We needed to remove pg_stat_statements to reach maximum workload. I remember when we removed it, peak shifted to normal position, closer to number of vCPUs we have. So what does it tell me? Okay, I'm thinking, oh, it looks like pg_stat_statements has some significant overhead, but when we enable track_planning, this overhead doubles.
And then, since our automation collects flame graphs, it collects pg_wait_sampling analysis and so on, so we quickly identify there are spin locks indeed, but the same spin locks are present when you just use previous assessments. This is the most interesting part, right? So on flame graphs we see, and like, I think I should publish a report. If I have it already, please let's have it in show notes, link to the report.
So in Flame Graphs we see that without track_planning enabled, for this particular workload, we have spin lock contention, and there is a very wide s_lock function inside pgss store. pgss store is a function which saves metrics. So what's happening?
To pg_stat_statements, that's the pass, right?
Yeah, yeah. And like I would expect, like, I remember in February I expected, like, I wanted to quickly reach 1 million TPS and go further. And I know it was achieved by Alexander Korotkov in 2016, like 8 years ago. So I was very surprised I couldn't reach it easily. And only when I removed pg_stat_statements, I reached it.
So what's happening here is with pg_stat_statements used, not track_planning, just pg_stat_statements, if you have such a weird workload, like it's just a single query ID and it runs hundreds of thousands of TPS for this, QPS in this case, queries per second, For this particular machine, it can be lower. Actually, we reproduce problems on 8 core machines. This is super interesting point as well. Yeah, yeah. So this performance cliff can come to you on small machines as well.
But wait, wait, how many concurrent connections? I guess it comes sooner?
Yeah, well, it comes sooner. I don't remember from the top of my head, but I know this problem can happen sooner. So the idea is, if we have limited resources and transactions, these queries are so fast, like I think it was like 15 microseconds or so, it's super fast for primary key lookup. But they all fight to update metrics in a single pg_stat_statements record. And spin lock is required for that. That's why we see spin lock contention. So it's observer effect as it is. Pure observer effect.
And pg_stat_statements can have observer effect, but you need to reach a lot of queries per second.
Good point.
Because of limited resources, right? So, yeah, and this means that in reality, very unlikely you will see it. Maybe, maybe, but unlikely. It should be super fast index-only scan, for example, and the frequency is so high.
But When you look at flame graphs with track_planning enabled you see exactly 2 areas similar width Both are as log inside pgss store and 1 pgss store is in execution phase and other pgss store in planning phase So metrics are saved separately and if you enable they are saved 2 times. That fully explains what's happening here. If we enable it we just move performance cliff 2 times closer to us. Well if we sit in 0, right? If we shift it already, it's... Yeah, so...
Well it explains why in your synthetic benchmarks you got saturation twice as fast.
Yes, yes. All pieces of the puzzle are like... This took quite a lot. We had 4 sessions of various kinds of research starting in February with pg_stat_statements and idea, let's squeeze a lot of TPS from our big machines. But this is interesting, right? Let's think about it. So we have this pgss store. Obviously, I think it should be possible to find a way to store it just once. I mean, to save it once.
Well, I remember you suggested that when we were discussing privately, and since then, having read the discussions, it's a deliberate design decision to save them separately. And I didn't know about that. So you mentioned a while back, we don't even track queries that fail for example. Well now in a way we are like because the number of times a query was planned doesn't have to equal the number of times it was executed anymore in pg_stat_statements.
Once you've got track_planning on you can see the number of times it was planned versus the number of times it was executed.
Are you saying that if we have a failed quiz but it was planned, planned time would be saved?
Yes. That's my understanding from reading the... I haven't tested this, but that's my understanding from reading the discussions around the design.
Still I think it should be possible to optimize. Yeah, I understand the design decision, but if we think about...
Either the design would need to change or the... Yeah exactly but it's a trade-off and that's like I found that interesting.
I remember cases, terrible cases which were very hard, extremely hard to diagnose because pg_stat_statements didn't show like for example merge joins which like merge join itself like not merge join somehow like maybe hash join was used, but considering merge join, planar spent many seconds, like 5 to 10 seconds, just considering it. Because it needs sometimes to check the table actual data and see like min or max values for some columns, which is unexpected, but planner sometimes does it.
And if it takes too long, planning phase, even if merge join is not chosen eventually. In the end, we don't see it, right? Log manager contention, right? Sometimes with JSONB values, we have something like planning is long and consuming a lot of CPU, but we should see it. We should see it, right? And pg_stat_kcache, which tracks physical metrics and those people who are, I don't know, like on self-managed Postgres can use it and it's great.
It also has track_planning actually the same, same parameter, right?
Oh no, off by default.
It's all by default. Yes. So now we understand the design decision and we understand that some benchmark, which checked very, very like edge case, I would say maybe even corner case because you have unlimited TPS and you have select only single query ID it's edge it's corner case not just edge we have 2 edges coming here it's a corner case so likelihood of having it in production extremely low and decision in my opinion decision was wrong what do you think
yeah I think I agree I can see why it was made I think if you see it's quite a startling number if you if you look at 45% drop that was the number reported and shown. I can see why people were scared of that, especially when it was in the beta phase. But in hindsight, especially with what I know now based on what you've said, it seems like we'd be much better off having it on for everybody.
45% is roughly 50, right? It's like 2x.
It sounds scary though, right?
It's reasonable. If you are sitting on the edge, you are suffering from this penalty pg_stat_statements already gives you. You just double it, because pgss store is called twice. That's why it's roughly 50%. But the conclusion here is not, let's not enable by default. Let's put a note in documentation that pg_stat_statements is sensitive to the cases when it's a single query ID and a lot of high frequent queries happening.
I think there is that note, yeah.
If there is, it's good. Then track_planning just doubles this penalty and that's it, but it's not happening under normal circumstances.
It's a really good point that it's not track_planning. track_planning isn't the thing causing this. The track_planning makes it twice as bad. But twice as bad is quite a small multiplication factor when you're talking about such an extreme case.
Yeah. It would be like
if we imagine if we had twice as you know that the transaction ID wraparound is a couple of billion? Or a couple of billion in the positive. If it was 4000000000 in the positive, that would make a bit of difference, but not a huge, huge difference. In these extreme cases, twice isn't that bad.
So I have 2 separate questions here. Can we have a single pgss store instead of 2? And second question, should we recommend enabling it to everyone but understanding that doubles the problem of pg_stat_statements which already is present, it's there already, but you just don't feel it. You don't feel it, for example, on that machine, almost 200 vCPUs. You don't feel it unless a single query ID has like 200, 000 calls per second. 200, 000. It's insane. This is the edge. It's super far.
And without track_planning, okay, it's 400, 000. So not enabling it, you're just saying, okay, instead of 200, 000, we will have 400, 000. But looking at your statistics, you see, okay, maximum calls per second is what? 1, 000? 5, 000? 10, 000? It's already a lot, right? And usually we have different kinds of problems with such queries which are exactly present during planning time because locking, right? It's already a problem.
Yeah, such a good point.
And I wonder, if we have a high frequent query, a single query ID like 10, 000, primary key lookup in many cases, maybe in most cases. We say, okay, log manager can come after you, right? Let's cache it. I don't know, like prepare statements, or maybe plp.js.calc function, indirect caching and so on. Let's avoid planning. Planning will be okay, right?
I haven't checked, but that would be an interesting test. Actually, no. Yeah, you could do the same test again, but with prepared statements. It would be interesting to see if track_planning is updated at all. Presumably it's not doing planning so it shouldn't do that.
That's a great point I'm going to check it because like as I said with our almost well oiled machine I will just ask to repeat experiment and with hyphen capital M prepared we will see and basically it should have effect like we disable it but we keep it enabled right
with a pool and maybe like with some way of keeping the sessions Remember last time it was doing the first...
Yeah, I remember.
Would it need to be that or not? I actually can't remember. Maybe that would be fine in this case.
In this case it will be fine because the problem is like it's vice versa. pgbench maintains connection here. What we discussed is that like in reality, if connection is not maintained, we connect again and then we don't have cache. So the connection, okay, I will double check this area. It's good point. And additional research to make the picture complete.
But my gut tells me it should be enabled in many cases, but we should be aware of spin-lock contention, which can happen if we have high frequency. Honestly, I'm thinking to find some rule maybe to in monitoring, for example, we know this machine has this number of vCPUs, and we can quickly check how many calls per second, even if track_planning is not enabled, we can check it how many calls per second we have for most frequent,
Very high frequency. Order by calls, descending.
Order by calls, exactly. It's not hard. Yeah. So top 10 by calls, and how big is that? And okay, this number of cores, this number of QPS, this level of QPS, just roughly estimate how far we are from the performance cliff. Maybe it's possible to do this, right? For modern... Of course, it depends on many things. It doesn't depend on the plan, I think. Because saving happens only once. It depends on duration, but not on duration, on frequency.
And of course, on the resources, type of processor you're using. So I think it's possible to have some kind of prediction, how far we are from this. And if we are really far, let's enable track_planning. This is my point. Meanwhile, let's think maybe it's possible to save it just once and reconsider decision-making. I'm really thankful you made this research. I liked it in my puzzle. Like I still collecting pieces of it. Cool. So enable it, but it will double overhead and be aware of overhead.
This is the bottom line from pg_stat_statements.
Yeah. If you ran a managed Postgres service, would you enable it for everybody by default at the beginning?
Well, yeah, this is another piece of great info you found. We almost forgot to discuss it. You found the... So you found that many... Well, it's not surprise to me. You sent me the list of managed service providers and what's possible what's not there it's not surprise for me that most have it off by default surprise for me it was that Google Cloud even doesn't allow to change it
not just Google Cloud quite a few of the providers...
Crunchy Bridge right?
So yeah yeah I is it's definitely not a it's definitely not a full list, and I'm guilty of testing more of the ones that make it easy to spin up new instances.
But let's drop some names here. RDS or by default?
Should we go the other way around? Is the other direction is the ones that do make it configurable. RDS is I think still by far the most popular. Off by default but configurable. Bear in mind pg_stat_statements in RDS is on by default. So they do change some of
the defaults
around this, for example. Most of these provide, in fact, I can't remember the last time I checked a provider that didn't have pg_stat_statements on by default. You mean
on by default, you mean it's in shared preload libraries, or it's created in like template database?
It's, yeah, it's, it starts tracking queries without you doing anything. You don't have to create extension or whatever the normal thing is to do.
I don't remember should we create extension.
Almost all providers, because they rely on it for their monitor, like the monitoring they present you runs off pg_stat_statements for most of these. So it's on by default, but track_planning isn't. That's true for almost all of the ones I tested on. But I think Timescale...
Timescale is champion here.
Yeah. Well, it's configurable but it's also on by default which was, I think, the only one that I found that had it on by default. There probably are others I didn't check them all. It's quite time consuming.
That's great. I love this decision.
Yeah, it's great.
But a
few of the others made it configurable, which is nice. I just was surprised that at least one of the major ones doesn't even.
Yeah, Google CloudSQL even doesn't allow to change it and we... You gave up trying to check Azure.
Yeah, I had limited time and the UI and I don't... I need to talk to somebody there to teach me how to use it or something because I was really struggling.
Yeah, I understand that. So, yeah, so who can like, our recommendation is to for those who are listening from those teams like CloudSQL, Crunchy and Supabase guys like make it configurable. It's not normal that it's not. I would say consider making it on by default, but at least bare minimum should be let users to decide.
I think so, and I think normally when these providers change a default they'll change it for new instances only where I think there's just so such low risk so I like that approach of you know if people create new clusters or new instances, have it on by default. It will help people when they come across planning time related issues, diagnose those issues much quicker, much easier, whether that's on their own or once they pull people in.
Yeah yeah yeah that's that's that's for sure yeah so good I hope we brought some food for thoughts to guys who are listening to us. So yeah, thank you.
Nice one Nikolay.
Yeah, thank you.
Take care, have a good week, bye.
