What to log

Michael

00:00

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and this is Nik, founder of Postgres.AI. Hello Nik, how's it going?

Nikolay

00:09

Hello Michael, everything's all right. Let's discuss logs. Everyone loves to deal with logs, right?

Michael

00:18

What do you mean?

Nikolay

00:20

Well, I mean, if you have a huge server, a lot of logs, so good to deal with them.

Michael

00:29

Well, I guess we're blessed in Postgres with a few ways that we don't always have to check logs, right? There are some system views that we can look at various information.

Nikolay

00:38

For example corruption, right? There's no such view.

Michael

00:42

Yeah, yeah there was a but there's quite a few things that there's no such view. Well there's quite a lot of information that's put in the logs and isn't available elsewhere. So it is important. And when I say is put in the logs, actually quite often it isn't even put in the logs unless you change some settings.

00:58

So I thought it was, you had this good idea to discuss all of the things that most people should change in terms of what to log so that we can then check the logs for issues and when we're doing performance analysis and that kind of thing.

Nikolay

01:14

Yeah, So I think system views, ideally we should for everything we should have both system views and detailed logging.

Michael

01:22

Oh both?

Nikolay

01:24

Yeah, because like think about special statements, we have normalized queries only, it's not possible to have there every query with all parameters because cardinality will be insane.

Michael

01:39

But equally, I think it's quite often not possible to have every query logged to the logs, right? Like there's also limitations there.

Nikolay

01:45

Exactly. So pros and cons of both approaches. Cumulative metrics, they are good. You can visualize, you can see spikes, everything, but they lack details. And same for example, temporary files, right? We can see at high level, we can see from the pg_stat_database that, okay, we generate a lot of temporary files. Queries generate a lot of temporary files, some queries. We can go down to pg_stat_statements and see which query groups or normalized queries produce those temporary files.

02:18

It's already super helpful. But obviously, the same normalized query, if you take 1 set of parameters, it can produce gigabytes of temporary files. Another set of parameters, it can produce 0 temporary files. Depending on parameters, it can be very different. So we need examples, and this is exactly when we need to look at the logs. But at the same time, if we log everything, observer effect is not a joke.

02:46

And I must confess, several big clients suffered from observer effect when I decided to log too much. And it caused downtime and it was not pleasant experience. So we always need to think about observer effect. Those who say log everything, log all queries, they just don't deal with really heavy loaded systems, unfortunately. Yeah. Just try to do it in heavy loaded systems, which has like 100, 000 TPS. And that's it.

03:19

You will put down everything very quickly, you will learn it's not the way to go. So we need to limit the amount of logging, but we cannot avoid logging because we need examples. So my approach is to combine both things.

Michael

03:36

Yeah, it's a balance, isn't it? We need to record enough information to be able to diagnose issues when they occur, but not so much that there's excessive overhead. So it's a balancing act.

Nikolay

03:48

Right. And you need to think, like it's better to think in advance what to log, because if some incident happened, you somehow mitigated it. You want to do a root cause analysis, but logs lack this information. So you can only guess and you can guess quite well if you have a lot of experience, but still it's better to have logging tuned properly in advance. And I see Postgres in the latest few years, like improving defaults, but still defaults are not enough usually.

04:20

We tune some defaults for our clients always. I mean, we tune some log related settings to have more logging.

Michael

04:29

Yeah, in my research for this I found 3 different parameters that were all either introduced or had the default changed from off to some version of on all in Postgres 15. So I was kind of expecting a gradual improvement in this over time but it seems almost, I don't know if it's a coincidence or not, but 1 major version where quite a few things are improved and I haven't seen a whole lot of movement in other areas unfortunately.

Nikolay

04:58

Yeah you're probably right, Defaults changed in 15, you mean log_checkpoints and autovacuum related to 10 minutes, which is enormous still, like it's huge value. And what was the third?

Michael

05:11

And there's a new log startup progress interval.

Nikolay

05:14

But it's new, it's not a default change. But decision was made to not to introduce it being turned off by default decision, which is a great decision. I agree. And it's quite like 10 seconds, right? So every 10. Yeah, let's discuss everything, everything we have, right? Let's let's, Let's start from something, move to different topic. But I agree, so 15 was revolutionary in terms of bringing good defaults in logging.

Michael

05:44

In some areas, yeah.

Nikolay

05:45

Yeah, And also worth mentioning Postgres log, it's always a single log. So you can imagine if we had ability to separate streams of logging, for example, slow queries could go in 1 place, maybe to faster disks, or maybe to... There is a like this idea, very old idea, that some logging should go not to local disk but straight to some storage over network UDP and that's it like send it So we could log a lot of more stuff.

06:20

Although when you ask guys who support, I don't know, like Elastic or something, which is receiving a lot of logging from different systems, they also would like to limit, you know? Or Splunk, for example. Splunk is an old commercial enterprise system for logs. They can deal with huge volumes, but always still cautious about how much you are going to send to them.

Michael

06:44

Yeah. It's a good point, though. Other databases do have... I'm pretty sure MySQL has a separate slow query log, because I've seen people familiar with MySQL asking about where's Postgres' slow query log?

Nikolay

06:58

Yeah, it's not a big change, it could be like, I think it could be, I mean it's easy to implement, but I imagine how much of Infrastructure changes needed to be done in all the systems if slow query log would be separated What do you think

Michael

07:13

what what I'm missing something though? What's the benefit of having it separated out?

Nikolay

07:18

You can put it to different disks, faster disks for example, analysis wise also it's easier

Michael

07:25

Because the volume is likely to be so much higher than the other

Nikolay

07:29

Yeah, maybe Some rotation policies could be different. And analysis wise, the nature of like, I see a bunch of logs, which are related to query optimization. And I also see a bunch of log messages, which are related to like startup, autovacuum, checkpointing, like more like not about query processing, but everything else, right? And separation of them is, to me, it sounds reasonable. So I Imagine it would be possible if like, what do you think?

08:03

Imagine there is a new setting which says everything related to query processing, including log_min_duration_statement based, Everything that can contain a query text, actually. Everything. Like, for example, slow queries exceeding log_min_duration_statement. Queries generated temporary files. Queries which are slow because they are waiting for other queries and logged after deadlock_timeout 1 second by default, right? Imagine all of them go in 1 place, everything else stays in the old place.

08:38

But it's optional. You can decide to choose

Michael

08:42

it or not. Yeah, I haven't thought long and hard about it, but I do quite like the simplicity of everything being in 1 place. But it can also be helpful, right? I understand how if you're not worried about slow queries and you're looking for other things, they can be really noisy and get in the way. And in a heavy-loading environment, it would just be impossible without tools to then get anything useful out of that.

09:05

But like an autovacuum could be the cause of some slow, like some issue that you're facing with query performance. Or checkpoint could be the cause. And maybe in a less heavily loaded environment where you're reviewing it manually seeing that in in order might help you debugging like what's actually happening here

Nikolay

09:26

yeah makes sense makes sense and also some errors Should we like, what will happen with queries which are erroring out? Yeah. But, but still like, it could be a choice, right? Because people who are interested in errors, corruption related errors, I hope we will discuss them as well. And maybe a checkpointer behavior and so on, and less interested in query processing itself.

09:57

They often say, you know, like your huge query is being logged just because they are exceeding 1 second, they make our log really hard to read. Yeah. Well,

Michael

10:13

hard to read manually, right? With tools you can filter to the bits.

Nikolay

10:17

Or with DuckDB you can have CSV and talk to it through DuckDB queries, which I think is a great idea. Quite a new and great idea.

Michael

10:26

But I like this almost as a way of splitting our conversation. You mentioned the performance related stuff and everything else. Where do you want to start with 1 of those?

Nikolay

10:39

Yeah, we already talked about log_checkpoints. Let's maybe talk about this. So in 15 it was changed to on and that's it, that's great. I mean, it's easy. Like I think everyone needs it. And finally in 15, how long ago it was? 3 years ago.

Michael

10:56

Yeah, nearly.

Nikolay

10:57

It was realized that everyone should have it, everyone.

Michael

11:01

So our advice to people would be, if you're on a version older than that, turn it on.

Nikolay

11:07

100% so.

Michael

11:08

If it's not already.

Nikolay

11:09

Yeah, and another thing happened in 2015, it's related to autovacuum logging. So log_autovacuum_min_duration was always, I think before that it was minus 1. So it means the vacuum is not logged at all. And it was changed to 10 minutes. In my opinion, it's not enough. 10 minutes is way too long. It should be 10 seconds. In some places we set it to 0 to see everything.

11:38

Of course, I can imagine, especially if you have many workers, you also tuned the autovacuum_naptime to very low value so they sleep less between work. And this, or workers sleep less, right? In this case, I imagine amount of logging from a vacuum workers might be substantial. But at least like 10 seconds or a few seconds threshold, it's good. Because if you have a lot of tables, 10 minutes, you will miss a lot of details, interesting details.

Michael

12:13

And it's not that frequent. The thing if we go back to that, you know, recording enough information to diagnose issues without excessive overhead, the overhead of logging all of your autovacuums, unless you have, as you say, really, like, yeah, very low nap time and maybe lots and lots of tape, like, maybe you've got thousands of tapes, thousands of schemas, like if they've multiplied out, I think I heard somebody that was talking about having millions of objects in their Database.

Nikolay

12:41

All right, do you remember if you have, for example, naptime 1 second, You have 10 workers, for example, and you have longer running Transaction holding extreme horizon, so the vacuum cannot Delete some dead Tuples. How often those workers will visit this Table, which needs processing? Maybe they will do it in loop, right?

Michael

13:00

Yeah, good point. I don't know.

Nikolay

13:02

And if Table is very small, then quickly process it, sleep 1 second, again process it, sleep 1 second, so it will be a lot of spam, right? Unless there is additional logic in autovacuum, I'm not 100% remembering. It would be good to double check.

Michael

13:19

Worth checking, but also to once per second, when we're talking about logging, when we're talking about potentially, it's also still not that bad.

Nikolay

13:28

But 10 workers, Thousands tables. I can imagine something like this. But also, nobody... Like, default nap time is 1 minute. And default number of workers is 3. Yeah. So I... Anyway, like, I definitely... We always recommend to our clients to see more of the vacuum work in logs. It's really helpful, especially during some incidents, after some incidents, we have this useful information to understand what happened in terms of data that cannot be deleted. It also reports xmin horizon, by the way.

14:10

It reports it like, it reports the delta between current xmin and xmin horizon, so it's super useful. Yeah.

Michael

14:20

Nice, so just to be clear, anything, if you're on older than 15, make sure you have this set to something so that you're actually logging autovacuum at all. And even if you're on 15 or above, consider lowering it substantially from the default.

Nikolay

14:35

Yeah, 1 more thing here is that if we don't have activity at all, in this case, it will be producing some log entries which are probably not super useful.

Michael

14:49

If we don't have any autovacuum activity or any any kind of updates, delete.

Nikolay

14:56

Yeah, or maybe SELECT only during some long time.

Michael

15:00

But that was the argument for not having checkpoint logging on, if there's a silent database.

Nikolay

15:08

There was a battle about it, yeah.

Michael

15:11

Yeah, but we're talking about very, very few log entries in the grand scheme of things still?

Nikolay

15:17

I just mean if you, for example, if you build some service, if you have lots of databases or you build a platform for other people.

Michael

15:26

Sure, sure.

Nikolay

15:27

And you can imagine like many of them will have databases which are not super active, but logging has costs. If you have, say you have like 10,000 databases and 9,000 of them are not super active and you will be paying for this disk space for logs, right? This will make you think differently and try to save. But this is exactly like, this is different situation compared to, okay, I have 1 database, it's hot startup, AI startup, it's growing.

16:04

It's a single like monolith or few monolith or shard, it doesn't matter. And we need to understand our workload really well. In this case, we need to log everything, right?

Michael

16:15

Yeah, I get what you mean. But going back to the, let's say the platform that has 10, 000 of which 9, 000 are quiet, in that case why not rotate the, like, just delete logs more often?

Nikolay

16:29

Good point, So yeah, if you, for example, understand that you're going to log something that's probably in not loaded system is not super useful, but you know the maximum amount of that already, how many lines it will be, and you know it will be rotated so there is just constant price for each node that you're going to accept. And logs also can be compressed really well. Yeah I agree, I agree with this point as well. I just wanted us not to forget about different perspectives people might have.

Michael

17:06

Yeah, great.

Nikolay

17:08

1 big pet versus huge cattle you need to take care of. So yeah. Good. What about autovacuum? That's it, right? So we need to know details, especially about larger tables, especially when autovacuum is blocked. And if you're, for example, a backend developer, I think it's interesting for you as well, because it may sound autovacuum is just some, okay, it's a garbage collector in Postgres, right?

17:36

But if you think, like, if you have chances to deal with performance of system you develop, garbage collection part is very important because every time you deleted or updated something or had INSERT with ROLLBACK, you produce dead tuples. And basically, if you deleted something, COMMIT, return to your application code something, but it's only half of work done. autovacuum will do the remaining half of work, so to speak. And you definitely want to know when it happened, in which volumes.

18:11

At some point you will need probably to process a lot of rows, like delete millions of rows. We had an episode about this. And people keep asking, I saw comments yesterday, comments like, okay, deletes, what about updates? Well, similar, similar updates also, like they produce new tuple, they mark old tuple as dead, but the process of what exactly will clean it. And we need to understand its behavior.

18:42

And if we want to move really fast, We need to take care of cleaning, otherwise we will see a lot of huge gaps in storage. Space won't be reclaimed, it won't be reused very quickly, and this is called bloat, right? So we don't want this. So we need to understand garbage collection and Postgres, which is called autovacuum. Yeah, this is like, that's why you want autovacuum logging, even if it's, even if you are not DBA, but just backend developer. This was my point.

Michael

19:15

Yeah. And, And it's just like you mentioned at the beginning, there are system views where we can look at when was the last time autovacuum ran on each table, which is useful. But you can't see the couple of times before that, or...

Nikolay

19:29

You don't see details. How many... Yes, exactly. Was it in the horizon a problem, actually? Yeah. I think, by the way, is there some work in progress to extend it?

Michael

19:41

I don't know.

Nikolay

19:43

Because it's worth extending I think. pg_stat_all_tables or pg_stat_user_tables. It's useful, but it's just a counter and last timestamp. Not enough. 2 counters, 2 last timestamps. Or manual and automatic vacuuming. Okay, good. In the same release, just to close with Postgres 15, it was this log startup process interval. I don't think it's for backend engineers. It's more for DBAs because if they tune checkpointer behavior and distance between checkpoints is huge, significant.

20:15

In this case, startup can, might take long time and it's great that now we have this setting and as you said Default is quite good. So we can not to think about it at all Yeah, what's next?

Michael

20:30

What about log_statement? We talked about that in a recent episode as well. The default there is none, right?

Nikolay

20:38

Yeah, exactly. And I like to recommend people to set DDL, to put DDL there just to trace all schema changes. When it was done, who did it. Of course, actor may overwrite it. Unless, yeah, so, but it's good to have some default there. And I think there's a there's an ability actually to forbid overriding, right, some settings.

Michael

21:08

So I think there's 2, there's almost 2 categories here, right? You're talking about almost the security element of it. But I think even if we, even if we assume it's general case, it's still extremely useful to see when DDL changes were happening in the context of everything else.

Nikolay

21:24

The timestamp, exact, exact query. For example, it may save you time in future if you, for example, start dealing with logical replication, doing some experiments, and it's blocked by DDL. But you can trace when exactly it happened and which DDL it was, which is great.

Michael

21:42

Or even look back historically over the past 30 days and be like, which days do people run DDL on which ones don't they you know that you could plan an upgrade around when it's unlikely to have DDL happening.

Nikolay

21:54

It's super helpful in various incident analysis when you have deployment and something went wrong. We discussed it many times. DDL may be a reason if it cooked not properly, it may be a reason for downtime or some issues. And if you have log_statement equal DDL, you can trace it. Yes. Yeah. At least after the incident happened. So, definitely DDL is a good idea.

Michael

22:26

And the cost is very low, right? How much DDL are you running on your system?

Nikolay

22:31

It's not thousand per second, definitely.

Michael

22:34

No. I hope. You

Nikolay

22:36

know, I see systems which have DDL as a part of their normal workload coming from users. It happens. Well, for example, if it's a multi-tenant system, which creates tables for each customer and if they extend functionality, they also create something. Also temporary tables, which I don't like at all. Like I avoid temporary tables at all costs, but people use them and temporary table creation is also DDL.

Michael

23:07

Yeah, good point.

Nikolay

23:08

So I can imagine that there are cases when it can happen, like we have a storm of DDL coming to logs, but for me it's kind of edge case rather than normal.

Michael

23:23

So, great. Yeah. On the security side, I wondered about connections and disconnections are both off by default.

Nikolay

23:32

I don't think it's what this this is sometimes I see people turn it on. It depends on the particular system. In many cases, it's off. It's all by off by default. And this information is usually most interesting. It looks most interesting to DBAs, DBREs, not Backend developers. Let's move on closer to workloads and query optimization maybe.

Michael

23:57

Yeah, and I think we did have an episode either about pgaudit or about, did we do it specifically about pgaudit or about security or something like that? So I'll link up that episode because I think that covers probably the more interesting parts of the DBA side of things.

Nikolay

24:13

Sounds good. Yeah, maybe Let's talk about log_lock_waits before we focus on workload analysis. So it's also part of workload analysis, just specific part related to logs. And we also discussed it recently, I remember. And this is definitely what we recommend turning on because it's off by default. Yeah, because it's helpful. Also during incidents, that's helpful.

24:37

If, if you're developing your app, didn't think about multi-session concurrency control properly and blocking situations, you might have incidents, spike of active sessions, or maybe even downtime caused by heavy lock contention. And having logs related to this is really helpful. Really helpful, because without it, you can only diagnose something in runtime. Okay, you see there's a spike of exclusive locks.

25:14

So, With log_lock_waits, you will see that some session tried to run some query, but it was blocked by another session. And it was waiting, our query, was waiting longer than deadlock_timeout, which is by default 1 second. So if you, for example, try to update 1 row, but there is no transaction which is not committed yet, it may be already updated this row, but it's not committed, and we need to wait until commit or Rollback for the transaction.

25:45

And our update lasts, like, waiting period lasts 1 second. In some systems it may be longer or less longer if deadlock_timeout was adjusted. And this gets logged and you see, okay, our session process ID with executing query this was waiting for long and it was waiting for those process IDs. Unfortunately we don't see details about those sessions. This is disadvantage. But it's already half of the coin visible.

Michael

26:18

Yeah well imagine trying to debug without this, it just seems to me, reproducing ordering for locking issues just seems to me to be so difficult without this kind of information.

Nikolay

26:30

Yeah, also it's helpful if you have deadlock issues, and you try to understand your system better, this kind of logging is helpful. I cannot say it will answer all your questions, unfortunately. I had cases when deadlock troubleshooting was extremely challenging, and we had logging, we had everything, but it was still challenging. Sometimes It's really challenging. You need to go to application code understand what exactly It's doing but it's already something and it's better to have it always

Michael

27:00

Yeah, and this to me seems like a no-brainer. The trickier 1, balance-wise, seems to be the time, whether to reduce it from 1 second or increase it. And that seems quite loaded because it's doing 2 things, right? It's determining at what time lock waits get logged, but also at what time deadlocks get picked.

Nikolay

27:25

Deadlock detection happens.

Michael

27:27

Yeah, so it's doing 2 things.

Nikolay

27:29

I can, I understand the decision from engineering perspective? It was like, where should we log it? Oh, this is where we're checking it. It's good time to log it exactly at that time. So I can imagine this decision, why it happened. But I agree with you, there is some lack of flexibility a little bit. But also it's protecting us from huge volumes. Like let's talk a little bit about observer effect. Here I don't see it actually.

27:55

So at each session, like there is this like 1 second delay, basically by default, 1 second delay before it gets logged. If we have a lot of sessions okay how many? 100? 100 active sessions.

Michael

28:09

Max connections.

Nikolay

28:10

Yeah we might have 100 of these messages per second. Well it's significant volume But I think on this server with 100 active sessions and I hope with many more cores, I think disks are already quite powerful so they can handle 100 entries per second for this type of logging. But it might contribute in some cases, I think. Again, in some edge or corner cases, it might contribute a lot. So it's worth understanding. But personally, I've never seen the case when we decided to turn it off.

28:46

And I think it should be on by default, but it's off by default. So advice from us, turn it on.

Michael

28:53

Yeah, good 1.

Nikolay

28:54

What's next?

Michael

28:55

Should we move on to performance things?

Nikolay

28:57

Yeah, maybe temporary files first.

Michael

28:59

Yeah, good 1. Kind of bridge the gap a little bit.

Nikolay

29:02

Yeah and by default it's disabled minus 1.

Michael

29:07

Yes so this is when in fact actually I looked into it this this is logging when the temporary file gets deleted interestingly But what's happening is when a query is using some temporary storage, for example, if a sort spills to disk or doesn't have enough working memory available to do a hash join or any operation that uses temporary data.

29:36

So I had this under performance but I guess it is also just general system-wide tuning as well around config parameters like work_mem and hash_mem multiplier more recently as well.

Nikolay

29:49

Yeah, yeah. Yeah, I can imagine many queries will start producing a lot of logs if they need temporary files. It's so multi-sided question to me, because if you produce, so if you try to save on log, logging of temporary file generation cases, deletion cases, as you said, you already have issues with dealing with disk. The query processing itself was dealing with disk. It's better to capture that info and try to optimize it as soon as possible. But will I put 0 there?

30:28

0 means let's log all the cases even if it's just 8 kilobytes. Maybe no, but what I would do, what we usually recommend, I'm sharing 3 pieces of advice today. We say minus 1 is definitely not good. And of course, in some cases, if you don't see temporary file generated right now, looking at pg_stat_activity or pg_stat_statements, because they also have temporary file written and read a couple of columns, so you can see it there as well.

31:01

If you don't see a big problem right now, maybe okay, maybe not, let's not log it. But why not to be prepared for future, right?

Michael

31:08

Yeah, I would say the opposite. I'd say if you're not seeing them right now, you could even afford to set it to 0 because you're just not going to get much volume.

Nikolay

31:16

What if tomorrow memory is not enough and we start logging very tiny amount, tiny numbers and it will be a lot of cases?

Michael

31:26

Well, depends if you've if you've tuned work_mem, even if work_mem is the default 4MB, it would only be logging most likely queries that are using more than 4MB of memory and sorts and things, so how many of those per second are you realistically running? It seems unlikely to me that it's going to be super high volume.

Nikolay

31:47

First of all, work_mem should be increased. Yes. Right. But if you keep it very low and suddenly you, you changed like data volumes become bigger and bigger and for like for hash operations ordering we need more and more at some point we slightly exceed 4 gigabytes 4 megabytes and many of processes do it for example boom you you need to generate temporary files and you you like log_temp_files equals 0 add some text on it additional text, right? My approach is different.

32:26

I If we don't see it right now, but we grow This is scary moment for me. I would prefer seeing very heavily loaded systems already having some temporary files. And we understand them. And we go from quite significant value, like 10 megabytes, And we go down in stages. Restart is not needed, by the way, for all settings we discussed today. So far, restart is not needed, which is perfect for going in iterations gradually, right?

32:59

So we gradually descend to 0, but maybe not to 0, maybe 200 kilobytes or something.

Michael

33:05

But what's the difference? Like it's on. Are you going to get temporary files that are 100 kilobytes?

Nikolay

33:11

Well, if we put 100 kilobytes to log_temp_files, we will see only the cases which exceeded.

Michael

33:17

Yeah, but what I mean is if work_mem is 4MB and hash_mem_multiplier is 2 by default, we're only realistically going to have temp files that are more roughly, very roughly 4MB to 8MB.

Nikolay

33:29

I understood you. Well, obviously we think about it differently and I'm not sure who is right. You think when memory is not enough, everything goes to temporary file. I think when memory is not enough, only Delta goes to temporary file, which is a

Michael

33:44

good, Good point. I think it's actually different in 2 different cases. So, for example, like sorts, I think it's all or nothing. But I think, for example, with bitmap scans that have non exact

Nikolay

34:03

Michael, we have homework to do. Yeah.

Michael

34:08

It's a good point, though. You're right. There probably are some operations. I think including bit scans

Nikolay

34:15

In this Interesting question. Yeah, I Talk to some hackers. It might sound very simple actually.

Michael

34:25

I'm pretty sure with sorts it's all or nothing. But it's a good point that going down in stages makes a load of sense. And I was just trying to say I feel like you can skip it if you've currently got it.

Nikolay

34:35

SERGEY VASILYEVICH Yeah. Why go down or nothing? I understand this logic, yeah.

Michael

34:39

But for me, low is almost as good as 0. Like, there's no difference realistically. It's 100 kilobytes, great. If that makes things much safer in extremely heavily loaded systems with lots of potential max connections, great, low is still good.

Nikolay

34:55

Yeah. But again, this is a useful parameter because if you have cases when, we discussed it in the beginning already, when you have cases you can see some normalized query but you need examples to understand how to reproduce it and then decide what to do with it. Raise work_mem globally, locally, this will be a question.

Michael

35:15

1 user, for like an analytics user, that kind of thing.

Nikolay

35:18

Right, right, right, yeah. Yeah, so okay, good. And memory management might be quite challenging, like it's sometimes not trivial. Good, So let's talk about the main part of query analysis. It's pg_stat_statements is not enough, right? So you need to either have quite low log_min_duration_statement, at least 1 second, preferably lower, like half a second, maybe 100 milliseconds, to capture more cases of slow queries and then optimize them. Right?

Michael

36:00

Yeah, absolutely. And I like your point about them being examples. So it's not necessarily that the average of that query is going to be running in 100 milliseconds. We want to catch the outliers. Like, what are the parameters for that query that mean it does tend to run over 100 milliseconds? So we get those examples. It's perfect.

Nikolay

36:19

And

Michael

36:19

again, 1 that we can lower in stages?

Nikolay

36:23

Yeah, definitely. So I will start if never, like if it's big unknown, like from 5 seconds and go down. Doesn't make any sense to start with really large value because I assume we talk about OLTP systems where, for example, 10 seconds is already an absolutely slow query. I usually see 1 or half a second as a value being used these days. And if you already performed a lot of optimization efforts and you want to improve even further, you can go down. But maybe, maybe it's worth it instead of this.

37:05

Oh, before we go there, there's actually an idea you can go down to 0 but switch to sampling. There is statement-level sampling and transaction-level sampling, which is supported since a few years ago.

Michael

37:21

Yeah, you can have 2 levels, can't you? You can set log_min_duration_statement to, let's say, 100 milliseconds, and that will log everything above 100. Then you can have like a second threshold that is log_min_duration_sample that's like, you could say 10 milliseconds and then you separately set a log_statement_sample_rate to, you could say like 0.1 and that would do 10% or 0.01, it would do 1%. And that would log, let's say, 1% of the statements that were between 10 and 100 milliseconds.

37:55

So, yeah, it would be a way of, like, capturing some examples.

Nikolay

38:00

Yeah, like 1% of everything, and everything, like, 100% if it's above the old log_min_duration_statement threshold. And this was added in Postgres 13, I think it was adjusted later. 14, so we can say it's already almost all versions.

Michael

38:22

All supported versions, yeah.

Nikolay

38:23

All supported versions, yeah. Actually, 13 is the oldest already, wow. Yeah, time flies. Good, good. So we talked about this, but there's also idea instead of logging queries, let's log them with plans. And just jump straight to using auto_explain and just not to use what we just discussed. Log_min_duration_statement and log_min_duration_sample, blah, blah, blah. So just auto_explain brings plans, which is great. And...

Michael

39:00

Yeah, and it also does log the statement, so you do still get the query. But you could, in theory, have a third setting, you could have a third value, so you might want to log the queries above 100 milliseconds, and you have a separate threshold for logging auto_explain plans at a different threshold if you wanted to.

Nikolay

39:22

It was you who surprised me a lot a couple of years ago when you said auto_explain support sampling for ages. Yeah. I haven't noticed somehow. Yeah, but it's great. You can log every plan, execution with plan. You log actual execution with plan, but only 1% samples of it, right?

Michael

39:44

Yeah, so this is slightly different though, I think, because with the sampling on the log_min_duration_statement, I get the impression it's well, I'm not sure maybe maybe this is a question I get impression it's tracking them all but only logging a percentage of them. So you're then saving yourself from an overhead around logging. Because

Nikolay

40:11

it's... SLAVOJ Yes, calculate metrics and then log to disk.

Michael

40:17

SIMON Exactly. So there's the observer effect of observing, of tracking timings, and then there's the second observer effect of actually logging. Whereas with auto_explain, the sampling, you don't even measure a certain proportion of the, so you, it is, it's subtly different, I think.

Nikolay

40:37

Yeah. What would you recommend? Like we can have so many options here and worth saying we can, it's quite flexible. We tune it all. And then you think, okay, this is a big complex system. We have so much logging. We log everything. Not everything, sometimes same, sometimes not. But then you think, actually, you know what? I still need to do such statements because I want reliable, higher level metrics to understand. Because I identify 10 queries I need to work on, but where to start?

41:11

What is the most influential in various terms? Like I need to produce a Statement Stone to realize that. And I connect on by query ID and query ID in auto_explain is supported only since version 16, I think. So this was a big missing piece before 16.

41:32

But once you're on 16 or 17 or maybe already 18 beta 1, I don't know, then you connect by query ID and you see, okay, based on log analysis, I see these plans need my work, need attention or these queries, But I'm choosing this one because I see it's like 25% of all time spent based on pg_stat_statements. And this is a great way. So this brings us to the idea that having logs is not enough.

Michael

42:04

Yeah, well, they're kind of serving slightly different purposes, aren't they? Because if a query is taking 25% of total CPU, like if it's just taking 25% of our resources, It would probably be relatively easy to reproduce that. If we grab the query, grab an example that's, you know, relatively slow, and run EXPLAIN ANALYZE with all the parameters on that, we'll probably get a slow query plan.

42:34

But where auto-explain comes into its own is if queries are only every, like, if they're slow sometimes but not always. If there's very specific. Exactly. It's cases like that that might be difficult to diagnose. Maybe they happen at certain times or with, you know, in cases that you're struggling to put your finger on. Getting the plan at that moment is extremely helpful. And it's not always possible.

Nikolay

43:07

Do you think auto_explain should be in a different log?

Michael

43:12

I mean I still...

Nikolay

43:14

Let me... I maybe already mentioned this recipe in the past, I used it. So you have a tmpfs disk, RAM disk. Yeah. Like gigabyte, we have memory, so 1 gigabyte. And log goes there. And you quickly rotate it to ensure it's not filled up. And send it somewhere to storage. And it would be great to have separate log for such very bloaty things like auto_explain, right?

Michael

43:48

Well, and that's the big difference between auto_explain and just logging the query. Because even queries can get long, but query plans can be ridiculously big. Like the largest one I think I've seen was getting close to 100 megabytes of

Nikolay

44:04

text. Just one.

Michael

44:05

Just one query plan. Okay, a lot of that was like, there was a lot of like whitespace and a lot of repeated

Nikolay

44:12

identifiers. It should be compressed, immediately compressed and sent over network. But compressed by whom? Logging collector or? Yeah. That's an interesting question.

Michael

44:22

So it is the big downside in terms of volume.

Nikolay

44:25

So we see so much power we have right now for observability and query processing, query optimization, but still it feels like in heavily loaded systems, it's hard to have, like, it's not super hard, but there are opportunities to improve. 100% so.

Michael

44:42

Yeah. And if you've got a bit of, if you've not got issues right now it's great to set a bunch of these and you can afford to set things like auto_explain up if you've got overhead if you've got headroom you can set these things up in advance and then change them if that if you start to get into you know the 90% of CPU being used you can look into things you can do to reduce those.

45:07

But I really like a quote, I think somebody was at an Oracle conference and somebody asked, what's all the overhead of all of the performance analysis Oracle's doing. Oh yeah, of

Nikolay

45:18

course, Stanel Potter.

Michael

45:20

Ah, nice.

Nikolay

45:22

Yeah, he said something like, I can pay like 10% or something overhead, but save like 90% thanks to my optimization based on those information.

Michael

45:33

Yes, I think it's along those lines. I think if you have this information.

Nikolay

45:38

It's very exposition, you know, like, if you have this, but you're not going to use it, you're losing, right, additionally. But if you are definitely like, you are going to allocate your resources to optimize, ready to pay this penalty like or tax, but because I know I will optimize and I will be on a better spot after it.

Michael

46:03

Yeah, exactly, it's not a liability, it's an asset at that point.

Nikolay

46:07

And those 10% they will shrink, thanks to my optimization. Compared to initial latencies, they may become smaller, right?

Michael

46:20

Maybe, if they're a percentage, I've got a feeling some of them are fixed.

Nikolay

46:24

But if we are not sure that we will be able to optimize, I can understand that it can sound risky to pay this additional price without a reliable outcome. Who knows how successful we will be. But again, all of these things, except putting auto_explain to shared_preload_libraries, you can share them on fly. So it's so great.

46:49

You can temporarily lower all thresholds during a week or 2 while you're optimizing, for example, and then get back to your normal and do it like every quarter, for example, like routine optimization. So cycles of optimization. Also if it's non-production or something, you can be more aggressive in going down with thresholds and having more, to see more, yeah.

Michael

47:18

It feels like 1 last thing on auto_explain. The thing in the docs and quite a few experts will recommend is turning off the timing parameter. So you collect the EXPLAIN ANALYZE information but without per node operation, like per node timings. It does reduce the overhead, of course. But it also reduces the, like, fidelity of those plans and it is quite useful.

Nikolay

47:47

Well, yeah, I don't agree with you. For me, for me, most valuable is structure and buffers, as usual.

Michael

47:57

Of course, I get that they're more useful, But I think quite often in a lot of environments collecting the timing information is also worth it.

Nikolay

48:07

Well, yeah. You had a post long ago about this, but we also had a whole episode about auto_explain. Do you remember?

Michael

48:16

Yeah. I'll link up the relevant episodes. It's a

Nikolay

48:19

big topic. AutoExpand has many settings to tune, so it's a big topic. Good, I think maybe that's it, right?

Michael

48:27

Yes, definitely. Thanks so much, Nikolay. Catch you next week.

Nikolay

48:31

Yeah, Bye-bye.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript