Extended Statistics

Nikolay

00:00

Hello, hello, this is Postgres.FM. My name is Nikolay, founder of Postgres.AI and as usual my co-host is Michael, pgMustard. Hi Michael.

Michael

00:09

Hello Nikolay.

Nikolay

00:10

How are you doing? How is your service doing? I saw a newsletter.

Michael

00:17

Oh yes, all good. Ticking along.

Nikolay

00:20

Improvements.

Michael

00:21

How about you? Yeah, also like bug fixes, you know. Yeah.

Nikolay

00:26

And improvements as well. We are preparing some big update probably next month. So I wanted to, actually you wanted to discuss this. It's your topic.

Michael

00:37

Yeah, very much my topic, it sounds like.

Nikolay

00:39

Yeah, and I think it's a useful topic, but I also want to be honest and, as usual, I want to be honest and discuss how often can we use that feature of Postgres and what's wrong. Not wrong, maybe, but let's just go like extended statistics is the topic. So it's what you can achieve running SQL command, CREATE STATISTICS, giving it a name. It's very similar to index creation, right? Similar, right? You create something

Michael

01:15

with, it's

Nikolay

01:16

named for 1 table only. You cannot do it for multiple tables. You choose columns, and there are several parameters you can specify. And it's very similar to index creation, but instead of building the whole structure additionally to speed up search of values, of entries, we just tell Postgres that it needs to maintain more statistics than usual, because by default it maintains statistics for every Column individually. Right?

01:50

And also, not only every Column individually, but also multi-column statistics, it does maintain and also expression, not multi-column, expression statistics. It maintains for expression indexes, like Function indexes. For example, a very simple common case is when we have a Column named email and we use text, not citext. Nobody uses citext. This Extension is not used at all. I don't know why it exists. It exists for ages. citext is an Extension.

02:22

It's actually a contrib module, so it's always available in any Postgres setup. And it's case-insensitive text. Because by default, If you create an Index on the Column named unique Index, it won't guarantee uniqueness in case-insensitive manner. So uppercase email, lowercase email, they will be considered different values. This is not what we want, because in reality emails, they are cancel sensitive. Case doesn't matter. So you need to use citext contrib module, but nobody does it.

03:03

I never see it in reality. Well, it's so exotic. I used it last time 10 years ago. So we usually end up having an Index lower email. And what I wanted to say is that by default implicitly Postgres will maintain statistics for this value, lower email. Right? Yeah. So expression. It's automatically. But CREATE STATISTICS allows you to maintain additional statistics like additional expressions or multi-column statistics without creation of Index?

Michael

03:40

Yes, although I've only ever thought until reading a lot about it earlier today I'd only ever really thought of the multi-column use cases. The individual column cases is interesting if you don't if you if you want to be able to use the statistics on an expression but don't want the overhead of maintaining an Index on it.

Nikolay

04:05

Which

Michael

04:06

is interesting but I've not seen that case before.

Nikolay

04:09

So for example we have some expression on a Column, a single Column, we want statistics but we don't have it, we don't want an Index. Can you imagine such case?

Michael

04:21

I haven't worked out 1 yet, but but it made it into the documentation. So I imagine somebody came came with that case 1

Nikolay

04:28

example Yeah, I want an example

Michael

04:32

But I think I do think it's worth us going back to basics a little bit, like why do we want or need these statistics? I think it's really important. And I was going to just say...

Nikolay

04:42

Actually this is what you should see working with individual query plans in pgMustard, you obviously have cases and as we discussed before recording, you mentioned that you have some logic in pgMustard developed to diagnose cases when something is off and probably you need to consider statistic creation.

Michael

05:11

Well yeah, this is the tricky thing. It's not super intelligent though. We look out for bad raw estimates and that they can be a real problem for query plans because the beauty of the Postgres optimizer is that it's choosing between various plans and it's doing that work for you.

05:30

You don't have to specify how it should get the data, you just ask your query and it works out, hopefully the fastest way to do that, but it's making a bunch of decisions about join order, and maybe that's a bad example for this topic, but a scan type, different join algorithms, all of these decisions are based largely on the number of rows being returned at each stage and therefore the selectivity of various things and calculating those selectivities

06:02

is tricky, like it's a hard problem and it's the reason why Postgres maintains statistics in the first place. It does that on a per column basis. The reason it doesn't do it on a combination of column bases by default is because just the number of those explodes so quickly like as soon as you've got a few columns in in 1 table the number of combinations of like is it like n choose 2 like they just explodes

Nikolay

06:30

right

Michael

06:30

and So it would be super expensive to do it for everything. But in certain extreme cases, which are actually quite common in real world applications, some columns are heavily correlated to other columns and popular ones and important ones often get queried together.

06:50

So an example I see really often in like software service applications are things like organization ID and team ID or something like that you know a team ID is always going to be associated with an organization ID always always always

Nikolay

07:06

or country and language another example

Michael

07:09

country and language is often is a common example as like car and manufacturer and things like that I cut and but the 1 reason I mentioned org ID and team ID is because often the data is then partitioned based on that or the primary key is some combination of those things. So they're really often always being queried together. And Progress's default, which is a really sensible default, is to assume that different conditions are independent of 1 another.

07:37

And in a where condition, that means it's really going to underestimate how many rows are going to come back, which can lead to bad things like nested loops being

Nikolay

07:46

used. It just multiplies estimates.

Michael

07:51

But it also has the opposite problem in group by if you group by multiple things and they're actually very related you're going to get a massive and over estimate instead of under so it can have both it can have issues in

Nikolay

08:05

both directions.

Michael

08:07

Yeah so once you know you're like as the maker of that software you're probably very understanding of which of your columns are going to be very... I think the technical language is functionally dependent on others.

Nikolay

08:22

How?

Michael

08:23

Well, like orgid and teamid, you know that.

Nikolay

08:28

I don't like this answer, You know that. I want the answer which could be written for humans and maybe not only humans. Some answer, like some step-by-step how to recipe algorithm or something, right? Look here, check this, and then decide. This is what I think is missing here because as with indexes, great statistics is something that is not working by default. It requires an effort from engineer.

09:07

And I think, as I told you before recording, I never used it for real in production to solve any problems, like 0 cases I have. It appeared, it was added to Postgres 10, 7 years ago. During these 7 years, several times I considered it. And you know, like when something, like we see, okay, the row estimate is completely off, like thousand times, million times, completely off. What to do? What to do? Of course, analyze, analyze, recollect the stats, see if it helps.

09:41

If it helps, no, it doesn't help. What to do? Then, old school approach, okay, like probably 100 buckets is not enough. default_statistics_target is a global setting, but you can control it on an individual column level, yeah, and you say, okay, I want 1, 000 buckets here.

10:02

And actually, I remember Tomáš Vondra was a guest on Postgres TV we discussed this and Tomáš's opinion was hundred buckets is quite a lot so it's not a good idea to always fix it increasing this number but I see sometimes sometimes

Michael

10:21

yeah I think that might also sometimes help because it not only increases the number of buckets but also increases the length of the MCV the most common values

Nikolay

10:31

but if you don't use if we don't use standard statistics

Michael

10:36

with regular yeah with regular statistics target on a column you also get a bump in the number of MCVs tracked.

Nikolay

10:43

Ah, okay, this I understand, yeah, yeah, yeah.

Michael

10:46

Which is, I think, might also be sometimes helpful for... It might sometimes solve the problem. So I can understand why throwing more sampling at the problem could be helpful there.

Nikolay

10:58

And I'm okay to spend more disk space and memory for this. The problem usually if you raise default_statistics_target to 1000, for example, is that analysis takes longer. Which is acceptable in many cases. I wanted to mention very big news which occurred last week, like you know it, you name it.

Michael

11:24

Well, the commit Postgres 18 that unfortunately won't include what we're talking about today, extended statistics, but it's still huge and really positive news. Although I've seen that there are some performance issues.

Nikolay

11:39

So in Postgres 18 it is expected that during the upgrade

Michael

11:44

yes

Nikolay

11:45

We all know that analyze is on user shoulders, basically, on DBA or DBRE shoulders. pg_upgrade doesn't run analyze. If it's in place, brief upgrade, very fast upgrade, taking only a couple of minutes. It's pg_upgrade --link or -k. So hard links are used, it's very fast, and it also dumps, restores the schema, because new version has different features, so this schema needs to be dumped explicitly and restored. This is automated.

12:23

But then it says, okay, now it's time to run analyze, and there is analyzer in stages, which I don't like. We talked about it a few times. But it's basically, it's outside of the scope of pg_upgrade to run analyze. We also, I criticized managed Postgres providers like RDS, Cloud SQL, Azure because they don't automate it as well and live on shoulders of DBR. And people keep forgetting it, and after upgrade, they don't have statistics at all.

12:56

So, big news is in Postgres 18, it was just committed and hopefully won't be reverted. Statistics will be dumped and restored.

Michael

13:06

Yes, and that's even better than, in my opinion, that's even better than analyze being automated.

Nikolay

13:13

Of course, it's better. This is the answer I've got when I raised this on Twitter on X many months ago saying that's not all right that managed service providers don't do it. Somebody said like let's wait and it was the right idea So if now statistics is dumped and restored, we have old statistics and it's good.

Michael

13:37

Yeah, and my understanding is this is a pg_dump feature, therefore it should work... Let's say it does end up in 18, it should work for the next upgrade you do, no matter what version you're coming from.

Nikolay

13:54

Yeah, I also thought about it. It's interesting. I'm not 100% sure, but of course, definitely you can run, you can use new pg_dump against some old Postgres server, and hopefully this feature will work. It means that we will be able to upgrade old servers to new without need to run ANALYZE explicitly. Yeah, and

Michael

14:21

it means less downtime. I consider that time you're running ANALYZE to still be there, could effectively be down, even if you try to restore connections, just because queries could be much, much slower than they should be. So, yeah, I think for me this could drastically reduce effective downtime for a lot of projects during upgrades.

Nikolay

14:42

Yeah, I agree, because in my opinion, ANALYZE in stages it's a wrong feature, completely wrong. And I prefer keeping an ANALYZE inside maintenance window. And of course, if you have defaults, this is a target elevated to some high numbers, thousands. In this case, it takes longer, especially if you have like some people have a lot of Database objects, a lot.

15:12

I just recently had the pleasure to deal a little bit with a case with 200,000 tables and more than 1 million indexes and if you decided to raise default_statistics_target there, ANALYZE will take ages and even if it's parallelized like yeah so keeping statistics is good.

Michael

15:33

Let's back to the topic though, you might still want to run ANALYZE after a major version upgrade. There are a couple of caveats, even in the 18 commit, as to things that won't get pulled across. And one of them is extended statistics. So if you are using extended statistics, you will, and they're important for your query performance, you will still need to run ANALYZE to actually get back up and running.

Nikolay

15:56

I always like running ANALYZE. For example, you know, if you have partitioned table, the vacuum doesn't maintain statistics on the parent table. Somehow if you run ANALYZE explicitly you have your statistics so I just like running ANALYZE.

Michael

16:13

I think there's even an improvement, maybe it was in 17 that you can run ANALYZE only on the parents so you don't even have to analyze all of the partitions, which is quite cool.

Nikolay

16:23

Yeah, so partition statistics on partitioned tables probably it's another topic and it's yes itself but back to extended statistics. So my point is since this is a feature which requires you to run some things explicitly and make decisions, there should be recipes. For indexes, we have a recipe. We have sequential scan, we see some filter or order by, okay, we know this index should help. We verify it, ideally on full-size clones. We see it helps, we deploy it to production.

16:56

People who don't have fast and cheap clones, they check it right on production and see it works. Okay. So that's it. As for statistics, which extended statistics, which was added as we discussed in Postgres 10, 7 years ago. During these 7 years, several times I said, okay, Maybe we need to create statistics here. Let's try. I remember 0 cases when this was solution.

17:24

We always ended up having different solution like creation of index or maybe redesigning schema sometimes to the question of previous discussion with Frank, right? I mean, redesigning schema, denormalization and so on. So I don't remember conclusion. Okay, in this case, CREATE STATISTICS is our solution to the problem and that's Saves me it Honestly, it builds some Weird feeling I'm missing something. Everyone is using something. I'm not using

Michael

18:03

Well, interestingly the reason 1 of the main reasons I wanted to bring this up is I think extended statistics might be underused globally I think a lot of people don't know that it exists.

Nikolay

18:11

Because there are no recipes that's this is my point we need recipes when Very concrete recipes how to decide it's worth using it right here.

Michael

18:20

Yeah, interesting. But I do have something as a chance.

Nikolay

18:26

Give me a schema only dump. I will give some LLM, good 1, this dump and ask to find just based on column names, to find some ideas what can have correlation, then we build statistics and hope it will help someday. But it's a weak solution.

Michael

18:46

I also think it's flawed, because I think if it's not actually causing query performance issues, then you're paying that penalty for Analyze being slower. Well, again... For no benefit. And actually, that's a problem.

Nikolay

19:00

I didn't finish my thought. I'm okay to pay some penalty to Analyze more, keep more statistics. It's not a problem. Unless we have hundreds of thousands or millions of objects. In this case, I will be very careful.

19:16

But if it's only like a few thousand of objects I mean indexes and tables it's it takes 1 minute or 3 minutes to run Analyze on this database it can be many many many terabytes 1 or 3 minutes I don't care and in terms of storage it's nothing and memory it's not a lot maybe I'm wrong but I think so.

Michael

19:37

I think you're right and I think that's important for people to realize that it's not dependent on your data size because we're sampling at this point like because it's a sampling thing it doesn't scale linearly with how much data you have.

Nikolay

19:49

Also important disclaimer, I have set of mind targeted startups heavily because I'm building my fourth startup and our clients, consulting clients and clients for products mostly are startups and that means rapid development, rapid, really rapid, sometimes a couple of deployments per day or more.

20:11

It means that maybe we, statistics we decided to keep, maybe it's not needed today, but maybe it will be needed tomorrow because we make some new features and so on right and it's maybe it's better to have it because we know there is correlation in between these columns so let's have extended statistics over all these columns but we lack recipes this is the point So like like recipes leads to under use of this feature. I

Michael

20:37

Think they might also I think historically you've also mentioned having a bias towards OLTP systems in production and I think there's also a chance that because often you're hyper-optimizing queries and that's a good thing going for index only scans and you know very very limited data retrieval very small number of rows a very very precise lookups you're probably not hitting these bad plans in a lot of those cases that you're skipping straight to a really good

Nikolay

21:10

that's interesting for

Michael

21:11

that's really hard for the planners to get wrong because you're giving it the ideal index to use for the important access patterns and it's possible that this comes up a little bit more when people are having to serve more user-defined queries, maybe analytical ones that have any combination of parameters.

Nikolay

21:30

Right, like OLAP queries and so

Michael

21:31

on. Yeah, exactly and I think in those cases maybe sometimes just not knowing what's going to come you can't create indexes for every case you do and then this is slow

Nikolay

21:42

down writes unlike statistics so this is a super big difference.

Michael

21:47

Yes so I think there's a chance that bias towards OLTP and also being able to often add the indexes you need for a smaller number of access patterns or that like a lot of queries per second might be that this just isn't as important in those cases. But in terms of recipes, I think the big 1 is bad row estimates on important and slow queries. So if a Query is important and a lot slower than you need it to be, and there's a really bad row estimate in an important part of the plan.

22:21

So like, not in a part of the plan that is fast, but in a part of the plan that is slow. The really famous case is a nested loop that thinks that the planner estimates it's gonna return very, very few rows per loop, so maybe like 0 or 1. So therefore, it's thinking, well, I only need to do this a relatively small number of times, therefore, it's quickest and easiest to get started with a nested loop and I'll be done really quickly.

22:47

But in reality we'll create maybe a thousand rows or more per iteration. Actually in those cases a hash join or merge join would have been it would so much faster. So those cases I think can benefit. But even actually there's a blog post I was thinking of that blogged about a time where extended statistics helped them a lot I put the plan through our tool and noticed that yeah a bad row estimate was as highly scored as it could be, but so was an index efficiency issue.

23:21

And I think, looking back, they could have added a better index and solved their performance issue just as well. So they've solved it with great statistics, but they could have solved it with better index and probably the query plan they got in the end could still be improved further by a better index.

Nikolay

23:37

So I have an idea for myself and who is listening is welcome to join. Next month I will have a challenge for myself. So every time I deal with some query optimization, I will put consideration of extended statistics at first places. I will think, can it be helpful here? And so on. Just prioritize it and think about it more. And after 1 month of this, 30 days, or month, February is shorter month, right? So 30 days.

24:18

After it, I will make my conclusion, should I keep this as 1 of primary tools for query optimization or it's just secondary tool? How does it sound?

Michael

24:32

I would love to hear that, and from anybody else that does try it. 1 thing I'd encourage doing is also benchmarking the performance hit on analyze of that tape. Like if you do add it to a table, how long does analyze take before and how long does analyze take after and I think it would depend on which there's like a few types of extended statistics that you can add and you can add all of them for

Nikolay

24:58

expressions or you mean

Michael

25:00

they could it there's like 3 There's like 3 parameters you can give it. So, nDistinct, dependencies, and mcv. And I think mcv will be by far the most expensive in terms of analyzed performance. Oh, by the way, another thing that the docs mentioned can be penalized is query planning time and maybe query planning buffers for that matter.

Nikolay

25:20

If you have a lot of extended statistics created by us, a planner can take longer?

Michael

25:27

Yeah, well it makes sense, right? It's considering more information up front.

Nikolay

25:31

Right, makes sense, yeah.

Michael

25:33

I don't know how, but I'd be interested in the benchmarks of how much longer, how much more, like.

Nikolay

25:38

Yeah, but in this case 2 things. First is, well, both Benchmarks sound to me like single session benchmarks. So it's not like we run a lot of like, we can take 1 session and check just speed maybe multiple times for good statistics, right? So First thing sounds to me like checking analyze speed. And I think we need to remember that if we talk about a single table, it's OK. I mean, no problem. But if we talk about many tables, it's worth remembering that it can be parallelized.

26:17

And its recommendation is to use vacuumdb with option hyphen, hyphen analyze and specifying number of jobs, number of processes, backends basically, right? Which will execute this. And if we are alone on the server, we can take, for example, the number of vCPUs we have on the virtual machine, for example. This is 1 thing. But if it's a single table, it cannot be parallelized. It will be a single thread at any rate.

Michael

26:48

No. And there's actually 1 more thing that I think worth trying just before we wrap up is if it is a huge penalty but looks valuable to query performance, which I'm not sure you're going to come across, but we'll see. I've wondered about reducing statistics target for those columns so you could reduce it and then add multi-column. It just is an interesting idea of getting back some of that analyzed time.

Nikolay

27:13

I know a very good SQL optimization expert who recommended reducing default_statistics_target, and I'm going to communicate with him very soon. Yeah, great. But this battle in my mind was clear like 10 years ago already. Like reducing default_statistics_target or per column parameter was considered not good idea in my mind, compared to increasing it somehow.

27:44

I don't know, like this is from various cases, But I'm going to use this as well as a part of my challenge just to think about this as well, because I will talk to him soon.

Michael

27:55

So. I've got an idea for you.

Nikolay

27:57

Let me comment on the second. I already start forgetting what you said about... What was the second 1? First is less benchmark, you said benchmark. I don't... I'm very careful. Analyze?

Michael

28:10

Yeah. And also query planning time or query planning buffers.

Nikolay

28:15

Query Planning time, this is good. Yeah, buffers as well, that's great. So increasing default_statistics_target globally definitely should affect this as well, right? Globally,

Michael

28:26

yeah.

Nikolay

28:27

Here, if somebody is going to do it. Don't make a mistake my team and I did a few months ago when we published an article. So when you just connect to Postgres and you do something, planning time can be much bigger because of lack of rel cache and it's loaded, right? Second time you run this, it's very different. So planning time, first planning time and second planning time, they are very different. If you have connection pooler, most people do.

29:01

It means that you don't, like it's not a good idea to make conclusions from observations of the first round Yeah, yeah, so so

Michael

29:09

right point.

Nikolay

29:10

Yeah, so if even if statistics Extended statistics affects planning time. I think should be so Question is is it only for the first round or the second round as well? Who knows, right? Because if only first round it means like rel cache all... Well, it's not rel cache here, right? Well, it's an interesting point, just to check.

Michael

29:33

Yeah, based on the notes in the documentation, I got the impression it would be for every

Nikolay

29:38

planet query. Because it's statistics, it's not real cache. So anyway, I would check multiple times.

Michael

29:44

1 more idea for you, especially talking to that person you mentioned I wonder if you could come up with a recipe for which columns to reduce it on I reckon there's a whole bunch of really boring columns like unique IDs you know primary key columns type thing Do we really need high statistics targets on those or could we get away with much much lower? Do you see what I mean like ones that we know are unique in it if we're keeping statistical on the most common values?

Nikolay

30:09

Well, I agree with you, but what's the point to optimize in this area? What what are our savings?

Michael

30:16

In my reducing the time analyze takes

Nikolay

30:21

As I said, I don't I don't really I Don't really bother by a longer time. Okay, it's 3 minutes against 1 So what

Michael

30:31

That is quite a big deal, isn't it, in terms of downtime?

Nikolay

30:34

I don't know. We have a recipe for 0 downtime. We can afford a couple of additional minutes being spent, and nobody notices it, because It's a target cluster running on logical replication. We can do very long analyze there. So even 1 hour, nobody notices. It doesn't affect downtime at all. So I don't know. I don't know. Like I can imagine this will save something. We can find many places like that, right? For example, in some cases we can start tuning autovacuum at table level.

31:22

And so many times I saw teams, various teams, went this path. And then we pulled them back, saying, let's stick to defaults. Because table level settings often cause more harm than good Because they add complexity, a lot of complexity It's hard to maintain. Then you go microservice, you have a lot of clusters and this table level Well, it's so hard to maintain. So and you talk about column level settings.

Michael

31:54

All I meant is, if you can find a really simple rule, yeah exactly, if it's a unique column, reduce the statistics target to 1.

Nikolay

32:04

This is good if it goes to hackers and the core itself, and implementation which doesn't require effort from user. This I agree with. But if it's... Again, My main concern, my main comment about extended statistics is it requires effort, it's not clear, there are no clear recipes how to apply it, so it ends up not being actively used unfortunately. Maybe in the next 30 days we will have some recipes, right?

Michael

32:35

Sounds... I'm looking forward to it.

Nikolay

32:37

Sounds good. Okay. Thank you so much.

Michael

32:39

Thanks so much Nikolay.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript