Monitoring from Postgres AI

Michael

00:00

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard and I'm joined as usual by Nikolay, founder of Postgres AI. Hey Nik, how you doing?

Nikolay

00:09

Hi, Michael. I'm doing great. How are you?

Michael

00:12

Good. Thank you. And this week is launch week or was maybe if people are listening in the future.

Nikolay

00:21

Yeah, I can tell you, I can assure you people are listening to us in the future because we are not live.

Michael

00:26

Yeah, of course. So yes, so launch week for your company, Postgres AI.

Nikolay

00:31

Yeah, for the first time. I'm excited. Yeah. So we decided enough observing how others do it. First of all, Supabase, they just had launch week before this. I think number 15 already. So many. And since our team has grown and we had stuff accumulated to be released, so instead of releasing it earlier, I said let's just accumulate a little bit and release in 1 week. It's like sprint 5 days in a row.

01:02

And we actually, we are releasing more, we, I hope we released by the time people are listening, I hope we released more than 5 things. So we have also concept of extra bonus releases. And we started on Monday, we started with DBLab version 4, which actually took 2 years to develop.

Michael

01:20

Yeah, that's a long time.

Nikolay

01:22

Yeah, finally we polished it and many users already installed it. I mean release candidates. We needed to have 5 release candidates.

Michael

01:33

That's more than Postgres.

Nikolay

01:36

Yeah, yeah. Well, because a lot of new problems appeared because we changed a lot there. Yeah, but it's another story. Today we talk about our second big release, which is called Postgres AI Monitoring.

Michael

01:47

Yeah, so this is 1 you're doing, so we're recording on Monday. This is 1 you're doing tomorrow. So what is it? What are you up to?

Nikolay

01:54

Yes, we are doing this tomorrow. Everything is ready already. So I'm pretty confident we are doing this. Yeah. And only a few people know that we are releasing this kind of monitoring. And it's called Postgres AI monitoring, but right now there is a little AI there. And this is intentional. Let me explain this first, because of course we had a short chat before recording and you noticed this discrepancy. We use AI a lot.

02:27

Every day, for example, in my team, it's prohibited to do anything without AI-first. It's prohibited to write code. It's already so. Well, you can write code, but only if your AI tools failed. And first people were resistant to this. Some people said, like, what the heck, I don't want it. It's bad, it's poor quality. But in my opinion, the Anthropic release of Cloud 4 a couple of months ago, it was a game changer. Quality already very acceptable in many cases.

03:09

Not always, sometimes, especially if you are expert in some narrow topic, you understand problems which were never discussed publicly. In this case, of course, you will see AI will be wrong 100%. We have these cases. For example, a perfect example is our guests last week. They experienced some problem which was not covered by public discussions. Obviously, in such cases, if you just apply AI in a straightforward way, it will hallucinate some bullshit.

03:49

But if you apply AI to brainstorm some ideas and ground it, pre-ground, post-ground with materials, reliable materials like source code, documentation, information from reliable sources and experiments, post-grounding, in this case it's different. You can explore, behave, reproduce it, and this is the path we chose. Not just this problem, take AI and create, I don't know, like change requests somewhere. This won't work. Ungrounded approach won't work. So this is 1 point.

04:26

Another point is that 6, 7 years I keep talking, I avoid building monitoring system. For example, when I met with Lukas Fittl in San Francisco 1 day a few years ago, I told him this, like, you're building great monitoring system, pganalyze. We build some checkup because we need for our customers to find problems and then to start exploring them, experimenting. Our key strength of Postgres AI team is experiments.

04:58

That's why we built DBLab to simplify specific type of experiments and benchmarks and so on. And I mentioned checkups and somehow pganalyze has checkup section since then, but it's a different story. And I said I'm not going to build monitoring system, but it was inevitably like I was pushed somehow to this point where it's inevitable that we need to do it because no other system, and not only open source system, not other system is providing abilities we need.

05:33

We need several specific abilities with our clients. We always show it. And some of them is for troubleshooting. So like high, very high level analysis of Postgres components. And then you can understand where you can dive deeper. And actually, I remember great talks from Percona, from Zaitsev, and so on. And I saw some bits of good stuff here and there, like Percona Monitoring and Management, or how is it called?

Michael

06:02

PMM.

Nikolay

06:04

Yeah. Yeah, yeah, yeah. And things like use and 4 golden signals approaches like methodologies. Some great pieces there. But then you see something is really missing. Like For example, proper query analysis. We talked about it so many times. Comprehensive analysis using pg_stat_statements, pg_stat_kcache, pg_wait_sampling, top-down, single query with query texts with proper characteristics. And We didn't see any system like this.

06:36

A few years ago, we started maintaining so-called pgwatch2 Postgres AI edition, which was a fork of pgwatch2 with our metrics and dashboards. But then Pavlo Golub decided to rewrite pgwatch. So we started lagging in terms of this vehicle we used as pgwatch itself. So time has come and we realized, okay, and we experimented a lot. We actually started with InfluxDB. It was insane. We eliminated it. We used TimescaleDB. And then we had a big thinking process.

07:15

Should we stay with PureSQL and TimescaleDB, or we should use PromQL? We decided to use PromQL, but with some nuances. We still have Postgres inside. So we have both Prometheus and Postgres. So the idea was we need to build a new monitoring system, which will be part of our process. And in consulting, we had maybe 30 clients over the last 1 year and a half. We had a few dozens of clients, right?

07:47

And to all of them, we installed always our pgwatch2 Postgres AI edition, and also used postgres-checkup tool, old tool we always used, to generate reports in static form, like markdown or HTML PDF. So this time we said, okay, we finally do monitoring, but the main part of it will be open source. So monitoring itself is open source, but it's a part of bigger, bigger picture where we perform health checks and help clients proactively using AI, of course.

Michael

08:24

So I'm going to ask your question always. When you say open source, what do you mean?

Nikolay

08:30

License is Apache 2.0. As I remember, pgwatch is a BSD or Postgres license? 1 of these 2, but our dashboards are Apache 2.0, so anyone can use it. Majority of that is... So we couldn't use a so-called traditional Postgres exporter for Prometheus because it's so much stuff and it's really slow to put pieces there. So basically we have our own exporter inside, I mean, metrics and dashboards. And it's Prometheus. Why Prometheus? Because we just see that in many cases, well, several reasons.

09:15

1 of the key reasons is that we had already 2, actually 3 companies where we, in consulting mode, we contributed a lot to implementing our approach to their own monitoring system. So this is why we have confidence this monitoring system is great. And usually, bigger companies, they have some stack. Some people use Datadog, but you know, if you have 100 terabytes of data, or say, 1000 clusters, like microservices, Datadog is super expensive. Some people still keep paying.

09:53

But it's not only expensive, it has blind spots in terms of Postgres monitoring. A lot of them, for example, pg_stat_statements, they have some not all metrics. And for essential metrics, they, for example, can give you per call, but not per second, or per second, but not per call. There is no completeness there. If you check carefully, so they differentiate in 1 dimension, but not another. For another metric, it's vice versa. So it lacks many things. And also how dashboards are organized.

10:30

I don't understand why people put Tuple stats on the front page. It's not the main metric to observe. Not the main metric. It's not the number 1 thing you want to see when you have an incident and need to troubleshoot. And people don't put queries per second because Postgres doesn't have queries per second. But we can extract it from previous assignments with some asterisk that it's approximate because some things can be lost.

11:00

Sometimes we see queries per second are lower than transactions per second, but it cannot be so. Right? Yeah. In fact, because we don't see full picture, but in most cases it's vice versa. So we see vice versa. So this monitoring is open source fully, and it can be used in pieces if your organization already has Prometheus-based something. But lacks a systematic approach for dashboards. This is where we can, I mean, you can take this and improve, take pieces, maybe contribute back?

11:41

Also good idea to contribute back always. And I still think, although I personally invested like more than 5 years already to this. First few years I wanted to say, no, no, I'm not going to do it. But then I was forced. Well, well, life sometimes force you to do something right. And now this is my like, I resurrected to this idea, like, okay, I'm doing this.

12:08

And yeah, and not only I, we have really super experienced members in our team, who also spent a lot of time thinking and contributing to this. It's not only my own effort. So we have several very good team members who contributed a lot. So we decided to use PromQL also because of this, because it's more compatible with enterprise systems.

Michael

12:34

Yeah, you mentioned using Prometheus, but also Postgres.

Nikolay

12:38

But also Postgres, because in Prometheus there is a problem with query texts. In query analysis, you do want, like you have to just have statements, But in Prometheus, it's not normal to get query and store it. It's very inefficient. So you need secondary storage, and we chose Postgres, obviously. Basically we have 2 data stores. 1 is the main 1, Prometheus. And the second 1 is... Actually we are debating right now. We probably will... Prometheus, it will be Mimir.

13:12

We also thought about Victoria Metrics. So it's still... We will switch in the next few releases to something else. And for query texts, it's Postgres, but it's a small 1. Because you know, by default, pg_stat_statements for each node, 5000 statements. And do we track all of them? I need to check. Maybe no. So 1 of the things we noticed, we had cases with our previous monitoring, that sometimes people have millions of objects in the database. And this is really requires optimization.

13:47

And we made this optimization for really large cases. I remember our checkup actually also had cases with 100,000 indexes. In case of monitoring pgwatch2, it was previous version, basically, right? Our Postgres AI edition, we had cases with more than 1 million indexes and hundreds of thousands of tables. Bloat analysis was not working. But like, it's and this is interesting and a separate question, how to perform index maintenance in such a situation, for example. But

Michael

14:25

it's also a weakness of pg_stat_statements, right? Because often, I don't know if this is the case always, but often when you see that number of objects, it's because somebody's doing per tenant, but by schema. So a lot of the tables are the same dozen or several dozen tables and indexes. And then pg_stat_statements kind of falls flat on its face because it sees objects as unique and it tracks them all individually. So you can't look at. It's flooded. Yeah, and 5000 is not enough.

Nikolay

15:01

It's not aggregating properly.

Michael

15:02

Exactly, yeah.

Nikolay

15:03

Yeah, we discussed this recently also, there is a problem with temporary tables, for example,

Michael

15:09

sometimes

Nikolay

15:10

it's like the same schema, same table, but somehow you end up having different queries, 2 different queries and pages are set in the same. I would personally prefer some control how aggregation is done. There is a compute_query_id mechanism. So basically it would be great to define a rule, for example, ignore namespace, ignore schema name, right?

Michael

15:42

Yeah, wow. Yeah, interesting idea.

Nikolay

15:45

Yeah, maybe.

Michael

15:46

Because the object, the other table names will probably, for most people, the table names would stay the same, it's the schema name that would change.

Nikolay

15:53

When analyzing logs, I still prefer, I know since Postgres 16 we have query ID in the logs. We can't have, we need to enable it for auto-explain, but I still often prefer Lukas Fittl's, already mentioned him, his second time, his great library libpgquery and Qt fingerprints. I know it works purely based on syntax, no OIDs from pg_class or something, so I know how it will aggregate. And it's mostly often enough.

Michael

16:36

Do you know if it can handle that? The different schemas, same object name?

Nikolay

16:41

Well no, I think it distinguishes them. If query has fully qualified names involving schema name, I think it will produce different fingerprints. But I guess it's easier to tweak it than to try to change.

Michael

16:56

Maybe.

Nikolay

16:56

pg_stat_statements approach and so on. I don't know. I just feel, yeah, it would be great to have some more flexible, some default path, but some exotic paths to tune it, how we aggregate. Because otherwise you end up with 5000 slots in pg_stat_statements by default, And then maybe 5 different queries, each 1000 times, right? But just with some differences. You know, the in problem, I don't know, was it fixed already?

17:27

I know, for example, the lipoge query, regardless of the members of in list, it will produce the same fingerprint.

Michael

17:35

Nice. Okay. I don't know if the feature set statements got fixed, but yeah, you're right. If there's like the difference in having 1 in list versus 2 versus 3, they were all considered unique.

Nikolay

17:49

Some people implemented some regexps to convert it before calculating something. Usually you want, if you have in-listed, regardless of number of members in the list, it should be considered the same. Semantically it's the same. 2 members, 3 members, it doesn't matter. But pg_stat_statements, that was until, like, I thought I saw some discussion. I like my memory tricks me sometimes. Maybe it

Michael

18:18

was definitely been discussions. I just don't know if there's any commit. I think gut feeling is it might be in a very recent version, but I'm not I'm not 100% sure without checking the but the like, for example, any if you if you if you transform it into a query with any, you're going to get the same entry in pg_stat_statements every time. So why should it be different with in lists? Like it's, it doesn't make sense.

18:41

Yeah. Although they could have different plans, like at a certain point it could flip the plan, but that's true of any query. That's true of any querying.

Nikolay

18:49

How work, how they work with not in and nulls, right. Remember this danger, right. So yeah, let's discuss. So yeah, open source, Apache 2.0, open source components, PG version 3 as like engine for this. So we just,

Michael

19:07

pgwatch

Nikolay

19:08

pgwatch version 3, which was written by Pavlo Golub, CYBERTEC, great work, a lot of optimizations, a lot of things like how we control, they call it sinks, right? Like how data streams are controlled and so on. And what else? Very fresh Grafana, obviously, right? And there we have several interesting tricks there. So where does this dashboards shine? If we talk about only dashboards, as I said, it's a part of bigger vision, but just talking about dashboards and manual use of them.

19:46

First dashboard is the troubleshooting dashboard. You have just 1 minute and you need to understand where to dig further, right? We use 4 golden signals there. So we have throughput, latencies, errors, what else? Saturation, right?

Michael

20:04

I didn't see that 1. Dashboard, okay, dashboards.

Nikolay

20:08

Dashboard number 1.

Michael

20:10

Active session history?

Nikolay

20:12

Yeah. No, active session history, This is where it starts. But it's a single node performance overview, number 1. Active session history, in our opinion, we're convinced after many years observing how RDS does it, others are also convinced. Wait event analysis is the king in monitoring. If you want 1 single graph or how to say chart, this is to understand like quickly performance and what's wrong. This is what you need to look at.

20:43

That's why we put it on the very first place on the very first dashboard. But it's only very high level, like 30,000 feet view. So only wait event types, not wait events. No query IDs, not particular wait events, only wait event types. So it's very, like, is it like CPU or I or something? And CPU, we mark with asterisk because it's not only CPU we know sometimes some parts of the code still are not covered by by wait events. So technically, it can be CPU or unknown wait event. Right?

21:15

Yeah. But and CPU is green is as everywhere is in performance insights, PASH Viewer, anywhere. So then we have very high level pieces to understand workload and various Postgres components. And we follow 4 golden signals there. So latency, traffic, situation, traffic throughput, right? Situation errors. Latencies, we extract from pg_stat_statements. I think it's a great idea actually to have it in the pg_stat_database. Probably there are reasons why it's still not there.

21:50

You can extract it if you have PgBouncer, sometimes I just check PgBouncer logs manually. To double check that I understand latency properly. But from pg_stat_statements, good enough. Of course, again, if you have a situation like we just discussed, and you see only the tip of the iceberg in those pg_stat_statements.max entries of pg_stat_statements, 5000 by default, then it will be a wrong value. So I think, yeah, we, we, we, we understand this, right? So it's like best efforts. It's kind

Michael

22:23

of an edge case as well, right? Yeah. It's not many people that have hundreds of thousands of objects.

Nikolay

22:29

Yeah. Yeah. And throughput, we like in various ways, like calls per second, transactions per second. So these are key ones. And we didn't put tuples per second, as you can find in Datadog and some other places. Maybe we should, but I'm not convinced this is super low level throughput metric should be on the very front. Maybe actually, because I'm contradicting with myself. I'm always discussing logical replication.

22:56

I say on modern Intel and AMD processors, single slot, Single publication subscription, it can saturate subscriber side single threaded process at levels like 1000, 2000, 3000 tuples per second, depending on various factors. So maybe tuples per second for writes, maybe we should put it, actually. Good idea, maybe. But this is very low level throughput. So in database you can send throughput in various levels. Transactions, queries, well, tuples next. Is tuples next? Or maybe rows?

23:37

Rows, tuples are different, right? Well, buffers, yeah, yeah. Actually, this is what we do have. We have hits and reads and writes, dirtied buffers. All those buffers we put it right on the front page, but only high level, we don't distinguish by query ID, this is the job for different dashboard. So you quickly can understand what's happening here. And of course, we translate it to bytes, right? So how many bytes per second of buffer reads your buffer pool is having right now?

24:09

And did we have spikes, right? So yeah. And actually, yeah, and errors. Yeah. So if we are errors, we will see it in transaction analysis, number of rollbacks. We currently don't work with logs, but it's in the roadmap, of course. So we extract errors from logs, aggregate, and so on. I still have hope that a tiny but very useful extension called logerrors. 1 day we'll make it into engine and we will have some pg_stat_errors or something.

24:44

It would be great to have some error code and how many times it occurred, right? Basic accumulative metric and we could start observing them in normal way, exporting and then alerting. Speaking of alerts, it's also in roadmap, definitely. We have very careful approach because talking to different customers, we see very different expectations, very like opposite sometimes. So yeah, I will discuss it. Yeah. In the past.

Michael

25:14

Well, I think it's interesting. I think alerting is fascinating and like balancing risks of false positives versus risks of false negatives. Like I think it's a really hard problem.

Nikolay

25:27

It's a hard problem.

Michael

25:29

And configurationally, People always just end up making everything configurable and it's... And picking defaults is hard or like... Anyway, it's not envious of people that make monitoring tools. They quite like making

Nikolay

25:43

it simple. Yeah, I think we will avoid a single default set, set of defaults, and we will just give some, basically, a questionnaire. What matters for you more? More like this or that? And based on that we adjust the setting right away.

Michael

26:01

Like when people first install it, there's like the old school wizards that you go through and answer a bunch

Nikolay

26:08

of questions

Michael

26:09

and then...

Nikolay

26:11

Old school Typeform, you know? I'm joking, I'm joking. Typeform is already old school. Because I see dental clinics use Typeform already. It's not cool already, it's old, ancient. I like actually, I love Typeform. I love the fact that you can use only keyboard, no mouse involved, navigate all the questions, answer them. So anyway, yeah, alerts is a separate topic, it's in the roadmap, not right now. We also paid very careful attention to metrics we display.

26:47

We had many debates about this, really. For example, we just talked about, for example, buffer reads per second. We translate them to bytes, so megabytes, mebibytes, gibibytes. We use lowercase I in the middle to emphasize it's gibibytes, mebibytes. So if we display it on X's Y, is it megabytes or mebibytes per second? And I don't see consistency in various monitoring systems. So I managed to convince we should display maybe bytes per second.

27:24

And to convince, I just said, okay, actually, you know what? We have all these metrics, it's differentiated by time. So if you display a raw metric, it will be constantly growing. It's a cumulative metric, it's just a number which always grows until you reset it. So since you differentiate it by time, it becomes already spiky or some kind of normal behavior. So it should be maybe by second.

27:51

But also, I said, notice we develop additional differentiating on the right side, almost always you can find a chart which is displaying the same metric per call, average per call.

Michael

28:06

So flipping instead of per second, per call, per query.

Nikolay

28:11

Exactly. So if it's, for example, execution time per call, it's average execution part of the latency.

Michael

28:28

That 1 you can't do.

Nikolay

28:30

Total exact time per second you can do. This is the key metric, this is the central metric in our approach. It's measured in seconds per second. We discussed this, yes. I'm glad we're repeating this topic. This is key metric.

Michael

28:48

Yeah, me too. I've clearly forgotten it.

Nikolay

28:49

Seconds per second is a key metric. My favorite metric. It shows you how many seconds database spends to execute your queries per each second.

Michael

29:01

And then do you have to divide it by cause to like understand you to like say shut up? No, no, no,

Nikolay

29:05

no, no, no, no, no. Wait, wait, wait, wait. It's 2 different things. We have some metric, for example, number of buffers, which is also always growing. Across all query ideas, here it's top level. We don't, we just summarize all of them, all 5000, right? Or we have total exact time, right? These are 2 raw metrics. They are always growing, growing, growing. You can let Grafana, basically PromQL using irate, You can let it differentiate by time.

29:36

We don't use mean exact time because we can have it from different place. Or you can divide it by calls. If you divide by calls, you have basically the same value as the mean exact time, but not Postgres calculated it, but PromQL calculated it. But if you divide it by time, it's differentiating by time. You will see how it changes over time. Every second we spend, for example, 10 seconds, give us at least 10 cores for this workload.

30:05

It's very roughly, it doesn't fare because this time can include waiting on lock acquisition, or I.O. It's not true CPU, user or system time. But roughly, it gives you a good understanding, what kind of, like the scale of your workload. If it's 10, don't run it on 2 core machines.

Michael

30:29

Yeah. Well, more to the point, is it growing over time or are we reaching some, you know, when are the spikes, that kind of thing. Yeah. Interesting.

Nikolay

30:39

And if you divide it by cost, you have average. So that's why we say metric slash S per second. And if it's seconds per second, we transparently say S slash S. Seconds per second. Why not?

Michael

31:00

Well, I just think it hurts my head even thinking about it. I think there's all sometimes a case in products for being accurate and sometimes the case of being simple or like easy to understand at first glance, and it depends who you're building for, right? Like accuracy is extremely helpful, especially for deep experts. They want to know precisely what happened all the time, but anything that could be distracting for folks who don't necessarily know what they're doing.

Nikolay

31:30

That's a great point.

Michael

31:32

Yeah. So, or it was the pre you've answered pretty much all the questions I had already, except for who is your, I know yourselves are your ideal user, but like, who is your ideal user for this

Nikolay

31:43

product? 99% ourselves. We do it for our own work, for consulting, but we see it beneficial to publish it as open source, so first of all, our clients could verify what we are collecting. Even Datadog publishes exporters, so it's fair. But for us, these dashboards, it's not final product. By the time we publish this release, it's already obvious. We already included this to workflow, where meta collection is step number 1 of big step number 1. It's just first half of the step number 1.

32:25

Dashboards is step number 1. And then API is exporting dashboards in form of static tables. Actually we didn't get there, but our dashboard with query analysis includes not only graph view charts, but also table view like our postgres-checkup has had. And table view is super beneficial because you can compare for each query ID, you can see many different metrics in 1 row. You can understand much better looking at very compact form of representation. It's for experts, it's not for non-experts.

32:58

Regular Backend engineer needs some time to understand how to work with this. It's possible. We did it, but it really needs some time. But since I said this is big step number 1, these dashboards and charts and tables and everything. Second step is analysis by experts, And experts can be humans or LLM. For efficient analysis, you need to transform raw observations, we call it, we transform observations to conclusions. You have some transformations where it's quite like we did with buffers.

33:37

Okay, 1000 buffers. Even for experts, it takes some time to understand is it big or not. But then we multiply it by 8 KB, and we have 8000 kB, roughly 8 MB. Not exactly, but roughly. We say, oh, okay, this I can feel already. I know how long does it take to get it from disk, how much of memory it takes. I already feel it. My point, my hypothesis in what we are doing is that both experts and LLMs work better with this transformed raw data.

34:14

If you present it better, This is what we do on second stage. Second stage, we've transformed observations very raw to conclusions, but we don't add thinking there. Maybe a little bit, you know, a little bit. Not debatable type of thinking, quite like straightforward logic, you know. Like We can calculate percentage, for example. This query ID in buffer hits takes like 25%. It's very noticeable. 1 fourth of whole hits traffic. 1 fourth of the whole hits traffic.

Michael

34:49

So, yeah, like, it sounds like the kind of arithmetic that's difficult to do on the fly accurately when you're in the middle of an outage or something.

Nikolay

35:00

It includes everything, like, arithmetic, like formulas. It also can include some representation, transformations, like put these numbers closer to each other. Just by putting this, you highlight that they have some correlation or something, you know? Yeah. Also, take specific charts. This is already like methodology we develop over many years, how we approach analysis of various problems. But still, we don't put conclusions there because we just observe.

35:32

We just slightly transform what we see to better form for consumption of bioexperts still and LLMs. And final thing is recommendations and understanding problems. Not only recommendations, first understanding problems. So We see problems and we already start thinking. This is already sometimes if you take 2 DBAs, they have 2 different opinions. What they see. Here is already a tricky part. But our idea is, okay, we have a brainstorm phase, a collection of hypotheses.

36:05

And then, since we build very great lab environments, we have thin clones, we have thick clones, we can have different tests, and We can verify ideas on thin or thick clones. Or maybe in synthetic, sometimes it's just before you're sending bug report or raise some discussion on hackers, you need to have minimal viable representation of your problem in synthetic form, not to share personal data, right? From production. It should be minimal.

36:39

It's also a special type of art in database operations to extract from production some minimal stuff and inject it. And here is where LLM can be very helpful, to iterate much faster.

Michael

36:55

I find this difficult for blog posts as well, like coming up with an example that you can share. Anyway, it's an art and very impressive and takes way longer than it looks. Well, at least it does for me.

Nikolay

37:08

It should take less. And I hope AI... I don't hope AI. AI is already improving this, but I think we are on the edge of massive improvements in this area.

Michael

37:20

I really hope so. I've tried it for a few, I've tried using even Claude 4 with limited success.

Nikolay

37:29

Yeah, I know it. Yeah. I know, I know some problems like you just give up. We spent whole last year experimenting so much. We still have it. We still have credits from Google Cloud, Gemini. Gemini excels in some parts of work, but not in all. And of course, Claude is not also not the final answer to everything. But I see many, like, There is a substantial improvement with Claude 4. It gives me a much more level of excitement of how we can move forward. This is our whole system.

38:16

Right now, almost everywhere is human, because we don't trust. It will take a few years to start trusting more. Maybe LLMs should be... Well, they don't stop. They are improving. Right? And knowledge base is growing also. Our knowledge base is growing, which we use when we work with LLMs to improve quality of answers. This is how I see it as a component in the beginning.

38:45

It can be used directly if you're an expert, or you can spend time understanding some documents like to dive deeper to details, and then not becoming an expert, but already to start understanding how things work. And then we transform it to form which is easier to move faster basically, not to jump between various pieces we already. For specific analysis we compile multiple reports to small report which is for this specific analysis. For example our checkup has a report called Memory Analysis.

39:18

Memory Analysis, if you want to tune memory, not only you need to tune shared_buffers, maintenance work_mem, work_mem, max_connections, but also check what's happening with your number of autovacuum workers. And also remember that in Postgres 17, before Postgres 17, even if you raised maintenance work mem beyond 1 gigabyte and put it to like 8 gigabytes, thinking, okay, I'm going to create or recreate indexes only in 1 session.

39:51

But autovacuum workers, you have them, you also raise number of autovacuum max workers, you raise it to 10, for example. And you think, okay, I know that, well, people don't know mostly, but I also didn't know until recently. But before 17, If autovacuum work mem is minus 1 It means that inheritance from maintenance work mem but maintenance work mem if you rest beyond 1 gigabyte It's still autovacuum cannot use more than 1 gigabyte.

40:22

But this was changed in Postgres 17, it can use more than 1 gigabyte right now. So if you tuned it before and you then upgrade, you can run out of memory. Suddenly, if you perform the vacuum tuning in more, like we always say, raise the number of workers, give more memory to workers, to backends which build indexes. Now you upgrade to Postgres 17, mathematics changes. So we have a report which collects all the pieces and involves the version and knowledge about these changes.

40:56

It requires some expert work. And we should not live in a world where this work is done manually. So it should be done in an automated fashion, collecting pieces, analyzing things, not to miss some important parts of the knowledge. So LLMs there, definitely.

Michael

41:19

Well not definitely. You can get versioned. I'm intrigued as to how much you'd end up using LLMs on this journey, because things like the version information, you can, like, you can get directly from Postgres, right? You can run a query and find out what version it is. You don't have to work that out or it's not something where like a language model is going to be beneficial.

41:45

And there's some, like, I don't do monitoring tools, but doing performance advice, it's like something learn to like, try to get good at. And I thought we would end up wanting to write bespoke advice in lots of different cases, but it's amazing how you can word things so that they cover different cases or conditionally show different pieces of the advice in different scenarios and piece together the full response without risk of hallucinations.

Nikolay

42:18

I agree with you. That's why on the first 2 steps we don't almost don't involve LLM because we don't want them to, to, to like, III cannot let LLM to summarize the numbers. I better use a regular calculator, right?

Michael

42:35

Yes. There were some things previous generations of computing were fantastic at.

Nikolay

42:45

That's why jumping from observations to conclusions, we basically don't involve LLM. We involve our methodologies we developed, how we analyze things, and we just transform. We don't need LLM there. But when we have problems, First of all, we need to understand which problems look most critical. This, I believe, can be done at least preliminarily. LLM can highlight what's the most critical.

43:12

Then most important, start hypothesizing about trying to understand root cause, trying to prove it, build experiment to prove it, and then how to solve it, build experiment how to solve it, prove that it's like drive this very hard work of root cause analysis and finding mitigation.

43:35

This can be done by LLMs because it can search similar results on our knowledge base and internet, find ideas, and using thin and thick clones and frameworks we build, it can start experimenting to check and to prove, oh, this is what's happening here. Of course, the problem we discussed last week with metronome guys, It will take time for us to reach a point where our system will perform full-fledged root cause analysis and experimentation part. But I would say it's like 1 or 2 years only.

44:12

We already had a lot of pieces built on this road.

Michael

44:16

IM. Well, I look forward to discussing it again then. Yeah. Yeah. Nice. Nice 1, Nikolai. Looking forward to it. Good luck with the rest of your launch week as well.

Nikolay

44:27

Thank you. Thank you.

Michael

44:28

In the past.

Nikolay

44:29

In the past. Back to the future. Or how to say. Good. Thank you so much.

Michael

44:36

Nice 1, take care.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript