Top ten dangerous issues

Nikolay

00:00

Hello, hello, this is Postgres.FM. I'm Nik, Postgres.AI, and as usual, my co-host is Michael, pgMustard. Hi, Michael, how are you doing?

Michael

00:10

I'm good, how are you?

Nikolay

00:12

Great, everything's all right. A lot of bugs to fix and incidents to troubleshoot, to perform root cause analysis as we say, RCA.

Michael

00:24

It sounds related to our topic today maybe.

Nikolay

00:27

Oh yeah, maybe yes. So The topic I chose is, I'm still not 100% sure how to name it properly, so let's decide together. But the situation is simple for me, relatively. So we help a lot of startups, and at some point I decided to focus only on startups. Being like raised a few startups and helped many startups. I know how it feels to choose technology and grow, grow, grow until some problems start hitting you. This is exactly like usually people choose RDS or CloudSQL or Supabase, anything.

01:11

And they don't need to hire DBAs, DBREs, and they grow quite well until a few terabytes of data or 10,000 TPS, that kind of scale. And then problems pop up here and there, and sometimes they come in batches, you know, like not just 1 problem, but several. And here usually, for us it's good, they come to us. I mean, Postgres.AI, we have still consulting wing, quite strong and growing. And we helped more than 20 startups over the last year, which I'm very proud of.

01:49

And I collected a lot of case studies, so to speak. And I decided to have some classification of problems that feel not good at very high level, for example, CTO level or even CEO, when you think they might start thinking, is Postgres the right choice or it's giving us too much headache? And it's not about like, oh, out of disk space suddenly, or major upgrades requiring some maintenance window. Although this also can cause some headache.

02:24

But it's more about problems like where you don't know what to do, or you see it requires a lot of effort to solve it properly.

Michael

02:34

Yeah, I've had a sneak peek of your list, so I like how you've described it. I also like the thought process of whether it hits the CTO or the CEO, and I was thinking, let's say you have a non-technical CEO, if they start hearing the word Postgres too often it's probably a bad sign. Ideally you might mention it once every few years when you do a major version upgrade, but then nothing bad happens and they don't hear it again for a few years.

03:03

But if they're hearing, you know, if it's getting to the CEO that Postgres is causing problems over and over again, the natural question is going to be, is there an alternative? What could we do instead? Or, you know, is this a big problem? So I guess it's these kind of dangers, not just to the startup, but also to Postgres' continued use at that startup.

Nikolay

03:26

Yeah, I like the word dangerous here because when you deal with some of these problems, it might feel dangerous to have Postgres for them. It's bad. Like I would like if things were better. So I have a list of 10 items And we can discuss and the list is unordered and I'm going to post it to my social networks so folks can discuss. And I'm like sincerely think that this list is useful.

04:00

If you're a startup, it's great to just to use this checklist to see how your cluster is doing or clusters are doing and are you ready. So Postgres growth readiness checklist. And interesting that I didn't include vertical and horizontal scaling there. I did it indirectly, we will touch it. But obviously, like this is the most discussed topic, the biggest danger, how Postgres scales, like cluster, single primary, and multiple standbys, how far we can go.

04:41

We know we can go very far, very, very far on a single cluster. At some point, microservices, or maybe sharding, it's great. But we had a great episode with Lev Kokotov, a PgDog and it resonates 1 of the items I have today it resonates with what he said during our episode. So anyway, let's exclude vertical and horizontal scaling and talk about stuff which kind of sounds boring. My first item is heavy lock contention. This is very popular. Maybe 50% of companies that come to us have this issue.

05:23

Somehow. So at some point I decided to start saying everyone, if you have queue-like workloads, or additionally, and or, If you don't know how dangerous it is to change schema in Postgres, just adding column can be a problem, right? We discussed it, I think, many times. You are not ready to grow, and at some point, sooner or later, it will hit you. And it will hit you as a spike of active sessions.

05:56

And we know some managed Postgres platforms provoke you to have huge number of huge max connections.

Michael

06:04

Max connections,

Nikolay

06:04

yeah. RDS like 5, 000, 2, 500. Why do they do this? Easier for them. But it's dangerous because it creates kind of performance cliff additionally.

Michael

06:19

Yeah, it's another version of these cliffs isn't it?

Nikolay

06:21

Michael

06:21

had another good episode recently.

Nikolay

06:23

Yeah, I plan to research this a little bit, probably we will publish something in this area to prove that it's not good. It's still not good even if you have Postgres 14 plus which has great optimizations for a large number of idle connections it's still not good.

Michael

06:42

And there have been some improvements like I know a very good engineer who took Postgres down by adding a column with a default, I think it was. But it was many years, there's some improvements in recent years of some DDL changes that are less dangerous than they were.

Nikolay

06:58

Yeah, there are several levels.

Michael

07:01

Yes, yeah, of course. And if they get stuck in if they don't have a lock_timeout for example in fact yeah we're probably going to be pointing to episodes on every single 1 of these bullet points but we have we had 1 on 0 downtime migrations I think it's probably the best for that and we had a separate 1 on queues actually, didn't we?

Nikolay

07:17

So yeah. Yeah. So definitely there are solutions here and you just need to proactively deploy. It's interesting that I see some companies grow quite far not noticing this problem, for example, with DDL. It becomes, it's like going to casino, like you can win, you can win. Sometimes, boom, you lose. Because if you deploy some DDL and you get blocked, you can block others and it can be a disaster. We discussed it several times.

07:50

And if you had a hundred deployments successfully, it doesn't mean you will keep winning, right? So it's better to have. And it concerns me, I have a feeling we should implement this in Postgres. Like alter table concurrently or something like this. It should itself perform these retries with low lock_timeout.

Michael

08:13

Yeah, it's tricky, isn't it? But I agree, but then people still need to know that it existed to actually use it because I think the main issue here is people not realizing that it can be a problem. And the fact it probably hits users. Let's say you've got a statement_timeout. When are you actually going to notice that users have been waiting for it? Are you going to notice that spike on your on your monitoring?

08:38

I'm not sure like it depends how many users actually got stuck waiting behind it and had slow queries. So and it's going to be hard to reproduce that you might not know why it was that. So

Nikolay

08:49

log_lock_waits is off so you don't see who blocked you. And you might be auto vacuum running in this aggressive mode or it can be another session long running transaction which holds access share lock to a table and you cannot alter it. And boom, you block others. So this is like a reaction chain. And yeah, it's not good. And queue-like workloads same, like at some smaller scale, you don't see problems at all. Then you occasionally experience them.

09:26

But if you grow very fast, you will start hitting these problems very badly. And they look like spikes of heavy lock contention or just heavy lock and lock in Postgres terminology is the same so just lock contention and yeah it doesn't look good so and suggestion is so simple like I It's funny that we talk a lot and people that come to us actually they mention they watch podcast and I say like, okay, do you like workload?

09:59

Just take care of indexes, take care of bloat, like maybe partitioning, but most importantly, skip locked. That's it. This is a solution, but we spent hours to discuss details. Because when you go to reality, it's not easy to learn this, like there are objections sometimes, but this is what we do, like we work with those objections and help to implement, right? So yeah, but we, yeah, for everything we had episode. There are episodes for everything. So this was number 1, heavy load contention.

10:34

And I chose the most popular reasons. Of course there are other reasons. But in my view, DDL and queue-like workloads, not to number biggest, the biggest ones. Okay, next it's boring, super boring. Bloat control and index management. We had episodes about it, maybe several actually. But Since again, managed Postgres platforms don't give you tools. For example, RDS, they did great job in auto-vacuum tuning, but only half of it.

11:15

They made it very aggressive in terms of how much resources, like throttling, they gave a lot of resources. But they don't adjust scale factors. So it visits, autovacuum visits your tables Not often, not often enough for LTP. So bloat can be accumulated and so on, and they don't give you resources to understand the reasons of bloat.

11:44

I'm thinking about it and I think it's tricky and it's also a problem of Postgres documentation because it lacks clarity how we troubleshoot reasons of the bloat because we always say long-running transaction But not every transaction is harmful. For example, in default Transaction Isolation level Read Committed, transaction is not that harmful if it consists of many small queries. If it's a single query, it holds a snapshot, it's harmful.

12:17

So I guess with observability we should shift from long-running transactional language to xmin horizon language fully and discuss that. Anyway, like I can easily imagine and I observe how people think, oh, like MongoDB doesn't have this stuff. Or some other Database system, they don't have the problem with bloat. Or indexes, indexes, oh. Actually with indexes, my true belief is that degradation of index health is happening in other systems as well. We also discussed it.

12:49

So they need to be rebuilt.

Michael

12:51

I was listening to a SQL Server podcast just for fun the other day and they had the exact same problem. But in the episode where we talked about index maintenance, I think it came up that even if you're really on top of autovacuum, even if you have it configured really nicely, there can still be occasions where you get some bloat.

13:12

If you have like a spike or if you have a large deletion or you have like a there's a few cases where you can end up with sparsely populated indexes that can't self-heal like if for example you've got like an or like a even UUIDv7 index and then you have a section that maybe deletes some old data and it's not partitioned, then you've got a gap in your index. So there's a bunch of reasons why they can get bloated anyway, even if you're on top of autovacuum.

13:41

So I think this is 1 of those ones that, yes, autovacuum fixes most of the problems, but you probably still want to have some plan for index maintenance anyway.

Nikolay

13:50

Yeah, so there are certain things that are not automated by Postgres itself or by Kubernetes operators or by, Well, some of them automated some things, but not everyone, not everything. Or managed service providers, even upgrades. Also like lack of automation there. We can mention this lack of automation of analyze, but fortunately future Postgres versions will be definitely fine because dump, restore of statistics is implemented finally and goes to Postgres 18, which is super great news.

14:26

Anyway, lack of automation might feel like, oh, this is a constant headache, but it's solvable. It's solvable. Fortunately, it requires some effort, but it's solvable. Okay. Next thing is, next thing is, let's talk about lightweight lock contention. So we talked about heavy lock contention or just lock contention. Lightweight lock contention is also, this feels like, like pain and of various kinds. So lightweight locks can be called latches, it's in memory.

14:57

So when some operations with buffer pool happen, for example, there are the lightweight locks, Postgres needs to establish or working with WAL or various data structures. Also can mention LockManager. So things like LWLock:LockManager or buffer mapping or sub-trans SLRU, multi-exec SLRU.

15:22

When you hear this, for me, like these terms, imagine like this font, like bloody, you know, like red blood, blood drops, drops of blood because, because III know so like many projects like suffered big pain, like big incidents. So for me, these, these terms are like bloody terms, you know, because, because yeah, because it's, it's, it was a lot of pain sometimes. For example, you know I'm a big fan of sub-transactions, right? Just my natural advice is just to eliminate them all.

16:05

Well, over time, I'm softer. I say, okay, you just need to understand them and use very carefully. But LockManager, couple of years, remember Jeremy Schneider posted like-

Michael

16:16

Yeah, great post.

Nikolay

16:17

Horror stories, and we discussed it as well. So this kind of contention might hit you and it feels like performance cliff usually so all good all good boom

Michael

16:30

right it's it what is or was is it changing in 18 but it what it was a hard-coded limit, right?

Nikolay

16:37

2016, you mean for fast path? Also SLRU sizes are now configurable, I think in 2017 already. Well, nice, good, but not always enough. Because okay, you can buy some time, but still there is a cliff and if you're not far from it, again, boom. Or this, I recently saw it, like, remember we discussed 4 million transactions per second. And we discussed that we first we found pg_stat_kcache was an issue, it was fixed, and then pg_stat_statements.

17:11

Okay, pg_stat_statements, if the transactions are super fast, it's bringing an observer effect. And we see it in newer Postgres versions as LWLock:pg_stat_statements because finally code is covered by proper props, right? Not props, like is wrapped and it's visible in the wait event analysis observing just that activity. So I saw it recently at 1 customer, I know like some layer of lightweight lock pages and statements, So we need to discuss what's happening.

17:47

It happens only when you have a lot of very fast queries, but it can be a problem as well. But yeah, and performance cliffs, it requires some practice to understand where they are. It's hard because you need to understand what kind of, like how to measure usage, how to understand like situation risks. This requires some practice.

Michael

18:11

I think this is 1 of the hardest ones. I think this is 1 of the hardest ones to see coming.

Nikolay

18:17

After all our stories with LW LockManager, every time I see some query exceeds 1000 QPS, queries per second, I'm already thinking, okay, this patient is developing some chronic disease you know.

Michael

18:35

Okay that's another that's 1 I haven't heard we've done several rules of thumb before but that's that's another good 1 so a thousand queries per second for a single query check.

Nikolay

18:45

It's very relative also how many vCPUs we have. If we have less, it can hit faster. Although we couldn't reproduce exactly the same nature as we see on huge machines like 196 cores, we couldn't reproduce that nature on 8 core machines at all. So yeah, it's for big boys only, you know. This is, yeah, this is, like, or maybe for adults. So young projects don't experience these problems.

Michael

19:22

That's a good point actually the startups that have hit this that you've written about and things have tended to be further along in their journeys huge huge but yeah but still growing quickly and it's even a bigger problem at that point but yeah good point should we move on?

Nikolay

19:37

Yeah so the next 1 is our usual suspect right it's a wraparound of 8 byte transaction ID and multi-exact ID. So many words already said about this. It just bothers me that monitoring doesn't cover, for example, usually doesn't cover multi-exact IDs. And people still don't have alerts and so on. So it's sad. Yeah, it's easy to create these days.

Michael

20:10

I get the impression though, I mean, there were a few high profile incidents that got blocked about. I think, yes, yeah, exactly. And I feel like I haven't seen 1 in a long while and I know there are a lot of projects that are having to you know, I think Agen have spoken about If they weren't on top of this it would be it would only be a matter of hours before they'd hit wrap around, you know it's that kind of volume. So they're really really having to monitor and stay on top of it all the time.

20:41

But I haven't heard of anybody actually hitting this for quite a while. Do you think, I wondered if for example there were some changes to I think it was autovacuum being like I think it kicks in to do an anti-wraparound vacuum differently or it might be lighter a lighter type of vacuum that it runs now I think I remember Peter Geoghegan posting about it, something like that. Do you remember a change in

Nikolay

21:05

that area? I don't remember, honestly. I just know this is still a problem. Again, at CTO level, it feels like, how come Postgres still has 4 byte transaction IDs and what kind of risks I need to take into account. But you are right, managed Postgres providers do quite a good job here. They take care.

21:30

I had a guest at Postgres TV, Hannu Krosing, who talked about how to escape from it in a non-traditional and in his opinion, and actually my opinion as well, in a better way, without single user mode. And since he is a part of CloudSQL team, so it also shows how much of effort managed Postgres providers do in this area, realizing this is a huge risk.

Michael

21:58

Yeah, and it's not even, even if it's a small risk, the impact when it happens is not small. So it's 1 of those ones where- Absolutely

Nikolay

22:07

good correction. It's low risk, high impact, exactly.

Michael

22:10

Yes, yes. So I think the cases that were blogged about were hours and hours possibly even getting, was it even a day or 2 of downtime for those organizations? And that was, that is then, I mean, you're talking about dangers, right?

Nikolay

22:27

That's- Global downtime, whole Database is down.

Michael

22:29

Exactly. People, you're gonna lose some customers over that, right?

Nikolay

22:34

Yeah, unlike the next item, a 4-byte integer primary key is still a thing, you know. I was surprised to have recently this case, which was overlooked by our tooling.

Michael

22:48

Oh, really?

Nikolay

22:49

I couldn't like, how come? Yeah, because it was a non-traditional way to have this.

Michael

22:55

Go on.

Nikolay

22:56

Well, it was first of all, the sequence, which was used by multiple tables. Yeah, 1 for all of them. And somehow it was defined, so our report in postgres-checkup didn't see it so when it came I was like how come this like this is old friend or old enemy, not friend, enemy.

Michael

23:21

Old enemy.

Nikolay

23:22

I haven't seen you for like so many years. And you look differently, you know, because multiple tables, but still, like, it's not fun. And this causes partial downtime because some part of workload cannot work anymore. You cannot INSERT.

Michael

23:40

Yeah.

Nikolay

23:41

Yeah. So, by the way, I also learned that if you just do in place ALTER TABLE for huge table, it not so dumb as I thought. I checked source code, I was impressed. And this code is from 9 point something, maybe even before. So if you ALTER TABLE, ALTER COLUMN to change from int4 to int8, it actually performs a job like similar to VACUUM FULL. Recreating indexes. And you don't have bloat. I expected like 50% bloat, you know.

Michael

24:16

Oh, why?

Nikolay

24:17

Because I thought it will be, it will rewrite the whole table. I was mistaken. It's quite smart. Yeah, it's of course it's a blocking operation, it causes downtime to perform it, but you end up having quite clean state of table and indexes. Not quite, clean state, it's fresh.

Michael

24:35

Yeah, so that is a table rewrite, no?

Nikolay

24:39

Yes, well, table, yes, well, you are right. I was thinking about table rewrite as a very dumb thing, like create more tuples and DELETE other tuples.

Michael

24:48

Got it, got it, got it.

Nikolay

24:49

But there is a mechanism of table rewrite in the code. Now I saw it finally, I'm still learning, you know, sorry.

Michael

24:56

You might end up with some padding issues if you had it optimized well before, but yeah.

Nikolay

25:01

Yeah, it also feels like Postgres could implement some reshape like eventually because there are building blocks in the code already, I see them like to first like offline style to change COLUMN order and then if you want it and then fully online style if pg_squeeze goes to core to core right yeah yeah it would be great yeah I'm just like connecting paths here and can be very powerful in like 3 to 5 years maybe But it's a lot of work additionally.

25:33

So all those who are involved in moving huge building blocks, I have huge respect. So okay. And I

Michael

25:42

think this is, if you know what you're doing, this one's easier to recover from. I assume like with the sequence, for example, you can handle it multiple ways, but you can set the sequence to like negative 2 billion and normally you've got a good start.

Nikolay

25:58

Everyone thinks they're smart and and this is first thing I this is the first thing I hear always when we discuss this. This was in the past, like let's use negative values. Of course, if you can use negative values, do it. Because we know Postgres integers are signed integers. So we use only half of capacity of, 4 byte capacity, half of it is 2.1 billion, roughly. So you have 2.1 billion more, but not always it's possible to use.

26:28

But this is old, old, old story still making some people nervous and I think it's good to check in advance.

Michael

26:39

So much better, so much better to have alerts when you're getting...

Nikolay

26:42

I sell several companies, big ones, from this, just raising this. And I know in some companies, it was like 1 year or a few years work to fix it.

Michael

26:56

So what was the problem before? Was it looking at columns instead of looking at sequences? Or what was the-

Nikolay

27:01

No, sequences are always 8 bytes. It was always so, like, if they are 8 bytes. Problem with report, I don't remember honestly. There was some problem with report. It was not standard way, not just create table and you have default with sequence and you see it. Something else, some function, I don't remember exactly. Okay. But usually our report catches such things. Or you can just check yourself if you have primary keys for byte, it's time to move to 8 bytes or to UUID version 7, right? Maybe.

27:39

Okay, that's it about this. Then let's heat up the situation, replication limits. So in the beginning I mentioned vertical and horizontal scaling and usually people say there's not enough CPU or something and disk I.O. And we need to scale and you can scale read only workloads having more read only standbys. But it's hard to scale writes, and this is true. But also true that at some point, and Lev mentioned it, In our PgDog episode, he mentioned that in Instacart they had it, right?

28:19

So like at 200 megabytes per second WAL generation, it's already over 100. I don't remember exactly what he mentioned, but somewhere in that area.

Michael

28:29

It was a lot.

Nikolay

28:31

Well, just 10 WALs per second, for example, it gives you 160. Well, RDS sets less because they have 64 megabyte WALs. They raised it 4 times. Anyway, 100 or 200, 300 megabytes per second, you start hitting the problems with single-threaded processes. Actually, it wonders me why we don't monitor, like any Postgres monitoring should have, with low level access. RDS doesn't have it, but low-level access.

29:04

We should see how much CPU, we should see CPU usage for every important single-threaded Postgres process. Right? WAL sender, WAL receiver, Logical Replication Worker, Checkpointer maybe as well. That would

Michael

29:24

be helpful.

Nikolay

29:24

Of course, because it's dangerous to grow at scale when you hit 100% of a single vCPU. And then you need to either vertically or horizontally scale or start saving on WAL generation. Fortunately, in pg_stat_statements, we have WAL metrics, 3 columns since Postgres 13. But unfortunately, this is query level. What we need, we need also table level to understand which tables are responsible for a lot of WAL. And pg_stat_activity lacks it. I think it's a good idea to implement.

30:03

If someone wants hacking, this is a great idea. Add 3 more columns, WAL-based, WAL-related metrics to pg_stat_all_tables, pg_stat_sys_tables, and user tables. It would be great. Also, maybe pg_stat_databases, like a global view of things, how much WAL.

Michael

30:23

Yeah, I've got a vague memory there were a couple of WAL-related new views, new system views

Nikolay

30:29

introduced. Oh yes, but it's about, Yeah, it's about... Are you talking about the pg_stat_io? No. WAL-related in Postgres 13, it went to EXPLAIN and it went to pg_stat_statements. This is what happened. Anyway, this is not an easy problem to solve. It's easy to check if you have access. Unfortunately, if you're on managed Postgres, you don't have access. They need to check it. What's happening, especially on standbys.

31:02

And also, it makes sense to tune compression properly because compression can eat your CPU. Remember we discussed WAL compression. I always said, let's turn it on. Now I think let's turn it on unless you have this problem. In this case, WAL sender. Yeah, you need to check how much of that is WAL compression. And also we have new compression algorithms implemented in fresh Postgres versions. So, yeah. And that is CD. So, so, so we can choose better 1.

31:44

For example, as it 4 should be providing similar as remember some I saw some benchmarks. I didn't do it myself yet. So it should be similar to to be gels at default compression in terms of compression ratio, but it takes much less CPU, like 2 to 3 times less CPU. So it's worth choosing that.

Michael

32:06

I looked at it briefly just for our own use for planned storage. And it was, we even got better performance as well. So It won on both for us.

Nikolay

32:16

Well, you probably saw what I saw, Small Datum blog, right? No? OK, maybe not. Let's move on, because there are more things to discuss.

Michael

32:27

Nice.

Nikolay

32:27

Design limits. So some people already think what they will do when their table reaches 32 terabytes.

Michael

32:38

Yeah, I guess this and the last 1 both feel like adult problems again, right? Like, There aren't too many small startups hitting the Instacart level of WAL generation or 32 terabytes of data.

Nikolay

32:52

So yeah, it's a really big clusters. But we can start thinking about them earlier and be better prepared. Maybe sometimes not spending too much time, because you know like if you spend too much time thinking about how you will bring statistics to new cluster after measure of red, but it's already implemented in 18, so just a couple of years more and everyone forgets about this problem.

Michael

33:21

It's the kind of thing that's useful to have like a, if you've got like a to-do list you're going to be using in a couple of years time or a calendar that you know you're going to get an alert for, Just put a reminder in for like a year or two's time, just check.

Nikolay

33:35

We also mentioned a few times during last episodes, latest episodes, this bothers me a lot. First I learned CloudSQL has this limit, then I learned RDS also has this limit. 64 terabytes per whole database what's happening here it's already not huge database

Michael

33:52

but again in a couple of years who knows they might have increased that

Nikolay

33:56

yeah well I think it's solvable of course I guess it's this is limit of single EBS volume or disk on Google Cloud, PD, SSD, or how they call it. So yeah, it's solvable, I think, right? With some tricks. But these days, it doesn't feel huge. 64, it feels big database, right? But when we say

Michael

34:24

to most of us,

Nikolay

34:26

but we have stories, 100 plus databases, all of them are are self managed. I think everyone has

Michael

34:34

all of them were sharded. All of those were sharded.

Nikolay

34:37

Yeah, 100%, yeah, we had a great episode, 100 terabytes, where it was Adyen, who else?

Michael

34:47

We had Notion and we had Figma.

Nikolay

34:49

Figma, right. And Notion and Figma, they are on RDS, but it's a shard, a single cluster, it's impossible on RDS, right? And I think Adyen has 100 plus terabytes. No, they have 100 plus terabytes, but it's self-managed. Yes. Because on RDS it's impossible, not supported.

Michael

35:07

Well, and they shard it also.

Nikolay

35:10

Yeah, yeah, yeah. But yeah, large companies like that, they're always like some parts are sharded, some are not. So anyway, when you have 10 to 20 terabytes, it's time to think if you are on RDS or CloudSQL, is it like how you will grow 5X? Because if you're 20 terabytes to grow 5X, 100, it's already not possible with a single cluster, right? Another reason to think about splitting somehow. So, okay. Then a few more items, like, so data loss. Data loss is a big deal.

35:45

If you poorly designed backups or HA solutions, Yeah, it can be. Let's join this with poor HA choice leading to failures like split brains. So data loss, split brain. Actually I thought we had a discussion. There is ongoing discussion in the project called CloudNativePG where I raised the topic of split-brain and demonstrated how to do it a couple of weeks ago.

36:15

And good news as I see, they decided to implement something to move in a direction similar to Patroni because when network partition happens and the primary is basically alone, it's bad because it remains active. And as I demonstrated, some parts of application might still talk into it. This is classical split-brain. And I saw, based on discussion, I saw it triggered, I never thought deeply, actually. But is split-brain just a variant of data loss?

Michael

36:56

Well, I guess you technically might not have lost the data. It's still there, you just have 2 versions. Which 1

Nikolay

37:03

is correct?

Michael

37:03

It's worse

Nikolay

37:04

than data loss. It's worse than data loss. It's worse. Because now you have 2 versions of reality. And it's bad. With data loss, you can apologize and ask to like, Bring your data back again, please. And in some cases we allowed some data loss. Of course, data loss is really a sad thing to have. But sometimes we have it like... Officially, some data loss might happen. The risk is very low. And maximum this number of bytes, for example, but it might happen. With split-brain it's worse.

37:38

You need a lot, spend a lot of effort to merge realities into 1.

Michael

37:46

Most cases of data loss I've seen have tended to be at least 2 things gone wrong like user error or some some way that the nodes gone down but quite often it's user error accidental delete without a where clause or dropping a table in 1 environment.

Nikolay

38:05

Well, this is like higher level data loss.

Michael

38:07

Well, but that, but that can cause the low level data loss if you then also don't have tested backups and it turns out you didn't have a good, so it's the combination of the 2 things often.

Nikolay

38:21

For me it's still, yeah, well, yes, yeah, if backups are missing, it's bad, yeah, you cannot recover, but also like data loss for me, classical, like lower level, it's database said, commit successful, and then my data has gone.

Michael

38:37

So yeah. So that's scary and dangerous as like a CTO.

Nikolay

38:42

Undermines trust into Postgres again, right? Yeah, yeah. If procedures are leading to data loss and also split brains.

Michael

38:51

Is it actually happening often or is it more the CTO thinking, I don't want to choose this technology because it could happen?

Nikolay

39:00

Depends on the project and the level of control. I'm pretty confident in many, many, many web and mobile app, OLTP style projects, if they are, unless they are like financial or something like social network, social media, like maybe even e-commerce and so on, data loss happens sometimes unnoticed.

39:22

Yeah. If you have asynchronous replicas and failover happened and the process of failover, like Patroni by default with asynchronous replicas allows up to 1 megabyte, it's written in config, Up to 1 megabyte of data loss, officially. Maybe byte. Right, so 1 megabyte is possible to lose. And who will complain if it's a social media comments or something, we store like comments. We lost some comments, nobody noticed maybe. But if it's a serious project, it's better not to do it. Or split brain.

39:54

Yeah, anyway, this is more not... Postgres doesn't drop quite well here. It's a question mostly to everything around Postgres infrastructure and if it's managed Postgres to their infrastructure, how they guarantee there is no data loss. And last time we discussed the problem with synchronous_commit and we discussed in detail how right now Postgres doesn't do a good job of not revealing proper LSNs on standbys, right?

40:26

So even Patroni can have a data loss in case of synchronous_commit with remote write. We discussed it. Okay, anyway, this feels like something for improvement definitely here. Good corruption. My general feeling, people don't realize how many types of corruption might happen. And it remains unnoticed in so many cases. When you start talking, people's reaction sometimes is, wait, what? So yeah, corruption at various levels.

Michael

41:10

So actually, maybe this, so in terms of the list where we started, kind of the point of the topic, there was kind of dangers, right? Is this 1 of those ones that if it silently happened for a while, it suddenly becomes a complete loss of trust of the underlying system or?

Nikolay

41:26

Yeah, as usual, I can rattle about a little about how Postgres defaults are outdated. We know data checksums only recently enabled by default. Yeah, great change. Like a month or 2 ago, right? It will be only in 18. It should be done 5 years ago. We saw

Michael

41:47

evidence. Some,

Nikolay

41:48

yeah. Yeah, many managed Postgres providers did it, like RDS. That's great.

Michael

41:52

Which is great. And that kind of is then the default for a lot of people.

Nikolay

41:56

Yeah, but it also doesn't guarantee that you don't have corruption. You need to read all the pages from time to time, right? And do they offer something, some anti-corruption tooling? They don't, nobody. Okay, they enabled, so what? This is just a small piece of the whole puzzle.

Michael

42:16

Yeah, and amcheck improving as well. I think, is it in 18?

Nikolay

42:20

Yeah, that's great.

Michael

42:21

In indexes as well.

Nikolay

42:22

Yeah, big.

Michael

42:24

In fact, I think we, did we have an episode on amcheck? I feel like it came up at least once or twice, maybe index maintenance.

Nikolay

42:30

I cannot remember all our episodes, so many.

Michael

42:33

Yeah, me neither.

Nikolay

42:34

But it's great to see the progress on the area of amcheck, but again, it's a small piece of the puzzle and it has settings, like options to choose from. So it's not trivial to choose among options. And also again, like a couple of episodes ago, we discussed support and how difficult it is to sometimes understand what's happening. Right. Yeah. And I, I right now have a case where very big platform doesn't help to investigate corruption.

43:07

And nobody has like ability to investigate it but them, and but they are not helping. So it was 1 of reasons I was provoked to talk about that in that episode. So it's bad. It looks really bad. Like, you guys are responsible for that corruption case, and you don't do a great job. And I think it's a problem of industry, and we discussed it already, so let's not repeat.

43:34

But in general, in general, I think if you are CTO or like leader who decides priorities, my big advice is to take this list and check this, like evaluate situation for yourself. Better let us do it, of course, but you can do it yourself. And then plan some proactive measures because corruption testing can be done even on RDS proactively.

44:06

If it happens, you need support, of course, because sometimes it's like it's low level, you don't have access, but at least you will feel control over it, right? So, over corruption. So, anti-corruption tooling is needed. This is what I feel. That's it. That's all my list. I'm sure it's lacking something like security, more security related stuff for example as usual I tend to like it what do you think like was it was it good

Michael

44:40

yeah you see a lot more of these things than I do obviously but yeah I think it's a really good list and check with checklists right it's not of course if you could go on forever with with things to be scared of but this this feels like if you ticked all of these off you'd be in such a good position versus most and obviously things can still go wrong but these are some of the most at least even if they're not the most common some of the things that could cause the biggest issues the things that

45:10

are most likely to get on the the CEO's desk or in the inbox so yeah this this feels like if you're on top of all of these things, you're going to go a long, long way before you hit issues.

Nikolay

45:22

I have a question for you. Guess among these 10 items, which item I never had in my production life?

Michael

45:30

Oh, maybe... Wait, give me a second.

Nikolay

45:34

It's tricky, right?

Michael

45:35

Transaction ID wraparound?

Nikolay

45:37

Exactly, how, yeah, I never had it. I only-

Michael

45:40

It's rare.

Nikolay

45:41

Well, it's rare, yeah. Let's cross it off, no problem. Yeah, I only found a way to emulate it, right, which we did multiple times, but never had it in reality in production. So yeah, everything else I had. Yeah. Not once.

Michael

46:01

I nearly guessed too quickly that it was going to be Split Brain, but then I was like, no wait, read the whole list. But I'm guessing you had Split Brains like a couple of times, maybe maximum?

Nikolay

46:11

Yeah, Replication Manager, Split Brain as a service, yes.

Michael

46:16

Okay, yeah, pre-Patroni days. Yeah, makes sense. All right, nice 1. Thanks so much, Nikolay.

Nikolay

46:24

Thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript