synchronous_commit

Michael

00:00

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard and this is Nikolay, founder of Postgres.AI. Hey Nikolay, how's it going?

Nikolay

00:09

Everything is alright, how are you?

Michael

00:11

Yeah, good thank you. It's been a nice sunny week here in the UK.

Nikolay

00:16

Oh, this is who stole our sun, okay.

Michael

00:19

So this week we got a listener request, in fact we get quite a lot of listener requests, thank you everybody. This 1 was from Shayon Mukherjee, apologies for pronunciation, and they asked us about synchronous_commit and whether or when it might make sense to ever turn it off or use different settings and I thought it's an interesting topic full stop so I wanted to get your thoughts as well on this.

Nikolay

00:47

I have many thoughts on this. We need to start somewhere. So

Michael

00:54

I wondered about, let's, starting probably not in the most obvious place but on a single node setup. I know this becomes more interesting, multi-node, but on single node, pure Postgres, there are still some interesting discussion points here, right?

Nikolay

01:13

Yeah, and even before that, there is single... So the meaning of synchronous_commit is overloaded. I know like technically it defines behavior of commit, right? And there is local behavior and if we have synchronous standbys, there is remote behavior. But originally it was only about local. Yeah. And then decision was, it was only on and off 2 options. This is how I remember it for a decade or more, more than a decade of being Postgres user. And then there is overloaded meaning.

01:58

Remote apply, remote write, remote apply. 2 options. So I think my first thought is it's a huge mistake to mix things here. It causes pain in users like me. All the time I like, I read it, I read it, I read it, like, I cannot remember this. To remember this, you need to deal only with this.

Michael

02:28

Yeah, or look up, well you can't remember it, you just look it up. I just look

Nikolay

02:33

it up. Every time you look it up and try to understand the logic. It's like, we have many places in GUC system, Postgres system like this. Like for example, you know, we still do a lot of consulting. We help our customers, many of them are startup companies who grow rapidly, very smart people. They understand things very quickly, but you need to explain what's happening. And 1 of the things like we, we deal with bloat all the time. Bloat is 1 of the major concerns.

03:04

And we explain, okay, we need autovacuum tuning. And there is autovacuum_work_mem, which is minus 1. It means you need to look at maintenance_work_mem. Why was this done? This is exactly the same thing. Like why was this done to my brain? Like somebody made some decision to cause huge pain for ages. It's a UX issue, like a user experience issue, because instead of clarifying and separating things, we develop some implicit dependencies and you need to be expert.

03:41

And even if you are expert, it's like, it's really hard to memorize all these things and the problem is like at some point we will be use more and more AI and so on but to explain these things to AI it's also difficult because there are implicit dependencies So synchronous_commit can be local but it also can be not local. And local is on and off and there is also word local, right? Yeah. So meaning of on was rewritten. Thank you guys. Thank you. Thank you who did this. Thank you.

04:16

Like to explain, like if you try to be consultant and explain things to others very smart guys You will see the pain It's really hard and why is this hard? This is my first thought why is this so hard?

Michael

04:34

As you say it's because combining 2 things right and and the behavior is I can see why they combined it though like the behavior

Nikolay

04:41

I also can see no no questions here

Michael

04:45

And to their credit the documentation is really nicely written so it can be pointed.

Nikolay

04:51

It's nicely written but it doesn't explain all the things so I hope we will touch additional topics not covered by documentation today.

Michael

04:59

Nice. So let's talk about the single node case on off on by default right So like why would we want to turn it off? What's happening here?

Nikolay

05:11

Yeah, if we forget about multi-node setups, we have just 1 node, right? And if we have write heavy workloads and we think write latency is not good and eventually right throughput is not good, because if latency is not good, throughput won't be good as well. There is some dependency there, which is not straightforward. And it means that, well, we know at commit time, Postgres needs to make sure information is written to WAL and it's present, like synchronized, like it's on disk. Right.

05:49

So we need, before that, Postgres cannot respond to the user that commit was successful. And when we say commit, it can be explicit commit, or it can be implicit commit if we have a single statement transaction, which is super popular approach in web and mobile apps. So when we just have single statement transactions all the time, just separate and so separate, separate update and so on And all of them have implicit commit as well. So commit can be successful only when WAL is already on disk.

06:28

And it means We need to wait for disk. And if disk is not a RAMFS, which is in 99.99% of cases, right? If it's regular disk, it can be quite slow. And in this case, we deal with, like if our transactions are super fast, like extremely fast, which is actually rare, usually they are not so fast. Because of various cases, there's things like including, for example, index write amplification. I have moved to criticize Postgres today, you know, like, Yeah, you know, you know me.

Michael

07:13

Well, yeah, so I feel like so far you've just yeah, you've you've so far described the kind of synchronous_commit equals on case it's so we have to wait yes which is default and we're waiting for the flush to happen so the rights to disk before confirming back to the client that that commit has been successful but turning it off we don't wait for the the extra step of flushing to disk before confirming to the client, which sounds risky.

07:43

Well, the risk is if there happens to be a crash on the node, in the time between the write-ahead log being written and the flush, so the commit being confirmed and the flush to disk, we lose any of that data. So the risk here is some data loss on crash.

Nikolay

08:06

Yeah, the short tail of WAL is lost. Postgres will still reach consistency, all good, but we lose some part of data. It's similar to any multi-node setup with asynchronous node when failover happens and if standbys are asynchronous, nobody guarantees that at failover everything will be present there, right? Because it's asynchronous.

08:36

There can be lags, but there's consistency, all good, and we can lose, for example, by default in Patroni settings, as I remember, there's a setting, I always forget the name, but a setting like maximum lag allowed it to fail over, as I remember by default it's 1 megabyte, so up to 1 megabyte is okay to lose. Here it's similar, there is no threshold, like unlimited, but technically it cannot be very, cannot be huge.

Michael

09:06

I looked into it and there is the default setting is a write-ahead log writer delay of 200 milliseconds by default, And so I thought maybe only 200 milliseconds worth of writes could be lost, but in the documentation it said actually... Yes, good point. Actually, it's for implementation detail reasons, it says it's 3 times that. So by default, it's 600 milliseconds that could be lost. But you could reduce that.

09:40

This setting only makes sense to change if you're setting synchronous_commit for any of your transactions. But you can reduce it if you want. So you could, if you do decide to go for some use of synchronous_commit equals off, then this is a setting you could reduce to minimize data loss. But at that point, you're still losing some data. You might still be losing some data, so it's, I'm not sure if it matters that much.

Nikolay

10:05

Right, right. So maximum is 3 times of that wal_writer_delay, right. And the question is, like, should we consider this? My answer is if writes are super fast, maybe yes. There is a trade-off here, obviously.

Michael

10:26

Yeah, I was thinking about it in our last episode about time series data. But you made a really good point about if you're if you are writing kind of lots and lots of little writes often in time series workloads people suggest actually batching inserts

Nikolay

10:45

Michael

10:45

by that point you're yeah so batching makes a load of sense for optimization.

Nikolay

10:50

It's a good point. So choosing between synchronous_commit off and batching, you just choose batching.

Michael

10:55

Yeah, so then, but then at that point, your, your transactions are not that fast because you're batching maybe a thousand or more of them together and then the benefits of synchronous_commit equals off of much smaller yeah exactly so it's an in I don't think you can get the benefit twice if that makes sense you

Nikolay

11:15

don't get

Michael

11:15

the benefit of doing both.

Nikolay

11:17

Single row inserts versus 1 insert of 1, 000 rows. 1, 000 inserts versus 1, 000 row single insert. Of course, you'll have just 1 commit. It's great. And it will be faster. But of course, or updates for example, of course you should avoid huge batches, because they have also downsides, like longer lasting locks, Or in case if such batch is crushed, you lose more. You need to retry more, right? Instead of losing just 1 insert, you're losing all 1,000 inserts.

11:57

So as usual, batching, my rule of thumb is just to choose batch size so it's roughly like 1 or 2 seconds maximum to avoid long lasting locks and risks to repeat too much. In this case, it's great. Like in this case, just 1 commit per second or maybe 10 commits per second, it's a very low number. So in this case, the overhead of the need to wait for a WAL to be synced to disk is super low, which is like roughly you can think about it like not more than 1 millisecond.

12:35

In this case, like just forget about it. synchronous_commit off is not a good idea, that's it. Yeah. But usually things are much more complicated because we just consider 1 type of workload. Any project has many types of workload, right?

Michael

12:54

Yeah, and that's a good point that Shayon actually made in a follow-up post that I think sometimes gets missed in this topic, which is that it can be set on a per transaction basis or per session basis per user basis, so if you're doing a one-off if you're doing like a one-off migration or bulk load of data and you can't for some reason not batch yeah then maybe it's worth considering turning it off just for that user or for that session or per transaction.

Nikolay

13:28

What kind of situation could it be in if you cannot buy like bulk but not batch?

Michael

13:33

I was struggling I was struggling to think of it, maybe some streaming, like streaming case, but it felt tenuous.

Nikolay

13:41

If technically it comes from many, many sources, instead of saying, okay, we need Kafka before it comes to Postgres, and then we will batch. It's like, it's a heavy solution. Maybe indeed, this is exactly the case when we, it's worth considering off, right? To

Michael

14:04

use off. The other thing I was thinking about is whether this comes up as a thought for people because of tools like pgbench that by default, you get a lot of small transactions, a lot of small very fast transactions. So because, I wondered if it becomes interesting to people because they just think let's quickly do a test with pgbench and oh wow synchronous_commit, what's that?

Nikolay

14:35

Have you tried pgbench and turn off synchronous_commit? I tried many times because it's obvious idea and I don't remember I saw a huge advantage. Because pgbench write transactions by default, they are not so simple. It's, as I remember, it's update plus 2 inserts, maybe a delete as well. It's multi-statement transaction, which is definitely not super lightweight. Yeah, fair enough. I didn't try it.

15:03

Yeah. Well, I can imagine we can have situations when the effect is quite noticeable, like dozens of percent, dozens of percent. But in general, from my practice, I stopped looking in this direction. Yeah, fair enough. To think about often. And of course, data loss is not a good idea. So, I don't know, like write heavy workloads, maybe yes, but again, batching and all proper tricks to reduce a number of actual commits happening and that's it.

Michael

15:38

Yeah and I think people when people talk about data loss they quite often immediately jump to things like banking apps you know things that where it's like absolutely critical not to, you know, you'd much rather have some downtime than have some data loss. But I also think user experience wise, like even for just regular CRUD apps, losing data can be really confusing to people, even if it's only a not that important thing.

16:04

The cases I was thinking of are much more like the time series ones, where if we've got 1 sensor reporting every minute, if we lose 1 minute, if we lose a couple of those sensor reports, it's not that important. We can still do a bunch of aggregates on that data. So I guess for monitoring and logging type things, maybe it's less important. But there can be really important logs, too, right? So I personally, I'm not I don't see too many use cases where this makes a lot of sense.

16:33

But there is a whole page on the Postgres docs about, it's called asynchronous commit. So it must have been somewhat of interest to quite a lot of people.

Nikolay

16:44

Right. Yeah, yeah, yeah.

Michael

16:47

Should we switch to the topic? It feels like this gets more interesting when we start talking

Nikolay

16:51

about synchronous. Let's do it. Let's move on and demultiply. First of all, own is overloaded when you have synchronous_standby_names non-empty. synchronous_standby_names non-empty, right? We can have purely synchronous replication or we can have so-called quorum commit and I just had a chat with 1 of our team members and I agree.

17:21

I heard criticism as well, like quorum commit is a very misleading term here because usually it's used like this, like we have multiple standbys, They are synchronous originally by nature, but then we say synchronous_commit, for example, remote write. And then we configure Postgres to allow synchronous_standby_names to have any, right? We say any. And for example, we have, say, 5 replicas, 5 standbys. And we say any 1, right, any 1 of them.

17:54

And it means that commit happens on the primary and on any 1 additionally. So on 2 nodes before a user receives success of commit. It means 2 out of 6. 5 standbys, primary, 6 nodes. So 2 out of 6 is not quorum, because quorum, definition of quorum means more than 50.

Michael

18:19

So that would be 4 out of 6?

Nikolay

18:21

Usually, usually. I think we can like say this in this case it's the criticism like it's conditional, right? We may say okay in our case quorum definition is kind of rewritten and we need 2 out of 6. But it sounds strange a little bit. So maybe, how should we call this situation when we not just 1 synchronous replica, and every commit must happen on both but we have multiple replicas and we say 1 more or 2 more so how do we call it semi-synchronous I've heard this term as well semi-synchronous

Michael

19:02

yeah yeah I actually yeah I like I kind of like it. It does imply.

Nikolay

19:07

Yeah, but official boxes.

Michael

19:08

It's confusing Semi-synchronous is confusing enough. That means I have to look it up, which is probably helpful.

Nikolay

19:14

I don't

Michael

19:14

assume I know what it means,

Nikolay

19:16

right?

Michael

19:17

But yeah, it doesn't fit into either category. Like it's not synchronous and it's not asynchronous. Like it's, yeah, I don't like it though. I don't feel comfortable with it because what's it saving us? Like what's the benefit of, like is the idea that if both crash at the same time...

Nikolay

19:36

Yeah, let's, before we talk about it, let's finish with terminology. Maybe, maybe, I'm looking up right now the word, the meaning of word quorum, maybe it's not that bad because in some cases we can say, okay, 25% is our quorum, right? We can define these rules. So, like, expectation that it should be more than 50, maybe it's like false expectation. Yeah, anyway, but you cannot say percentage. You say like, like any 1 or 2, right?

20:07

Configuring Postgres, synchronous_standby_names, you can say any word and then say, You can have some kind of, it's an interesting syntax, you saw it, right?

Michael

20:21

So- Yeah, I also saw you could do, is it like priority?

Nikolay

20:25

Yeah, yeah, yeah. Yeah, so yeah, there are some interesting things there. But anyway, idea is they work together. synchronous_standby_names and synchronous_commit in this case. And if synchronous_standby_names is not empty, so there is this magic happening. If it's just a Host name, purely synchronous replication, right? But also purely synchronous, there are flavors we will discuss based on this synchronous_commit setting.

20:56

If there is like expression is there, like any or like first, blah, blah, blah, then it's already more complicated. But it's interesting. And official documentation mentions, it uses the term quorum commit, right?

Michael

21:13

I saw it in a Crunchy Data blog post. I didn't actually see it in the documentation.

Nikolay

21:20

Okay, doesn't matter. Let's talk about flavors. Let's consider we have only for simplicity just 1 host name mentioned in synchronous_standby_names, No expressions, just 1. And if we don't change synchronous_commit setting, default is on, here the meaning changes. Right? Which like this makes me... I started with this, right? Like overloaded. The meaning changes and what does mean it means that we need to to wait until what like let's let's see

Michael

22:04

so yeah I think it's worth thinking about them in terms, I like to think of them like progressively. I think the order goes, remote apply is the strongest. So remote apply will wait until it's not only been written to the write-ahead log and

Nikolay

22:23

flush to disk but also

Michael

22:25

on the but but available to to queries so read queries on the on the standby will be able to...

Nikolay

22:35

And here I think, okay, apply here but why replay there? In pg_stat_replication it's called replay LSN, right? Apply or replay? Apply or replay, we consider the same thing, right? Yeah. Right. So

Michael

22:51

that's the strongest. I think that's the highest level we can set, but it has additional latency, obviously. I think on is the next strongest. I think we wait for flush but not for, what did you want to call it? Replay.

Nikolay

23:12

Replay or apply.

Michael

23:16

And then there's 1 level below on which is the right

Nikolay

23:21

there is no remote receive there is replay right

Michael

23:26

yeah which but I think receive is this they're saying it's the same as right

Nikolay

23:31

mmm WAL data received over transmitted over network received, but not yet flushed to file, but received, right? Yeah. Write and receive are the same. There is confusion because in different parts of Postgres here we have different words, different verbs used, right? But I agree, yeah. So there is a write, there is flush, and there is apply phases. And there are 3 steps here.

24:03

And there is no remote flush, which I think, okay, overloading happens, but at least it would be good to have remote flush and say, okay, on is acting like flush, but there is no remote flush. There is only remote write and remote apply and instead of flush we have on saving on number of words right supported

Michael

24:33

number of settings yeah but they did add a local which I think according to the documentation, they're saying is only for completeness. So it's interesting that they did add 1.

Nikolay

24:41

It's not only for completeness. And we will go there in a few minutes. So 1 means remote flush, medium setting, right? WAL is written. Yes. Like it's happening on the primary. In primary, at commit time, it's flushed to disk. We know that in the case of crash, it's already, this whole data is already on disk, but what will happen with data pages during recovery, they will need to be adjusted, right? So, like, it's not applied yet. Right. Good, good, good. So, how do you feel about this?

25:23

It's like synchronous replication, Which means that if we lose the primary, the new primary will have everything. Which is great, right? It will have everything and it will be needed to be applied, so recovery. But we don't lose anything. It's great. Agreed?

Michael

25:50

Yeah, with a little bit of latency cost on each, right?

Nikolay

25:54

Little bit.

Michael

25:55

Or a lot.

Nikolay

25:56

Yeah, this latency depends on the network as well, a lot.

Michael

26:00

Nikolay

26:00

course. And usually we prefer to move standbys to different availability zones. And this increases latency, of course, and this means that if we keep this as is, like synchronous_commit on and synchronous_standby_names have... This setting has 1 host name, It means that we will have a significant latency overhead This is why it's hard to use in heavily loaded projects, OLTP-like projects. But it's quite strict mode, right?

Michael

26:37

Yeah, and you can have an HA setup that's within the same availability zone, or at least the same region, I'm guessing, would massively reduce that.

Nikolay

26:46

Well, I wouldn't put. So for me, 1 availability zone can disappear. If you think about HA, we need to have a replica in a different availability zone. Which, of course, there is trade-off here. We need to wait. Wait, like, what is here, what is here, and then what's worse, what's best for us. And we move, for me, like HA means replica should be in different AZ. It's in different AZ, means like commit in this strict mode, remote flush or on, there's no remote flush.

27:23

It means we will need to wait both synchronization with disk on the primary and there. And network will add overhead here. And I expect we won't be able to scale these writes very well. So if it's like thousands of writes per second, it will be noticeable issue. That's why in most cases I see people go to... Well, by the way, remote apply, interesting. It's even worse, like we need to apply this. It's even worse and It kills performance even more, right? What do you think about this mode?

28:04

It's like extreme.

Michael

28:07

I actually lost you. I think our network connection is not great.

Nikolay

28:10

Okay, when I discussed network issues, we had network issues. So I moved us to consideration of remote apply.

Michael

28:20

Well, yeah, so actually question here then. This is no longer, I don't think, about risk of data loss. Because if we flushed, then we're good. So this is about, so is this for read replicas? This only makes sense if we're sending read traffic. Okay, great.

Nikolay

28:37

So this is good in terms of how data is visible there. We know by default on, it's acting like remove flush, which again doesn't exist. Changes are not applied, so any reading session which can be possible, the only possible option we can have on the SELECTs, like reads on standby. In this case it won't see this data yet. And the-

Michael

29:07

Unless we're using remote apply, in which case it would always be there.

Nikolay

29:11

It will be there right after commit. It's visible. It's great. So the delay between it's written and visible becomes 0. Once it's committed, it's immediately visible. Great. Yeah. And when do we need this? Question is, when do we need, I don't know. I don't work with such products. So I'm

Michael

29:35

right because by the point you need read replicas, the point of having read replicas is because you have so much read traffic that your primary can't serve it all, right? So we're talking about extremely high, at least extremely high reads. Maybe cases where we have a ton more reads than writes, and write latency is not that, maybe social media again.

Nikolay

30:01

Yeah, In this case, this replica can be in the same availability zone because it's not for HA purposes, right? It's just a redistributor needs maybe. I honestly don't know. We usually, we already have mechanisms everywhere implemented to deal with lags of replication. To stick to the primary write after writes and in the same session of user for some time or like these kinds of things right so

Michael

30:34

yeah you mentioned sticky reads before yeah

Nikolay

30:37

in this case to slow down commits for the sake of this like reduction of this delay of this leg I don't know I would prefer to deal with small lags, but don't slow down commits. Yeah. So

Michael

30:54

that's, I guess that's remote apply and the default is on. Do you ever, Do you see use cases for remote write?

Nikolay

31:02

Remote write is the most popular.

Michael

31:05

Oh really? Interesting.

Nikolay

31:06

Of course, most setups where synchronous binaries is non-empty, I see they use remote apply most of it. Remote write, remote write.

Michael

31:17

Right, that makes more sense.

Nikolay

31:18

So the list like complete, right? So, and here we have any issues because this is super popular option. And, oh, by the way, I wanted to mention that yesterday on Hacker News, it was the article from Aphyr, right? Jepson. Yeah, Jepson. It was discussing RDS multi-AZ clusters, which interesting, like it's closer to our topic today, but to understand the article, you need time. So we, great, we have remote write as being very popular. What does it mean?

32:02

The standby node received all data, but hasn't flushed it yet to the disk. It means that if all nodes suddenly shut down, Only primary will have this data. After commit, we know primary, if commit happens, primary flushed it, right? We don't have settings which say, which let us control the behavior of commit on standby nodes, but on the primary, it's not flushed.

32:39

Like if we have remote write on or remote apply, it means in any of these cases, disk on the primary has all data from this transaction, right? So it's like it's already, it's not off at all already. It's not off. And imagine we all know this appeared on the primary has this data because a standby since it's remote right is not guaranteed this might miss this data and this yeah this transaction might be lost And this is okay. Okay. But in case we lose the primary completely, we lose data. Right?

33:27

We lose data here. Yeah. But we lose data only if all nodes are shut down. Usually when we use the primary, standby nodes are not shut down. Unless something like... Yeah, yeah, yeah. So Let's say the loss of primary is our like corner case, not corner, edge case. But loss of primary and standby nodes are not lost but shut down, it's like kind of very very corner case, so several edges like, So several problems simultaneously. So usually they're still running.

34:05

That's why remote write is so popular because it says, okay, we lost the primary. We need to perform auto failover or manual failover, usually auto failover. And standby, 1 or 2, or how many you configure, of those standby nodes, pair the setting, pair standby commit equals remote, right? They have it in memory. And they're still running, so they will flush it. So the risk is good here. I mean, risk is understood. Like loss of primary and all, everyone is shut down suddenly.

34:41

But it's so low, so we can accept it. And say it still feels like synchronous_commit to me because I doubt when we lose primary standby node which received but not yet flushed this whole data, it will be suddenly off. It will be running, right? And we don't lose this data. This is great. This is why it's so popular. But it's great. Except it's not.

35:11

Because if auto failover needs to happen, it's very rarely, like sometimes people allocate like CrunchyBridge or RDS, not multi-AZ cluster, but multi-AZ standby. Now there are 2 options. And multi-AZ standby means there is synchronous standby only serving HA purposes. You cannot send queries to it. Crunchy Bridge has the same, like, not the same, not the same. They use Postgres replication to solve this. Exactly, they configure synchronous_standby_names.

35:50

And they set synchronous_commit to very weird value, we will discuss it. But the idea is HA replica doesn't receive traffic, it serves only for HA purposes. Great. But in many other cases, people use like, they say this quorum commit, 1 or 2 additional nodes must say data is received. Remote, right? And in this case, if we lose the primary, auto-failover will happen to where? To 1 of these nodes which received, right?

36:27

But the problem is, For example, if it's Patroni, Patroni cannot choose this node, because there is no such capability to find which... Like we have, again, we have 5 replicas, for example, 5 standbys. We know 1 of them received latest settings. How to find this?

Michael

36:49

At least, yeah.

Nikolay

36:51

There is no such capability, doesn't exist. Because this pg_last_receive_lsn, But it has the wrong name because it shows pg... it shows flush LSN. It's a bug. Oh, a bug or feature, I don't know. So this is a problem. And this is like if Patroni uses that I think it uses that it chooses the node which received like which flushed LSN with maximum value and this might be different no Yeah

Michael

37:32

So but so with that in mind I guess it's then an acceptable risk, or that's why it has some data loss. But why not use on in that case?

Nikolay

37:42

Because it will slow down commits.

Michael

37:45

OK, that's, yes, the trade-off.

Nikolay

37:47

We distribute our nodes among various, multiple AZs, availability zones, right? And there is network overhead.

Michael

38:00

Yeah, well, Interestingly, I checked before coming on with our own Google Cloud SQL setup, and I wondered what our setting was, and it was on. So either they haven't... I'm pretty sure... But yeah, it's interesting that they could have chosen to set it to remote.

Nikolay

38:20

And synchronous_standby_names is non-empty?

Michael

38:23

I didn't check. It's a good question.

Nikolay

38:26

If it's empty, you have pure asynchronous application. This is it. Like, this is it. Like, I Honestly, it concerns me a lot how overloaded values and names here. On means different things depending on different settings. Why? Maybe We are trying to save on the number of configuration settings, GUC names, not to expand it too much. Okay, 300, but let's not to expand it to 500, right? But this causes pain in brains, like people are using this. So if it's on, okay. Is it?

Michael

39:09

No, you're right.

Nikolay

39:10

It's asynchronous.

Michael

39:12

No synchronous_standby_names

Nikolay

39:14

Yeah, yeah. So it waits for flush only on the primary.

Michael

39:18

Yeah, so that's basically just local.

Nikolay

39:22

Yeah, yeah, yeah. Good. Yeah, so there's this problem, it's unsolved. It should be solved in Postgres. It should be solved in Postgres and I think it's not super difficult to solve so we need a function which will show write LSN or receive LSN. By the way, receive and write may be also a little bit different. You receive but okay, it doesn't matter Because while the receiver is a receiver, right? Anyway, so a couple of more things, deeper, I wanted to mention.

40:03

It's deeper things, but 1 of the things is even if we have a remote write, there is issue in the code that causes flushing periodically. So this WAL, like we have remote write, but sometimes we wait until it's flushed. And this is also like this, I guess this issue is known. It's revealed itself only under very heavy loads. And yeah, it also could be improved maybe, right? Because if, so remote, right? It's not purely remote, right? It's remote, right, right, right.

40:44

But at some point it's flush and all Transactions are currently committing, being committed, they wait additional flush. And then it's again, pure, again, write, write, write, and then again, flush. And it can be seen in the code that there are things there, F data sync, right?

Michael

41:06

Roughly, yeah. I mean, is this also the wal_writer_delay delay or a different kind

Nikolay

41:13

of delay? No, no, no, It's not documented.

Michael

41:17

Okay, interesting.

Nikolay

41:20

So, I mean, if you use remote, right, there is like behavior which is not purely remote, right? So it still involves disk sometimes, like periodically. I guess some buffer is full and needs to be written. But this causes delays of commits.

Michael

41:43

So. Makes sense.

Nikolay

41:44

Yeah, And another thing is probably off topic, but CrunchyBridge, let's discuss what we observed. We just created a fresh cluster and saw a few things. 1 of them is synchronous_standby_names is filled with, I created HA cluster. Yeah, so HA cluster was all good. And synchronous_standby_names has this host, which is HA replica. Like it's not visible to us. We cannot send queries to it. And synchronous_commit is set to local.

Michael

42:20

Yeah, so the first thing you mentioned, it being non-empty means they're using Postgres replication, right? What's the significance of that being non-empty?

Nikolay

42:30

If it's not empty, it's supposed to be... If a synchronous_commit was default on, it would mean we have a synchronous replication. Flush, as you said, like on means remote flush in this... Non-existing remote flash, it means, right? But synchronous_commit is set to local, which means we don't see the state of what's happening on standby. So like commit only depends, the commit behavior depends only the primary.

Michael

43:05

Like my setup, right? Like on Google CloudSQL for me, even though it's on, it's kind of like local because they're not using.

Nikolay

43:12

Right, but they still have the replica.

Michael

43:15

Yeah, Yeah, yeah.

Nikolay

43:16

Strange, right?

Michael

43:17

But local, local's confusing because I don't, yeah, I would have, as you said, if, I would have expected it to be on, or as you say, the more popular one.

Nikolay

43:26

If it would be on, if it was on, this replica would be synchronous. If it's local,

Michael

43:34

I guess it's unsynchronous? Yeah. Or asynchronous, yeah.

Nikolay

43:39

Asynchronous, right. Although it's mentioned in synchronous_standby_names.

Michael

43:45

Yeah, good point. What would you prefer it to be remote right, I guess, rather than on?

Nikolay

43:52

No, no, no. I think they are not, like, I'm not sure it's a mistake.

Michael

43:59

Well, it's not the default, right so it's definitely deliberate.

Nikolay

44:05

There are some thoughts about this I think and I'm trying to guess what because obviously making it synchronous would cause slow commits right yeah performance would be not great okay we go to local but why

Michael

44:23

do people are doing

Nikolay

44:24

yeah we keep it local, so this chair replica becomes asynchronous. Huh. Do you

Michael

44:32

think it's to look good in benchmarking?

Nikolay

44:34

Why cannot we configure asynchronous replica in a normal way? Why should we put it to synchronous_standby_names, right? Because it could be regular asynchronous replica, like present in slot, but not available for read-only transactions, right? But it's there.

44:54

And the only idea, it's not my idea, we discussed this situation in the team and idea came up into one great mind was probably they want WAL to be written by backends, not by WAL writer, because there's like this effect when you use synchronous binaries, it's not empty, backends are writing. And this can increase throughput. It's a guess, hypothesis. If Crunchy is listening to us, and we know they're listening because they changed the random page cost to 1.1, right?

Michael

45:34

Oh yeah.

Nikolay

45:34

Actually, it's great. I also wanted to thank them because since then I stopped saying 1 and I say 1.1 because Crunchy benchmarked it and decided to use 1.1 for random page cost. Great. Thank you so much. But if they, if you guys are listening, please let us know what's happening with settings. It's really interesting. Maybe it should be documented because it's interesting.

45:59

And so overall impression about synchronous_commit, all those options and synchronous replication, I honestly see more asynchronous setups purely. But I know serious guys and I see clusters with synchronous like quorum commit configured. My impression still there are things to be fixed and improved. This is what I observe. And 1 more of... Let's name it. When we created this cluster with HA, with HA checkbox, HA replica was also created. We see it in synchronous_standby_names.

46:42

And then we decided using our super user, which Crunchy has, and it's great. I think it's the only managed service, 1 of maybe 2 actually, which provides super user access, which is great. So we decided to change shared preload libraries and restart. And there was an idea, like a question, is this change propagated to a standby node or not? And what happened, Postgres restart, not cluster restart, Postgres restart, cost fell over. This is not all right. This is not all right.

47:18

Freshly created cluster and we did it 3 times, 2 out of 3 times, failover happened. And another time it didn't happen. So I guess it's like some race condition is present there, right? And we saw that change was not propagated to a channel. So it's actually data loss. If I say, if I have super user, it's my right, right? And I have alter system, it writes to, we can see it in Pidge settings and it's not yet applied, like painting restart is true, right?

47:54

And then we have a lower and we don't see it anymore. What is it? It's data loss.

Michael

48:04

Or it's unsupported behavior.

Nikolay

48:08

No, no, no, it's data loss. I wrote something. Using alter system, I wrote something and I lost it.

Michael

48:15

So you can't have it both ways that like III see where you're coming from 100% But on 1 hand you want super user access and on the other hand You're gonna call it data loss if you change anything that doesn't get propagated

Nikolay

48:28

Let me contradict any with myself like I I'm too harsh. I know like alter system. It's actually local, we can change settings on each standby separately, independently, so it's not guaranteed, it's not propagated by replication. I just expected maybe some automation exists which would propagate it. If this replica is a HA replica, maybe we should match settings. Because this setting required restart, but what happens to settings that don't require restart? If I change them, and then what?

49:05

Like if I change them using alter system, and do I lose them? This is not good, right?

Michael

49:13

Yeah. But equally, expectations of automations around this stuff is probably 1 that could be a nice excuse for other providers not to add CPUs to Access. So it's yeah. No, no, no.

Nikolay

49:25

Well, I think it's no, no, no, no, no, no, no, no, no, no.

Michael

49:31

I'm just saying be careful what you wish for.

Nikolay

49:33

Well, it could happen, like I change something using UI and...

Michael

49:40

No, because if you use the UI, they change it somewhere else and it does get propagated.

Nikolay

49:44

Maybe. But I'm rooting for Crunch actually here. I think it can be solved, this problem. I don't want them to close super user access. It's a great feature, super great feature. I think everyone should be. Based on our last, previous week's topic, My position, RDS and others, they are taking too much from us. It stops feeling open source.

Michael

50:13

We don't need to make this point again.

Nikolay

50:15

Yeah, okay. Well, I'm going to repeat it many times, right? So it's not all right. So and Crunch are doing a great job here to providing, we are providing super user access and I think it can be solved, right? And 1 more thing, if you're listening, when I restart Postgres, I see restart cluster, right?

Michael

50:38

I feel like I need to chop this bit up and send it to them as a support thing

Nikolay

50:42

good yeah it was good

Michael

50:44

nice 1

Nikolay

50:45

thank you

Michael

50:45

yeah thanks Thanks so much Nikolay and thanks for the request and yeah catch you next week.

Nikolay

50:51

See you. Bye.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript