synchronous_commit - podcast episode cover

synchronous_commit

May 02, 202551 minEp. 143
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Nikolay and Michael discuss synchronous_commit — what it means on single node setups, for synchronous replication setups, and the pros and cons of the different options for each.
 
Here are some links to things they mentioned:


~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!


~~~

Postgres FM is produced by:


With credit to:

  • Jessie Draws for the elephant artwork

Transcript

Michael

Hello and welcome to Postgres.FM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard and this is Nikolay, founder of Postgres.AI. Hey Nikolay, how's it going?

Nikolay

Everything is alright, how are you?

Michael

Yeah, good thank you. It's been a nice sunny week here in the UK.

Nikolay

Oh, this is who stole our sun, okay.

Michael

So this week we got a listener request, in fact we get quite a lot of listener requests, thank you everybody. This 1 was from Shayon Mukherjee, apologies for pronunciation, and they asked us about synchronous_commit and whether or when it might make sense to ever turn it off or use different settings and I thought it's an interesting topic full stop so I wanted to get your thoughts as well on this.

Nikolay

I have many thoughts on this. We need to start somewhere. So

Michael

I wondered about, let's, starting probably not in the most obvious place but on a single node setup. I know this becomes more interesting, multi-node, but on single node, pure Postgres, there are still some interesting discussion points here, right?

Nikolay

Yeah, and even before that, there is single... So the meaning of synchronous_commit is overloaded. I know like technically it defines behavior of commit, right? And there is local behavior and if we have synchronous standbys, there is remote behavior. But originally it was only about local. Yeah. And then decision was, it was only on and off 2 options. This is how I remember it for a decade or more, more than a decade of being Postgres user. And then there is overloaded meaning.

Remote apply, remote write, remote apply. 2 options. So I think my first thought is it's a huge mistake to mix things here. It causes pain in users like me. All the time I like, I read it, I read it, I read it, like, I cannot remember this. To remember this, you need to deal only with this.

Michael

Yeah, or look up, well you can't remember it, you just look it up. I just look

Nikolay

it up. Every time you look it up and try to understand the logic. It's like, we have many places in GUC system, Postgres system like this. Like for example, you know, we still do a lot of consulting. We help our customers, many of them are startup companies who grow rapidly, very smart people. They understand things very quickly, but you need to explain what's happening. And 1 of the things like we, we deal with bloat all the time. Bloat is 1 of the major concerns.

And we explain, okay, we need autovacuum tuning. And there is autovacuum_work_mem, which is minus 1. It means you need to look at maintenance_work_mem. Why was this done? This is exactly the same thing. Like why was this done to my brain? Like somebody made some decision to cause huge pain for ages. It's a UX issue, like a user experience issue, because instead of clarifying and separating things, we develop some implicit dependencies and you need to be expert.

And even if you are expert, it's like, it's really hard to memorize all these things and the problem is like at some point we will be use more and more AI and so on but to explain these things to AI it's also difficult because there are implicit dependencies So synchronous_commit can be local but it also can be not local. And local is on and off and there is also word local, right? Yeah. So meaning of on was rewritten. Thank you guys. Thank you. Thank you who did this. Thank you.

Like to explain, like if you try to be consultant and explain things to others very smart guys You will see the pain It's really hard and why is this hard? This is my first thought why is this so hard?

Michael

As you say it's because combining 2 things right and and the behavior is I can see why they combined it though like the behavior

Nikolay

I also can see no no questions here

Michael

And to their credit the documentation is really nicely written so it can be pointed.

Nikolay

It's nicely written but it doesn't explain all the things so I hope we will touch additional topics not covered by documentation today.

Michael

Nice. So let's talk about the single node case on off on by default right So like why would we want to turn it off? What's happening here?

Nikolay

Yeah, if we forget about multi-node setups, we have just 1 node, right? And if we have write heavy workloads and we think write latency is not good and eventually right throughput is not good, because if latency is not good, throughput won't be good as well. There is some dependency there, which is not straightforward. And it means that, well, we know at commit time, Postgres needs to make sure information is written to WAL and it's present, like synchronized, like it's on disk. Right.

So we need, before that, Postgres cannot respond to the user that commit was successful. And when we say commit, it can be explicit commit, or it can be implicit commit if we have a single statement transaction, which is super popular approach in web and mobile apps. So when we just have single statement transactions all the time, just separate and so separate, separate update and so on And all of them have implicit commit as well. So commit can be successful only when WAL is already on disk.

And it means We need to wait for disk. And if disk is not a RAMFS, which is in 99.99% of cases, right? If it's regular disk, it can be quite slow. And in this case, we deal with, like if our transactions are super fast, like extremely fast, which is actually rare, usually they are not so fast. Because of various cases, there's things like including, for example, index write amplification. I have moved to criticize Postgres today, you know, like, Yeah, you know, you know me.

Michael

Well, yeah, so I feel like so far you've just yeah, you've you've so far described the kind of synchronous_commit equals on case it's so we have to wait yes which is default and we're waiting for the flush to happen so the rights to disk before confirming back to the client that that commit has been successful but turning it off we don't wait for the the extra step of flushing to disk before confirming to the client, which sounds risky.

Well, the risk is if there happens to be a crash on the node, in the time between the write-ahead log being written and the flush, so the commit being confirmed and the flush to disk, we lose any of that data. So the risk here is some data loss on crash.

Nikolay

Yeah, the short tail of WAL is lost. Postgres will still reach consistency, all good, but we lose some part of data. It's similar to any multi-node setup with asynchronous node when failover happens and if standbys are asynchronous, nobody guarantees that at failover everything will be present there, right? Because it's asynchronous.

There can be lags, but there's consistency, all good, and we can lose, for example, by default in Patroni settings, as I remember, there's a setting, I always forget the name, but a setting like maximum lag allowed it to fail over, as I remember by default it's 1 megabyte, so up to 1 megabyte is okay to lose. Here it's similar, there is no threshold, like unlimited, but technically it cannot be very, cannot be huge.

Michael

I looked into it and there is the default setting is a write-ahead log writer delay of 200 milliseconds by default, And so I thought maybe only 200 milliseconds worth of writes could be lost, but in the documentation it said actually... Yes, good point. Actually, it's for implementation detail reasons, it says it's 3 times that. So by default, it's 600 milliseconds that could be lost. But you could reduce that.

This setting only makes sense to change if you're setting synchronous_commit for any of your transactions. But you can reduce it if you want. So you could, if you do decide to go for some use of synchronous_commit equals off, then this is a setting you could reduce to minimize data loss. But at that point, you're still losing some data. You might still be losing some data, so it's, I'm not sure if it matters that much.

Nikolay

Right, right. So maximum is 3 times of that wal_writer_delay, right. And the question is, like, should we consider this? My answer is if writes are super fast, maybe yes. There is a trade-off here, obviously.

Michael

Yeah, I was thinking about it in our last episode about time series data. But you made a really good point about if you're if you are writing kind of lots and lots of little writes often in time series workloads people suggest actually batching inserts

Nikolay

so

Michael

by that point you're yeah so batching makes a load of sense for optimization.

Nikolay

It's a good point. So choosing between synchronous_commit off and batching, you just choose batching.

Michael

Yeah, so then, but then at that point, your, your transactions are not that fast because you're batching maybe a thousand or more of them together and then the benefits of synchronous_commit equals off of much smaller yeah exactly so it's an in I don't think you can get the benefit twice if that makes sense you

Nikolay

don't get

Michael

the benefit of doing both.

Nikolay

Single row inserts versus 1 insert of 1, 000 rows. 1, 000 inserts versus 1, 000 row single insert. Of course, you'll have just 1 commit. It's great. And it will be faster. But of course, or updates for example, of course you should avoid huge batches, because they have also downsides, like longer lasting locks, Or in case if such batch is crushed, you lose more. You need to retry more, right? Instead of losing just 1 insert, you're losing all 1,000 inserts.

So as usual, batching, my rule of thumb is just to choose batch size so it's roughly like 1 or 2 seconds maximum to avoid long lasting locks and risks to repeat too much. In this case, it's great. Like in this case, just 1 commit per second or maybe 10 commits per second, it's a very low number. So in this case, the overhead of the need to wait for a WAL to be synced to disk is super low, which is like roughly you can think about it like not more than 1 millisecond.

In this case, like just forget about it. synchronous_commit off is not a good idea, that's it. Yeah. But usually things are much more complicated because we just consider 1 type of workload. Any project has many types of workload, right?

Michael

Yeah, and that's a good point that Shayon actually made in a follow-up post that I think sometimes gets missed in this topic, which is that it can be set on a per transaction basis or per session basis per user basis, so if you're doing a one-off if you're doing like a one-off migration or bulk load of data and you can't for some reason not batch yeah then maybe it's worth considering turning it off just for that user or for that session or per transaction.

Nikolay

What kind of situation could it be in if you cannot buy like bulk but not batch?

Michael

I was struggling I was struggling to think of it, maybe some streaming, like streaming case, but it felt tenuous.

Nikolay

If technically it comes from many, many sources, instead of saying, okay, we need Kafka before it comes to Postgres, and then we will batch. It's like, it's a heavy solution. Maybe indeed, this is exactly the case when we, it's worth considering off, right? To

Michael

use off. The other thing I was thinking about is whether this comes up as a thought for people because of tools like pgbench that by default, you get a lot of small transactions, a lot of small very fast transactions. So because, I wondered if it becomes interesting to people because they just think let's quickly do a test with pgbench and oh wow synchronous_commit, what's that?

Nikolay

Have you tried pgbench and turn off synchronous_commit? I tried many times because it's obvious idea and I don't remember I saw a huge advantage. Because pgbench write transactions by default, they are not so simple. It's, as I remember, it's update plus 2 inserts, maybe a delete as well. It's multi-statement transaction, which is definitely not super lightweight. Yeah, fair enough. I didn't try it.

Yeah. Well, I can imagine we can have situations when the effect is quite noticeable, like dozens of percent, dozens of percent. But in general, from my practice, I stopped looking in this direction. Yeah, fair enough. To think about often. And of course, data loss is not a good idea. So, I don't know, like write heavy workloads, maybe yes, but again, batching and all proper tricks to reduce a number of actual commits happening and that's it.

Michael

Yeah and I think people when people talk about data loss they quite often immediately jump to things like banking apps you know things that where it's like absolutely critical not to, you know, you'd much rather have some downtime than have some data loss. But I also think user experience wise, like even for just regular CRUD apps, losing data can be really confusing to people, even if it's only a not that important thing.

The cases I was thinking of are much more like the time series ones, where if we've got 1 sensor reporting every minute, if we lose 1 minute, if we lose a couple of those sensor reports, it's not that important. We can still do a bunch of aggregates on that data. So I guess for monitoring and logging type things, maybe it's less important. But there can be really important logs, too, right? So I personally, I'm not I don't see too many use cases where this makes a lot of sense.

But there is a whole page on the Postgres docs about, it's called asynchronous commit. So it must have been somewhat of interest to quite a lot of people.

Nikolay

Right. Yeah, yeah, yeah.

Michael

Should we switch to the topic? It feels like this gets more interesting when we start talking

Nikolay

about synchronous. Let's do it. Let's move on and demultiply. First of all, own is overloaded when you have synchronous_standby_names non-empty. synchronous_standby_names non-empty, right? We can have purely synchronous replication or we can have so-called quorum commit and I just had a chat with 1 of our team members and I agree.

I heard criticism as well, like quorum commit is a very misleading term here because usually it's used like this, like we have multiple standbys, They are synchronous originally by nature, but then we say synchronous_commit, for example, remote write. And then we configure Postgres to allow synchronous_standby_names to have any, right? We say any. And for example, we have, say, 5 replicas, 5 standbys. And we say any 1, right, any 1 of them.

And it means that commit happens on the primary and on any 1 additionally. So on 2 nodes before a user receives success of commit. It means 2 out of 6. 5 standbys, primary, 6 nodes. So 2 out of 6 is not quorum, because quorum, definition of quorum means more than 50.

Michael

So that would be 4 out of 6?

Nikolay

Usually, usually. I think we can like say this in this case it's the criticism like it's conditional, right? We may say okay in our case quorum definition is kind of rewritten and we need 2 out of 6. But it sounds strange a little bit. So maybe, how should we call this situation when we not just 1 synchronous replica, and every commit must happen on both but we have multiple replicas and we say 1 more or 2 more so how do we call it semi-synchronous I've heard this term as well semi-synchronous

Michael

yeah yeah I actually yeah I like I kind of like it. It does imply.

Nikolay

Yeah, but official boxes.

Michael

It's confusing Semi-synchronous is confusing enough. That means I have to look it up, which is probably helpful.

Nikolay

I don't

Michael

assume I know what it means,

Nikolay

right?

Michael

But yeah, it doesn't fit into either category. Like it's not synchronous and it's not asynchronous. Like it's, yeah, I don't like it though. I don't feel comfortable with it because what's it saving us? Like what's the benefit of, like is the idea that if both crash at the same time...

Nikolay

Yeah, let's, before we talk about it, let's finish with terminology. Maybe, maybe, I'm looking up right now the word, the meaning of word quorum, maybe it's not that bad because in some cases we can say, okay, 25% is our quorum, right? We can define these rules. So, like, expectation that it should be more than 50, maybe it's like false expectation. Yeah, anyway, but you cannot say percentage. You say like, like any 1 or 2, right?

Configuring Postgres, synchronous_standby_names, you can say any word and then say, You can have some kind of, it's an interesting syntax, you saw it, right?

Michael

So- Yeah, I also saw you could do, is it like priority?

Nikolay

Yeah, yeah, yeah. Yeah, so yeah, there are some interesting things there. But anyway, idea is they work together. synchronous_standby_names and synchronous_commit in this case. And if synchronous_standby_names is not empty, so there is this magic happening. If it's just a Host name, purely synchronous replication, right? But also purely synchronous, there are flavors we will discuss based on this synchronous_commit setting.

If there is like expression is there, like any or like first, blah, blah, blah, then it's already more complicated. But it's interesting. And official documentation mentions, it uses the term quorum commit, right?

Michael

I saw it in a Crunchy Data blog post. I didn't actually see it in the documentation.

Nikolay

Okay, doesn't matter. Let's talk about flavors. Let's consider we have only for simplicity just 1 host name mentioned in synchronous_standby_names, No expressions, just 1. And if we don't change synchronous_commit setting, default is on, here the meaning changes. Right? Which like this makes me... I started with this, right? Like overloaded. The meaning changes and what does mean it means that we need to to wait until what like let's let's see

Michael

so yeah I think it's worth thinking about them in terms, I like to think of them like progressively. I think the order goes, remote apply is the strongest. So remote apply will wait until it's not only been written to the write-ahead log and

Nikolay

flush to disk but also

Michael

on the but but available to to queries so read queries on the on the standby will be able to...

Nikolay

And here I think, okay, apply here but why replay there? In pg_stat_replication it's called replay LSN, right? Apply or replay? Apply or replay, we consider the same thing, right? Yeah. Right. So

Michael

that's the strongest. I think that's the highest level we can set, but it has additional latency, obviously. I think on is the next strongest. I think we wait for flush but not for, what did you want to call it? Replay.

Nikolay

Replay or apply.

Michael

And then there's 1 level below on which is the right

Nikolay

there is no remote receive there is replay right

Michael

yeah which but I think receive is this they're saying it's the same as right

Nikolay

mmm WAL data received over transmitted over network received, but not yet flushed to file, but received, right? Yeah. Write and receive are the same. There is confusion because in different parts of Postgres here we have different words, different verbs used, right? But I agree, yeah. So there is a write, there is flush, and there is apply phases. And there are 3 steps here.

And there is no remote flush, which I think, okay, overloading happens, but at least it would be good to have remote flush and say, okay, on is acting like flush, but there is no remote flush. There is only remote write and remote apply and instead of flush we have on saving on number of words right supported

Michael

number of settings yeah but they did add a local which I think according to the documentation, they're saying is only for completeness. So it's interesting that they did add 1.

Nikolay

It's not only for completeness. And we will go there in a few minutes. So 1 means remote flush, medium setting, right? WAL is written. Yes. Like it's happening on the primary. In primary, at commit time, it's flushed to disk. We know that in the case of crash, it's already, this whole data is already on disk, but what will happen with data pages during recovery, they will need to be adjusted, right? So, like, it's not applied yet. Right. Good, good, good. So, how do you feel about this?

It's like synchronous replication, Which means that if we lose the primary, the new primary will have everything. Which is great, right? It will have everything and it will be needed to be applied, so recovery. But we don't lose anything. It's great. Agreed?

Michael

Yeah, with a little bit of latency cost on each, right?

Nikolay

Little bit.

Michael

Or a lot.

Nikolay

Yeah, this latency depends on the network as well, a lot.

Michael

Of

Nikolay

course. And usually we prefer to move standbys to different availability zones. And this increases latency, of course, and this means that if we keep this as is, like synchronous_commit on and synchronous_standby_names have... This setting has 1 host name, It means that we will have a significant latency overhead This is why it's hard to use in heavily loaded projects, OLTP-like projects. But it's quite strict mode, right?

Michael

Yeah, and you can have an HA setup that's within the same availability zone, or at least the same region, I'm guessing, would massively reduce that.

Nikolay

Well, I wouldn't put. So for me, 1 availability zone can disappear. If you think about HA, we need to have a replica in a different availability zone. Which, of course, there is trade-off here. We need to wait. Wait, like, what is here, what is here, and then what's worse, what's best for us. And we move, for me, like HA means replica should be in different AZ. It's in different AZ, means like commit in this strict mode, remote flush or on, there's no remote flush.

It means we will need to wait both synchronization with disk on the primary and there. And network will add overhead here. And I expect we won't be able to scale these writes very well. So if it's like thousands of writes per second, it will be noticeable issue. That's why in most cases I see people go to... Well, by the way, remote apply, interesting. It's even worse, like we need to apply this. It's even worse and It kills performance even more, right? What do you think about this mode?

It's like extreme.

Michael

I actually lost you. I think our network connection is not great.

Nikolay

Okay, when I discussed network issues, we had network issues. So I moved us to consideration of remote apply.

Michael

Well, yeah, so actually question here then. This is no longer, I don't think, about risk of data loss. Because if we flushed, then we're good. So this is about, so is this for read replicas? This only makes sense if we're sending read traffic. Okay, great.

Nikolay

So this is good in terms of how data is visible there. We know by default on, it's acting like remove flush, which again doesn't exist. Changes are not applied, so any reading session which can be possible, the only possible option we can have on the SELECTs, like reads on standby. In this case it won't see this data yet. And the-

Michael

Unless we're using remote apply, in which case it would always be there.

Nikolay

It will be there right after commit. It's visible. It's great. So the delay between it's written and visible becomes 0. Once it's committed, it's immediately visible. Great. Yeah. And when do we need this? Question is, when do we need, I don't know. I don't work with such products. So I'm

Michael

right because by the point you need read replicas, the point of having read replicas is because you have so much read traffic that your primary can't serve it all, right? So we're talking about extremely high, at least extremely high reads. Maybe cases where we have a ton more reads than writes, and write latency is not that, maybe social media again.

Nikolay

Yeah, In this case, this replica can be in the same availability zone because it's not for HA purposes, right? It's just a redistributor needs maybe. I honestly don't know. We usually, we already have mechanisms everywhere implemented to deal with lags of replication. To stick to the primary write after writes and in the same session of user for some time or like these kinds of things right so

Michael

yeah you mentioned sticky reads before yeah

Nikolay

in this case to slow down commits for the sake of this like reduction of this delay of this leg I don't know I would prefer to deal with small lags, but don't slow down commits. Yeah. So

Michael

that's, I guess that's remote apply and the default is on. Do you ever, Do you see use cases for remote write?

Nikolay

Remote write is the most popular.

Michael

Oh really? Interesting.

Nikolay

Of course, most setups where synchronous binaries is non-empty, I see they use remote apply most of it. Remote write, remote write.

Michael

Right, that makes more sense.

Nikolay

So the list like complete, right? So, and here we have any issues because this is super popular option. And, oh, by the way, I wanted to mention that yesterday on Hacker News, it was the article from Aphyr, right? Jepson. Yeah, Jepson. It was discussing RDS multi-AZ clusters, which interesting, like it's closer to our topic today, but to understand the article, you need time. So we, great, we have remote write as being very popular. What does it mean?

The standby node received all data, but hasn't flushed it yet to the disk. It means that if all nodes suddenly shut down, Only primary will have this data. After commit, we know primary, if commit happens, primary flushed it, right? We don't have settings which say, which let us control the behavior of commit on standby nodes, but on the primary, it's not flushed.

Like if we have remote write on or remote apply, it means in any of these cases, disk on the primary has all data from this transaction, right? So it's like it's already, it's not off at all already. It's not off. And imagine we all know this appeared on the primary has this data because a standby since it's remote right is not guaranteed this might miss this data and this yeah this transaction might be lost And this is okay. Okay. But in case we lose the primary completely, we lose data. Right?

We lose data here. Yeah. But we lose data only if all nodes are shut down. Usually when we use the primary, standby nodes are not shut down. Unless something like... Yeah, yeah, yeah. So Let's say the loss of primary is our like corner case, not corner, edge case. But loss of primary and standby nodes are not lost but shut down, it's like kind of very very corner case, so several edges like, So several problems simultaneously. So usually they're still running.

That's why remote write is so popular because it says, okay, we lost the primary. We need to perform auto failover or manual failover, usually auto failover. And standby, 1 or 2, or how many you configure, of those standby nodes, pair the setting, pair standby commit equals remote, right? They have it in memory. And they're still running, so they will flush it. So the risk is good here. I mean, risk is understood. Like loss of primary and all, everyone is shut down suddenly.

But it's so low, so we can accept it. And say it still feels like synchronous_commit to me because I doubt when we lose primary standby node which received but not yet flushed this whole data, it will be suddenly off. It will be running, right? And we don't lose this data. This is great. This is why it's so popular. But it's great. Except it's not.

Because if auto failover needs to happen, it's very rarely, like sometimes people allocate like CrunchyBridge or RDS, not multi-AZ cluster, but multi-AZ standby. Now there are 2 options. And multi-AZ standby means there is synchronous standby only serving HA purposes. You cannot send queries to it. Crunchy Bridge has the same, like, not the same, not the same. They use Postgres replication to solve this. Exactly, they configure synchronous_standby_names.

And they set synchronous_commit to very weird value, we will discuss it. But the idea is HA replica doesn't receive traffic, it serves only for HA purposes. Great. But in many other cases, people use like, they say this quorum commit, 1 or 2 additional nodes must say data is received. Remote, right? And in this case, if we lose the primary, auto-failover will happen to where? To 1 of these nodes which received, right?

But the problem is, For example, if it's Patroni, Patroni cannot choose this node, because there is no such capability to find which... Like we have, again, we have 5 replicas, for example, 5 standbys. We know 1 of them received latest settings. How to find this?

Michael

At least, yeah.

Nikolay

There is no such capability, doesn't exist. Because this pg_last_receive_lsn, But it has the wrong name because it shows pg... it shows flush LSN. It's a bug. Oh, a bug or feature, I don't know. So this is a problem. And this is like if Patroni uses that I think it uses that it chooses the node which received like which flushed LSN with maximum value and this might be different no Yeah

Michael

So but so with that in mind I guess it's then an acceptable risk, or that's why it has some data loss. But why not use on in that case?

Nikolay

Because it will slow down commits.

Michael

OK, that's, yes, the trade-off.

Nikolay

We distribute our nodes among various, multiple AZs, availability zones, right? And there is network overhead.

Michael

Yeah, well, Interestingly, I checked before coming on with our own Google Cloud SQL setup, and I wondered what our setting was, and it was on. So either they haven't... I'm pretty sure... But yeah, it's interesting that they could have chosen to set it to remote.

Nikolay

And synchronous_standby_names is non-empty?

Michael

I didn't check. It's a good question.

Nikolay

If it's empty, you have pure asynchronous application. This is it. Like, this is it. Like, I Honestly, it concerns me a lot how overloaded values and names here. On means different things depending on different settings. Why? Maybe We are trying to save on the number of configuration settings, GUC names, not to expand it too much. Okay, 300, but let's not to expand it to 500, right? But this causes pain in brains, like people are using this. So if it's on, okay. Is it?

Michael

No, you're right.

Nikolay

It's asynchronous.

Michael

No synchronous_standby_names

Nikolay

Yeah, yeah. So it waits for flush only on the primary.

Michael

Yeah, so that's basically just local.

Nikolay

Yeah, yeah, yeah. Good. Yeah, so there's this problem, it's unsolved. It should be solved in Postgres. It should be solved in Postgres and I think it's not super difficult to solve so we need a function which will show write LSN or receive LSN. By the way, receive and write may be also a little bit different. You receive but okay, it doesn't matter Because while the receiver is a receiver, right? Anyway, so a couple of more things, deeper, I wanted to mention.

It's deeper things, but 1 of the things is even if we have a remote write, there is issue in the code that causes flushing periodically. So this WAL, like we have remote write, but sometimes we wait until it's flushed. And this is also like this, I guess this issue is known. It's revealed itself only under very heavy loads. And yeah, it also could be improved maybe, right? Because if, so remote, right? It's not purely remote, right? It's remote, right, right, right.

But at some point it's flush and all Transactions are currently committing, being committed, they wait additional flush. And then it's again, pure, again, write, write, write, and then again, flush. And it can be seen in the code that there are things there, F data sync, right?

Michael

Roughly, yeah. I mean, is this also the wal_writer_delay delay or a different kind

Nikolay

of delay? No, no, no, It's not documented.

Michael

Okay, interesting.

Nikolay

So, I mean, if you use remote, right, there is like behavior which is not purely remote, right? So it still involves disk sometimes, like periodically. I guess some buffer is full and needs to be written. But this causes delays of commits.

Michael

So. Makes sense.

Nikolay

Yeah, And another thing is probably off topic, but CrunchyBridge, let's discuss what we observed. We just created a fresh cluster and saw a few things. 1 of them is synchronous_standby_names is filled with, I created HA cluster. Yeah, so HA cluster was all good. And synchronous_standby_names has this host, which is HA replica. Like it's not visible to us. We cannot send queries to it. And synchronous_commit is set to local.

Michael

Yeah, so the first thing you mentioned, it being non-empty means they're using Postgres replication, right? What's the significance of that being non-empty?

Nikolay

If it's not empty, it's supposed to be... If a synchronous_commit was default on, it would mean we have a synchronous replication. Flush, as you said, like on means remote flush in this... Non-existing remote flash, it means, right? But synchronous_commit is set to local, which means we don't see the state of what's happening on standby. So like commit only depends, the commit behavior depends only the primary.

Michael

Like my setup, right? Like on Google CloudSQL for me, even though it's on, it's kind of like local because they're not using.

Nikolay

Right, but they still have the replica.

Michael

Yeah, Yeah, yeah.

Nikolay

Strange, right?

Michael

But local, local's confusing because I don't, yeah, I would have, as you said, if, I would have expected it to be on, or as you say, the more popular one.

Nikolay

If it would be on, if it was on, this replica would be synchronous. If it's local,

Michael

I guess it's unsynchronous? Yeah. Or asynchronous, yeah.

Nikolay

Asynchronous, right. Although it's mentioned in synchronous_standby_names.

Michael

Yeah, good point. What would you prefer it to be remote right, I guess, rather than on?

Nikolay

No, no, no. I think they are not, like, I'm not sure it's a mistake.

Michael

Well, it's not the default, right so it's definitely deliberate.

Nikolay

There are some thoughts about this I think and I'm trying to guess what because obviously making it synchronous would cause slow commits right yeah performance would be not great okay we go to local but why

Michael

do people are doing

Nikolay

yeah we keep it local, so this chair replica becomes asynchronous. Huh. Do you

Michael

think it's to look good in benchmarking?

Nikolay

Why cannot we configure asynchronous replica in a normal way? Why should we put it to synchronous_standby_names, right? Because it could be regular asynchronous replica, like present in slot, but not available for read-only transactions, right? But it's there.

And the only idea, it's not my idea, we discussed this situation in the team and idea came up into one great mind was probably they want WAL to be written by backends, not by WAL writer, because there's like this effect when you use synchronous binaries, it's not empty, backends are writing. And this can increase throughput. It's a guess, hypothesis. If Crunchy is listening to us, and we know they're listening because they changed the random page cost to 1.1, right?

Michael

Oh yeah.

Nikolay

Actually, it's great. I also wanted to thank them because since then I stopped saying 1 and I say 1.1 because Crunchy benchmarked it and decided to use 1.1 for random page cost. Great. Thank you so much. But if they, if you guys are listening, please let us know what's happening with settings. It's really interesting. Maybe it should be documented because it's interesting.

And so overall impression about synchronous_commit, all those options and synchronous replication, I honestly see more asynchronous setups purely. But I know serious guys and I see clusters with synchronous like quorum commit configured. My impression still there are things to be fixed and improved. This is what I observe. And 1 more of... Let's name it. When we created this cluster with HA, with HA checkbox, HA replica was also created. We see it in synchronous_standby_names.

And then we decided using our super user, which Crunchy has, and it's great. I think it's the only managed service, 1 of maybe 2 actually, which provides super user access, which is great. So we decided to change shared preload libraries and restart. And there was an idea, like a question, is this change propagated to a standby node or not? And what happened, Postgres restart, not cluster restart, Postgres restart, cost fell over. This is not all right. This is not all right.

Freshly created cluster and we did it 3 times, 2 out of 3 times, failover happened. And another time it didn't happen. So I guess it's like some race condition is present there, right? And we saw that change was not propagated to a channel. So it's actually data loss. If I say, if I have super user, it's my right, right? And I have alter system, it writes to, we can see it in Pidge settings and it's not yet applied, like painting restart is true, right?

And then we have a lower and we don't see it anymore. What is it? It's data loss.

Michael

Or it's unsupported behavior.

Nikolay

No, no, no, it's data loss. I wrote something. Using alter system, I wrote something and I lost it.

Michael

So you can't have it both ways that like III see where you're coming from 100% But on 1 hand you want super user access and on the other hand You're gonna call it data loss if you change anything that doesn't get propagated

Nikolay

Let me contradict any with myself like I I'm too harsh. I know like alter system. It's actually local, we can change settings on each standby separately, independently, so it's not guaranteed, it's not propagated by replication. I just expected maybe some automation exists which would propagate it. If this replica is a HA replica, maybe we should match settings. Because this setting required restart, but what happens to settings that don't require restart? If I change them, and then what?

Like if I change them using alter system, and do I lose them? This is not good, right?

Michael

Yeah. But equally, expectations of automations around this stuff is probably 1 that could be a nice excuse for other providers not to add CPUs to Access. So it's yeah. No, no, no.

Nikolay

Well, I think it's no, no, no, no, no, no, no, no, no, no.

Michael

I'm just saying be careful what you wish for.

Nikolay

Well, it could happen, like I change something using UI and...

Michael

No, because if you use the UI, they change it somewhere else and it does get propagated.

Nikolay

Maybe. But I'm rooting for Crunch actually here. I think it can be solved, this problem. I don't want them to close super user access. It's a great feature, super great feature. I think everyone should be. Based on our last, previous week's topic, My position, RDS and others, they are taking too much from us. It stops feeling open source.

Michael

We don't need to make this point again.

Nikolay

Yeah, okay. Well, I'm going to repeat it many times, right? So it's not all right. So and Crunch are doing a great job here to providing, we are providing super user access and I think it can be solved, right? And 1 more thing, if you're listening, when I restart Postgres, I see restart cluster, right?

Michael

I feel like I need to chop this bit up and send it to them as a support thing

Nikolay

good yeah it was good

Michael

nice 1

Nikolay

thank you

Michael

yeah thanks Thanks so much Nikolay and thanks for the request and yeah catch you next week.

Nikolay

See you. Bye.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android