Hello, hello, this is Postgres.FM. This is Nik from Postgres.AI. And as usual, co-host Michael. Hi, Michael.
Hi, Nik.
Yeah. And we invited a couple of guests today, Nate and Andrew. And we invited because the company Metronome experienced outages in I think in May, right, or April, in May, right? In May of 2025, which were extremely hard to diagnose, but these folks did a very good job and discussed all the details in excellent blog post. And this blog post looked so great, so we couldn't miss it. So thank you for coming. Let's dive into some very rare Postgres problems.
Yeah, thanks for having us. And hopefully sharing this out will keep making it rare.
Yeah, yes. Thank you for having us. It's a pleasure to be here.
Great. So, Michael, where should we start?
How about we start with, could you guys give us a little bit of an overview of what Metronome does, why you're using Postgres, and what you were up to that was hitting the limits this hard? Yeah, so I can start with
the overview, and then Andrew, I can pair on the exact way that we kind of blew our foot off here.
Absolutely.
Yeah, so Metronome is a usage-based billing platform. And so, you know, when it was founded, I think it's coming up on 6 years ago now, it was really, you know, At that point in time, this is just before COVID and lockdowns began, the target was really, there's a lot of infrastructure and service companies, most of them are building on some kind of heartbeat, you know, per minute or sometimes per second model. That gets really tricky to do at scale.
In the simplest cases, yeah, you're just incrementing a counter, But as things ramp up and your account executives go out and cook up all different ways to finesse how they want to charge that client, it gets really complicated. You add in things like committed spend, credits, you want to automate it, you want to cut people off and they're not paying. The complexity really compounds immediately.
And the 1 of the founders got worked on the billing at Dropbox and had also encountered issues where when you build this internally, which is where everyone kind of starts, it gets very easy to build the brittle system by accident. Builds your local Maxima, fits your current pricing and packaging. delayed. Like they can't launch their product or they can't charge for it for weeks or something, months.
So you get led into the spot where you're delaying launches or just like burning compute because you can't actually charge for it, which feels really silly. And then there's also a frequent, you just accept a lot of slippage because building these systems to be perfectly accurate gets really tricky. And thankfully there's a lot of foundational pieces such as Kafka exactly once and strong transactional databases to make this possible to build and have confidence in.
But it's very difficult to do this internally and prioritize that degree of rigor. And also if you're building infrastructure company, you know, let's say you're a database startup, your best engineers are there because they want to work on building a database, they're not actually there because they want to work on building. And so we also think that, you know, kind of consolidating talent to work in this problem yields a better result.
And I think the marriage of the business logic domain, the exact correctness needs, and the overall data scale volumes, we're right place, right time for this AI boom, introduces a lot of very interesting challenges. And well, we're about to talk about one of them.
And you chose Postgres, obviously, right?
Yeah, the Postgres decision preceded both of us by a ways, but I think we're broadly quite happy with it.
Okay, good, good. So do you know why Postgres was chosen in the past?
Not precisely. I mean, like, probably, yeah, we immediately went to using RDS. I don't know if it was immediately Aurora. Interesting. Yeah. I think it would be an interesting decision as a small startup who anticipates having a lot of data scale to want to deal with the schema change pain associated with MySQL.
Yeah, I wanted to add a joke that maybe somebody googled some open source relational database with good money data type. But yeah, you know, money data type is what people shouldn't use in Postgres. This is...
Yeah, no, I think we're mostly in numeric, but at this point we're kind of cursed. We didn't lock down the precision more tightly because Postgres will give you far too much precision. And kind of having to interoperate with any other system becomes difficult. And so we're starting to figure out how to unwind some of those decisions.
Yeah, yeah. That is an interesting tangent. Especially, I guess, with people billing by fractions of a cent and things. So yeah, anyway, probably too much of a tangent, or maybe we'll come back to it later.
Yeah, I mean, in short, yeah, it gets very painful translating between the different systems that do or don't have that precision. Sometimes even your billing provider, understandably can't bill in fractions of a cent, which means that how all the different sub-products you paid for and have those round together, might not end up with the same total.
Yeah. Yeah.
Can you tell us a little bit about the scale? I remember 30 terabytes on Aurora from the blog post. Anything like in this area you could share to get us the feeling of the scale?
So we have 2 primary workloads that are running on Postgres. The 1 we'll mostly be talking about is our main DB, which is a common start-up pattern where we've got our kitchen sink database that has nearly every dataset within it. And yeah, that's ballooned out to around 30 terabytes now and growing terabyte plus per month at this point. And yeah, this is, wasn't an issue for a while.
And then depending on how our clients leverage our platform, you know, part of billing is generating invoices and invoices are usually characterized, you know, with like, you know, a top level total and whatnot, but then you have a line item per product usage. And, you know, some of these companies have a lot of different products you can pay for. You know, you look at the list of different LLM model engines that are available.
Sometimes you can use batch, async versus real time, or like infrastructure vendors who are deployed to dozens of different regions. And it's easy for the cardinality here to explode, especially if you want to group it by a logical application on their platform. And we may, but we now regret, decision to basically persist those line items in a separate row and a ginormous table in our Postgres database. How big is
that table?
It's about half the cluster and growing.
Like 15 terabytes unpartitioned, right? Yep. Yeah. Yeah, well, I know this. It's a pain,
Yeah. There's some scaling problems you get really ahead of and others that you defer because I mean, a lot of the technology in the Postgres space is really fantastic now. Like Aurora does probably just handle it, copy out, making sure that you're dealing with vacuuming and whatnot and all the other kind of implicit thresholds.
Actually, I forgot to mention that those who watch Postgres FM regularly should watch already by this point previous episode with Sugu, where we discussed Multigres and sharding. And in that very episode, he mentioned your case, right? So yeah, I wanted to answer this is what we go next to do next week, but it was a secret at that time. So, and before that we also had PgDog, Lev Kokotov from PgDog And obviously sharding currently is hot topic again and with Citus in the past, there is SPQR.
Before that we had a PL/Proxy from Skype very long ago, like previous era. So is it right to say that the problem we are going to discuss today, you wouldn't meet it if you're on sharding schema? If your database was sharded, this wouldn't happen probably.
I think we've seen the app suggest that it would be easier to avoid it. As part of the migration that we're about to mention, we're moving another table that's generated data to a partition scheme. And we've already seen a lot of positive effects there, and that you can run a lot of autovacuum workers in parallel. And so it allows you to much more easily keep up with the rate of change.
This is partitioning, not sharding. And with unpartitioned table, the problem is vacuum is single threaded, right? Yeah. So with partitioned schema, you can raise a number of workers for max workers you can raise it and benefit from it if you have a lot of CPUs and disk IO capacity. But what I'm trying to say problems like this they like can be avoided completely if we have sharding and every shard is smaller. Every single shard
is smaller. Yeah, and that's what I was trying to get at. By partitioning it, we can kind of see the effect of, you know, if these were all in logical chunks, that were maybe at max 100 gigabytes, it's much easier for the garbage collection system to keep up.
Yeah, yeah, yeah, yeah. That's interesting.
It would be even easier if we were across many machines. But at the very least, I think we can infer most of the effect, even if it's all on 1.
Yeah. Okay. Maybe it's already time to dive into the problem itself, right?
Yeah. Or maybe briefly, like, why we create the problem.
Yeah. Before you go into this space, I was curious, do you have any issues with multixacts before that at all? Like maybe other kinds of issues like observing in performance insights, lightweight locks, contention, lightweight multixacts, or like a couple of them. You never had them before, right?
Not in particular. No, I mean, we had run into emergency out of the vacuums for sure.
Yeah, yeah, that's very interesting because your path is like definitely is different from what we observed with other projects. Because usually, different problems with multixacts came, and the contention happened, and so on. Your path is quite unique in this case, I think. Yeah. So let's dive in. Let's dive in.
Yeah, I really want to describe how we create the problem for ourselves. And I really want Andrew to explain how we diagnose it. Cause like, that's the most interesting part here by far. Yeah. As we keep seeing these tables grow, you know, we kind of realize also do some access patterns that were introduced. It wasn't really tenable to keep just appending data month after month for every customer in terms of the line item table.
And yeah, this was also painful for the invoice table, which, you know, there's a lot of right amplification between the 2. The average invoice has, you know, dozens and sometimes hundreds of line items, depending on how a client's using our platform. And so invoices have to be queried a lot of different ways. We didn't really see a way of moving that outside of a relational store and like being able to use foreign keys and constraints with it in the rest of our data.
It was like providing too much value to really doing anything different there. We did wind up partitioning it just to like enable the autovacuum process to keep up much more easily. But we realized we could constrain the access pattern for the line items sufficiently that it didn't actually need to be in Postgres.
Because I guess within Metronome, you know, over the course of a month, you have the notion of an invoice that's in a draft state where we'll calculate on demand for you, or we're not actually persisting data associated with it. Then only after you go through a finalization, which is typically the first of the month, because most billing periods are going to be monthly, that we do all the math, do the calculations, persist that data. Which, you can then infer our workload sounds pretty funky.
So we've got a steady trickle of writes for configuration changes. And then once a month, we have to write out several billion rows which you know amounts to like 1 to 2 terabytes of data. Yeah
all on the same day.
As close to it as we can get.
Yeah very different from social media. Very different.
Otherwise it's you know a more typical you know we have 99 plus percent reads versus writes. It's a lot of compute on demand. So it's a little funky, I think that because the high spikes in write volume are so periodic is why we haven't seen multixact problems in the past. I think we'd likely narrowly been avoiding them because it's so hard to observe whether or not you're approaching them.
I feel like this is 1 of the meta-stable states of Postgres, if you're familiar with that, hey, from the definition of stable, where often things are working, working, working, and they just fall off a cliff or entirely lock up.
But the basic trigger here being we identified we wanted to make this architecture change, and then we were running the very long migration process of pushing these line items into S3, which is what our back end store was going to be, so their access pattern was so simple. But also rewriting and moving the invoices from 1 table to another. And fortunately, we're doing that broadly 1 by 1, which we'll get to why that's relevant.
So you mentioned performance cliff. This is a very good term when we talk about multixact set, when we had different cases in the past where, as I mentioned, contention happened. And interesting enough, like we always, like 20, 15, 10 years ago, we often saw opinions that foreign keys are dangerous, like they have huge overhead, don't use them because like they slow down everything and so on.
But interesting that now we can agree with it, but in very different way, in an unexpected way, because they, yes, they slow down writes, but we can afford it. And your table, which grown to 15 terabytes, is a good example, right? You survived with foreign keys and you paid that price to have good data consistency, right? But then performance cliffs, right?
Yeah, and for that table in particular, the reason why we were able to move that was it only had 1 foreign key, which is back to the invoice row. It was much simpler. That table is effectively append-only. It was a workload that just really didn't have to be on Postgres. I don't think we were taking advantage of the qualities to justify the cost.
So let's discuss multixacts in general, like why we should think about them. Like, you know, my own story is usually, like with multixacts, I take some monitoring, this is my hobby. Take some Postgres monitoring and we check which metrics they have. And usually they have transaction wraparound risk analysis, right? Like to control freezing process and to understand how far we are from transaction ID wraparound. My question number 1, where is multixact ID wraparound analysis?
Most monitoring systems lack it. But they lack it because it never happens. Usually, like, transaction ID wraparound happens. We know several blog posts and horror stories from the past. But multixact ID wraparound, nobody experienced it. So OK, Why should we think about multixacts? Tell us more technical details what happened with your case.
Well, briefly I'll respond to the point about foreign keys. The community is right that you don't always need them. But I think what can be tricky is that when you're a very young company making lots of changes to your product, trying to find that product market fit and get to the point where scaling becomes the biggest concern. Foreign keys are really useful to like prevent teams from stepping on each other and breaking your data.
And only when those product areas and like the platform you're building on top of become mature enough, can you maybe afford to put in the, you know, like pull request level, you know, integrity checks that make sure you're not changing some access pattern or maybe build the asynchronous, you know, like offline data consistency check to make sure you're not violating it. And this is a case where we didn't think critically about have we made that leap from 1 side to the other.
And in retrospect, we have. And it's 1 of the mitigations we took. Yeah. Where I guess I would defend having a really strict schema early on, and then as you scale the platform, just becomes necessary to progress some of those guardrails.
Yeah, to maintain good quality of data, that's absolutely great advice. But let's discuss how multixacts are created, and so on.
Sure. Yes. So I suppose we should start with what a multixact even is and when it's used. So basically, in Postgres, when you have 2 or more transactions trying to lock the same row at the same time, for instance, when you're using SELECT FOR SHARE or SELECT FOR UPDATE or in fact when you're inserting into a table that has a foreign key to another table, Postgres uses a system called multixacts or multi transactions.
And a multixact is basically a logical object that represents a set of transactions locking the same row. Each participating transaction is called a multixact member. And those are a little distinct, but they are related. And the way that a multixact is stored is that it's represented by a 32-bit integer ID and that's where your wraparound issue can happen with multixact IDs. But in this scenario, that was actually not the problem.
For every multixact, Postgres keeps an offset into a globally shared, append-only, multixact membership space. These are stored in the pg_multixact_members file on disk. You can think of it as logically like a simple last recently used cache that is contiguous, so you have to append to it every single time.
And so when you have to store these members, you will have to look up an offset into this theoretically long array, and each 1 of your transaction IDs is stored in 1 slot in this array, and this array is indexed by a 32-bit integer. This membership array is what it was exhausted in our case.
So due to the way that multixacts are created and subsequently reaped, it is easy to exhaust this membership space when you have a lot of parallel transactions attempting to lock the same row, whether it's through foreign keys or whether you're doing SELECT FOR SHARE or SELECT FOR UPDATE. And the reason for that is that since these multixacts are immutable, suppose that you have 3 transactions trying to lock the same row.
You would first create the first multixact between the first 2 transactions, and then the third transaction would come in and you would have to create yet another multixact with the old transaction IDs and then the new 1. But the membership space is not relinquished until a vacuum occurs, which can happen later, or perhaps never in some cases, if you are constantly adding new transactions and you're super busy doing all these updates. And what's interesting is this.
The multi-transaction membership space cannot be reclaimed in chunks. It has to be reclaimed contiguously. So suppose you have a very long-running transaction that is part of a multixact that is very old. It can hold up the entire vacuuming of subsequent multixacts that are finished but are short-lived. And it is not until that the oldest multi-transaction is finished that the vacuum can actually reclaim all the membership space from the old multixact to the most recent one.
And this scenario creates the ability to exhaust the multixact membership space. And that's what we experienced ourselves.
Yeah, I wanted to add one thing. It might sound strange that SELECT FOR UPDATE also might, may produce. You also explored this, right? It happens only when we have sub-transactions, I think, right? So if it's a single transaction, select for update, there is no place for multixacts. But if we start doing savepoints there, basically we have different transactions inside one transaction, and then multixact mechanism starts working again. And I just remember, like, how many years ago?
Four years ago, I explored this when I explored another big performance cliff related to sub-transactions. So there I noticed this as well. Yeah.
That is true.
Good introduction. Thank you so much. So yeah, it's not it, right? There's also some problems like quadratic behavior, right?
Yes, I hinted at that before. So essentially, if you have, let's say, five transactions that are attempting to lock the same row, you have to create the multixacts one at a time by essentially cloning it with the previous transaction IDs and adding that additional transaction ID one at a time. So now you have a multixact with two, then three, then four, then five.
And you add that all up, and that becomes a quadratically growing use case of multixact membership space, at least until the vacuum can actually clean up those old multixacts. And so that's where you will find quadratic consumption of the multixact space. And this can happen very quickly if you have a lot of concurrent transactions attempting to, say, insert into a table with a foreign key that links to another table that is a low cardinality table.
So they're all trying to reference that exact same row.
But it's also applicable when we have also additional long-running transaction, right? Because otherwise, cleanup will happen, right?
Correct, so a long running transaction will cause the vacuum to essentially stop reclaiming multixact membership space until it is finished, because it can only do it sequentially.
Is it this transaction, long-running transaction, should it also participate in multixact mechanism or it can be any, even SELECT, simple SELECT?
A simple SELECT probably won't do it. It would have to be in a multi-transaction context. So if you, let's say, have a multixact and you have 2 transactions in that and the first 1 finishes, that multixact is not cleaned up until that second transaction. Part of that multixact is also finished. So all transactions in the multixact have to finish before the vacuum can reclaim it.
Yeah, so this highlights pieces of advice like partition your large tables, avoid long-running transactions if imposed with OLTP, right? Like just yet another reason to avoid long-running transactions. And if it's, yeah, if, so long-running transactions on replica reported by, hosted by feedback, they are harmless here. So, right, so like again, yeah, yeah.
I'm trying, we are debating right now how to properly monitor long running transactions because there are different effects, negative effects, and we need to distinguish them. In this case, yeah, we should monitor probably long-run transactions which participate. And by long, how much long was it in your case? Like 1 minute is long or 30 minutes is long in this case?
I'm actually not sure. I don't actually have the metrics on that. Nate, do you know?
I think 99.9%, probably 9% of our queries are sub-second. And I guess I previously worked on a Database and for a company, or for a team at Stripe, and had some of these lessons, just like really deeply, or have all the scar tissue in short. Running a platform at their scale is difficult. And so, I've been pushing us in that direction and kind of moving anything that we expect to be longer off of the online workload into our lake house.
And so it's very, you know, occasionally we have queries that go 5, 10 seconds, but I believe we have a like, you know, global time of about 30 seconds. So I guess like part of my misunderstanding here is I thought Part of these multixacts is that the multi-transaction can be daisy-chained as you have many concurrent overlapping transactions occur. I think that's more what you're running into than a long running 1. Was that we just had this steady stream of overlapping transactions.
And until you break that chain, you're in this like, host position where you keep, you know, very quickly at an increasing rate, consuming space.
Yes, I was only merely stating the cleanup scenario and the challenges there, but you're absolutely correct. When you have this daisy-chaining, you know, constantly creating new multixacts by adding new transaction IDs to it, you will very quickly exhaust your membership space.
You say very quickly and very easily, but I mean, we are still talking about 2000000000, right? Like, I think that's a, there's been a debate in the Postgres world about 32-bit versus 64-bit for a long time, but we are still talking about 2000000000. So You say very quickly and very easily at your scale, right?
Like this is also due to how many things you're doing and maybe the pattern as well, like invoices that have lots of line items or, you know, like it feels like there are specifics here that make this more likely. I'm not saying it shouldn't be a problem, I'm looking forward to hearing how you diagnose this and what we can do better. But I feel like we should say it's not necessarily very easy to hit this limit.
I'm not actually, I think I found 1 other Stack Overflow post or somebody else who hit it I don't know if you came across that while you were while you were in the middle of this But I couldn't find any other examples on the mailing list, but it's super interesting I think we're seeing more and more Companies using Postgres at this scale Like Nikolay mentioned a few of them that we talked to, but we also had an episode with folks from Notion and Figma and Adyen as well on that same episode.
So there are a lot of high volume companies now on Postgres. So I think this stuff is super interesting, but 2 billion still seems like, even with the daisy chaining, it still seems like a lot to me.
Let me disagree with you. There is no, like debate is not like 64 or 32. Postgres desperately needs 64 transactions. And for example, OrioleDB already implemented it. But OrioleDB is like alternative storage engine and it's right now in hands of Supabase in terms of development. And I think right now there is launch week and they published some benchmarks again, like it's great, great project. So I think many projects dream about the day when we will have 64, not 32.
Right now only several like forks implemented only. There are commercial forks which implement 64 transaction IDs and multixact IDs. Right?
I didn't know that. Sadly not Aurora it sounds like.
Yeah. So 2 billion is not a lot already. It's not a lot.
Yeah. It's pretty easy to hit it if you're going at scale and you're hitting it with quadratic velocity. You get to 2 billion quickly. It does sound like a big number,
but it's not that big. Michael, sorry for interruption. Michael, you can take chess board and start putting 1 grain on first cell. And so. Michael: Oh, yeah. So wait, maybe I misunderstood. If there are 5 row changes we're trying to do, do we create 2 to the 5 rows in this table? Or is it like 1, then 2, then 3? Is that only 5 rows? So in the member space?
Yeah, good question. So basically, it's more of a sequencing thing. So essentially, you have 2 transactions. You start with 2. You get 1 multixact taking up 2 members. The third transaction comes in, you create a new multixact with 3 members, but that old 1 with 2 members has not been cleaned up yet, so you're only taking up 5. The third 1 comes in, I'm sorry, excuse me, the fourth transaction comes in, you create a multixact with 4 and 3 and 2.
So now you've added an additional 4 on top of that. And now the fifth 1 comes in, and now you have a multixact with 5432. And that all add together, it takes up membership space until the vacuum can come up and reap those old multixacts.
Yeah, chessboard example is good here, I guess. We can try to understand volume when we approach 64 cell, number 64, how big it will be.
Andrew, you've done such a good job describing how this could happen, but when you're in the middle of this, I'm guessing this is relatively new knowledge, or like when you're in the middle of this, how did you work out what on earth was going on?
Yes, that's quite the journey. So I'm actually really new to Metronome. I just joined in April. So when this incident was occurring, it was only my fifth week or so. And this was occurring on a Saturday right before a vacation to Miami. So, wow.
Well, let's rewind a little bit, because I guess the most embarrassing part of this is like the first, what we now know was the first occurrence, was the full week prior.
Yes.
Where our migration workload had basically implicitly triggered this. We hadn't really known what to do. And unfortunately we're not able to get to a full RCA. And so, yeah, we got to our best explanation, which is like, oh, there's something in transaction ID wraparounds that we're not grokking correctly, or Amazon's hiding something from us. And then we proceeded to turn it back on with a slightly lower concurrency, you know, 5 days later. And so like, yeah, so we didn't get there immediately.
But then, you know, that was a P1 incident, you know, internally, tried to be cautious, but didn't follow through sufficiently. And so this is like kind of in the second chapter where Andrew comes in.
Yes, that's right. I was not a participant in the first 3 incidences. It was only the fourth 1, which I think was on a Saturday in which the call went out from Cosmo to see if additional engineering resources could come and jump on, so I decided to. But having been so new, I actually didn't have any tactical knowledge I could use to address all the specific fires that were being started by this. So I decided to actually figure out what exactly was happening.
And as Nate had said, we weren't really sure, but we knew it was related to multixact in some way. We just weren't really sure exactly why. And we were struggling to really fully understand because in our initial evaluation, we saw that the number of multixact IDs that was being created was well below the threshold of like 400 million. And we weren't really sure why. So what I did is I took that knowledge and I just went ahead
and looked into the Postgres source code and looked at the multixact.c file, which is this enormous file that contains pretty much all the logic that you'll need to know about multixacts. And in there, we were working with the hypothesis that somehow this threshold of 400 million, but we were only seeing 200 million before vacuums started occurring, was related in some way.
So I found the function called multixact member freeze threshold which calculates the threshold that you should start basically an autovacuum based on the number of multixacts that are in use and newly the number of multixact members that are in use, and that's what caught my eye.
So from there, I mathematically proved that the amount of members should be extremely high given the code path that we are obviously hitting, because we're seeing vacuums occur at a specific number of multixacts being used. And from there I was able to find log messages related to multixact membership exhaustion and correlate that with Datadog logs that we had found ourselves in the recent week.
And there we could conclude that multixact members were being, the membership space was being exhausted. And we were able to put together what I would consider kind of a hacky estimate of the number of multixact members that are currently in use by looking at the files that are written for each member and estimating by multiplying by a constant how many members are currently in use at the moment. And that's how we came across the solution, or the cause, I should say.
I think this is worth diving into a little bit because I definitely didn't realize that there were at least 2 exhaustion possibilities with multixact so we've got the the total well yeah so what why are there these 2 numbers like why do we have 2 there's 2,000,000,000 that you were monitoring at 400 million, which should be super safe, right? Because we have the possibility to go up to 2,000,000,000, but you want it to be cleaning up much sooner than that. So you monitored at 400 million or so.
Why is there another limit?
Yeah, so we have 2 limits here. We have an ID that we give to multixacts. That's a 32-bit integer. So, of course, you cannot assign more than 2 billion to these multixacts because we don't assign negative IDs. And then you have the multixact membership space. Now that is represented as a logically a simple LRU that is append only. And so we can only address up to 32 bits of membership space. Each membership slot is addressed by a 32-bit integer.
And thus now you have 2 different values that you have to worry about. You have to worry about how many multixacts you're creating and how much membership space you're consuming with those multixacts. And that's where those 2 numbers come from.
And 1 of the great things in your blog post you mention is how do you even monitor that? Have you worked out a way since? Is there anything you can do there?
Yes, I think I might have just offhand mentioned it, but we use an estimate where we read the membership files because they are written to disk, And then we know that the files have to be a certain size and we multiply by that constant, whatever the size of these members are, and we get an estimate for the number of members. Now, this is not the best solution, which is why I submitted a Postgres patch to surface this telemetry directly.
Nice, Thank you.
Yeah, that's what I wanted to mention. We have several things going on in mailing lists after this blog post. By the way, before this, why did you decide to post in so great detail? Like, what was your intention? To help community or like?
I guess a couple of factors. 1, we do want to help the greater Postgres community. I think there are lots of other situations where we've benefited greatly by blog posts put out, in particular by companies like Percona, as well as others who have maybe led us away from paths that would have wound up being painful. But another part is that internally at Metronome, we view ourselves as a infrastructure provider for our clients.
And while I think it's okay to say that we're not fully recognized as that yet, and we're still on our journey to solidifying that narrative, it is what we want to, or how we want to act to be perceived that way. And we are in a critical path for a lot of our clients and how they operate. And so for us to have multiple multi-hour downtime in the space of less than 10 days is pretty bad and honestly not what you want for your infrastructure provider if you choose to be in that way.
And so having a really crisp explanation of, hey, we are a little embarrassed by this, but here's what we learned so you can learn from it, as well as a little bit of vetting of, we didn't run into the easiest thing. It is novel. There is a reason why it took a while for us to sort it out. And so I think a lot of our partners felt a lot better. Better could be a strong term.
But when you have an honest RCA, and Cloudflare might be leading the industry in how they operate in this vein, you feel a lot better that your provider is actually learning from mistakes and is going to do a better job of avoiding them than you might internally if you're evaluating, should I just build this in-house instead?
So great transparency and so on. It pays off over time. Yeah, that's great. So I saw several mailing list discussions, but before that I also like, I feel like with multixacts, we can have different kinds of problems. 1 is wraparound, right? Okay, as I said, I never saw it. And that's why many monitoring systems even don't monitor it. And we have several cases of SLRU limits reached, right? And 1 of them was explored in the past. We can provide maybe blog posts, links to description.
And this is your case. And your case obviously bumped into lack of observability bits, like how much of capacity we reach, like how far from the performance cliff we are, right? But also besides that, I saw different topics were discussed in mailing list. For example, there is a setting vacuum_multixact_failsafe_age, which is about wraparound, actually. Peter Geoghegan brought to attention that this setting doesn't take into account multixact membership exhaustion.
So it's also an interesting side effect from your blog post. Maybe fully unexpected, right? Then I saw, Andrew, I think your proposal to have multixact monitoring, like system view, so any monitoring could have some graphs and also alerts probably, right? And this proposal currently at what stage? I saw it's in commitfest, but it requires review, right?
Correct. Yes. The patch is in the commitfest and, it definitely still requires review. We've, well, I've engaged with another contributor named Naga, who's, we've actually submitted competing proposals. His, actually directly reads from the, the memory space while mine uses the pg_stats infrastructure to surface this telemetry. But yes, it is still just in review. It has not been accepted yet, as far as I can tell.
Yeah, so but it's great that it's registered in commitfest and I think if some people who listen to us and maybe some people who watch also our sessions of hacking we do with Andrey and Kirk sometimes almost every week. We missed today because of some issues, but we are going to be back next week. So we show how easy it is actually to test. You take Cursor, you like tell it, just compile Postgres, apply patch, and let's start using it.
And just you, instead of diving too much into C code, you think about use cases and how to, for example, emulate problems, right? And so on. This is actually fun. And I encourage our audience to maybe to start with their first review and test this patch and this can increase chances that it will be included to Postgres 19, right?
Yes, what's also kind of interesting is I know you earlier mentioned that there was a debate about moving to 64-bit integers and in fact part of this commitfest does include a proposal to change those offsets to 64-bits. So there's a lot of changes that are coming in 19, it seems, if they're accepted, of course, which would help alleviate this multixact membership and multixact issue writ large.
That's great, but it's only next year.
I suppose we can't go too fast.
Well, and even longer if it's Aurora, right? They have a generally a bit of an additional lag to getting major versions because it's a fork that they have to do a lot of work on. On that note though, you mentioned trust in an infrastructure provider and how issues, big issues can rock that trust and you need to see some maybe root cause analysis or maybe some action.
Has this at any point rocked your confidence in Postgres or less so Aurora, I guess, in terms of is it the right platform as you scale? Maybe initially and then since you've worked out exactly what the issue was, what are your thoughts there?
Yeah, I don't think it actually has really rocked our confidence in Postgres. I haven't seen anyone run for the hills yet.
But you know, it definitely surfaces the fact that there is quite a bit to learn and that this is a very complex system But to that end something I've learned at least Participating with the Postgres dev community and being on the mailing list is that there's a lot of very smart individuals who are part of this community and who are actively contributing and are passionately debating what should and shouldn't be part of it and how things should be designed.
So, in many ways, it actually may have built my confidence to see so many people paying attention and caring about it from all across the world, from Fujitsu, Amazon, Google, ourselves, even just regular folks doing it in their hobby spare time.
So in many ways perhaps it's a testament to how far open source can go and how much confidence we can have in these systems because like you said this is a pretty rare instance and many of the core cases that most people need to use are covered or at least will be.
That's really good to hear and of course when people hit issues with proprietary systems like SQL Server and oracle I'm sure when they blog about it those conversations do happen internally at Microsoft or oracle but you can't see them you can't see the the actual conversations and part of the way you diagnosed this issue was diving into the source code so there are some really cool factors that wouldn't have been possible if this was an open source.
Yeah, that's not just open source, but open discussions as well, open mailing lists. So yeah, that's really good to hear, but it would have been very understandable if... People have definitely left Postgres for less good reasons. So not naming names, Uber.
Yeah, I will say, as we said in our blog post, 30 terabytes is not a small amount of data to move. And so if we decided to make a large decision like that, I believe we would have to reckon with the transition itself. And so in many ways, you know, we probably want to try to work with what we got. And thankfully, I think it's served us well for the most part.
Yeah, there's something to be said for the devil you know. And, you know, there's lots of other systems Postgres compatible that, you know, promise you the world and something gives typically whether it's the price, the actual performance on either the read or the writes or how maintainable it is to actually run it.
And yeah, I think it's clear that we're going to have to undergo the investment to figure out what's next, because we can only go single writer for so long, especially with the demands that we put on the database on the first of the month every single month. A great extreme.
Yeah. And behind the scenes, we also chatted with Nate about the formula. So Michael, for the first time, we should include some formula in the show notes. There's a formula that is saying that roughly 63,000 transactions overlapping is enough to reach a 2 billion limit.
Yeah, and it's a really dangerous thing of some math on the fly that can be checked by anyone after the fact. But yeah, as I worked it out, yeah, only 63,000 overlapping transactions, which is not that many. Relatively easy to do depending on how you, you know, with a highly concurrent workload operating 1 row at a time with a potential overlap between them. You can consume that really quickly.
It might not also be transactions. If you take a checkpoint, that's almost seen as like a sub-transaction and will also create a new multixact as well. So if you're doing a lot of checkpointing, you can easily hit that as well.
Are you using sub-transactions?
I'm not actually sure myself, but it is a possibility.
I don't believe we are, at least not intentionally.
I think right now in the latest versions of Postgres there's ability to change SRU sizes to avoid SRU overflow cases, But my opinion about sub-transactions in projects like this, at this scale, remains just avoid them. Yeah, Because they can bring other problems as well, like sub-transfer can be overflown and so on. And they consume transaction ID. So yeah. Good. That's I think quite good deep dive, unless we forgot anything.
Any last things you wanted to add?
I think we can talk about the mitigations, just so that if someone's unfortunate enough to run into this, that they have a few breadcrumbs for how to work around it. Because it was rather expensive for us to figure out what those were, including Andrew and others digging through the source to really understand it. But a lot of what we inferred after the fact was, so we had this migration which was doing a bunch of row by row writes, not ideal if you're doing a highly concurrent process here.
And so we did make those batches. We also did a lot of deep thinking on which of the foreign keys we actually needed on this new table.
Batches, How did you find the ideal for you batch size?
I don't know that we're at an ideal, but if we climbed the hill far enough, that it's not a problem.
I see.
And so it was kind of a trade-off between what throughput we could obtain and making sure that we weren't having the kind of like overtime growth in our heuristic for the member space utilization, which is mainly based on how many of the offset files existed. Yeah,
I usually say choosing batches, choose the biggest possible, but avoid locking rows for too long to affect users, because users, we know perception is 200 milliseconds, so if you lock like 1 second should be maximum, 2 seconds should be absolute maximum for batch size choice.
Yeah, it's less often. So we're even batches around like 200,
which
is enough to mitigate this concern and kind of keep all of the operations up, you know, short, since there's a very online workload hitting this as well.
This is not users in terms of batch transactions, but users read results, right? Or no? Nobody's using
it now. So sorry, this is in the context of our greater migration to move line items, line item construct out of our database, as well as partitioning the invoice table, which has also grown to be quite large. And so, you know, kind of we would read the old version, all the line items in the invoice row, write those to a new format in S3, which is ultimately keyed by the invoice ID.
And then a lot of logic, like we can't just do a deterministic mapping because then you get hot partitions in S3, so you have to do a little bit of extra work and leave yourself a link in the actual database row. And so then we were writing those out to the new table, which is partitioned 1 by 1. And our diagnosis, and it's a little hard to confirm this, because again, there is no observability bit here, is that that was the big mistake, as well as the single inserts, like very quick.
It was the issue that they all reference a handful of other tables for foreign keys, some of which were very low cardinality. So, you know, we have like some enum tables and we, you know, for better or worse, in this case worse, had chosen to use a foreign key reference to them. And so if every invoice is corresponding to the USD credit type, for example, you're going to have a lot of overlapping transactions all accessing that row to verify the integrity of the data.
And so we kind of walk through it, and we realize a lot of these low cardinality cases, we don't really change those code paths. We're quite confident they're going to be stable. It's only the higher cardinality cases. An invoice also references its customer that owns that invoice. But that's a much higher cardinality connection. We're much less likely to have overlapping transactions referring to the same customer in a migration like this.
Without foreign keys, did you think about periodical checks of referential integrity, like asynchronously?
Yeah, so we're going to be adding those to run on our Lakehouse. So yeah, we have a syncing process that brings all the files over into Parquet. Eventually, we're going to finish this way on getting it to be CDC, so it's more incremental, probably using Iceberg. And so on those, we'll run every, I mean, it's much cheaper to run the checks there, for a lot of reasons.
More and more reasons to move data out of Postgres. I understand it for analytical reasons, but yesterday S3, yesterday S3 released vector index. If you haven't heard it, it's interesting. So I'm just curious where the place for Postgres will be in our systems. But I totally understand, For analytical workloads, definitely.
Yeah. I mean, everything that's OLTP-shaped is going to be staying in Postgres for some time. Can't see a reason to move it anywhere else.
Well, yeah.
Great. In terms of mitigations, did you tune any of the settings around multixact parameters? Yeah. What did you do? Which ones did you increase or decrease?
I don't know if there was actually any specific like Postgres like flags that we changed actually. I don't think we changed any constant specifically. But what we did do is we did introspect into the code base and determine the danger levels for membership spaces. And so we used that as a threshold for alerting ourselves with Grafana.
And so if our membership space consumption breaches a certain level, what we will do is we will take actions to reduce the number of concurrency, the amount of concurrent Transactions that are occurring until Postgres can recover that membership space. So it's a little bit of a manual intervention at the moment, but I don't think there's any specific knobs that we can tune because it is fundamentally limited by a 32-bit integer.
And the telemetry with respect to multixact membership numbers is not used anywhere else except in the detection of when autovacuum should occur. And that is sort of automatic, so, and kind of hard-coded.
Yeah, and the current utilization, you use function pg_get_multixact_members, right? Or how do you?
No, I don't believe that is it. I think it's some specialized function where we actually have to get all the files and read the file sizes themselves. So it's not something from Postgres, I don't believe.
But it's Aurora. You don't have access to files.
There is some function. I can't remember, Mnid.
It's JLS or something, right? So it lists files. You can still SQL. You can list for, oh, OK, interesting.
The Query originated from Amazon blog posts. I'll try to pull it up. You can see if you want to include it in the show notes.
Let's make sure. I don't remember this recipe was shared in your blog post, how to monitor this.
It was not, we did not post it there. We actually, intend to have a followup with a more engineering based audience where we would detail these steps kind of in the same way that we detailed it here, but in a more concrete format for engineers and not for executives and company leaders.
Yeah, and we were a bit more timeline constrained for initial RCA. And we are optimistic that we can link to Andrew's patch for the second follow-up blog post.
Exactly. That's awesome. Because people like, probably we should add it to monitoring like practically everywhere in large systems, right? This would be main outcome of this discussion probably, right? For those who listen and have big systems.
Absolutely.
Before your patch Andrew got accepted and everyone upgraded to Postgres 19, which will take some time. Right, so...
Yes, until then we definitely have some guidance for monitoring systems. We admit though that it is not the most ideal, but it is something that people can use to guide themselves away from disaster.
Well and I think work well there's the thread you mentioned from Peter Geoghegan he's somebody that has worked in the past on mitigation factors. So not only to be able to monitor a transaction ID wraparound, but he did some things to vacuum so that it's just less likely to get into that state in the first place.
And I think making that parameter he mentioned aware of the member space or have some different parameter that kicks in and does a, like an anti-wraparound vacuum or whatever the equivalent would need to be earlier would make this less likely to happen. And of course, monitoring is a good idea anyway, but prevention seems even better. So yeah.
Optimization inside Postgres could be implemented here
because it's
quadratic behavior, at least it could, as like in other cases in Postgres, at least it could be split to some pieces, like 128 pieces, you know, something like this. I don't know. I haven't seen discussion of actual optimization to get more capacity. Have you seen it?
No, but I was looking through what parameters we even have that mention multixact, and 1 of them is in version 17, we got multixact member buffers, which is quite low by default, only 256 kilobytes or 32 pages, and we got multixact offset buffers, also low, 128 kilobytes by default.
So I'm wondering whether increasing those would help with the member space maybe that's the offset buffers 1 and then yeah we also have multixact failsafe age which you mentioned version 14 that's the 1 for Peter Geoghegan and 2 much older ones I did wonder about Freeze Min Age, that's 5 million by default, I wondered if lowering that would just get you at, make sure that the vacuum ran quicker when it did need to happen, you know that kind of
thing, and then there's the freeze table age is quite is 150 million, so I again I'm not sure if lowering that would mean you just get vacuums more often but I guess when we're talking about such a huge table it's still going to be a problem with the vacuum does so until you get partitioning in place anyway thank you so much both of you it's been amazing learning about this scary but great and thank you for paying the pain and the cost so that so many fewer people don't have to in future.
And special thanks for transparency and admitting like your own mistakes in the past. Like I especially appreciate this like how you talk freely about your own mistakes. This is very valuable and a good example for others as well.
Absolutely. Thanks for having us. It was a pleasure to be here.
Thank you.
Yeah. It's really fun chat.
