Passing Messages - podcast episode cover

Passing Messages

Feb 14, 202559 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Ben and Matt wade into the deep waters of messaging systems, get utterly lost in time synchronization rabbit holes, and discover their new podcast tagline: "We make mistakes so you don't have to." Matt celebrates by getting his car stuck where cars shouldn't go.

Transcript

Matt Godbolt

Hey, Ben.

Ben Rady

Hey Matt, dancing to the theme music.

Matt Godbolt

How are you doing, friend? We both are. I know. and is yeah Again, we've said this before, but like we've the new the new recording setup means that we get to play the intro music, if only just to get the timings roughly right.

Ben Rady

Mm hmm.

Matt Godbolt

And so that we are bopping away. And actually, thinking of that, in the background, not that our listener can see, because we never record the video to this, but there's a huge box of nonsense behind me, if you can see it over here.

Ben Rady

Mm hmm.

Matt Godbolt

So I'm traveling at the moment. I'm at my my parents' house in England, ah which makes this an international recording. And I've been mining all of the stuff that I left behind as a kid to send either home, which was actually kind of a miracle thing. I had two large boxes of like cool stuff ah of mine that I'd left here because you know I wasn't planning on being in the States for more than a couple of years, but 14 years on I'm like, well, maybe I should take this home now.

Ben Rady

Yep. Right.

Matt Godbolt

I just put, I stuck a sticker on the side of these boxes and I took them to a local shop.

Ben Rady

Yeah.

Matt Godbolt

Magic happened and then my wife texted me a picture of them on my front porch in America. It's like, I know that's how that works, but I was still so excited to see my things get there in like a day and a half. It was brilliant. um But now I'm sending all of this stuff off to the person who composed our theme music, Inverse Phase, Brendan Becker, um for his museum.

Ben Rady

nice.

Matt Godbolt

In fact, he has got great news. He has just um raised enough money to buy his own property and he's moving his museum ah of cool like games and weird and wonderful computer tech paraphernalia over the years ah into Pittsburgh. So at some point next year, I will definitely be going to Pittsburgh to see him. and the amazing things. Anyway, weirdest thing. um What are we talking about today, Ben?

Ben Rady

Today, in addition to that, in addition to cool museums, we are talking about We are talking about messaging systems. There are various systems in the world that you design as ah some service somewhere is going to send you a reasonably small message and you're going to process that message and then you're going to send

one or maybe more or maybe zero messages out to another thing. And the whole architecture of the system is designed with just sort of this message passing in mind. um And oftentimes when you have systems like this, you have distributed computing problems.

Matt Godbolt

Mm hmm. Yep.

Ben Rady

You have ah sort of reproducibility ah concerns that you need to think about. And so I thought it would be a good idea to talk about some of the things in our experience, having built some of systems like this.

Matt Godbolt

Right. Yep.

Ben Rady

And we can talk about maybe what some of those systems were just for context. But um in our experience building systems like this, what are some things that you should do? And what are some things that you should definitely not do?

Matt Godbolt

Interesting, yes. um So we're talking about, you you mentioned small message there, so we're not talking like bulk data thing here, we're talking, what what would be an example of, what would be like a canonical example of this kind of system

Ben Rady

Yeah. Well, I think starting right off with some of the things that you should not do is I don't think that you should put gigabytes of data into something and call it a message.

Matt Godbolt

Okay.

Ben Rady

Um, that's it in these, that is something that I would be skeptical of if, if someone was like, well, can't we just take this like, you know, three gigabyte file and stick it in there?

Matt Godbolt

I mean, strictly speaking, you can, but you mean in these kinds of systems, you wouldn't want to...

Ben Rady

It's like, maybe you shouldn't do that.

Matt Godbolt

So, and again, message, positive so we're talking things like broadly something which could be using ah something like Kafka as a sort of mental model of like, Hey, you're going to just put a sequence of messages somewhere and, or it could be um some other system, but I'm just putting Kafka in my head for now. It's like something that probably most of the audience space might have heard of.

Ben Rady

Yeah.

Matt Godbolt

and then obviously that's a great example of something you shouldn't do is putting a massive massive massive message into a message queue system. They're usually not good at larger pieces of data.

Ben Rady

Mm hmm.

Matt Godbolt

um it Sometimes your recipients will um want to discard some messages and if you curse them to download a three gigabyte file just to discover they don't want it that's not what you wanted. That's not good behavior. So, you know, the typical solution I can think of of that is that you normally put bulk data somewhere else, be it like, say, an S3 bucket, or some shared file system, or some other system.

Ben Rady

Mm hmm.

Matt Godbolt

And then you send a message that says, Hey, there exists some big data over somewhere else that you can get hold of. Is that the kind of thing?

Ben Rady

Yeah, yes, that's what I've done as well as you have basically a pointer to some other large piece of data, whether it's a file in object storage or maybe even like you know, one thing I've seen is like embedding a SQL query that's, um you know, bitemporal. So when you run it, you always get the same results. You can put that in the message and be like, oh, there's some data available here if you want to query it, right? um but

Matt Godbolt

Oh, that's neat.

Ben Rady

But like embedding... the core idea here is that don't don't put a bunch of data into a messaging system, whether that's just a system that's passing messages or a queue, right like something like Kafka or some other type of queue. Instead, put in something that allows you to fetch the the consumers of that stream to fetch that data if and when they want based on maybe some metadata that you include in the message.

Matt Godbolt

Yep. Got it. Yep. Now you've obviously by doing that you have added another system to an otherwise straightforward system like I would need to mock out if I was testing this both the retrieval system separately from the message queue system. There's an allure to saying, Hey, let's just throw it in the one system and then everything's a message and we don't need anything outside.

So I can see why, but, but there is a blurry line, like, you know, we throw out three gig of data is like, that's too much, but, but maybe, you know, 300K, I don't know, maybe, uh, yeah.

Ben Rady

Mhm. Yeah. Yeah, yeah, yeah. Yeah, right. It's like that maybe is fine. Yeah. So yeah, so all of that I think starts to become a lot more context sensitive. And maybe it's worthwhile talking about like some of the systems that we have built to paint a little picture of some of this context and be able to talk about ah the trade-offs that we're talking about here in those contexts.

Matt Godbolt

Yeah. Okay. So yeah. What, what, what kind of systems have we, what do you want to start with? What would you happy talking about first?

Ben Rady

Well, I mean, I can kind of go, you know, the the three main systems that I think of that I built that are like this are um there is a sort of infrastructure and monitoring system that I built at a trading firm. And then at that same trading firm, I actually worked on, ah yes, that is like pantomiming the logo of the of the system that we built. And at that same firm, I actually also built a trading system for event trading.

So this is like discrete events that are happening in the world. And we would name news as an example of that. And we would we would trade those events

Matt Godbolt

Right. So election results come in kind of thing and you're like, Hey, if, if this person wins and the market moves this way, or, you know, if some, if a drug gets [approved]...

Ben Rady

Yeah, tweets, we would trade tweets. I mean, things like that, you know, ah press releases, those kinds of things.

Matt Godbolt

Right.

Ben Rady

And that was extremely latency sensitive, right? Like that trade is basically like you're, you're racing the speed of light. um And so that had its own special constraints.

Matt Godbolt

Right, because you and everybody else know that if the Fed puts the the interest rate up, then the market will react in a particular way and you want to either take advantage of it or, you know, protect your own position, whatever. But yeah, interesting.

Ben Rady

Exactly. Yeah, yeah, yeah. um So like you know in that example, like a queue is just right out.

Matt Godbolt

Yeah.

Ben Rady

You can't queue anything. right like That's not going to work. um And then probably the third one was ah the system that ah we collectively built at Coinbase.

Matt Godbolt

ha I was thinking about that one.

Ben Rady

which ah was an exchange, right? Like Coinbase hired Matt and I and a few other people to build a replacement for their cloud-based exchange. And what happened with that is a big long story, which is maybe another podcast, but nonetheless...

Matt Godbolt

Or not....

Ben Rady

We yeah, right, or not, honestly. You can read about it on the internet if you want. How about that? I think that's the best way to to to do that.

Matt Godbolt

Yes.

Ben Rady

But nonetheless, we we built an exchange. And that is very much a system like this where you're passing messages around. So those those are the three that sort of spring to mind for me.

Matt Godbolt

Right. And just concretely for those, you know, an exchange in this instance and is is a a service where many people are sending messages into the system to buy and sell a commodity, in this instance, various cryptocurrency coins and things. um And yeah, we had to process those and we had to process them fairly and we had to process them ah as the lowest latency that was reasonable and very, very, very reliably. And yeah, we used a very interesting design of a messaging system at the very core, the very guts of how it all fitted together to give us certain properties that we wanted to be able to tell our clients that we had, you know, like fairness and guarantees over certain things, which was very interesting. Yeah, no, those are cool. Where do you want to start? Do you want to start with the monitoring system or

Ben Rady

ah Well, those are those are mine. Are there any others that you can kind of throw into the mix here?

Matt Godbolt

I mean, I think, in general, receiving market data itself, that is the the information that exchanges then that the exhaust from an exchange. So um the publicly visible information for some definition of public about what's going on in any particular market is disseminated as a set of discrete messages that is, is ordained to you, you get a PDF from the exchange, and they say, this is how we're going to do it. But you have to be able to sort of keep up and read and process that. So you get ah yeah There is a message processing system there so that's the thing i have the most but experience with but i don't get the choice of designing it i just got to make sure i hit the spec of the of of what's going on there so i don't think of them i don't think of that as in in the same way as as the other thing so let's just stick with your your and i'll see if anything rings a bell with something that i have done.

Ben Rady

Okay. um But yeah, so examples of things to do and not do. So you know in the in the sort of latency-constrained world, that I was living in with that event trade, and I would imagine in other places where you have latency constraints, you need to be very careful about the messages at rest, right? So ah a more dysfunctional form of this, I think, is you're building a messaging system, but in the middle of your messaging system, you put a database.

So you write data into the database, and then you have some other thing that is pulling data out of the database. And it's like maybe got like a cursor or something where it's like, you know, I'm at like row 1000.

Matt Godbolt

Right. Right. You're tailing a log, effectively. That log just lives in a database and you've got, yeah, you're just following down in insert on one side and a select the next thing on the other.

Ben Rady

Yeah, yeah, yeah. and the And the terrible thing about those designs is that they they kind of sort of mostly work a little bit, right? So it's easy to trick yourself into thinking that you have something that will scale and you're like, oh yeah, you know, this database scales, I don't know, whatever, it's some cloud database and it scales infinitely, right?

Matt Godbolt

Right.

Ben Rady

Or I've got, you know, some cluster of these things and I can just scale it out horizontally. But like, you know There's not really any magic there. If you've just got one table and you're writing things into the one table and you have lots of things reading from the one table, you need to really understand what that database is intended to do and what it's capable of doing and maybe ask the question in that case, you know do we need something more like Kafka?

Matt Godbolt

Right, right, right.

Ben Rady

Do we need something more that is more of a traditional queue?

Matt Godbolt

Right, because you're, I mean, ah not to throw anything in your way, but no, a good friend of mine once suggested that using a sequence of numbered files is a perfectly reasonable way of sending messages between systems. And that's true as well. So I don't think you're saying that a database is not a solution to some problems, but certainly when latency is important, you've got too much non-determinism and there's too many moving parts.

So what do you do if you have um a latency sensitive application that needs to be able to react as fast as you possibly can, and you still want it to be a message passing system.

Ben Rady

Yeah, yeah. Right, yes. Mm-hmm. ah ha okay i mean so you know Again, we're calling on some of our prior experiences here. um Not storing the messages, right like having the sender and the receiver directly sending messages to each other, either over ah you know TCP or some sort of reliable multicast protocol, which you know you can Google various options there and see what you like.

Matt Godbolt

I was going to say there's, there's, that's a whole episode.

Ben Rady

Yeah, right, um is a great way to sort of reduce that latency. It does put constraints on the consumers, depending on exactly how you do it, to either not create back pressure or to deal with that back pressure in some way.

Matt Godbolt

Yep.

Ben Rady

Like, you know, the fundamental question to ask is if the consumer doesn't consume the data, what happens? Right?

where Where does it live? Does it get dropped? Does it get stuffed somewhere else that it reads later? And how would it ever possibly catch up? So there's all sorts of concerns to think about there. But fundamentally, if you've got something where you've got some latency constraint, I think... Attacking that problem as I'm going to write my messages into some sort of storage thingy and then read them back out again.

Matt Godbolt

Yep.

Ben Rady

You just need to be really careful about what kind of latency that's going to introduce and maybe just going directly is better.

Matt Godbolt

Right. Right, and I suppose in the limit, um if you can do this, which obviously we've we've kind of glossed over already, um being on the same physical computer means that you can use shared memory transport type things and a queue that that lives only in memory.

So there is there's a queue, but like only because you have to have somewhere to put it, you know, so a double buffer or even in the limit of like, I'm writing to this thing ah in process A and process B is just waiting for the okay to read it read from it as soon as it's been finished, as as soon as it's finished being written to.

Ben Rady

Mm hmm. Mm hmm.

Matt Godbolt

um But, you know, in all the things that I've been to thinking about so far have all been some network traffic has happened between a more distributed system than than something that can be literally co-located. Because, of course, and even more of a limiting case, they're in the same thread and they just literally have memory mapped and in this, you know, they're just ah a global variable is being said or whatever a shared variable, I should say.

Ben Rady

Mm hmm.

Matt Godbolt

um Yeah, so um storing the data is is sort of orthogonal to, or sorry, durability of the data. You don't always need durability. Something like Kafka will always give you durability. And as you say, that's the thing that stores it kind of first, and then everybody gets a copy of it from the brokers that have already stored it.

There's a quorum based here, and everyone's got you know it. it is We know that if a message has been sent, if before anyone sees it, some configurable amount of durability has taken place such that you know that that message has not been lost.

Ben Rady

Mm hmm.

Matt Godbolt

And we'll definitely be there again if you have to go back and get it. And then there's something on the back end as well where you can say, I know that this message definitely got processed by at least one of the people that were supposed to do anything with this message. And so that's really, really good when you're talking about things like financial transactions and other things where you like, it absolutely needs to happen.

We need to have a journal of record. And that journal is is more important than the the latency hit we have.

Ben Rady

Mm-hmm. Yeah.

Matt Godbolt

In the case of your event trade, presumably, if you dropped a message or if they're, again, back-pressure related things here, maybe dropping the message is okay, because it's better to not hold up the fast people by having that one slower consumer than it is and have that message being missed by that consumer than it is to cause them ato potentially to to to fire an order too late or some other and some issue there, right.

Ben Rady

Mm-hmm. Mm hmm. Yeah. yeah yeah yeah and Another actually interesting dimension of that particular system, um which I think is worth talking about, is that the messages were were not sequenced. We had lots of different messages coming in from different data centers.

Matt Godbolt

Interesting.

Ben Rady

that were all hitting the same system. And it didn't really matter what sequence they arrived in, right? this The system could could deal with that in different ways.

Matt Godbolt

ah Oh, that is interesting. Yeah.

Ben Rady

But oftentimes, it is very useful to be able to sequence a stream of messages because that allows you to do things like create a state machine

And then any consumer of that stream should be able to reproduce the same state of the state machine from the sequence of events. And obviously, and a classic example of this in finance is building a book. But there are lots of situations in which you want to have a sequenced stream of events that you can use to reproduce state in any consumer that sees that stream.

Matt Godbolt

Yes. Right, this is like log structured journals of information like databases and things, you just need to be able to process them in strict sequence. Now, and again, when you, that's okay. So like you mentioned building building a book in in our world, which is taking this multicast data that flows from the exchange and applying it um as the set of modifications to an empty state to bring your world up to date with whatever orders are flying around and are currently active.

Ben Rady

Mm-hmm.

Matt Godbolt

And you absolutely have to apply them in the right sequence or else things go horribly wrong. But in that instance, there is a single producer, at least for any one book, there is exactly one producer that is can give you a sequenced number.

And therefore you can see if the messages arrive in order. And so that's That's an easier proposition. And again, for those folks who are thinking like TCP, again, if you've got a single connection that's TCP one end to the other, then again, the the the messages that are being sent aren't going to be reordered anyway, that's a property of the of the transport. But in general, for the kind of UDP messages that we talk about in finance, that's not true. And you need to be able to see if you either have received messages out of order, or you've seen that you in fact miss one that you need to go and get it from some other ah other place.

So that's an interesting property, again, of messages. So we've already talked about all durability is one sort of dimension. Another dimension is like, what are the constraints on ah reproducibility and sequencing that kind of sort of go hand-in-hand?

um So just to sort of to take another point here, that something like Kafka, by putting it through a broker, somebody who's responsible, at least for a single stream in Kafka, you have also, as well as the durability guarantees, you have got like a single place of record where the ordering is kind of set in stone.

Ben Rady

right

Matt Godbolt

And so a subsequent read of that will give you back the things in the same order that everyone saw it in. And that's a useful property in some cases. But going back to your event trade, you are saying that that's something that you could actually tolerate. And in fact, you didn't want to take the hit for ah receiving from multiple, multiple systems, right.

Ben Rady

Yeah, the sequencing process would just slow that down so we couldn't do it, right? It's you have to to just design the system to to be tolerant of that. But I think something that's really important to understand, and this is true of Kafka. It's it's this might be just like a general CAP theorem thing of like if you're going to get a sequence stream of events. Right. Right.

then it can be very difficult to build a system that can scale horizontally with that constraint. Because something has to be that you know the sequencer.

Matt Godbolt

right the arbiter of what time things happen which came first right yeah there's yeah

Ben Rady

Which came first? Yeah. ah huh ah huh so and And in the particular case of Kafka, I forget topic versus stream and and and exactly how that is. But it's like the thing that gives you that ordering guarantee cannot scale horizontally.

Matt Godbolt

that is yes the stream within a topic so topics can have multiple streams and those streams are kind of a unit by which they are um given to individual members of the Kafka cluster and of course you can have multiple processes and threads and whatever so essentially by sending to a single stream you're sending to a single

Ben Rady

right Yeah.

Matt Godbolt

...single destination, and that's the thing that gets to decide, but there's only one of them. If you need if you need to go faster, you need two of them, and now suddenly you're no longer, do you have this nice guarantee of a total ordering. And that's what we're talking about here, a total ordering.

Ben Rady

right Yeah, yeah, yeah, yeah, yeah. So there's some important trade-offs to consider there.

Matt Godbolt

So why not just use the time as the total ordering?

Ben Rady

[Laughs] Well, how much time do you have? Because, pun intended.

Matt Godbolt

Uh, well, you said you, you said you had an hour, so, uh, I'm taking you at your word.

Ben Rady

um Well, so to start with, um what precision? ah because whatever precision you choose, you're going to get some amount of collision, right? These two events happened at the same nanosecond. Which comes first? I don't know, right?

Matt Godbolt

All of the precision. I mean, ah yeah, yeah, no, exactly.

Ben Rady

Right, like that's not a deterministic sort order, right?

Matt Godbolt

And if the, if you look, yeah, you think that never happens and then, you know, that's, that doesn't, what you know, birthday paradox kind of thing means that it happens a little bit more often than you would otherwise naively think. But yeah, it's still, I, I, I'm going to admit here. Um, we did use nanoseconds since 1970 as a, like a global key for packets arriving in one of the products I worked on a number of companies ago.

And the solution there was a post process, arbitrarily picked one of them if it found two that had the same and just added one nanosecond until it didn't till it didn't match anymore, right?

Ben Rady

Mmhm. Yeah, right, right, right, right, right.

Matt Godbolt

It's like, it's pragmatically, it mostly never happens. But what it does, it really blows your system up. So yeah, and then so how much precision?

Ben Rady

Yes, it's.

Matt Godbolt

Great question. And you know, you and I have been fortunate enough to work in the the finance industry where we already like to have accurate time. So getting a somewhat accurate to within low digits of nanoseconds time is is feasible for us, but for most people that isn't an option you can get milliseconds at best and ntp will get you within plus or minus fifteen maybe twenty milliseconds you know better than two people synchronizing their watching and watching an old spy movie but not that much better.

Ben Rady

Right, right. Yeah, yeah, yeah. And I do think it's sort of that false precision problem that leads you into this trap where you're just like, well, this nanosecond precision timestamp, what are the odds, like they can't even physically arrive at the same time.

Like the photons don't move like that. It's like, okay, but then what happens when your clocks are just off, right? Like you're just, they're just not that precise. And so you get two things that have the same timestamp because your clocks just aren't that precise.

Matt Godbolt

and right and you know when as soon as you have more than one cable the photons don't move that way but you can have two parallel streams of photons that do arrive at exactly the same time and so you do it can it can and does happen yeah so yeah you can't just use time and anyway whose time are we talking about because ah you know yes

Ben Rady

Right, right. Now we're getting into the whole problem. This is a whole other category of this, which is clock domains, right? Like synchronizing time between multiple computers is hard. it requires thought and oftentimes specialized equipment. And if you just sort of take it for granted that all clocks everywhere are the same, you're you're setting yourself up for a lot of hurt, like the the hurt is coming for you.

Matt Godbolt

Right.

Ben Rady

um So anytime that you're gonna be comparing time, ah you need to be thinking about what is the source of those clocks and how precise are they and how accurate are they and how are you gonna deal with the the differences between them and what are those differences?

What can they be and you know what are the the things there? So it can go all the way from, you know, we've got a GPS antenna that's sitting on the top of the building. And we know the precise geographic coordinates of that antenna. And we know how long the cable is from that antenna to all of the various servers that are using that antenna to synchronize their time. And from the length of those cables, we can compute the drift from the received signal and the antenna to each of the individual computers, right?

And unless you're taking that level of precaution or something kind of like it, I would not trust any nanosecond timestamp to be greater or less than anything else, right?

Matt Godbolt

You've missed out even some bits there. you know like When we were doing stuff at previous companies, you know there would be a rubidium-based oscillator with a very high... you know There's an oven that's got like rubidium at some temperature and it's used and that's the thing that you synchronize with the GPS and everything synchronizes to that with some complicated protocol and

Ben Rady

Yeah. yeah ye Yep,

Matt Godbolt

Yeah, well, no, I say it complicated. This is my favorite protocol. And I remember one of our network engineers saying to me, yeah, we use PPS to synchronize the master clock with the individual, like clocks on each of the machines. I'm like, PPS, wow, what's that? Because I've heard of NTP, and I've heard of PTP, and PPS. And he's like, it stands for pulse per second. And it's like, literally, it goes five volts once a cycle a second, on the second, and like, oh, right, that's the protocol.

Ben Rady

Yeah, yeah, yeah.

Matt Godbolt

Just on and off, got it.

Ben Rady

This is good. It's a simple protocol.

Matt Godbolt

It's a simple protocol. But yeah, again, you talk about the lead, you know, the cables were very carefully measured and very carefully designed to be understandable how long they the delays they brought in. So yeah, it's complicated.

Ben Rady

Yeah, yeah, yeah. Right, right.

Matt Godbolt

And And reasonable people could disagree because yeah, you can have a data center full of things that uses your discipline for clock synchronization, which you're maybe happy with.

Ben Rady

Oh, yeah.

Matt Godbolt

But if you take a message from, say, an exchange and the exchange says, hey, this happened at this point in time, you have to trust their ability to manage that if you want to say, well, ah why don't we use their clocks, they're, you know, whatever we're doing on our side, forget it. Let's just use the clocks from the remote people. We have been through this process. You're like, well, that makes sense. You know, they surely um have done something sane. And then of course, what if they haven't? I mean, what would ever throw aspersions that are friends who have a difficult job maintaining these systems, but like,

Ben Rady

Yeah.

Matt Godbolt

Things have gone wrong before and then suddenly you're thrown into a world of of of hurt because time went backwards by tens of nanoseconds and you're like, no, I always expect time to go forwards because you know, that's one of the few truths along with taxes and death is like time goes forwards.

Ben Rady

Nope, you think it does. But I mean, I think that raises a really good point, which is one way that you can get around this time synchronization difficulty is to never use the system time of the computers that are in the the messaging system and embed time in the messages, right? And then these the ultimate source of the messages is the thing that has to have a reasonably accurate time.

but the sense of time for all of the downstream system just comes from that. And that is really important if you want to do what we were kind of talking about earlier where you have a sequence of messages and you're trying to reconstitute state based on that sequence of messages. If there's any sort of time processing that has to happen, then embedding the time in the messages allows you to reconstitute that state retroactively, right?

So you can go back and you can replay the messages from three months ago

Matt Godbolt

Yep.

Ben Rady

and reconstitute whatever state that you have, even if it depends on time, because it doesn't depend on the the clock of the computer that's just running the the simulation or the reproduction, it's extracting that time from the messages itself. So you will always get exactly the same result.

Matt Godbolt

Right.

Yeah, just to take a a temporary diversion here, this is one of the things that in the code base that I was working on, um we use different types for the different types of time. So they were literally not comparable or convertible between each other without like an explicit thing I could search for in the code saying like, we're doing this, we're crossing clock domains right now. I am trying to look at the current time as measured by whatever process has given me the time on my computer and I'm comparing it to the message time that was embedded in the message through some mechanism and i have to know that that comes with this huge bag of caveats. It's sometimes useful to do it because one thing you might wanna do is measure the skew between the two just to graph it somewhere or just to keep track of it or just to alert if it gets more than a few hundred milliseconds or something out. So you do want to be able to do it, but you definitely don't want to be able to do it just by saying `time t = clock.now - message.time`.

Ben Rady

Yeah, yeah.

Matt Godbolt

It should be, no, that's so so that's a syntax error, right? The thing is going to fail to compile there. You have to do some work here. And you know that's um That's ah always been a worthwhile thing I've found to do. And even within a a computer, you know like there are different clocks. You've got monotonic clocks that are guaranteed to not go backwards. You've got clocks that try and like adjust because of like the NTP drift as they're readjusting themselves.

You've got like the CPU cycle counter, which is measured in its own domain. So this is something that's useful to have more generally. Gosh, this is really going off topic, isn't it? This is great. But no, it's it's a really important thing to to know about. I think it's worth saying as well just because it's cool that it is possible to get networking hardware to add a timestamp onto the end of packets that flow through it.

Ben Rady

Mmhmm. Mmhmm.

Matt Godbolt

So there are certain switches that you can configure.

You can plug them into this PPS and get them to synchronize with your very accurate timestamp. And then every message that flows through that switch gets a payload on the back of each packet tacked on after like the end of what would normally be the UDP packet or the TCP packet or whatever and you need to use exotic mechanisms to go and actually pull those bytes out but they are there and then that you can have like a source of truth that maybe the edge of your network as things come in from the outside world you say well this is where we're going to timestamp it. And that's useful for both reconstituting the sequence in which they arrived at the edge, which is not necessarily the order that they arrived at you, because cables can vary within the system and routes within your system can vary, but it gives you something to measure things by. And in particular, when you're doing some of those more ah latency sensitive things that we were talking about, having a sort of ground truth comparison, that you can look at that timestamp for the thing that came in, and look at the timestamp of your message that went out of

Ben Rady

Mmhmm.

Matt Godbolt

of the system. you've got like That's literally how long it took, warts and all, every network hop. Anyway, that's one of the many sources of clock domains. And we were talking about clock domains in the context of ordering. So yeah, go ahead.

Ben Rady

yeah Yeah, well, and that actually brings up another topic, which is that time stamping is an example of something else that is a really good practice, which is tracing. right as each As the message flows through your system and as it's being processed at each stage, it is quite often useful to be able to embed in the message or maybe as a wrapper around the message, depending on how you do it,

Matt Godbolt

Yes.

Ben Rady

information about the tracing. And that can be useful for performance. It can be useful for like um you know error ah debugging, yeah like you know like just general observability, figuring out, like hey, this message failed to process...

Matt Godbolt

Yep. Well, debugging. Yeah.

Ben Rady

Why? like Where did it stop? What problems did it run into? Or it was really slow to process. Why? What was the bottleneck? right What was the slow part? um And, you know, sometimes you'll do things like creating some sort of identifier at the point of ingestion or message creation.

And then you can have like an external system that refers to the message as it flows through using that identifier. Or sometimes you're literally just adding information into the message object as it's flowing through um to.

Matt Godbolt

Right. That, incidentally, is what we used ah the nanosecond timestamp for, because obviously the the the hardware on the outside would put this nanosecond timestamp on every packet. We're like, well, that's a unique identifier, except when it isn't.

Ben Rady

Yeah, yeah, yeah, except it's not.

Matt Godbolt

um But most of the time it is. And and then it would gives you this sort of unique ID, this sort of like trace ID, which is carries information in its own right, because it's the time that it arrived as well.

Yeah, not always, ah unfortunately, not always unique. um No, that's, ah we I've variously seen this as, you know, "provenance" or "tracing" or or "causality", or, you know, there's, the and I'm sure like that I know that the OpenTelemetry projects, I keep being pointed out, and I'm going to start looking at that soon.

I keep meaning to, um they seem to have a whole bunch of stuff around the telemetry of more just generally of systems, but I wonder if they have something that also talks of or or can be used to correlate.

That's another one, "correlation IDs" and things like that. One event and like the the causality as it traces through your system and you see all the different events. I mean, even on just like a website, just seeing that someone clicked a button and caused an error and you're like, well, that the backend error was caused by this click over here is useful. Anyway, sorry, again, off, really off base here, but yeah.

Ben Rady

No, I mean, I think these are all these are all dimensions of this problem that you need to be thinking about if you're going to build systems like this, right?

Matt Godbolt

we've we've we've talked about um various dimensions so far of messages. We talked about like durability, we talked about sequencing, we've talked about ah now tracing, um which sort of had determinism ah what are the and and you know very we We opened with you know don't put giant areas of data giant blocks of data into your messages. And we said, be very careful about which clocks you use. What other the considerations are there?

Ben Rady

Yeah, yeah.

Matt Godbolt

I mean, so how would your your monitoring system? does it what Let's just think a little bit about the monitoring system.

Ben Rady

Yeah, yeah, yeah.

Matt Godbolt

So that had a very, very high set of inputs. Like, essentially, it was ah it was a centralized monitoring system for the whole company's services. though All the services could send all the stats they wanted to it.

Ben Rady

yeah

Matt Godbolt

And you had to deal with it.

Ben Rady

Yeah, I'll tell you one thing one mistake that we made, and this is you know ah good judgment comes from experience and experience comes from bad judgment.

Matt Godbolt

[laughs]

Ben Rady

And so listeners, I hope that you get to benefit from all of the bad judgment of the of the people on this podcast and the hard-won experience. And so when I say like you need to be careful about clock domains and you need to think about like where your source of time is, one of the great mistakes that we made very early on in that project, and it's something that just haunted us forever,

is we allowed people who were sending messages to the system. So the idea behind the system is that you'd have you know external clients that could send you know telemetry data or, I mean, basically anything like prices, internal application metrics, whatever they wanted, they could send um you know data to the system. It worked a little bit like StatsD, if you've ever used StatsD, but it had sort of more, yeah, yeah.

Matt Godbolt

Yeah, sort of prometheus-y type things that but but it's ah a lot more it was designed for more real time stuff rather than like once a minute once a second kind of stuff it was it was very much like

Ben Rady

Yes, yes, yes. The idea behind the system was like, you know, it's cool and that Grafana has a chart that updates once a minute, but we need something that can update many times per second because it's monitoring trading systems. And if something happens, we need to know about it right now. So like human time. But one of the great mistakes that we made with the system was allowing people to put their own timestamps on those messages. That was a terrible idea. An absolutely terrible idea.

Matt Godbolt

It's so easy to do. I can see why you'd want to be able to do this. You know, like I find this quite often with things like um the, ah like our Prometheus setup, because, you know, like, Hey, I've got a build.

Ben Rady

Yes.

Matt Godbolt

I want to like measure my build time and I want to post it. And then sometimes I want to go, actually, I want to go back in time and like run the last hundred builds one day apart from each other. And I want to populate some data in the database so that I i don't just have "now data". I have historic data once I've thought I want it.

Ben Rady

Mm hmm.

Matt Godbolt

Right. And so how bad would it be to let me post stuff that's in the past to you so that I can write my data?

Ben Rady

Yeah, yeah, yeah.

Matt Godbolt

Like, you know, it's a reasonable thing to want to do. So what was the drawback? What was the, what, what made you rue that decision?

Ben Rady

Right. Well, because inevitably people want to be able to say like, Oh, and also give me the list of all the messages that were delivered on this day. And now that's just wrong because your timestamp and my timestamp don't line up for whatever reason, right? It could be that you post or pre or post dated your thing, but you did the calculation wrong.

Matt Godbolt

Right.

Ben Rady

It could be that like what you actually want when you say the delivery day that was delivered on that day was the delivery that data was that was delivered on that day and not like whatever timestamp it had, because that came out of your log file or whatever.

Matt Godbolt

Well, this is, this comes back to almost like the bitemporality thing. It's like, you know, there's the time that I got it. And that's the kind of knowledge time. When did I know that you said that you wanted this thing?

Ben Rady

Bitemporality, yes.

Matt Godbolt

That's one timestamp. And then the other timestamp is what time did you say that you wanted this thing to be known as of or related to, sorry.

Ben Rady

yeah Yes. Yes.

Matt Godbolt

ah And you in almost all situations, those two times are coincident or so close that nobody cares, but not always. right And I think that's one of the harder things. I don't know if we've weve ever talked about bitemporality. Maybe we have. I don't know.

Ben Rady

Mm-hmm, mm-hmm. I don't know.

Matt Godbolt

We we must have done in passing. du but That's a whole interesting world as well. you know like it's it's ah yeah You want to say, on this day, what messages did you send me?

Ben Rady

Yeah. Mm hmm.

Matt Godbolt

And then you want to say, on this day, what samples fall in this window? Which is different from when did you tell me about those samples?

Ben Rady

Right.

Matt Godbolt

right That's a very, i mean again, they're mostly the same.

Ben Rady

Right, right, right.

Matt Godbolt

But yeah, that's OK.

Ben Rady

Yeah, yeah, if I had it to do over again, what I would have said is no, you cannot specify the timestamp, but you can, and this was true already, you can put whatever data you want in your message and you can query based on any of that data. So if you want to have your own log timestamp or ingestion timestamp or whatever, you can add that as a field to your message.

Matt Godbolt

Mm hmm.

Ben Rady

My system will be blissfully ignorant of it other than it's another field that you can do stuff with and you can do whatever you want with that timestamp.

Matt Godbolt

Yeah. Yeah, that is your, that is your piece of data to do with you wish, but we know when it arrived with us and that's all we're going to like keep as the sort of primary thing that we can. Yeah. yeah

Ben Rady

Yeah. Yes. Also, speaking of timestamps, Please, please, please do not put localized timestamps in your messages.

Matt Godbolt

Oh.

Ben Rady

It's so it's a long, it's a yeah it's it's it can be nano precision, it can be millisecond precision, it can be second precision. I don't even care, but it's a number. Please just put a number in there. Don't put some parsed string with a time zone offset. and No.

Matt Godbolt

Yeah. No, and store it in UTC for this kind of thing or some well-defined never-changing thing. um I think, I don't know to what extent it's an open secret or not, but um a very large web search company ah to this day, to the best of my understanding, still logs everything in West Coast time, which means that it,

Ben Rady

Yes.

Matt Godbolt

Its logs and the graphs that go with it have a twice a year, either a big gap or a weird back double backing on themselves type of thing.

Ben Rady

Mm hmm.

Matt Godbolt

um And it's just the cost of changing it is so high that it hasn't been done.

Ben Rady

Mm hmm.

Matt Godbolt

But yeah, you there are time, there's a time and a place for a localized time. And it is in application level. things, like if if you're if you're if you're saying um if you're trying to talk about what time did a trade happen on a particular exchange, it is useful to specify it in the local time of that exchange, say, because you know that our exchange opens at 8.30 local time on that day and closes at 3.30 local time on that day.

Ben Rady

Yes.

Matt Godbolt

But if you have to sit and try and work out or do anything other than compare with arithmetic operations straightforward arithmetic operations on a 64-bit number then you're doing something wrong. If you have to kind of work out what day that was and then i was it daylight savings or not on that wait a second that was in europe wasn't it and they don't do daylight saving.

Ben Rady

Absolutely, absolutely. like Like religion and politics, time localization should only be discussed in the home. Like you, the international standard is a 64-bit number. And only when you're displaying it or like viewing it or or making a report, do you ever take that 64-bit number and turn it into some localized time that is localized for the person who is viewing it, right?

Matt Godbolt

Yes, or the whatever it is.

Ben Rady

Or the system perhaps that is viewing it. But yes, yes, right.

Matt Godbolt

Yes, no, then that makes sense. Yeah, I think i that is that. And then, ah yeah, nanoseconds since 1970 is not a bad thing to fit into 64 bits. That'll get you to, I can't remember when, but it was, you know, it's far enough in the future, that at least right now, I don't have to worry about it before I retire, although that is, you know, I'm an old man. So maybe ah maybe the younger folk will have to worry about it.

Ben Rady

Mmhm

Matt Godbolt

um ah But there are no any number of of ways of storing time better than that or you know yeah you can pick your own epoch right: you don't have to be 1970 is convenient if it is cuz then you could use.

Ben Rady

Yeah. Right, right.

Matt Godbolt

ah the Unix date command to kind of move back and forth. In fact, one of the first things I do, ah yeah I've checked in all my dot files. Sorry, this is another sidetrack, but one of the like fish functions of the shell that I use is to convert numbers from an epoch time to like a displayable time and backwards, right? So I could do epoch and then just type a number in and then it, based on however many the digits it's got, it guesses whether it's millis, micros or nanos, and then it prints it out in my current time zone. And it is the single most useful thing. I know people go to epoch-converter.com, which drives me bonkers to see, you know, why would you go to our website with all these flashing ads and things on it, just to convert some numbers when it's like something that command line can do, but on the other hand, it's a pain to do.

Ben Rady

Yeah. Or you can just open up a JavaScript console in your favorite browser and paste the timestamp into `new Date()`.

Matt Godbolt

Yeah, that's true.

Ben Rady

And that'll, that'll also give it to you. um

Matt Godbolt

That's a great one.

Ben Rady

yeah

Matt Godbolt

I'm remembering that one.

Ben Rady

Yeah, it's super, super convenient most of the time.

Matt Godbolt

That one's even more portable than mine, yeah.

Ben Rady

um Another thing to think about here, and this is kind of getting back to, you know, I was saying like, don't put a database in the middle of your messaging system, right? ah generally Generally, sometimes it's it's fine. And, you know, as you said before, sometimes it's just a file. But like, okay, if I can't do that, then how am I supposed to bridge the gap? Because there will almost certainly be a gap. between the world of like stream processing systems and batch processing systems.

Matt Godbolt

Right.

Ben Rady

right like At some point, someone's going to run want to run a database query or something on your data.

Matt Godbolt

Right.

Ben Rady

right And how do you handle that? right And also, this kind of ties into a durability thing, where it's like if you don't have a system like Kafka or some other sort of durable queue, in the middle of your system to kind of keep track of the history. You know you just have you know UDP packets or you have something else. like What should be responsible for sort of keeping the historical record of everything that has ever happened, right?

Matt Godbolt

Right. Right.

Ben Rady

So I...

Matt Godbolt

Which obviously some people don't need and that's fine. if you're if you're If you're a video game server and you've got the player positions that are being updated, then maybe you don't need a log for all time.

Ben Rady

Right. Yes.

Matt Godbolt

But you know, if you're working in finance, it's generally a good idea to keep everything forever for all time in case somebody comes and asks you a very awkward question about what happened.

Ben Rady

Yeah, yeah. And this ties in also to another thing that we were talking about, about ah reproducing state for state machines. So it's like it's you know the cool idea is like, all right, I'm going to take my messages. I'm going to pass them into some system that processes them. There will be no other information that goes into the state machine other than the messages itself. And therefore, I can completely reproduce the state from the sequence of messages. It's like, yes, that's cool. But what happens when you have seven years worth of messages? and you have to start at the beginning.

Yes, yes.

Matt Godbolt

right right right

Ben Rady

That seems bad. So one of the things that you typically do is you have something that is consuming the stream of messages whose purpose is to store them and also potentially snapshot them.

right So you you have something that is consuming the messages, it's writing them into some persistent store, maybe it's even like transforming them into like something that can fit into it like a database table or some other format that is nice for bulk processing. And another thing that it might be doing is running this sort of state machine and taking a snapshot at some regular interval and then putting that into the storage as well. So that when you need to reproduce the state for some particular point in time,

Rather than having to play all seven years worth of messages through your system, you can jump to you know a prior but recent snapshot and then load that state into your system and then only replay the messages forward from there. And that will be much faster and much more efficient.

Matt Godbolt

Right. Right, right, right. Provided there is there exists a sensible snapshot format, which is an interesting. So I think what you're this this has now sort of moved into what what I think of as like a log structured journal of light you know like, you have some database yeah or in-memory representation of the world that you update through seeing these events.

Ben Rady

Right. Right.

Matt Godbolt

um For some things, um so for example, to build the set of live orders on an exchange, that is the prices of like Google and all the people that are trying to buy and sell Google, um you can unambiguously snapshot that state and go, okay, this is what um this is ah at this point in time, at nine in the morning, these are the, everyone's orders. And now if you just load up this nine AM, you can carry on. You don't have to have loaded up, you know, the seven AM m ones and, or the whole, from the whole day. That's fine, right? But as soon as you start getting to things that have state that is like non-trivial,

now it becomes a function of the processor of that state. So let me give you an example. What if you were keeping some kind of exponential moving average of some of the factors of that? That depends on how long the window f your exponential...

Ben Rady

Uh-huh.

Matt Godbolt

is, and some other properties of that. What do you count? Which kinds of information go into that or don't? And now you've got a complicated piece of state that is arguably different for every client. you know Maybe some people care about a 10-day look back, and then other people want a you know a five-minute look back. And so that gets kind of tricky. I don't know where I'm going with this now, but like if it just it's it's not as straightforward for um application domains if they have any kind of state that is that requires some history in order to get to the point other than the like the pure individual like add/remove of, say, a book, unambiguous stuff, yeah.

Ben Rady

Yeah, and that state can get quite large because of these constraints. And I think this is something that is really important to think about because this kind of snapshotting is becomes very important when you think about error recovery, right? And there's two dimensions of error recovery that I think we we can talk about here. One is you've got some consumer of the stream and it's crashed. And now you want to restart it, right?

Matt Godbolt

Right? Yep.

Ben Rady

what state do you need to to let it sort of rejoin the stream, right? Again, do you have to go back to the beginning of time and process seven years with the messages for your system to restart? That's gonna be bad, right?

Matt Godbolt

Oh, we'll fix it next year when, yeah, it only gets worse, yeah.

Ben Rady

So if you, yeah, right. Yes, we've we've rebooted it and the website will be back online in 2038. um so ah So you have to think about the state if you want to be able to recover, and you need to think about how you can reasonably snapshot that state if you want to be able to spin something back up and have it sort of rejoin this stream, right?

um And so you have to I think you have to consider that from the very beginning. like how How big is the state? How often can we snapshot it? What is our sort of acceptable amount of downtime here for these various things? you know Is it like an hour? Is it a minute? Is it you know a month? um And how are we going to be able to to rejoin this processing? Otherwise, we can never turn this software off, right?

Matt Godbolt

Right, which is an option. um Just don't write any bugs. I don't have any hardware faults...we'll be golden!

Ben Rady

Yeah, right. ahha yes yeah yeah yeah yeah um Another dimension to think about with with fault timelines with these systems are poisoned messages.

Matt Godbolt

Yeah.

Ben Rady

right so That is a very common situation where there's a bug in your system or a bug in a producer system, perhaps, and you receive a message that you can't process. right And redundancy here will not save you. right You can have 10 redundant systems that are all consuming the stream and processing the messages so that if one like runs out of memory or whatever, you know the other nine are there.

But if they all have the same bug and they all get the same message, the whole point of the distributed state machine is that they are all going to do the same thing, which is not process your message.

Matt Godbolt

er all crash

Ben Rady

That means they might crash. you know All kinds of manner of problems can happen here. So one common approach to dealing with these things is creating what's called a dead letter queue. So you you have a message that comes in and your system cannot process it, but it's able to detect that it can't process it. Maybe it raises an error, maybe there's some validation step, whatever it is, and it's like, I can't process this message.

So what I'm going to do is I'm going to take it i'm going to put it into another queue, another stream of messages called the dead letter queue.

Matt Godbolt

They're all crash.

Ben Rady

And it's going to sit there until somebody does something with it. Now, the first thing that you want to do with it is send some kind of notification or alert or something to tell everybody, yes, like you know somebody's getting paged.

Matt Godbolt

Streamer messages, yeah, yeah. Someone's phone should go off.

Ben Rady

It's like, ah, we just got a message we don't know how to process, right? um But if you if you do that, then depending on the state machine that you're trying to reproduce, if you have one, or just the message processing that you're doing, it can sometimes be OK to say, OK, I'm going to take this message. I'm going to put it in the dead letter queue. And then I'm just going to keep going. right I'm going to pretend like I never even got this message because it's malformed or it's it's there's some other problematic thing with it. and I'm just going to keep going.

You can obviously run into situations where there's just a bug in your code and this is a message that you need to process and you didn't process it correctly and now your state is wrong.

Matt Godbolt

And now you're doomed. Yeah, yeah.

Ben Rady

But there are also situations in which you have one of these messages and it is truly something that is malformed and can be ignored, was never supposed to be created in the first place, and now you can just continue on having this in the dead letter queue. A common pattern that I ah have used with great effect is being able to basically re-drive those dead letter queue messages back into the main queue if sequencing doesn't matter.

Matt Godbolt

Oh, interesting.

Ben Rady

but If sequencing matters, then you can't do this. right But if you have a system where there's no sequencer ah or there is a sequencer where it doesn't really matter all that much, then you can take these messages and be like, all right, we got this message, we don't know how to handle.

um It went into the dead letter queue. We're now going to change the code so that it can handle this message in some way, redeploy that, and then re-drive the message back into the queue so that it can be correctly processed and flow all the way through.

Matt Godbolt

Right.

Ben Rady

right and That is a really nice way to handle it if you can.

Matt Godbolt

If you're able to do that, then yeah, that's a really, and that's, that's so particularly, for example, if this was some, um you know, holiday booking stream of information, you're like a centralized holiday booking thing, and then if someone comes in and they've just booked some suite and some

the price is higher than you've ever hit before and some internal issue happens and you're like, oh, damn, you know, we can't book this for them because it's it's a it's legitimately $100,000 a night a thing.

Ben Rady

Yes.

Matt Godbolt

And that just overflows something we're done. But you're like, this is really valuable business. Ben, could you hotfix that very, very quickly? Write a test, fix the test, deploy the thing, and then we're gonna put it back in again, and then the booking goes through, albeit 30 minutes, an hour late.

Ben Rady

hu hu Yeah,

Matt Godbolt

At least it gets done, and you caption the revenue, and everyone's happy, and, you know, it ever it's... ah ah Yeah, that that seems like a really nice way to heal the system in that instance.

Ben Rady

Yeah, yeah. Yeah.

Matt Godbolt

But obviously, sometimes it can be a legitimate bug or a malformed message or something something like that.

Ben Rady

Yeah.

Matt Godbolt

Yeah, and you have to be able to deal with it. Yeah, because as you say, fault tolerance was was a dimension that you talked about. So ah Another dimension for message processing systems is that, like, things go wrong, computers go wrong, and it's entirely reasonable to have more than one person, more than one person, more than one system listening to this stream of messages and independently processing them and updating them. And then if the machine breaks,

Ben Rady

Right.

Matt Godbolt

Well, you've got two more of them, and that's OK. And then you have to have a system behind that system that determines what the actual outcome of any particular update was. But you've got fault tolerance by scaling through a messaging system. And that's that that's a really interesting solution. And part of the solution that we put together at the aforementioned cryptocurrency trading place, which was a really interesting solution for a number of of of of things that we were doing, wasn't it? It allowed us to do rolling updates of the code because we could have a quorum of five machines doing the same processing and then take two of them out of the system, upgrade them and then put them back in again and then run them in silent mode and check that everyone still agreed on everything that was happening

And then only when we were confident that we hadn't introduced a new bug, we could add them back into the pool and then start rolling over the other three. And there you go. Now you can do rolling upgrades and you're never down. Hooray.

Um, it let us do things like have different configurations of those computers, be it through the different JVM settings or different hardware or whatever, such that if one of them processed the message faster than the other or one of them had to GC say, or one of them was doing some JIT work or whatever. Um, we could make sure that as long as two or three came up with a a good answer that we were happy with, the other two could be slower and that's fine. And that meant that we could hide some of our tail latencies.

Ben Rady

Mmhmm, mmhmm, mmhmm. Yeah, yes.

Matt Godbolt

in, in the quorum, which was, you know, so we got all these ah wonderful and obviously, yeah, if we had an equipment failure, then, you know, two or three of those machines could die and the machine and the site would stay up and we'd be able to process transactions and everything. And that was super cool. And was, was definitely eye opening to me working there in terms of like, Hey, you get a lot of benefits from doing it this way. That's great.

Ben Rady

Yeah, I had that that same experience. and And we had done some things like that at ah my my previous company when we were basically intentionally creating races between systems because we were trying to get them to run as fast as possible. And it created a an opportunity to to make the system more fault tolerant, where you'd have you know multiple parallel things that are all processing the same stuff. And the first one to finish wins.

And so like if there's some variation in the latency because of some you know operating system level thing, or a garbage collection because some of this was Java, or so something else had happened, right or one of them was just offline and was losing every race because it just wasn't processing anything, it was all fine.

Matt Godbolt

Yeah.

Ben Rady

right um I think one of the more interesting ah things from that is if you want to be tolerant of certain types of failures, you know like gamma ray burst type stuff where bits just get flipped, then the number of systems that you need to do this is three. The number of counting is three ah because you need to have two of them, not one, not two.

Matt Godbolt

Not one. Not two. Five is right out.

Ben Rady

And five five is actually kind of fine in this case, but you need at least three, right? um Because if you have two and one says the answer is A and the other says the answer is B, You don't know which is right. You need you need three so that you can compare, okay, two of them say it's A and one of them says it's B, so B is suspect.

Matt Godbolt

[laughs]

Ben Rady

And if you have five and four of them say it's A and one of them says it's B, then that's even better, right? But you need at least three.

Matt Godbolt

Time to, yeah.

Ben Rady

Yeah.

Matt Godbolt

so're all running the same version of the code it's time to yes start looking through your radiation hardening protocol for what on earth happened or check the `dmesg` for any kind of uncorrectable error memory errors and things of that nature but but yeah that's Yeah, I think I've just looked at the time and we've been, gabbling you know, given that we hadn't really got a plan, which is, yeah you know, and as regular, our regular listener will know, is how we do this.

Ben Rady

Yes. Right.

Matt Godbolt

We have, we've covered quite a lot of ground, although I don't know that we covered our intended topic. exactly as I would have done if we'd have written out something before because we went on so many tangents, but in a good way. Like we talked about time, we talked about durability, we talked about scalability, um and all these things come out of a ah message-based system or can come out of a message-based system.

and Especially if you have this sort of like journal-based thing where you say the sequence of messages is the only input into my state machine and I can trivially start from the beginning of time and get to exactly the same state.

Ben Rady

Mm hmm.

Matt Godbolt

Or we can snapshot if we know what the internal state's important at different points along that time and have the best of all worlds, which is which is super cool.

Ben Rady

Yeah.

Matt Godbolt

Yeah. um So I think by way of saying, maybe we should stop here is what i why I bring all that up. So um this has been super cool. And definitely some deep memories there from from previous companies coming up there.

Ben Rady

yeah Oh, yeah. Bring in bringing bringing back the hard lessons of systems past.

Matt Godbolt

"We make mistakes so you don't have to."

Ben Rady

A new tagline on this podcast.

Matt Godbolt

That is our new tagline, okay. We've certainly made, I've definitely made plenty of mistakes as as well you know, um as I shared on social media a picture of me driving a car through a place where cars shouldn't go and was too, I got the car wedged. [ https://bsky.app/profile/matt.godbolt.org/post/3ldh76pqffc2z ]

Ben Rady

Uh huh.

Matt Godbolt

um Yeah, you have to look at me on

Ben Rady

three gigabytes versus worth of car in your in your messaging system and it didn't work.

Matt Godbolt

In my 2.9 gigabyte ah hard drive, yeah, it didn't work very well anyway.

Ben Rady

Yeah, yeah, yeah, yeah.

Matt Godbolt

Well, I think we should leave it there, my friend. Thank you as ever for for joining me in this endeavor of trying to, I don't know what we're doing, trying to what?

Ben Rady

huh Yeah, yeah, this was a good one.

Matt Godbolt

Be entertaining and enjoy ourselves and hopefully be interesting and useful to other people too. All right, friend, until next time.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android