The Almost Correct System

00:00

Quick note, this episode isn't sponsored. I'm building a new kind of IDE called Rex because existing ones make it hard to work across multiple projects in parallel. I'm sharing it to get feedback from listeners. I'd really love to hear your thoughts. The link is in the description. And now let's move on with today's super interesting episode. Welcome back to the Deep Dive. Today we are tackling a subject that I think keeps a lot of engineers, you know up at night.

00:27

And I want to start with a bit of a provocation. A counterintuitive premise if you will. I love a good provocation. Let's hear it. OK, in modern cloud architectures, perfect code. And I'm talking structurally sound, bug free, beautiful logic, the kind of stuff you'd like. Frame on a wall can still create catastrophic failure. Oh, absolutely. That is not hyperbole. I mean that is a Tuesday in distributed systems. Right. And that's terrifying.

00:54

We all grow up, as, you know, junior engineers thinking if I write the function correctly, the software works. It's a very binary sort of true, false view of the world. It is. But you're telling me that in the systems we're discussing today, production outages rarely happen because of a syntax error or a logic bug inside a single function? Rarely. I mean sure, bugs happen, people make typos they have off by 1 errors they, you know, forget to close a bracket.

01:18

But the really nasty ones. The ones that paid you at 3:00 AM. Exactly the ones that take down an entire platform for hours and make the CTO sweat through their shirt. Those usually happen in the empty space between the services. The space between. It's not about what the code is doing, it's about what the code assumes the rest of the world is doing. And that is exactly our mission.

01:37

For this deep dive. We are taking a whole stack of research architecture guides from AWS, some academic papers on things like metastable failures and Bayesian frameworks, and some really deep analysis on microservice pitfalls. Specifically looking at a paper titled The Hidden Danger of Assumption Drift in Distributed Systems and we are going to use it to well to upgrade your

02:00

brain. We're moving from junior thinking, which is, you know, is my function correct, to senior thinking, which is what assumptions is my system making and when will they inevitably be wrong? Exactly. And honestly, if you are preparing for a high level system design interview or a coding interview for a senior role, this is the secret sauce. This is the stuff that separates the pass from the strong hire.

02:23

I want you to walk out of this deep dive ready to stand at a whiteboard and just blow an interviewer's mind. It's the difference between knowing how to code and knowing how to engineer, and there's a massive, massive gap between those two things. So to guide us through this, we're going to anchor everything on that concept I mentioned a second ago, Assumption drift. Assumption drift.

02:44

It's a great term. It basically describes what happens when different parts of your distributed system invent their own version of reality. I love that phrasing. Inventing their own reality. It sounds like a sci-fi plot, but it's really just bad architecture. It sounds philosophical, but it's completely mechanical. Service A assumes a request takes 2 seconds, Service B assumes it can take 30. Their clocks are different, they drift apart. And in that gap?

03:09

In that gap, failure happens. And not just failure, silent failure. The kind that's almost impossible to debug because every individual part thinks it's doing the right thing. So here is the road map for today. We're going to build this up gradually, piece by piece. We'll start with the foundation layer, Why distributed systems are just fundamentally weird compared to the code running on your laptop.

03:31

Then we'll dive deep into this assumption drift, how it creates these solid killers in your architecture. And finally, we'll arm you with the defensive toolkit we need to talk about patterns like jitter, circuit Breakers, and deadline propagation. Not just what they are, but how to explain them in an interview to show you actually know how the sausage is made. Let's do it. OK, let's start with the weirdness.

03:52

You brought a classic analogy to the table from some of the AWS source material, Pac-Man. Yes, the Pac-Man analogy. It's perfect for explaining the leap from local to distributed execution. It really crystallizes the problem. Walk us through it. Why is Pac-Man relevant to, you know, modern cloud architecture? OK, so imagine you're writing the code for the original Pac-Man game. It's the 1980s. It runs on a single arcade machine.

04:19

You have a line of code, something like board dot find Pac-Man. Simple enough. Find the yellow guy on the grid. Right, and what happens? The CPU looks at the memory, finds the coordinates and returns them. It's nearly instantaneous nanoseconds, but more importantly, the code shares fate with the machine. Fate sharing That sounds dramatic. It is dramatic. It's a critical concept. It means if the power cord gets

04:42

pulled, the whole thing dies. The board, the ghosts, Pac-Man, the code looking for Pac-Man, it all goes dark together. So there's no situation where just one part fails. Exactly. There's no scenario where the fine function keeps running, but the board itself has, you know, vaporized. They live together, they die together. It is a single, cohesive universe. One brain, one body. OK, that makes perfect sense. It's all one machine. Nicely. Now let's teleport to today.

05:09

You're building Cloud Pac-Man. The board is on a server in a data center in Virginia. The user is on a phone in London. The game logic is running on a server in Tokyo. Now you run that same line of code board dot find Pac-Man. Syntactically it looks exactly the same in my editor. It looks the same in your IDE, but under the herd it has just exploded in complexity. That single function call isn't a simple memory look up anymore,

05:36

it's a network request. And network requests don't just work or fail, they go on a journey. A perilous journey, it sounds like. Very. In the AWS Architecture guide they break this down into what I like to call the 8 failure modes of the apocalypse. And for an interview you need to know these. Not just memorize them, but feel them in your bones. OK, let's run through these because I think people, myself included, really underestimate how many physical steps are actually involved in just

06:02

sending a message. OK, so let's trace the packet. Step one POST request. You, the client, try to put the message onto the network. And that can fail right out of the gate. Immediately your network card, your Nic could be fried. The local router could be down, the operating system kernel could be out of memory for network buffers. You haven't even left your own machine yet, and you've already failed. OK, so failure mode one is we trip over our own shoelaces

06:28

before leaving the. House, you got it. Step 2 deliver request. So the message is on the wire. It's traveling under the ocean in a fiber optic cable, but it has to actually get to the server. And things can happen on the way. A lot of things. A backhoe could cut that fiber optic cable, which happens surprisingly often. Yeah, a cosmic ray could flip a bit in transit.

06:47

Or, and this is a subtle 1, the server could crash literally milliseconds after receiving the packet but before the operating system hands it to your application. So the message arrives, but the person who is supposed to sign for it is dead on arrival. Perfect analogy. OK, Step 3. Validate request. The server application wakes up, it gets the message and it looks at it and says. I have no idea what this is. A version mismatch or something? Exactly.

07:11

Maybe you upgraded the client to send Jason, but the server is still expecting XML. That's a valid message, just not one it understands rejected. OK, so we're three failure modes deep and we haven't even started to do any actual work yet. Not a thing. Yeah, now Step 4, update server state. This is the first time the server tries to actually do the thing you asked. Move Pac-Man. But maybe the database is locked, Maybe the disk is full. The application logic itself

07:37

fails. OK, but let's say it works. Pac-Man moves. We're good, right? No, you're only halfway there. Now the server has to tell you that it worked. This whole thing has to happen in reverse. Oh man, OK. Step 5. Post reply The server's network card might die just as it's about to send the success message you're. Kidding me? That's just cruel. It happens. Step 6. Deliver reply. The network goes down on the return trip. A shark bites the undersea cable on the way back.

08:04

Your reply is lost at sea. This is brutal. It's a miracle anything works at all. It really is step 7. Validate reply. The client gets the success message back, but it's garbled. Or maybe the client app crashed and rebooted and doesn't even remember asking the question in the first place. So it gets an answer to a question it forgot it asked. Right. And finally step 8, update client state. The client gets the success message but then fails to actually process it and update

08:31

the screen. The logic is done, but the user never sees it. So that one innocent looking line of code board dot find Pac-Man is actually a gauntlet of eight discrete death. Traps. Precisely, And here is the crucial insight for the interview. This is the thing that trips up almost everyone who hasn't really lived this pain. Lay it on us. The unknown state. The unknown state. This is different from true or false. Fundamentally different in local code. A function returns true, false

08:58

or it throws an error 3 states. In distributed systems there is a fourth, and it is the most important one, unknown. Explain that in practical terms. Give me an example. Let's say you're building a banking app. You make a request to transfer $100 from your checking to your savings. You send the request, you wait, and nothing. The connection just times out. OK, so it failed. I'll try again. Did it. Well, I didn't get a confirmation, I didn't get a success message, so I assume it

09:24

failed. And that assumption is one of the most dangerous assumptions in software engineering. Look at our list of failure modes. Maybe the request got there, the bank moved the money, but the reply got lost. That was failure mode 6. So the money is gone from my checking account. Right. Or maybe the request never got there at all. That was failure mode too. In that case, the money is still there. You as a client are sitting there holding a time out exception. You do not know.

09:52

You cannot know if the money was transferred or not. That is incredibly uncomfortable. Humans hate it. We are wired for binary outcomes. We want to know, did it happen? And the system is just shrugging its shoulders and saying maybe. So if I'm in an interview and the interviewer asks what happens if this request to the payment service times out, the wrong answer is we assume it failed and retry. That is a fatal error in the interview.

10:19

If you assume it failed and you retry, you might transfer the money twice. If you assume it succeeded and you don't retry, you might never transfer the money. You have to design for the unknown. And this all comes back to that concept you mentioned earlier, independent failure, yes. Unlike the local Pac-Man machine where the CPU and the memory share fate, in a distributed system, the network can be perfectly fine while the server dies. The server can be perfectly fine

10:44

while the network dies. They fail independently of each other. So we've established that the environment is hostile, the network is unreliable, and we can't trust timeouts to tell us the actual truth of what happened. Correct, that's the foundation. Now let's talk about how we as engineers make this whole situation so much worse for ourselves. Let's talk about assumption drift.

11:04

We have this source, The Hidden Danger of assumption drift, and it introduces the idea of the almost correct system. The almost correct system. It's actually far more dangerous than a blatantly broken system. How so? That seems backwards. Well, if a system is broken, like it crashes on startup, you fix it. It fails loud. Failing loud is a feature. It's obvious there's a problem. Right, you get an alert, you find the bug, you push a fix. An almost correct system is

11:34

different. It passes all the unit tests. It works perfectly in staging, it works when you demo it to your boss. But it has these hidden latent assumptions that only drift apart and cause a catastrophe under a very specific kind of load. And the classic example of this which the source walks through is a time out chain disaster. This is a distributed systems Horror Story, and it's a true story that happens all the time.

11:57

Let's imagine a simple chain. You have a user, we'll call them the client, a load balancer in the middle and a back end service doing the work. OK, a standard 3 tier setup. Now let's look at their time out configurations. These might have been set by different teams at different times and they all look reasonable individually. The client, say a mobile app, wants a snappy experience, so it's developer sets a time out of two seconds. It makes sense. I don't want to wait forever for

12:23

a page to load. If it takes longer than two seconds, I'm probably going to refresh or just leave the app. Exactly. The load balancer, which is a piece of infrastructure, needs to be a bit more patient to manage traffic, so it has a timeout of five seconds. OK, still seems fine. And the back end service which does the heavy lifting. Maybe it's a complex database query has a time out of 30 seconds to make sure it always finishes its jobs, even the big ones.

12:47

So we have two seconds on the client, 5 on the load balancer and 30 on the back end. Individually these numbers seem perfectly fine. The back end needs time to crunch numbers. The client needs to be fast. But watch what happens when you put them together. The client sends a request, it goes through the load balancer, hits the back end, and the back end starts processing. But let's say the back end is having a slow day. Maybe there's a called cache, Maybe the database is under

13:11

load. It's going to take 10 seconds to do the work. OK. So it's going to be slow. Now at the two second mark, what does the client do? The client times out, it gives up. In its world, the operation has failed. Right, the client says this didn't work. And typically, what does a well behaved client application do when something fails? It retries. It retries. It sends a second request for the exact same thing. But here is the critical moment of drift.

13:37

Does the back end know the client has given up and left? No, of course not. The back end is still working on the first request. It has a 32nd timeout. It thinks it has 28 seconds left to finish the job. Exactly. So the back end is chugging along processing request one. Suddenly request 2 arise from the same user. Now the back end is processing both requests. But the client is only listening for the answer to the second one.

14:03

The first one is an orphan. It's an orphan, it's a ghost, it's doing work for nobody, and if the back end is slow because it's overloaded, we just doubled the load on it. And if that second request also takes more than two seconds. The client retries again. Request 3 arrives. Now the back end is doing triple the work. The system isn't just failing, it's actively tearing itself apart in a feedback loop. And the crazy part is that the code in the back end is technically correct.

14:29

It's dutifully processing every request it received. It is. It's just processing a request that nobody wants anymore. This is assumption drift in its purest form. The client assumes if I timed out, the request is dead and gone. The back end assumes if I'm processing a request, someone is patiently waiting for the answer. Both are totally wrong. Their realities have drifted apart. The client lives in a 2 second reality. The back end lives in a 32nd

14:55

reality. They're not in the same universe anymore. And in that void between universes, your system dies. So how do we fix this? What's the senior engineer move here at the whiteboard? The big shift is moving from these implicit contracts, I guess it will finish quickly to explicit ones. The first step is what you could call the alignment rule, which is generally you want your timeouts to increase as you go down the stack. The client should be the most

15:19

patient. If the client is willing to wait 30 seconds, then the load balancer should be willing to wait say 29 seconds and the back end 28. The system gives up from the back, not from the. Front so you shed load at the source, not at the client. Exactly. Yeah, but the gold standard, the thing that really shows you know what you're talking about, is deadline propagation. Deadline propagation? That sounds fancy. It's incredibly elegant.

15:43

Instead of every service having its own static timeout number like 30 seconds, the first service in the chain, the client or the front end said I have a total budget of 2 seconds for this entire operation. Then passes that deadline down the chain, usually in a request header. So if the load balancer takes 5 seconds to do its part, it looks at the budget and tells the back end you have 15 seconds left go.

16:05

Recisely and if the back end gets a request and sees that it only has .1 seconds left in the budget, but it knows the job takes at least a second, it doesn't even start, it fails fast. It immediately returns an error saying I can't possibly do this in time. That saves so much wasted work. The ghost work. It saves the system, it aligns all the assumptions. Everyone is looking at the exact same clock. It stops the ghost work, the work being done for clients who

16:31

have already left the building. OK, that covers the timing assumption, but you mentioned that the client retries and it seems like retries are a huge double edged sword. Oh, retries are selfish. That's the phrasing from the Amazon Builders Library, and I absolutely love it. Selfish. How is code selfish? Think about it from the server's perspective. When you retry, you're demanding more resources from a server that is likely already struggling.

16:56

You're essentially saying I know you're busy or slow or failing, but stop what you're doing and process my request again right now. It's like shouting at a waiter in a slammed restaurant who is already dropping plates. It doesn't help. It's the exact same dynamic, and this leads to these catastrophic events called retry storms. If you have a deep architecture service, A calls B which calls C which calls the database, and they all have a default retry 3 times policy. You can do the math.

17:25

Oh boy, I think I see where this is going. If the database at the bottom slows down, service C retries its call to the database three times. OK. But from service B's perspective, it's call to service C is just slow, so it retries its call to service C3 times, and each of those calls will trigger three database retries. That's nine database queries now. And service A sees that service B is slow, so it retries its call to service B three times. So you have 3 * 3 * 3.

17:51

That's 27 queries to the database for what was originally a single user action. If you have 5 layers of services all retrying 3 times, the load on the database increases by a factor of 243. You turn a small hiccup into a self-inflicted distributed denial of service attack. A complete meltdown. You've amplified a tiny problem into a catastrophic failure. So do we just not retry? Is that the answer? That seems wrong too. No, we have to retry. Networks are flaky.

18:17

We all know that from the 8 failure modes. We need to handle transient failures. A dropped packet shouldn't kill the whole operation, but we need to do it defensively. We need to do it with back off and jitter. OK, let's unpack these. Back off is pretty standard. I think most people have heard of that. Exponential back off. It's the most common type. Instead of retrying immediately, you wait one second. If that fails, you wait 2

18:39

seconds. If that fails, you wait 4 seconds, then 8, and so on. You give the server breathing room. You're being polite instead of selfish. But the AWS source makes a really big deal about jitter. Why isn't exponential back off enough on its own? Because of synchronization, this is a subtle but critical failure mode. Imagine a popular mobile app and the server it talks to reboots. So you have thousands of clients all at the exact same time getting a connection failure at 12.000000.

19:09

OK, they all fail simultaneously. They all have the same back off logic so they all back off for one second. What happens at 12.000001? They all hit the server again at the exact same time. Whim. They all fail again. They all back off for 2 seconds. What happens at 12.00 point 033? You get these perfectly synchronized spikes of traffic. It looks like a metronome. Tick slam, Tick slam.

19:32

And the server can never recover because it just gets hammered in these synchronized waves it can't get its feet under. It never jitter is the solution. Jitter adds randomness. Instead of waiting exactly one second, you wait one second, plus or minus a few 100 milliseconds. Or a better strategy is to pick a random number between zero and your current exponential back off cap. So instead of a metronome. You want rain.

19:53

You want the request to patter against the server randomly, spreading the load out over time instead of concentrating into these sharp spikes. I love that visual metronome versus rain. That's a great way to explain it in an interview. It is, and if you are designing a system in an interview and you mentioned retries without also mentioning jitter, you are setting yourself up for failure.

20:13

The interviewer will look at you and ask what about the thundering herd problem and you'll have to backtrack. It shows a gap in your knowledge. Noted. OK, there's another huge piece to this puzzle. If we are retrying requests, we run the risk of doing things twice, like that bank transfer we talked about at the beginning. This is where we have to talk about idem potency. It's a fun word to say idem potency.

20:34

And it basically just means doing something more than once has the same effect as doing it just once. Correct. The mathematical definition is FS, FX, inquest FX. In a distributed system. If you have retries, which we've established you absolutely must, then your operations must be idempotent. There's no way around it. But not all operations are naturally idempotent, right? Like adding an item to a shopping cart. Is. No, it's not, So we have to categorize them.

21:01

First, there's natural idempotency. This is the easy stuff. An operation like set user status to active. You can call that 10 times in a row, the status is still active. That's safe to retry. Then there's what you might call business idempotency. This relies on a natural key in your business domain. The classic example is create customer e-mail. The e-mail address is the unique key.

21:22

If I try to create a customer for bob@example.com twice, the second one should either fail with a user already exists error or just return the existing record. The system state doesn't change incorrectly. OK, that makes sense. You rely on the uniqueness of the data itself. But the hard one, and the one that always comes up in system design interviews, is operations that are inherently not idemitant. The classic one is charge this credit card $50.

21:46

Right. The credit card company is more than happy to let me charge a card $50 twice. That's a perfectly valid sequence of operations. Exactly. So you can't rely on the data itself for these cases. You need to introduce an artificial key, a unique ID or an idem potency key. How does that work? When the client decides to make the payment, it first generates a unique ID, usually UID. Then it sends the request saying charge $50 and use this

22:09

transaction ID 123ABC456. And the server has to remember that it's seen 123 ABC 456 before. Yes, the server stores that idem potency key, maybe in a dedicated table or right alongside the transaction record for some period of time. If it receives a request with that same ID again, it doesn't try to charge the card again. It just looks up the result of the first attempt and says oh I already did this, here is the receipt from the first time ah.

22:36

So that solves the unknown failure mode perfectly. If I get a time out on my payment request, I just retry with the exact same ID. Exactly. If the first one failed before anything happened, the second one goes through. If the first one actually succeeded but the reply was lost, the second one doesn't double charge, it just returns the safe success message. It handles the ambiguity perfectly.

22:56

So the So what for the interview is if you design a payment system or an ordering system or anything transactional and you don't explicitly mention idem potency keys alongside your retry strategy with back off and jitter, you're essentially designing a system that will either lose or steal money. You fail the interview, plain and simple. You cannot build reliable distributed systems without these fundamental patterns. You're building a casino, not a bank. OK, we're getting into the

23:24

really deep water now. We've covered the basics of defensive engineering. Now I want to talk about something even scarier. Metastability. Metastability. This is one of my favorite topics because it's so non intuitive. It comes from a fantastic paper called Metastable Failures in Distributed Systems and it describes a specific type of nightmare scenario. The nightmare that doesn't end when you wake up. Pretty much, yeah. We often think of systems as being in one of two states, up

23:54

or down. Metastability introduces a terrifying third state that's, well, kind of both in neither. And the definitions from the paper here are stable, vulnerable and metastable. Correct. Stable is the happy state. Normal operation, traffic is flowing, latency is low, everything is fine. OK. Business as usual. Vulnerable is the interesting one. The system looks fine from the outside. The dashboards are green. It's serving traffic, but it has

24:20

lost its safety margin. It's like an airplane that has lost one of its engines but is still flying. It's walking on the edge of a Cliff, but it hasn't fallen yet. Then comes the trigger. It's often something small, a momentary network blip, a minor traffic spike, a single server rebooting, and that little push sends the system over the edge into them in a stable state. And this is where it gets really

24:44

weird. In a normal failure, you remove the trigger, you fix the bug, or the traffic spike ends and the system recovers and goes back to normal. But in amid a stable failure, removing the trigger does not fix the system. The system stays down. It gets stuck. Why? What's holding it down? It's held down by the sustaining effect. There's some kind of feedback loop in the system's own recovery or error handling logic that keeps it pinned in the failed state.

25:10

The analogy I liked from the source was getting stuck in a hole. Yes, that's the perfect way to think about it. Tripping on a small rock is the trigger. It causes you to fall into a deep hole, but once you're at the bottom of the hole, the rock doesn't matter anymore. You can remove the rock, but you're still in the hole. The depth of the hole is the sustaining effect. Let's make this concrete. The look aside cache failure is the absolute classic example of

25:32

this. This has happened to basically every large tech company at some point. You have a standard setup, a web app, a cache like Redis and a database. Very standard. In the stable state, the cache is warm and has a 90% hit rate. The app is handling say 3000 requests per second or QPS with a 90% ten hit rate. 2700 of those requests are served instantly by the fast cache. Only 300 QPS actually hit the slow database. And let's say the database can handle 500 QPS. It's happy.

26:01

It's not even breaking a sweat. OK, the system is stable, but it's hitting capacity limit is 500 QPS at the database layer. It's vulnerable. Exactly it's vulnerable. Now the trigger, the cache server cluster has to be rebooted, maybe for a security patch or an upgrade. For a few moments the cache is completely empty, so. The cache hit rate drops from 90%. Then what happens to the 3000 QPS from the app? They all go straight to the database.

26:25

All 3000 QPS hit the database. The database was built to handle 500. It's now getting 6 times its capacity. It immediately overloads. CPU goes to 100%, connections are refused, queries start timing out. The database melts. Now here is the metastable trap. The cache servers are back online. Now they're up and running, ready to be filled. How do you refill a Lookaside cache? You you have to read the data from the database and then you write it into the cache.

26:53

But you can't read from the database. It's overloaded, it's on fire. Every request to it is timing out. Oh no. So the cache stays empty. The cache stays empty. Because the cache is empty, all 3000 QPS keep hitting the database. Because the traffic keeps hitting the database, the database stays overloaded and down. Because the database is down, you can't refill the cache. It's a perfect vicious. Loop. It's a death spiral. Yeah, even though the original trigger, the cache reboot is

27:20

long gone, the system is stuck. The sustaining effect is the application's own logic for refilling the cache, which now acts as a weapon that keeps the database down. And the system was vulnerable the whole time because it's secretly relied on that 90 cash hit rate to survive. It had hidden capacity. You thought your system could handle 3000 Q, but in reality your database, the hard limit, could only handle 500. The cache was just masking that

27:46

vulnerability. That is genuinely terrifying. How do you even get out of that state? It's very painful. You usually have to manually intervene and stop all traffic at the edge at the load balancer. Like turn off the website for everyone and let the database recover. Then very slowly run scripts to warm up the cache by hand, then cautiously let a small percentage of users back in.

28:06

It's a major, major outage. Another aspect of this sustaining effect that the paper mentions is work amplification. Yes, this is another one of those super counterintuitive things. We often assume that handling an error is cheap. You know you have a tried out catch block a try to catch return error. Right, it seems like the catch block does less work, it's just creating an error message. How hard can that be? But in many complex systems, the error path is actually more

28:31

expensive than the success path. How is that possible? Think about everything that might happen when you catch an important exception. You might capture a full stack trace. That's CPU intensive. You might write a huge detailed log entry to a file on disk. That's IO. You might do Adns look up to log the client's host name for debugging. That's a blocking network call. You might send an alert to a

28:54

monitoring system. Oh wow, so if the system is overloaded and it starts throwing a lot of errors. The very act of handling those errors consumes more CPU, more IO, and more network resources than the success path would have, which makes the overload worse, which causes more errors. It's another one of these feedback loops. Another loop, to be a senior engineer you have to look at your air handling code and ask is this cheap, is this lean?

29:19

Or am I accidentally building a mechanism that will DDoS myself with stack traces when things go wrong? This is incredibly heavy stuff. It really really shifts the perspective from does my code work to how does my system fail. How does it fail and can it recover on its own? That is the only question that matters. OK, let's talk about the advanced patterns. We've identified these horrible problems, retry storms, metastable loops. What are the heavy duty tools we

29:47

use to fight back? The first one in the toolkit, and it's a big one, is the circuit breaker. Wikipedia defines this pretty clearly. It's a direct analogy to the electrical fuse or circuit breaker in your house. It's the exact same principle. If you plug too many things into an outlet, the wiring gets too hot and the breaker trips to cut the power and save the house from burning down. In software.

30:09

If a downstream service is failing too often, we trip the circuit breaker to protect it. What does that look like in code? What are the mechanics? It's a simple state machine. There are three states. The normal state is closed. This means the circuit is complete electricity or in our case network traffic is flowing. OK, that's the happy path. Then if the breaker detects too many failures, say more than 50% of requests are failing in the last 10 seconds, it transitions to the open state.

30:36

This means the wire is cut when a new request comes in. The circuit breaker doesn't even try to call the downstream service. It fails it immediately with an error. And this directly prevents the retry storm we talked about. It gives the struggling downstream service a break some breathing room to recover. Exactly. It stops the bleeding, but you can't stay open forever. You need a way to know when it's safe to close the circuit again. That's the third state half open. The scout.

31:01

The scout. After a configured timeout, say 30 seconds, the breaker let's one single request through a probe. If that single request succeeds, the breaker assumes the downstream service has recovered and it moves back to the closed state. Everything is back to normal. And if that scout request fails. It goes right back to the open state and starts the time out timer again. It's a really elegant little state machine, but it directly solves that sustaining effect of

31:26

the database overload. It does, but here is the nuance, the senior interview detail that separates the candidates. It's the mini circuit breaker problem, or the problem of granularity. What's that? Imagine you have a service that calls a database that is sharded into say, 100 shards. Shards A through Y are perfectly healthy, but Shard Z is on a faulty machine and is melting down. OK, a partial failure.

31:51

If you have one big global circuit breaker for the entire database dependency and Shard Z starts failing, what happens? You trip the main breaker. Now nobody can talk to the healthy shards A through Y either. You've taken down the whole system because of one small partial failure. So you need to have a circuit breaker per Shard. Or partition. Or per instance you're talking to. The granularity of your circuit breaker matters immensely.

32:14

If you make it too broad, you create unnecessary outages. If you make it too narrow, say a breaker for every single customer ID, you might use too much memory tracking the state of all those Breakers. It's always a trade off. Always, yeah. Discussing that trade off is what gets you the job. Now let's go even deeper down the rabbit hole. We have a paper here from Google Research on D2, TCP and deadline awareness. This feels like the absolute cutting edge.

32:40

This is for when you're designing extremely high performance, low latency systems. Think Google search, high frequency trading or ad bidding platforms. And the specific problem they describe in the paper is the old DI problem, which stands for Online Data Intensive Applications. Right, and the classic example is Google search. You type a query. A query hits a root server. That root server then fans out your query to maybe the 1000

33:06

leaf servers. Each of those servers is responsible for searching its own little chunk of the Internet. And they all find their results and reply. And they all reply at roughly the same time. Which creates a traffic jam. The paper calls this in cast congestion, a fan in burst. Exactly. You have 1000 responses hitting the same network switch all at once. The switch's memory buffers fill up instantly, packets start

33:27

getting dropped. And standard TCP, the protocol that the entire Internet has run on for decades. How does it handle this packet loss? Poorly for this use case. TCP is designed for one thing above all else, fair share. It tries to be fair to everyone using the pipe. When it detects packet loss, its algorithm tells everyone to slow down their sending rate, usually by cutting it in half. But in this scenario, not everyone is equal. Some responses might be more

33:53

important than others. It's not about importance, it's about time. In an ODI app you have a hard deadline. We must return search results to the user in 200 milliseconds. If a packet containing a piece of a search result is going to arrive at 250 meters, it is useless. The root server has already given up and set back the results it has.

34:12

That late packet is garbage. So standard TCP might slow down a packet that only has 10 milliseconds left on its clock, causing it to miss the deadline while giving that bandwidth to a packet that has plenty of time to spare. Yes. TCP has no concept of time. It only knows about bytes and fairness. It will happily make everyone late in the name of being fair. So what does D2 TCP do differently?

34:33

How does it fix this? It adds deadline awareness directly into the congestion control algorithm at the kernel level. It uses a mathematical trick they call gamma correction. Gamma correction sounds like something from video editing. It's a math term, but the intuition is actually pretty simple. It modifies TCP's back off behavior based on how close a flow is to its deadline. How does it decide? How does it know the deadline? The application tells it.

35:00

The application says this socket connection is for an operation that must complete by this time stamp. DGTCP then knows the time budget for every packet. And what does it do with that knowledge? If a flow is very near its deadline, it's in danger of being late. DDTCPS Gamma correction tells it do not back off. Be aggressive, push through. You need to finish now. And if a flow has a far off deadline, plenty of time. It tells that flow to back off

35:27

very aggressively. You have time, Get out of the way and let the urgent traffic pass. It's prioritizing traffic based on time remaining. Exactly. It moves the entire network, stacks philosophy from fairness to urgency. That is a profound shift in thinking, and for a system design interview, mentioning that standard TCP itself might be the bottleneck because it ignores deadlines, that's a serious differentiator. It shows you understand the stack all the way down to the transport layer.

35:55

Most engineers stop thinking at the load balancer. This shows you think deeper. OK, we're rounding the final corner here. We've talked about how to design these complex systems with circuit Breakers and deadline propagation. The last part is about verification. How do we know it all actually works? This brings us to a fascinating paper called the Zebra Conf. The big headline here is that most complex failures aren't caused by code bugs anymore, they're caused by configuration errors.

36:23

Heterogeneous configurations. Right, In a massive distributed system, not every server is running the exact same configuration. You might be in the middle of rolling out a change. You might have different hardware types that need different tuning parameters. And the paper points out this insidious problem where parameter A masks a fault in parameter B. Yes, it's a combinatorial nightmare.

36:43

Maybe you have a retry limit of three on one group of servers and five on another, and that specific interaction accidentally hides a bug in your timeout logic. System works until the day you decide to make all the retry limits 5 and suddenly the whole system falls over because the bug in the timeout logic is now exposed. So for the interview, the point is not just to design the code, but to design the configuration management process itself.

37:08

But don't just say we put configs in a YAML file. That's not an answer. How do you test config changes? How do you verify that a change doesn't violate your system score assumptions? How do you prevent config drift from killing you silently over months? And they mentioned a technique called Bayesian risk refinement. That's a very fancy mouthful, but the concept behind it is we can't possibly test every combination of configurations.

37:31

The search space is infinite, so we use probability Bayesian logic to guide our testing. We build a model of our system and use it to find the riskiest configuration changes. We focus our limited testing resources on the areas where we have the least confidence or where an error would be most catastrophic. It's intelligent risk based testing rather than just brute force exhaustive testing. Exactly, and it's how modern large scale systems are actually

37:55

managed. All right, we have covered a massive amount of ground today from Pac-Man to gamma corrected TCP. Let's bring it all together. Let's create the Senior Engineers interview checklist. OK, if you were walking into that interview room or logging onto that Zoom call for a system design round, here is your mental CHEAT SHEET. Let's hear it, Item 1. Assume failure. Don't design a happy path.

38:18

First start with the failures. Ask yourself what happens if the network dies after the request is processed but before the reply is received. If you can answer that, you're on the right track. Always handle the unknown state. Item 2. Check assumptions. Explicitly state your assumptions about time load and data. Are your timeouts aligned? Remember, upstream timeouts should generally be longer than downstream. Or better yet, use deadline propagation. Don't let your system's reality

38:46

drift apart. Item 3. Control flow. Don't just let request fly blindly into the system. You are the engineer, you control the flow. Use exponential, back off. Use jitter to turn a metronome into rain. Use circuit Breakers to give services a chance to recover. Control the storm before it starts. Item 4. Consistency. If you retry, and you will, your operations must be idem potent.

39:09

State how you will achieve this. Use natural keys, business keys, or for the hardest problems, explicitly pass a unique idem potency key. Do not design a system that can charge a credit card twice. And finally item 5. Meta stability. This is the advanced one. Ask the big question. Does my recovery mechanism act as a denial of service attack on myself? Do my retries create a retry storm? Does my cash warming logic keep my database down? Look for those deadly feedback loops.

39:35

That is a ridiculously powerful list. It's the difference between saying I hope this works and I know how this will break and I've designed it to recover gracefully. I want to leave our listeners with one final thought. We started this whole deep dive with the idea that perfect code can fail. And I think we've seen exactly why that's true. There's a quote I love that I think sums this all up. Great software isn't built by eliminating bugs, it's built by

40:01

eliminating surprises. I love that. That's perfect. The bugs will always happen. The cosmic rays will flip the bits. The backhoe will cut the fiber optic cable. Those are just facts of life in this field. They are not surprises. But the surprise The system collapsing into a heat because of a simple retry storm or getting stuck in a metastoble loop for hours. That is what we as engineers can and must eliminate through good

40:26

design. The code inside the brackets, the if statements, the for loops. In many ways, that's the easy part. The void outside the brackets, the empty space where the messages travel and the assumptions live. That is where the senior engineer lives. And that is where you need to live to ace that interview. Good luck, you've got this. Thanks for listening to the deep dive. We'll see you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript