Manual Testing and Observability

Matt Godbolt

00:19

Hey Ben, the last few episodes we've been talking about testing and it occurs to me that we're leaving a lot undiscussed because I think as we've said before, my intro to testing was all about handing off a video game, partially written, to some poor person who was going to be sitting and playing it for four hours while videotaping in case it went wrong, which is not ideal, but it's all I had for the first sort of decade. I know that you have some opinions about this.

Ben Rady

00:47

Well, so there's nothing immediately or inherently wrong with that. I think that the key thing - and we've talked about this a few times - is finding a way to get some confidence, right? Like confidence that your code works, confidence, confidence means the ability or the, um, the feeling that you're ready to move forward with whatever the next step is. Right? And so there's lots of different ways to get there. And I actually think one of the most essential ways is being able to do manual testing. If you can't put yourself in the seat, in the position, in the, in the shoes of the user, who's going to be using your software and use it exactly as they do, you're never really going to be completely sure, what I would maybe refer to as Portland sure, and I'll explain what that means in a minute, that the software actually works. You can write all the unit tests you want, you can run all the integration tests you want, and those will create confidence, but they won't create surety. There's always the chance that you missed a test, that the software, that there's some thing that you didn't understand that doesn't work the way you think it works. Okay.

Matt Godbolt

01:51

That's often one of the criticisms of writing tests is that, well, it's never going to catch everything. So why even try?

Ben Rady

01:59

"Why even try" is the place where that falls down.

Matt Godbolt

02:01

Right. I know, but yeah. It's like, you know, why wear a seatbelt in your car, when if you crash then you're probably, yeah.... I don't really subscribe to that viewpoint.

Ben Rady

02:13

If I'm going to get hit by a semi-truck with what good is the, uh, seatbelt gonna do? I guess, I don't know.

Matt Godbolt

02:18

I suppose that's more it.

Ben Rady

02:18

Yeah. Yeah. But yeah, there's definitely that line of reasoning out there, but it's not an either-or, right? Like it's, you can do all of these things. You don't have to necessarily choose that I'm going to only do one or only do the other. Just like I would never put myself in a situation where I was only writing unit tests, because that's just not going to get you to the point where you can be confident in all situations. And it's not going to get the level of certainty that you need to do things that are where there's millions of dollars on the line or lives on the line or things like that, where you have to be very, very sure that things are gonna work as opposed to, well, you know, this is a web app, we're going to deploy it. And if there's a bug, okay, I'll just deploy it again. That's fine. Right. Right.

Matt Godbolt

02:58

There's a cost with there being a mistake. And sometimes that's something that you can sustain as sort of part of your process. It's like, well, we're 95% sure. And that's certainly good enough for me to push an update to one of my hobby websites. But it's certainly not enough for me to turn on a new trading system that is going to lose me millions of dollars potentially, you know, 95% is not good enough for my day job.

Ben Rady

03:22

Exactly, exactly. And so building these, these systems that create confidence, whether it's unit tests or a testing environment or manual tests or all these other things, it depends on all these factors. You know, Alistair Cockburn was a guy that actually tried to quantify this with his Crystal system. He had all these like different dimensions of like cost and scale and yeah, he was, he was way into this. Um, and I don't know that you need to necessarily take it that far. I mean, you know, it's an interesting exercise anyway, and certainly Alistair's done a lot of work on that front, but the we've talked a few times about confidence is, is this feeling that you need to, to move forward. Right. And that depends on a lot of things. And one of the things that absolutely depends on is the cost of the price of failure, right? So one way that you can make it easier to move forward is not by like writing more tests, but by reducing the cost of failure, right? Like structuring things, such that if something breaks, it's fine, no one will notice.

Matt Godbolt

04:17

So things like a red/green deploys or green/blue deploy type things where you can, well, I start rolling it out and you know, I'll notice I have good monitoring. So I know if things are going awry, like a sustained, like a 0.5% error rate before I decide, yeah, I'm going to roll this back and we'll lick our wounds and see what we did wrong. And then if, as long as that's acceptable to your business or your use case or whatever.

Ben Rady

04:39

Right, exactly. Or building systems that fail fast, right. Where it's like doing nothing is fine. Right. The failure mode of like the system does nothing is completely fine. The thing that's unacceptable is doing the wrong thing. And so if we ever do anything that even smells like the wrong thing, everything just shuts down and turns off and we'll, we'll figure it out. What happened after that? You know, it just depends on the context and the kind of systems that you're building, but there's, there's no way that you're necessarily, I mean, it's very difficult to get to a state where you are completely sure that something works where you are "Portland-sure". What does Portland-sure mean?

Matt Godbolt

05:14

What does Portland-sure mean?

Ben Rady

05:14

"What does Portland-sure mean", he asked rhetorically. So I've never been to the city of Portland. It's a wonderful place from what I hear, I actually have a few friends that live there. Um, but I've never been there. So I don't actually know for sure that Portland, Oregon exists. I'm very confident that it exists. It shows up on maps. Like I said, I have friends there. I talked to these friends, I say that they say they're there. And they're, these are reputable people. Most of them are reputable people,

Matt Godbolt

05:47

[laughs] I won't ask you to name names...

Ben Rady

05:50

But you know, maybe they're being deceived. Maybe they, maybe they have some mental illness that I'm unaware of. Maybe, uh, they actually don't live in the city of Portland. They've just outside of it. And they've been told that the nearby city is Portland. So there's all these like, you know, one in a bajillion possibilities that actually, Portland's not really real, but probably it's real. Right?

Matt Godbolt

06:14

For all practical purposes, you treat it as if it exists.

Ben Rady

06:18

Exactly, exactly. And so you can get, you can go way down this path and you get into sort of these deep philosophical, like what is real and like, you know, all of these like age of enlightenment theories on like...

Matt Godbolt

06:29

That sounds like a very late night conversation, after a couple of beers, level of discussion rather than...

Ben Rady

06:35

Exactly: like is the world of the physical world, a real place. Are we living in a simulation, but here's the thing, all of those kinds of questions are not useful for engineering. Right, right. Like that's not a useful engineering question to ask. Um, but it is, you do have to sort of have this level of like, you can never really be a hundred percent sure about anything, but you can be so sure as to be, you know, assuming that the world around me is real. And assuming that the thing that I observe actually is happening and I'm not suffering from some sort of hallucination. Yeah. Then I'm sure. And so that's what Portland sure is to me. It's

Matt Godbolt

07:15

You were sure of it as you are the state, sorry. The, the city of Portland exists in Oregon. I've never been there even though you've never been there. There's a lot of, yeah. There's a lot of supporting evidence, but no actual firsthand experience. It's all hearsay that. Right. So that's Portland sure. So how do you get to Portland sure?

Ben Rady

07:34

Well, so Portland surety is this thing of like, there is a certain level of trust in there. It's sort of like when I like programming languages, are these amazing things and computers are these amazing things that have a level of reliability that is hard to match in other areas. Right. Not impossible. And there are certainly other things that can achieve that sort of level of reliability.

Matt Godbolt

07:57

You're talking, like if you open a file and the file handle, it comes back non-zero or whatever, then you have a file that works and the operating system works. It's very rare that you have to say, what if the operating system isn't working or what if the CPU has a bug or what if the Ram is corrupt?

Ben Rady

08:12

Right. Exactly. It's not that those things can't happen. You know, there's gamma rays, there's other things. But generally if you ask a modern CPU to add two integers and you're not giving it an invalid instruction, that's going to cause like an overflow or something it's going to correctly give you the value of those two integers and questioning whether or not that is actually going to happen is, is about as useful as questioning whether or not the city of Portland exists. Right? Like from an engineering perspective, you know, maybe that's an exercise that you want to do at some point. And it is not totally impossible that that couldn't happen.

Matt Godbolt

08:41

We'll have war stories, right. Where we've ended up finding, Oh, and it was a bug in the kernel. Right. But those are few and far between.

Ben Rady

08:48

Yes. And, and, and if you spent all of your time getting that level of confidence where you were like, you know, checking these every single one of these things, and there's like millions of them, right? Like very few people, you are one of the few people that I know that actually dive so deeply down into the inner workings of how computers actually work at like the level of the Silicon to be even able to answer these questions, let alone be able to verify that it really works the way that you think it works. Right. Um, and you know, it's, there's only so many things in the world that you can do that with, right? Like you can't do that for everything. And so like understanding how everything works is just gonna, it's not practical from an engineering standpoint. So the, so the, my point here is when I say I'm Portland sure. About something it's, I'm as sure about this as the city of Portland, it means I've dug down to the necessary levels of abstraction. The one that's that inner sense of mine that has seen those kernel bugs and has seen, you know, all those sorts of weird one-off errors that happen sometimes.

Matt Godbolt

09:50

Let's assume it isn't a broken operating system right now, it's more likely to be the threading code we just added.

Ben Rady

09:56

Right. Exactly. So like developing that is really important. And so like, one of the ways you can do that is with automated tasks.

Matt Godbolt

10:04

And this is when you say developing that you mean developing, uh, the faith in the system, that it's correct.

Ben Rady

10:05

The matching of the faith with the, with the risk that, that sort of cost of failure that we were just talking about, like has my level of confidence risen to the point where this is now safe enough to move forward, you know, can I, can I drive through that intersection with the green light, with enough confidence that I'm not going to get hit by a truck, right. The cost of failure, there is really high. So your confidence needs to be high, right. If it's, you know, walking out onto the sidewalk, it's like, well, you know, if a bicycle is coming along and they hit me, that's not going to be the end of the world. So I'm just going to keep my AirPods in my ears and keep walking. Right? Like those are different levels of, of failure cost.

Matt Godbolt

10:41

Trade off between, uh, the, the the certainty that you're right. And the cost of being wrong. There's a sort of, yeah. So you're developing your, sorry, I interrupted. You were talking about like unit tests are just one part of developing a sense that an appropriate level of confidence that your code is correct, but what other things can there be? I mean, obviously we've just talked about, we started with manual testing. That is an obvious thing that I would do. If I have just made a change to a piece of code, then I'm going to run it. Maybe I'm going to step through it in the debugger. Maybe just, you know, go through and see line by line. Is it doing broadly what I would expect it to do under the test circumstances that I have created for it, if it's a web app, I'll load it up in the browser and I'll look at the JavaScript error console and I'll click on a few things that I know are problematic and just develop a little bit of a sense of, is it okay?

11:31

It's the thing about that is it's hard to communicate to other people. Like I work on a hobby project, which is web based. I know the things that I randomly click on the have gone wrong in the past, and I've written down a few of them, but we don't have, I don't have that nice sense of safety valve of an automated version of it. I've tried to create one of those, and it was difficult to make and hard to keep up to date. And ultimately, I don't think gave me the security that I was expecting it compared to the pain of keeping up to date, but it was inter subjective. I could say to other people who were working on the same code base, Hey, deploy to the, the, you know, the staging environment, run these tests against the staging environment. And then you're pretty sure that the staging environment is going to work when we promote it to production, but ultimately those have atrophied. And I think really it comes back to your, your original thought, the cost of me getting it wrong is egg on my face. Not lost business, not lost revenue, not trust, really going so I can afford to make the odd mistake in my particular case. But maybe if you are, you know, if you're, um, a government website, you do need to be up all the time, or if you're

Ben Rady

12:43

Have you seen many government websites, they're not really all that great.

Matt Godbolt

12:47

Yeah. Okay. I was trying to think of something for which, you know, the certainty of it being up was important, uh, and the government sprang to mind, but your point is valid.

Ben Rady

12:55

Yeah. Well, you know, trading systems are a good one. I mean, there's like embedded devices are another one. And, you know, maybe we'll talk about that at some point where it's like, you have to be certain because you're not going to get a chance to change it. Right. Like you're gonna, you're gonna upload this firmware on the devices that are not going to be connected to the internet. Cause, you know, do you really want your pacemaker connected to the internet?

Matt Godbolt

13:15

Right. And I mean, medical things in general. So I mean, if you want your Portland surety indicates to be as high as it's going to be in is definitely in the firmware for the defibrillator. That's like on the, on the walls.

Ben Rady

13:27

Absolutely. And like aerospace, there's definitely these kinds of situations in aerospace. I mean, there's lots of domains where it's like, there's either lives at stake or there's significant amounts of money at stake. And so it's really important to, to get things right. And I mean, you know, to your example here about like, I have these manual steps that I go through and I tell people to do that. I mean, I think we would all recognize that the best way to do that is to try to find the ways to automate that in a way that is scalable, right. Where you're not writing really slow running integration tests and having like hundreds of them that are kind of brittle, but at the same time, like you don't want to have the readme with the manual set of steps. It goes, here's what you do to check this.

14:08

But I will say, I do think the ability to test things manually is incredibly important. And I personally, as much as I'm like the testing guy, the automated testing guy, I don't ever, well, maybe not ever, ever is a strong word, but I very, very, very rarely. And I have the one counter example to this actually. I very, very rarely make a change to a piece of software where I haven't gone through and use that software as the user would. Right. So if I'm trying to add a feature to a system, usually even beforehand, I'm like trying it out and trying to reproduce like, Oh, I can't do this. Then I go try to do that for myself. And I say, Oh, well, that is kind of painful. Maybe we need to add some functionality here. Right. And then I will drive that behavior out with tests and ensure that my, that my tests are, you know, have all the nice attributes that we've talked about, where they're, they're fast, they're reliable, they're informative to help guide me toward a design that is testable.

15:03

And therefore, you know, maybe a little bit more decoupled and all these nice properties, but then once I'm done writing those tests, I go back and I use the software manually. And I put myself in the shoes of the user that I'm building this thing for, users that I'm building this thing for. And I try to use it just as they would. And if I find that difficult to do, because for example, I don't have access to the production data that they have, or I don't have an environment that's realistic, or I have a device that's different. I solve that problem. I go, and I get the data or I, I change the software so that I can connect to a production environment in a safe way and use the data. I mean, that's it like read on the access to a production database.

15:42

That's like a mirror of your prod database is a great technique for this. There's lots of other topics for this, right? But being able to use the software as your users is using it, that's how you find the missing tests, right? Those unit tests that you didn't realize you needed to write. That's how you find them, but you should only ever do that once. The purpose of that exercise is to give yourself that sort of Portland surety, that when you go and you tell some user, whether it's directly face-to-face or with email marketing broadcast is, Hey, check out our cool new features. You are really sure that that stuff works cause you've seen it work. Right. Which you only ever want to do that once. And then you take that knowledge that you learned by doing that and figuring out, Oh, actually this doesn't work in this case.

16:25

And you go back and encode that into the tests so that not every person that comes after you has to follow those manual steps, again. You've taken that confidence and you've, you've put it intersubjectively into the code. So now everyone can share your confidence because you've kind of put it into the tests. Right, right. You've seen it work for yourself and you've recorded that in the tests. So the one situation where I will usually not do any sort of manual testing when I'm fixing a bug specifically, is when I have a stack trace that shows exactly what the bug was. And the stack trace has some unique elements in it, right? Like it's, it's hitting a piece of code that is not often traveled or, you know, has a, is pretty deep and can show like a particular path. And if I can write a unit test that completely reproduces, that stack trace to where it's like almost, or, or exactly identical, right. That usually gives me enough confidence to then fix the bug and make the test pass and then just commit and deploy and not actually have to reproduce the bug manually first and then fix it and then go try again and confirm that I've fixed the bug because usually those stack traces, um, depending on exactly what they are and what path they're taking through the code, if you can reproduce it, like it's a pretty good indication.

Matt Godbolt

17:45

It does sound slightly pipe dreamy for some of the things that I I'm involved with. Um, just because of the number of moving parts.

Ben Rady

17:53

So the pipe dreamy thing is interesting, right? So I think you have to address those things as they come up. And I think that, you know, part of the skill of writing these kinds of tests is starting with the assumption that given enough effort, this is almost always possible. And then sort of backing off from there and finding the, sort of the right level of effort to put into it where you can maintain. Because the key thing is, is that you want to be able to maintain the sense of confidence among yourself and your team, that if you have a, whatever your process is, whether it's run a CI environment with a whole bunch of unit tests and then do some limited amount of integration or manual testing or whatever it is. But if you follow the process that you will achieve a result that is good enough to move forward.

18:51

So that's not complete certainty of no failures, right. But it's given our environment and given our risk tolerance and given our failure costs, if you follow the process, you will achieve the right level of risk. Right. And if you're finding that you can't that after you're done following the process, you're like, ah, maybe I'll check a few more things, right? That's you need to listen to that and, and, and say, okay, the thing that I should do is okay for this immediate thing that I need to do, maybe check a few more things, but then right after that, I need to make some improvements to our process, whether it's writing more tests or, you know, one way that you can, you can talk about, you can address this is by adding observability to your systems, right. Maybe I can't write the unit test to tell me for this huge production environment, with thousands of servers, that if I were to replicate double my AWS costs and I don't really want to deal with right now, I I'm sure you can relate to this.

Matt Godbolt

19:49

Exactly where I'm there. All of these things are coming from that sort of sense. Yeah.

Ben Rady

19:53

So if you can't get that level of confidence just from unit tests, because of the nature and the cost of your environment, another way to get that level of confidence is through observability. So maybe you deploy this new thing and you have some, and we can talk about lots of different ways to get observability, but like one would be adding in structured logging, right? So you have a special structured log that all of your applications, write to that lets you gather certain metrics about what the software is doing, how it's behaving, you know, maybe it's error rates. Maybe it's like, Oh, I know that there's a queue over here that it takes these incoming messages and I just change, potentially change the number of incoming messages. Uh, so what I really want is an ability to see the size of that queue as I roll this thing out. And as it starts propagating all these different services that queue starting to grow, and if it is I'm gonna, I'm gonna, I'm gonna roll this back. So that implies a whole bunch of things.

Matt Godbolt

20:48

The ability to roll back. Yes. I was going to say that was the first thing that came to me because like, oftentimes by the time you've hit the big red button, maybe it's not so easy to unhit the big red button. Right, right, right, right.

Ben Rady

20:58

So this gets back to this whole thing of like, in order to move forward, you have to have confidence. You have to reach that level of safety. One way to get that as writing tasks and other way to get that as you make it really easy to roll back. So you've put in the effort to build those systems that are easy to roll back so that, okay, well our unit test coverage in this area, isn't great. I'm going to add a little bit of observability here. I'm going to mix that in with a little bit of, you know, rollback, magic. And so I can get that confidence to deploy. I just yep. Push, deploy. We're good because I can see very quickly if this doesn't work, I can undo it. And I have confidence that it's going to undo properly.

Matt Godbolt

21:31

That's kind of a third dimension to the sort of confidence versus cost of getting it wrong. Maybe it's, it's sort of related to the cost of getting it wrong. And that is how long it can be wrong for before the severity kicks in. Now, like if you're doing a database migration, like a huge new change to the way things are stored, maybe it's very, very expensive to go back because you've now created a ton of records with the new format that you can't undo. So it's hard to roll back. So there you perhaps have to account for it by having extra testing, even more confidence in the system before you rolled it out. But if you're moving a widget around on the UI, right, the cost of rolling back is pfffft. The cost of getting it wrong is also that same noise is diminimous. I didn't think I could do it again.

22:20

So I wasn't going to try. It's not such an expensive thing. Um, so for, for me, when I'm doing my staging rollouts of my, my funny little hobby, um, that's mostly because I can't 100% trust the rollout process. And so if I've broken something by renaming a directory, somewhere on some AWS thing, again, because of the cost of keeping everything up and running and two parallel systems is high, I pushed to a staging thing. And then I guess what I'm looking for is does it start responding to requests and does the page open up? Okay, cool. That probably means that I can take that exact version and push it to production without any incidents. So in a sort of funny way, maybe that's observability. Will this deploy succeed in at least one instance that looks very, very close to the real production. Yes, then.

23:05

Okay. Now it's good to go. And then I can roll back because it's symlink change, right. To go back to an older version. So I, I'm lucky that I have that in my case and the, again, the cost is very, very low if I get it wrong, but in a, like a, as we say, an expensive database system or a financial system where if you can turn it off within 10 seconds, having observed at doing something wrong, you could still easily have lost millions of dollars. Then you do need a different approach. But observability is a useful trait in of itself as much as, you know, having metrics and dashboards and counters so that while you are rolling out your software, you get the nice warm glow of seeing the queue length, either increased because you know, you're now putting more things in the queue and that's good.

23:47

And that's what you wanted or decreased because you sped up the calculation or whatever those things are good for as, as a human to sit and watch and kind of enjoy the expected results of your change. Being, becoming visible to you, both in a pretty graph format, but also in terms of something you can look at and debug later, if it turns out to not be the case that you wanted, but observability is useful for a number of reasons, other debugging later on, like what, what else went wrong? Building observability into your application has always been a good thing for me. Like you mentioned structured logging, what kind of things are you thinking about when you say a structured log?

Ben Rady

24:22

A lot of times we get into these situations where we just use a, whatever logging framework is available to us when we're we're, you know, building stuff. And we write out these sort of human readable, it's like timestamp log level, subsystem name, and then a message

Matt Godbolt

24:37

All been there and done the grep to find the metric that you didn't actually push the Prometheus or whatever it's like, well, okay, we can infer it like with this thing and this regular expression and right, right.

Ben Rady

24:49

And that's half of my bash-fu comes from just, you know, be forced into situations where I have to do a graph sort unique, blah, blah, blah, blah, blah, to figure out what things are doing. And so the structure structured logging is an approach that says, maybe we should try to do these things on purpose instead of by accident. Um, because we know this is going to happen. And I mean, this is something that happens a lot. And I think, I think one of the problems we have these, these discussions in the break room and we have these meetings and we have these podcasts where we talk about where it would be. Wouldn't it be great if we had this? And everyone's like, Oh yeah, that's great. But if I'm doing that, then I'm not doing something else. Right. And that's a valid concern. But what inevitably happens is that we wind up needing these things.

25:28

And we it's, it's like the accidental observability moments that we end up relying on. How many times have you, and I cracked open Wireshark to see what two services who are talking to each other doing, because, and that was an absolutely essential life saving move. Well, maybe not life saving, but money saving move. And how terrible would it have been if we, if we hadn't been able to do that. And the reason that we were using Wireshark instead of something else is that's all we had only by virtue of communicating over the network where we provided this tunnel into what our systems are doing. And we never

Matt Godbolt

26:02

Of having expensive recording machines for other purposes that happen to be capturing all the packets anyway, often. Well, Oh, that's, that's lucky that we have this sort of trace knocking around. But I mean also, I mean, how often have you run strace or what are some of the other applications because you're like, well, I don't have the observability that I need to be able to understand what's going on in a situation. And thankfully I can re reproduce it enough. And the best thing I can do is strace the process and hope to heck that the problem happens and we can see whatever file descriptor is hanging on or, or what, what what's going on in that respect.

Ben Rady

26:36

Exactly. Exactly. So those are the things that we sort of the moments of observability, that the ability to do this, that we sort of stumble into just by the fact of the environment that we're running in. And so structured logging, I think is one example where you could say like, no, no, no. What if we actually did this intentionally, right? What we built the system with this, these needs in mind that we know we have, we know we're going to need this stuff, right. It's just a question of what tools do we have at our disposal to get it. Um, and I think one of the things that can happen if you do that, that is more difficult. If you don't do it intentionally, is that, that observability matures and morphs into something that lets you now take automated action based on it.

27:19

So it's one thing to grep through a log and see something that's happening. It's another thing to send, uh, a stat to a, you know, stats D or Prometheus or some other thing to see a pretty chart. But only when you get to a point of maturity where you, you can be confident that the system should behave in certain ways and confident that it shouldn't behave in other ways, can you start doing things like, Oh, I know that I don't even need to trigger a rollback. If there's a problem, I can just push this out. And I have enough experience with the tools that I built for observability. And I have enough history now to be able to say with confidence, if this queue size exceeds this, there's something real wrong and it just should roll back automatically. Right. But to get there, you have to progress through the stages of adding the observability in the first place, having it be in a format that is easily consumable by all the environments that you need and all the ways that you need, and then establishing that sort of pattern of what normal looks like for your application and understanding how the failure modes look like.

28:20

And a lot of this comes from not only adding that observability, but also doing like chaos monkey things to like simulate failure and understand what your failure modes are, having the recording in place. So that when those sort of, you know, gamma ray moments happen and things break in strange ways, you have recorded it and you can go back and be like, Oh, well, this was really interesting because this failed like this. So you have to sort of have that, that sort of hard-fought history to be able to get to that point. But once you do, and you know what normal looks like, and you know what abnormal looks like, then you can start automating some of these things. I think actually one of the problems that people run into when they start hearing about this and like observability it's Oh yeah, I'm gonna handle the structured logging. And all these stats is they try to jump right to the automation, right.

Matt Godbolt

29:02

They're like, Oh yeah, I'm going through without passing, go without going around a couple of times and saying, Oh, I think I see how this is going to fit together now.

Ben Rady

29:09

Yeah. They start making assumptions about how they think the system should behave instead of observing how it actually does behave. And then what happens is that you get a whole bunch of other failures that happen on top of it, whether it's like a ton of alert spam, it's like, Oh, the queue size exceeded the 10,000. It's like, yeah, actually it does that all the time

Matt Godbolt

29:25

Because every Monday morning when some other exogenous events happened and you had to let it run a couple of weeks to just notice that that's normal.

Ben Rady

29:31

Exactly. Exactly. Exactly. So if you, you have to sort of progress through that whole process of like, you know, get the observable, get those tools for observability in place, observe them, figure out what normal looks like, figure out what real is. And then you can get to the point of actually taking automated action on it. But when you get to that point, now you've created this wonderful safety net where you can go real fast because it's like, yeah, it's like, there's all these different ways that we just can't break things. Because if we do, if we make certain mistakes, the system will, will recover in a safe way.

Matt Godbolt

30:05

This doesn't, this can't apply to everything. You know, we talked about embedded systems earlier, and I think, you know, we're pretty sure that that's not what we're talking about here. Our sort of traditional server model certainly fall into this category where you can say, Hey, I got a request. I did a whole bunch of things to it. And I, as a result of it, here's the response I posted and I can measure queue sizes or how long each function took or whatever is a useful piece of information throughout the processing of a request. And then making that, and then probably aggregating that over many instances in today's sort of modern server infrastructure and then sort of having an alerting and monitoring system that sits a level above that and is configured to look at the queue size in aggregate or the average queue size or the minimum to maximum queue size or things like that.

30:54

But observability can give you more than that kind of, uh, of alerting and monitoring. I mean, we, we, we, I talked about it a little bit with the, the, the idea of like using it as a debugging tool. You know, we talked about strace and stuff, but I've had some experiences with something similar, which was not aggregated, um, at like a, a server to server level. It wasn't alerted on this server to server, but it was recorded and kept because, uh, I was working on a trading system where quite reasonably five or six years later, we might get a call from a regulator saying, Hey, this trade you did in 2015, why did you send this specific trade? What was special about this trade? And we have to be able to answer the question of like, why do we buy a hundred Google shares then?

31:41

And one of the ways that we, we developed a system to answer those kinds of questions was essentially the kind of thing you've been talking about. Observability. We had a trace from every single piece of the software as an event flowed through it, everything annotated like a single message with, Hey, I made this call because these things, and they're all referenced in cross-reference with, with numbers and timestamps and things. And then we would like write that out to disk. And it was written specifically for answering these kinds of questions for regulators, but it was the most useful thing in the world for ourselves just to pick over the corpse of a problem trade or a crash that we'd had. Well, we got every piece of information we needed, Oh, we just processed a batch that had 300 things in it. That's higher than we've ever processed before.

32:29

Maybe that's, uh, leads to the, the, the, the crash. And we were able to do that in a real time as in, you know, microsecond level trading systems. So there's kind of no excuse, well, that's not true. There's always an excuse to not expending engineering effort, but like, it can be done. You can make it a high enough priority to, to keep track of the decisions you're making and gather observability. Even you're worrying about microseconds in terms of latency. So that for me was super useful. And now I, I kind of look for that level of observability in almost everything that I come to. And very few things have that, but you know, very few, few things also have this sort of very straight one piece of information comes in calculations happen. One outcome comes out the other side. It's not, we're not always in that set position. So, but yeah, observability is, is, is fascinating. I hadn't really thought of it in terms of, um, taking it to the alert level before, where, where the very fact that you can develop surety about what your system does, can give you the faith in your system, not in your system, but the faith in the deployment of the system, you will quickly know whether you got it right or not, and then you can roll it back. And those are all kind of interlinked in terms of the confidence of being able to move forward.

Ben Rady

33:42

Right, right, right. There's some deep relationships there for sure. And I mean, you kind of were in the interesting position of you were forced to build that observability into your system for a regulatory reason, but once you got it, it was like, you can never take this away from me. Right. Like it was such a wonderful thing.

Matt Godbolt

33:58

It's like almost anyone who knows how to use strace beyond like "man strace" and like looking at the first thing now, suddenly you have a new thing in your arsenal for debugging, almost everything. Like the first thing you do, like, well, this is weird. I'll just run strace on it and having that level of observability and that level of information, I guess, to your point earlier, similarly, Wireshark? Right. Once you've worked out Wireshark, it's amazing how many things I solve with Wireshark these days. Oh, why can't I see this thing on my home network? Oh, I know. I just run Wireshark and it's like, that's crazy. Why are you using this tool to do that? Well, because it answers the question. No one gave me this information, but they telling me, but not explicitly. And I can find that, that information, if I use the right tool, how much better could the tool be if it was built into the applications that actually have the information that you want and was exposed.

Ben Rady

34:49

Yeah. Not digging on Wireshark. I mean, sometimes one of the ways that you can get this observability is like, well, we're going to send this data over the network. Right. And then we'll be able to see it in the captures and then we'll know what it was. Right. And we'll build a parser for it. Like sometimes you do that on purpose.

Matt Godbolt

35:04

Yeah. Right. I mean, we've, we've, we've had situations before where we've had just a ping running and use that as well. This means that there's connectivity going between these machines. We can see it in the capture. That's kind of gives us, uh, a yardstick to measure stuff by or a timestamp of the last known good connection. So yeah. Yeah. It is. It's a, it's definitely, there is the Wireshark hammer. There's the strace hammer. There's even the system tap hammer, which is, I've used twice to, to amazing effect and found kernel bugs with, to your point earlier about whether the kernel is, it could be trusted or not, but they, they, they do give the, the wielder of that hammer, maybe a little bit too much confidence that it's the right hammer to use for all problems. We should all acknowledge that right now. But they are useful.

Ben Rady

35:51

Yeah. But I mean, it's, it's, it's a different thing. I feel like when you, when you build these kinds of capabilities, intentionally rather than be forced into doing them by a regulator or sort of fall accidentally into them, by the nature of you sending data over a network or whatever it may be, I can't prove this. I'm not Portland sure of this, but I get the feeling. I get the based on my experience, you will get to that sort of magical state of having full understanding of the normality of your system sooner, where you can start automating these things, automating rollbacks, automating deployments, automating alerting in a way that isn't spammy, if you do it intentionally. And if you kind of do it, especially from the start when you can. And then the other thing that I kind of wonder about on this topic is, you know, I was saying before, we want to put ourselves in the seat of a user that's using our software.

36:45

Right. And you know, we've been talking for the last few minutes now about observability. I do kind of wonder if there's maybe a little bit of a, of an overlap between these two things, right. Just as like, you know, we've said a few times, I've certainly said many, many times in my life, there's this sort of benefit that you get from, from test driven development, where it naturally makes your design better. I wonder if there is sort of this natural thing where like adding observability to a system makes it easier for you to put yourself in the seat of a user, because a lot of times, if you want to reproduce what a user has done, you kinda need to like rearrange the matrix, so to speak, right? Like you need to put this data over here and you need to have this thing be in this state and you need to be able to sort of manipulate the, the environment that you're in, um, to reproduce what the user was doing at the time.

37:40

And I have to wonder if that sort of ability to sort of reach into the different parts of your system and, and at least see what's going on. If not control what's going on is very related to the ability to, um, observe it. And so, like a dumb example of this would be, I can clone the production database to my local workstation using a read only account and a read only password. Maybe it's even from a mirror of the production database that has yesterday's data in it. Um, and I can clone that onto my local workstation and we have whatever procedures are in place. And everybody's comfortable with the idea that I can do this. There's no sensitive data or whatever. Yeah. Some magical, yeah, it's been scrubbed if that's what needs to happen, you know, the minimum amount of scrubbing necessary, but we've, we've, we've making sure that there's no personally identifying information or whatever, whatever constraints are is for your problem.

38:32

I can do it right. I can then go in and I can monkey with that data to experiment and figure out what I think happened to this one user one time that caused this stack trace to happen that is now sitting in my JIRA ticket or whatever it is saying, like, Hey, we've got this bug. I can then mess around with that data until I can use it and reproduce the stack trace. And if that stack trace matches exactly, I am very confident that I have now reproduced this bug. And that means that I know what data was in the database at the time that caused it. And I means that when I fix it, I can go back and follow that same series of steps. And if I don't see the stack trace, if it doesn't error, I can be very confident that I fixed it. Yep. The ability to do that, I feel like is not all that different from the kind of things that you would normally build into most systems to make them more observable. Right. Because the whole, like what about PII? We can't like take all this data and just shove it into this whole system where anybody can see it because then all the personally identifiable information

Matt Godbolt

39:28

Yeah. Using acronyms, just thank you for clarifying yes. Yeah.

Ben Rady

39:32

Personally identifiable information will leak out and that you can't have that from, from a regulatory problem or from a legal thing. Okay. Well, we're going to have to solve that problem anyway, because we needed it to be observable and we needed it to be manually testable. We need to be, we need our developers to be able to put themselves in the shoes of a person who's using the system. So again, I, this is something that I'm less sure about. I'm very, I'm very sure that there's this overlap between testing and good design. I'm less sure about this, but I'm starting to wonder if there is a little bit of like, once you start adding in these hooks, and once you start sort of thinking in this way, where systems are decoupled enough, and then the connections between them have this observability property to them, or they have this sort of like capturable property to them, this, the saveable storable property. Once you start building things that way, if you don't wind up with a system that is naturally more observable because the engineers have to be able to reach into various parts and tweak things.

Matt Godbolt

40:29

That's an interesting point and probably a good point for us to stop here because we've, you know, I'd want to think about that some more.

Ben Rady

40:36

That's a good one to think about for sure.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript