Observable Metrics

Matt Godbolt

00:18

Hey, Ben.

Ben Rady

00:20

Hey, Matt.

Matt Godbolt

00:20

I am not convinced my levels are right here, so I apologize if the audio is awful on this. um Something isn't going well here. ah I think it's too quiet. i mean, how does it sound to you?

Ben Rady

00:33

You think it's too quiet? It sounds good to me.

Matt Godbolt

00:35

But I've turned the gain up here and it looks really... Okay, well, we'll go with that. Anyway, hi!

Ben Rady

00:39

yeah hello.

Matt Godbolt

00:40

How are you doing, friend? This is a catastrophe ah for for for editing Matt as we were just... ah During the the opening jingle, we were we were taking the mickey out of editing Matt because he's a jerk. But that's fine because he's not here.

Ben Rady

00:55

Yeah. I usually find that yesterday Ben is a jerk and tomorrow Ben is the most responsible person on the face of the earth.

Matt Godbolt

01:02

That's... Yeah.

Ben Rady

01:03

He's going to take care of everything because I'm not doing anything today.

Matt Godbolt

01:06

Yeah, you know that thing? Yeah, that's 100% how I see the world too. Yeah. So we we we had started this, as we often do, by chatting in a Google Meet before this.

01:18

and And then we were like, let's just record. Let's do it. And then we had to switch out. And then unfortunately, because we're now using a separate recording system, ah that involves fiddling around with web browser settings and plugging in different microphones everything. And now I can't even remember what it was we were talking about.

Ben Rady

01:36

Uh, so we, I, I, when I just cut over to this, I was of course in the middle of writing a test cause you know, what else would I be doing?

Matt Godbolt

01:42

Of course.

Ben Rady

01:43

And, uh, I was saying how one of the things that I love to do is test my metrics. And you had made a ah very interesting point about performance sensitive code and this technique.

Matt Godbolt

01:55

That's right.

Ben Rady

01:56

So do you, do you recall what you said?

Matt Godbolt

01:57

I do now. Thank you for being the responsible adult and remembering from one minute to the next what on earth is going on.

Ben Rady

02:03

Tomorrow Ben, has manifested today.

Matt Godbolt

02:05

Yeah, it seems so. It seems so. So yeah, you had said that you were writing tests for the metrics in your code. And that's a fantastic thing to do because if you're relying on those metrics, you probably want to make sure they're right.

Ben Rady

02:19

Mm-hmm. Mm-hmm.

Matt Godbolt

02:20

And then I'd said for some areas of performant code, so certainly in C++ land, sometimes there isn't a seam for you to put ah like ah an interface or some kind of like a measuring point in your ah your regular code to write a test around. So like the classic example I can think of is if you've got a high performance like cache,

02:44

of information that is like a software cache, just to be clear, then you wanna be able to test whether or not you hit the cache or not. But that's not, the cache is there to be transparent, right? It's either there or it's not there or whatever, or you know it fetches a value or it may be maybe it's more like a memoization cache.

Ben Rady

03:00

right right right

Matt Godbolt

03:00

So you know it either computes a value or it returns the value that you did before.

Ben Rady

03:05

Mm-hm

Matt Godbolt

03:05

And the caller doesn't need to care about that. And you don't want to have to break your interface just to write a test. And you don't want to have to pass in ah a listener class that says, hey, on cache result, because that's all, you know, what is it it's not good for the design of your system. It's maybe not good for the performance.

Ben Rady

03:13

right ah huh

Matt Godbolt

03:22

But you probably do want metrics about how often your cache is being hit. And if you're writing performant code, you've probably written a really relatively performant ah metric system.

Ben Rady

03:34

Right, right. Yeah.

Matt Godbolt

03:34

And then it becomes a natural way of measuring whether your code is testing the things that you're wanting. Am I getting a cache hit? Am I getting a cache miss? by looking at the metrics.

Ben Rady

03:44

Yes.

Matt Godbolt

03:45

And so that was where we were when we said, let's record this. And now that's the end of the podcast.

Ben Rady

03:50

Yeah, yeah, yeah. and

Matt Godbolt

03:50

Thank you for listening, everybody.

Ben Rady

03:51

ah Thanks, everybody. Outro music plays. um No, and I mean, I think that this is a a great specific example of a general thing, which I think we've talked about on the on the podcast before, of like, what does it mean to have testable code, right? What does it mean to have code that is testable?

Matt Godbolt

04:09

Yes.

Ben Rady

04:10

And there is a sort of a premise that is baked into all of this and it's woven into like test-driven development a bunch of other things, which is if you build software that has a nice interface for some definition of nice, it will be easy to test and writing the tests helps you create that nice interface. And this is a specific example of that because the thing that's nice about this is the observability. We want to have code that is observable.

Matt Godbolt

04:35

yes

Ben Rady

04:36

We want to have code where we can know what it's doing. And the tests in this case is giving you like a very specific thing of like, I need to know if I hit the cache. You need to know that for the test and you need to know that when you're running your software. And that is the same problem. It is the exact same problem.

Matt Godbolt

04:55

Yeah, but I suppose specifically in this instance, like a ah not totally unreasonable API to your ah whatever this thing, cached thing is, is that you return like a tuple of the thing that you got out of the cache and some status object that said, um did I get it from the cache? Was it a hit? Was it a miss?

Ben Rady

05:16

And so the tests are there to do that. Sure, yeah.

Matt Godbolt

05:17

what you know that That's a reasonable interface, in which case now...

Ben Rady

05:19

You can also do it that way, yeah.

Matt Godbolt

05:20

Well, that that's my point, right? That is a reasonable way to write this ah system. But very specifically, oftentimes, if you're like in a very high performance kind of piece of code, that you buy writing the interface that way, you have pessimized the case where you don't care if it was in the cache or not, which is the very common case.

Ben Rady

05:41

Mm-hmm.

Matt Godbolt

05:41

And you're forcing everything through this. But the metrics are something that are a sort of a side channel that are something you still care about and are performant.

Ben Rady

05:50

Mm-hmm.

Matt Godbolt

05:50

And you're now using them to test the inner workings of something that should be transparent. And you, in fact, want it to be transparent. And I think there's sort of... ah yeah maybe you're saying those are the same things. um I've just found it as being ah ah an interesting way of saying like, there's some internal workings of a class that I would like to be able to test, but I don't really want to expose it to the outside world.

Ben Rady

06:12

Mm-hmm. Mm-hmm. Mm-hmm.

Matt Godbolt

06:13

And I can't expose it to the outside world through either because the performance characteristics would be different, but the, the, Well, I can't directly expose it to this word. And so this metrics represents an indirect way of me accessing interesting things that happened in my class.

Ben Rady

06:31

Yeah, I kind of get what you're saying there, but I think it all hinges on the definition of outside world, right?

Matt Godbolt

06:38

Right. Okay.

Ben Rady

06:38

Like the caller of the code is – let's live in ah a multiverse for a second here. The caller of the code is one world.

Matt Godbolt

06:47

Sure.

Ben Rady

06:48

But another world is sort of you as an operator of the software. And the tests can stand in for both of those things.

Matt Godbolt

06:55

Of course.

Ben Rady

06:55

You don't have to do them all as one thing, right? The caller of the code can be like, yeah, I asked for this value. I got this value back. I'm not getting a tuple that indicates whether it was a cache hit or a cache miss or... some sort of other metadata that rides along with it because I don't actually care about that.

Matt Godbolt

07:05

Because I don't care. yeah. yeah

Ben Rady

07:10

Right. And I shouldn't have to care about that ah just for the purposes of testing to make sure that my caching system works. But you as an operator of that software, you as somebody who is going to be, you know, watching it run and making sure that the performance is good and making sure that the changes that you've made have taken place effect as you would have expected, need another dimension into the multiverse of ways to see what is going on. And the tests can stand in for that too.

07:38

Um, There is another actual aspect of this. I was just ranting to you earlier this week about this, actually, when we were at lunch, there is another aspect of this um that I think holds here. And that is logging. Usually what people do is they have a logging system and they just dump things into the logs.

Matt Godbolt

07:50

Yeah. Right. right

Ben Rady

07:58

You know, it's like, oh, I've got this variable here. I'll log this out or whatever it is. Right. And you know, it's the really unfortunate case when it's like, okay, yeah, we have log statements in the code for when the terrible thing happens.

08:11

And then you go and you look in the logs because the terrible thing has happened. And what you see is like the magic value is "%s". Because your whatever logging thing that you set up didn't actually capture the value that you wanted because you thought it was templated and then was it wasn't or whatever it is that happened.

Matt Godbolt

08:28

ah So sad.

Ben Rady

08:28

And like the one in a million moment is come and gone and you're never going to see it again, right?

Matt Godbolt

08:33

Yeah.

Ben Rady

08:33

And so

Matt Godbolt

08:34

Are you about to write tests for logging is what you're about to say, isn't it?

Ben Rady

08:37

another that is exactly what I'm saying. This is exactly what I'm saying is that i think that one of the benefits of structured logging is that you can approach it in the exact same way that we approach are talking about these metrics, right?

Matt Godbolt

08:51

Right. They are similar sounding things in this instance. It's just a different way of structured logging. And one is a counter and the other one is maybe a sequence of events that you've logged.

Ben Rady

09:02

So the... Exactly, exactly right.

Matt Godbolt

09:04

Yeah.

Ben Rady

09:04

but it it is But it is exactly this thing of the tests are not just standing in for like the caller of the code as they usually do.

Matt Godbolt

09:11

Right.

Ben Rady

09:11

but they are standing in for the the sort of tomorrow you, ah who's a very responsible person and wants to know what their metrics are and what their logs are and wants to make sure that they're correct. And then you can also use both of those dimensions of kind of observability to understand what your code is doing and verify that it is correct, right? The the tests can operate on both of those dimensions at the same time.

Matt Godbolt

09:33

Yeah. Right.

Ben Rady

09:33

Mm-hmm.

Matt Godbolt

09:33

I mean, who among us hasn't written that warning statement like "this is weird" And then, you know, your test coverage says, hey, you never hit the "this is weird" log line. And you're like, oh, I should write a test for it. But realistically speaking, what am I going to do? All it does is log, "this is weird".

Ben Rady

09:49

Right, right.

Matt Godbolt

09:49

And you know i'm I'm sure you've done this before. you know Even with you know most logging systems, are certainly ones that I've interacted with, you have a test fixture that can capture the log. So you can write it and and then then, but your assertion is something weak, like assert "this is weird" in captured.log.

Ben Rady

10:05

Yeah. Right, right.

Matt Godbolt

10:06

And that's better than a kick in the teeth, but it is not ideal. And what you're saying is with a more, print you know, but certainly in terms of the textual mapping and you know, it makes your test quite brittle.

Ben Rady

10:18

Right.

Matt Godbolt

10:18

But if you can have a structured log, so like I think we have talked about structured logging before, but do you want to just give us a quick recap of what you think of or what right now, in the middle of it all, what you think of as structured logging?

Ben Rady

10:28

Yeah. Yeah. I mean, and I, and I grant that people have, have differing takes on this and I think you can do it in different ways, but I, I think that if I were to try to summarize all of the different approaches that I've seen that have been called structured logging, it is kind of, you alluded to it earlier.

10:42

It is treating your logs as a stream of events, right? um Sometimes multiple streams of events. Like you can think of like the info logs as one stream and the error logs is a separate stream and the warning logs, another stream. Or you can mush them all together and have a heterogeneous thing. But the the basic idea is that you are going to not ah think of your logs as I'm just puking some text out to standard error or standard out.

11:06

It is, no, there's a stream of events that is coming out of my system. And I can turn those into human readable logs if I want, but I can turn them into whatever I want because I'm a wizard and I have programming skills and I can transform a stream of events into anything. And so it solves a number of of kind of problems. and And one of them is this sort of case of like making sure that you are actually capturing the information in your logs that you think you are.

Matt Godbolt

11:35

Right.

Ben Rady

11:35

Another one is this sort of case of like, well, how do i make sure that we are responding to this situation in which I want to do nothing? And in fact, the thing that sort of kicked off this whole conversation 10 minutes ago was me writing a test for a situation where I was skipping a trade that I wanted to ignore intentionally because it was being replayed, right? Like it was it was like, oh, I want to make sure this is idempotent.

Matt Godbolt

11:57

Right.

Ben Rady

11:58

We've seen this trade already. I don't want to publish it again. So like the correct action is to do nothing. Right now, in that case, I was making an assertion about a metric, but you could easily imagine that that could also be a log statement and testing that testing that nothing has happened is a very important thing to be able to do, right?

Matt Godbolt

12:16

Yes. and and And more importantly, discriminating between the nothing has happened because I processed the event correctly and determined that nothing should happen compared to you didn't call the process event function at all in your test.

Ben Rady

12:29

right Yep, exactly.

Matt Godbolt

12:29

Therefore, nothing happened. Right. Which is the. Yeah. So you can distinguish them when the nothing that happened is actually something did happen. The something was I bumped a metric saying ignored_events++, or I logged warning: this event was skipped because it's a replay" or whatever it is that you've done. Yeah, that makes a lot of of sense there. It certainly gives you a lot more, lets you sleep at night a bit more comfortably because, you know, again,

12:56

How many times have we written tests where you realize this test is passing and then like scratching your head like wait, it's not being run, is it? That's what I've missed out test as "tset".

Ben Rady

13:04

Yeah, right. Right. Yes.

Matt Godbolt

13:05

And now my my my system that looks for only the word test is not actually running any of these files at all. Right.

Ben Rady

13:11

Right, right. The test where it's like, you can comment out all of the code that you thought you were testing and the test still passed because there's, there's no assertion in it.

Matt Godbolt

13:18

Yeah.

Ben Rady

13:18

Right. It's just like run some code and hope an exception doesn't happen. Right. Like those are, those are very unfulfilling tests.

Matt Godbolt

13:24

Right. This gives you a way of measuring some of the, some of those types of events and or quantifying them and saying that this, yeah, gathering confidence that actually you are doing the thing that you thought that you were doing.

Ben Rady

13:25

And so. Yeah, yeah, yeah, yeah. Another thing that you can do with structured logging, which has another sort of flavor of this is, you know, you you have these moments sometimes where you're you're you're trying to test something and you're like, part of me just wants to like reach into the center of this class and pull out this state. But I don't want to really do that because that's going to break the encapsulation of the class, right? Like,

14:00

you know, I want to be able to refactor this code. I want to be able to change things, certain things about this code without having to change the tests, because that's what refactoring is.

Matt Godbolt

14:09

Thank you.

Ben Rady

14:09

ah And I don't want to reach into the guts of this class, because that'll make my test less valuable and make it so that I can't refactor. But I really want to know like what this value is. And so one of the things that you can do with structured logging, which I think is really interesting, is it gives you a conduit to sort of more carefully and selectively pull pieces of information out of the internals of a class in a way that doesn't expose all of the guts. It just sort of exposes like the one little piece of information that you want.

14:38

And the example of this is like, you're gonna have a log statement that says like the queue size is five, right? Well, it's like, I don't wanna reach into the guts of the class and check to see what the queue size is. But like in the instances where it's important to log what the queue size is, I can use that as a way to confirm my suspicions about what it should be, right?

14:57

And you can go another level deep with this if you if you want to. And I have, and don't know if it's generally a good idea, but I think it's an interesting thing to talk about, which is when you have structured logs and you can find a way to do object serialization in those structured logs in a way that's not totally insane or sometimes just mildly insane, You can have complete objects that come out of there and go into your logging system and can be reconstituted later.

Matt Godbolt

15:27

Right. Mm-hmm. Mm-hmm.

Ben Rady

15:28

And the one place where I think I have seen this done the least insane is with exceptions, right? Like you have, ah you know, part of your logging system where if an exception occurs,

15:38

You have a reasonably high confidence serialization system that allows you to capture that exception, maybe with some special cases in there and make sure it's not too big or contains like a reference to like a, you know, ephemeral resource or some other thing like that, but you have some confidence where you can turn it into something.

15:57

And then when you're troubleshooting that error later, you can reconstitute it. And I think that is a more obvious way to do this kind of thing. But I could also see situations in which that that structured logging allows you to sort of, in a in a less brittle way, in a less encapsulation violating way, check to make sure that the internal state of things is what you expect it to be without creating direct dependencies from the tests into the internal parts of the code.

Matt Godbolt

16:22

And I think that's a special case of what I was talking about right at the beginning, which is to say, you know, again, the the internal state in this instance is whether the cache was hit or not.

Ben Rady

16:30

Mm-hmm.

Matt Godbolt

16:30

And it's just a way of exposing that internal state without making it either in the face of the caller or having to add a whole metric subsystem into the specifically to that cache and say, did the last thing, all those kinds of things.

Ben Rady

16:42

Yeah. Yeah.

Matt Godbolt

16:42

So it's a really nice way of, yeah, like, kind of side channel attacking the internal state of your, you know, and slightly better than, you know, like having the, um ah the other sort of, I guess, is it an anti-pattern?

Ben Rady

16:55

hu Yeah. yeah

Matt Godbolt

16:55

Let me see what you think. You know, how many times have you written something that's like, you know, um get cacheForTesting the function called that, which is, know, like you look at it and you say, this uses the same um other functions. this uses the same functionality as the real test function, sorry, the real cache function, but it it it does return that tuple with all this extra information about it.

Ben Rady

17:06

Yeah, yeah. Yeah, yeah.

Matt Godbolt

17:17

And you're kind of like, you look at it and you say like, I hope that I don't have a bug that is represented in the untested cache function get function that isn't in my ah you know for testing cache and you kind of look at it and you go like it's three lines i think it's fine or you know sometimes you can implement one in terms of the other and hope fingers crossed that the optimizer throws away the fact that in your not test version you always discard that kind of side channel and therefore you know all goes it nets out that's a nice way of doing it but um

Ben Rady

17:49

Yeah, yeah. Yeah.

Matt Godbolt

17:50

Yeah, so that that, yeah, do you think anytime, i mean, I certainly think of it, anytime I write a test that has "xxForTesting" in it, i I do die inside a little bit, but sometimes it's a necessary evil if I haven't got this.

Ben Rady

18:00

Yeah, it's it's not great, but if I have to choose between adding a little bit of extra complexity to my code and not being confident that it works, I'm going to go with a little complexity is worth knowing that it actually works.

Matt Godbolt

18:11

Right.

Ben Rady

18:12

But if there's a way to do both of those things at the same time, or do it in a way where that sort of surface area of the "for testing" is not only smaller, but also useful for other things, then I think that's a better way to do it.

Matt Godbolt

18:25

Right. In which case it should, it loses the "for testing" at that point. Right. It just becomes, yeah, it is just like, Hey, this is a, ah yeah.

Ben Rady

18:30

Yeah.

Matt Godbolt

18:30

A window into this class that is useful.

Ben Rady

18:32

Yeah, exactly.

Matt Godbolt

18:32

Yeah. And the metrics exemp are exemplified that metrics and all structured logs.

Ben Rady

18:36

Mm-hmm.

Matt Godbolt

18:36

Yeah. Yeah.

Ben Rady

18:37

Mm-hmm. Mm-hmm.

Matt Godbolt

18:37

No, that's cool.

Ben Rady

18:38

Yeah.

Matt Godbolt

18:39

Um, well, that's kind of all we had. I mean, i was going to say, that's what we had planned, but we had no plans. We were just talking and they were like, we should probably record this.

Ben Rady

18:48

Yeah, we had zero plan.

Matt Godbolt

18:49

Uh, so here we are. Um,

Ben Rady

18:52

yeah We could talk about metrics some more. i have lots of ah ideas on metrics and good ways to use metrics.

Matt Godbolt

18:56

Well, let's do that. Let's do that then. Yeah, I didn't want it to like peter out awkwardly here as it was.

Ben Rady

19:01

so No, so one one one thing that I debate a lot is the sort of, i would say the difference between push and pull metrics.

Matt Godbolt

19:09

Yeah.

Ben Rady

19:10

So let's contrast two systems in particular as examples here.

Matt Godbolt

19:13

Hmm.

Ben Rady

19:14

So one of them that is ah ah kind of top of mind for me recently, actually, is ah a system like StatsD, right? The way StatsD works is ah you have a centralized metrics collection service.

19:30

And you create, and there's clients that do this for you, but just describing how the protocol works. When you have like a metric, like a a counter that you want to increment, or maybe a gauge that it's like, yeah, the disk is like 96% then you create a very small human readable text snippet, which is like, I think it's like the metric name and then a pipe and then a value and then a pipe and then like ah the type, or whether it's a gauge or a counter or something like that. I think that's roughly the StatsD thing.

Matt Godbolt

19:30

Hmm.

Ben Rady

20:00

And then you put that in a datagram and you send that datagram off to your central collection server and you have no idea whether it got there, but

Matt Godbolt

20:08

right And you mean like literally a network packet, a single network fire-and-forget network packet: UDP.

Ben Rady

20:15

correct. Yep. Yes. Yes. UDP datagram just goes, whoop. And ah the idea is that this is really useful for metrics where you don't want to block the sender, right? Like you don't want the sender to be like, I'm waiting to send this metric somewhere.

Matt Godbolt

20:29

Right.

Ben Rady

20:29

um But if the if it doesn't get to where it's going, it's maybe not the end of the world, right?

Matt Godbolt

20:35

Right.

Ben Rady

20:36

um So that's that is sort of one style. And there are other ways to you know maybe make that a little bit more reliable.

Matt Godbolt

20:42

Yep. Yep.

Ben Rady

20:43

And you know certainly if you use gauges and things like that more frequently than counters, you can get like pretty reliable success out of that. But one of the great advantages of that is that the senders or the receiver doesn't need to know that the senders exist. You can have a situation where it's like a new system comes up and it starts publishing its metrics and the receiver is just like oh, I guess i have a new thing that I need to worry about.

Matt Godbolt

21:04

yep

Ben Rady

21:04

Cool.

Matt Godbolt

21:04

Right, it just receives a datagram from someone else and goes, new client, fantastic, right.

Ben Rady

21:08

Right.

Matt Godbolt

21:08

And then there's exactly exactly one piece of configuration, which is in all of the clients where the aggregator is, the the one receiver is, got it.

Ben Rady

21:18

Yep. Yes. Yes. Yes.

Matt Godbolt

21:18

Okay, so that's the, presumably that's the push case, you're pushing out

Ben Rady

21:22

Yeah. yeah Yeah. Yeah. yeah And then you have systems like Prometheus. where the way Prometheus works is you've got an endpoint. I think it's usually an HTTP endpoint. I think it has to be an HTTP endpoint, actually. Could be wrong about that. um But you've got some endpoint that's in your program that is being monitored, right, that is being observed. And the Prometheus kind of scraper reaches out to you on some periodic basis and says, like, give me your metrics, right?

Matt Godbolt

21:46

mmhm

Ben Rady

21:46

And so internally, you can have a thing where it's not like blocking the hot loop of any part of your execution. It's just sort of stashing the metrics in memory. to be available the next time it comes around. But it's just taking this sort of like periodic snapshot of what is going on with with the metrics, right?

Matt Godbolt

22:04

Right. Right.

Ben Rady

22:05

Now, I'm not even talking about like the actual metric collection internally, because there's like a billion different ways to do that.

Matt Godbolt

22:12

right

Ben Rady

22:12

I'm kind of just talking about like, okay, assume you have a program that's got application level metrics. How does it get to somewhere else other than that machine?

Matt Godbolt

22:19

Right.

Ben Rady

22:20

And I think these are the sort of two basic ways that but I've seen people do it.

Matt Godbolt

22:25

Right. Absolutely. Push and pull. I mean, we've talked, I think, about um various um UDP-based systems before. I mean, we had one ah ah several companies ago that I know you worked on, which was a metric collection system that was more of the UDP datagram-based thing. Obviously, StatsD is an example of that. It has a lot of benefits. You mentioned the configuration is straightforward. um the It's non-blocking for some definition of non-blocking in the publisher.

Ben Rady

22:51

Yeah, yeah, right.

Matt Godbolt

22:51

I mean, sending a UDP datagram is kind of a heavyweight activity in some... worlds uh but it's straightforward relatively speaking and you so certainly of the the StatsD format is very straightforward so you blast it off obviously the drawbacks are it might not get there which reminds me of a joke um which i tell you but you know it's about udp i don't think i don't think you get it

Ben Rady

22:55

Yeah. Mm-hmm. Mm-hmm.

Matt Godbolt

23:21

It might not get there. um if it does if the If the collector is down or misconfigured, you'll never know. You're just sending it out into the into the ether, literally.

Ben Rady

23:31

Right.

Matt Godbolt

23:31

And um the there could be a bottleneck if you're generating a ton of of statistics back to back.

Ben Rady

23:39

Yeah.

Matt Godbolt

23:39

if you've got like um If you try and update your counter on every single update, then you're sending a blast of relatively... heavyweight packets at a machine, and that machine has to be able to deal with all of that data. And in fact, you might back up trying to send it. So those are the drawbacks, but it's very, very appealing because um also if you're a very short-lived application, if you're like a command line client, you might not live long enough to be scraped by a different system.

Ben Rady

24:05

Right, right, right, right, right.

Matt Godbolt

24:07

Right, that's that. Then let's talk about the pull-based systems. And let me just read that back to you. So in this instance, somehow some centralized system has to know about all of the places that have metrics.

Ben Rady

24:21

Yeah.

Matt Godbolt

24:22

And then it is responsible for connecting to them in turn or however, and saying, give me a snapshot of your metrics, please, over HTTP or TCP or something like that.

Ben Rady

24:32

Mm-hmm.

Matt Godbolt

24:33

So obviously the the pro points there are um you the collection system is responsible for the period upon which it is collecting these statistics. So it could be like, well, i can do it once a second or once a minute or once an hour. It doesn't matter as long as you know I can configure that in one place. And you're not being swamped by millions of intermediate values because you only care about it on the cadence that you care about. Yeah. The drawback is how do you find all your clients?

Ben Rady

25:05

Right.

Matt Godbolt

25:05

That sounds relatively complex. and now you can... Now I've got another problem. So yeah, okay. I've just read those back to you, but obviously you you brought this subject up for because I believe you probably have opinions and I'd be interested in your opinions on those things.

Ben Rady

25:15

I do have opinions. i do I do want to make the point though, by the way, about sending the datagrams is that you don't have to do that in process, just as with Prometheus, you're going to store your metrics in memory and then it's going to get scraped. You can also store your metrics in memory and then send them out with some cadence over UDP, right? Like you can do them inline.

Matt Godbolt

25:31

That makes sense, yeah.

Ben Rady

25:32

You don't have to, right?

Matt Godbolt

25:34

ah Yeah, that makes sense. Yeah.

Ben Rady

25:35

Yeah. ah But i and I am a huge fan. One of the one of the sort of um ah you know scary bedtime stories that ah finance dads tell their kids is the story of Knight Capital and how and how a trading firm lost you know hundreds of millions of dollars in 45 minutes, something like that.

Matt Godbolt

26:01

Oh. Yes.

Ben Rady

26:02

um And it's a terrible story. And it's it's funny because I actually, ah you we we used to work, you used to work, I work with somebody who actually is very familiar with this process, was was was directly involved with some of the companies that cleaned up afterwards anyway.

Matt Godbolt

26:16

Very familiar.

Ben Rady

26:18

And ah it's funny how much of this has turned into sort of like lore and, you know, it's been, you kind you know,

Matt Godbolt

26:25

Folklore, yeah.

Ben Rady

26:26

a kind of you know the the game of telephone has been told many times, but it is nonetheless true that like one of the problems that happened there is that they had software running that they did not realize was running, right? They didn't realize that it was doing what it was doing, right?

Matt Godbolt

26:44

Yeah.

Ben Rady

26:45

And I generally feel like I sleep better at night knowing that there's a central server, everything that is running is at least trying to publish to that central server. And if something comes up unexpectedly, there's at least a chance, probably a very good chance, that those messages will suddenly appear on that central server and it will have the the ability at least to detect that something is running that should not be running. Right.

Matt Godbolt

27:16

Mm-hmm.

Ben Rady

27:17

um You can kind of do a little hybrid of both of these things. If you want, you can have like, you know, the central server then reach back out to the sending clients. It can even like give them like an aggregated ACK where it's like, yeah I've received 300 messages from you in the last minute or something like, just so you know, I'm i'm actually receiving your messages. um You can do things like that.

27:40

But um the thing that really makes me sleep well at night with a lot of these systems is having a way so that if someone were to start a piece of software like on their desktop or in some test server or somewhere else, it would at least try to tell someone about it as opposed to, well, it hasn't been added to the central configurations. so There's no way we could ever know.

Matt Godbolt

28:04

Got it. Yeah. I mean, there are different ways of solving that problem. Obviously one way, because, you know, again, if you try and reach out to a server, but it doesn't come back to you, you still have this problem, right?

Ben Rady

28:16

There are. Mm-hmm.

Matt Godbolt

28:16

You know, and in the finance worlds that we're talking about, we have very strict network segregation, which means that you might not be able to send the ping to the central servers to say like, Hey, I'm a production machine.

Ben Rady

28:29

Mm-hmm.

Matt Godbolt

28:29

So there's issues of that nature like that.

Ben Rady

28:31

Yep.

Matt Godbolt

28:31

Um, And so I feel that like there is there's always an incomplete part to this. There's always a slightly of a blind spot here. because um But in general, a service discovery mechanism that's robust to these is useful whether or not you're pushing information to a centralized server or whether or not you are ah being scraped by some centralized server.

Ben Rady

28:37

Yep. Yeah.

Matt Godbolt

28:55

And that seems to me the more the the thing here is but where your in saying like if you're sending these periodic metric pings to some system, you could notice that something was alive and doing something that unexpected. um That's kind of begging the question of like, why are you using your metric system to determine the liveness of software? Why don't we have a software liveness indicator? Maybe you are talking about that as well here, but that's, you know,

Ben Rady

29:17

Yeah, I mean, I'm kind of like, I'm talking about this in the context where where everything is already broken, right? It's sort of like both of these systems work great when everything works great, right? And it's like, when they break, what are some of the different ways in which they break? Oh, sure.

Matt Godbolt

29:31

Right.

Ben Rady

29:31

And you may you're absolutely right that it's like network partitioning is one way in which the sort of like push-based model, you know, the StatsD model doesn't save you because it's like you have a test server that's configured and running in prod and it can't reach the test network.

Matt Godbolt

29:44

but then, you know, so we're there's ah there's ah so there's another sort of solution. There's ah another solution. There's another potential here, which is if we don't use the fire and forget, single UDP datagram thing and you have instead the TCP connection, then obviously you get the positive code connection that that you are talking to the central server.

Ben Rady

30:02

Mm hmm. Yeah. Yeah.

Matt Godbolt

30:02

You get your ticket from it that says, yes, you're okay to run or whatever, you those kinds of things.

Ben Rady

30:07

Yeah, ye yeah,

Matt Godbolt

30:07

But then you are solving- But then you are sort of solving the similar problem to, and excuse the dog, um oh you are solving similar problems to, and now I can't even remember what the thing's called now. What a is is it we use for service discovery, the old company?

Ben Rady

30:22

Consul.

Matt Godbolt

30:23

Consul. Yeah, which is, you know,

Ben Rady

30:25

Yeah.

Matt Godbolt

30:26

um Chubby in Google terms, I think is the equivalent. And, you know, it's so it's a centralized lock manager, but it's sort of a small amount of shared state between things. And so people can go in and now obviously that's still opt in and you still have to be part of the Consul cluster or you have your, your system has to be registered with Consul cluster in order for it to be noticed.

Ben Rady

30:44

Mm-hmm.

Matt Godbolt

30:44

But that's what it's supposed to be. That's one of the things that's meant to be there for is to say like, Hey, find me all the things that say that they are metrics producers or, everything that says I'm a web browser or a web sorry web server or that kind of thing. And so that feels like a good solution. But ah just like my network partition example and the whatever, you can still break it because if you're not in the Consul cluster, then you're in a partitioned world of your own, right?

Ben Rady

30:52

Mm-hmm.

Matt Godbolt

31:08

And so, yeah, there's not an easy solution to any of these things.

Ben Rady

31:11

Yeah, yeah.

Matt Godbolt

31:11

But I do wonder if conflating metrics gathering with this is... is is a good thing, whether or not, you know, you just mentioned in passing that this is a useful thing to be able to do. It certainly is a surprise if you get a...

Ben Rady

31:25

Yeah, it's this is this is one of those things where it's like this is not this is not a real solution to the problem that you're talking about it being like you like we have A and B. We're trying to choose between A and B. And I'm like, I think I like A better than B. And I was like, why?

Matt Godbolt

31:33

Yeah. Yeah.

Ben Rady

31:38

Well, it's like, well, because in certain situations, it'll solve this problem. It's like, well, but in other situations, it won't. It's like, yeah, but that's not why we're talking about A and B. I'm just trying to pick between two options.

Matt Godbolt

31:46

Yeah.

Ben Rady

31:46

Right.

Matt Godbolt

31:46

No, that's really interesting. Yeah, yeah. and

Ben Rady

31:48

So it's just it.

Matt Godbolt

31:49

yeah Yeah, no, no, I got it. And I mean, ultimately, it's it's almost like, what if you were to do, ah if you were doing metrics gathering, the hybrid solution where, you know, instead of proactively

32:00

being scraped you just connect into the central server and then it asks you so it's still push and pull like you connected into it and it knows that you existed and you therefore service discovery is if you telnet to port 8000 of the central machine then we care about what information that you have um but you get scraped by it saying okay give me what you got

Ben Rady

32:00

Mm-hmm. Mm-hmm. Yeah. Yeah. yeah Mm-hmm.

Matt Godbolt

32:25

But obviously that doesn't work over a HTTP, which obviously has convenience methods.

Ben Rady

32:29

Mm-hmm.

Matt Godbolt

32:29

Certainly when I'm a developer, it's useful to be able to hit my own web server. And in fact, some of the tests I wrote ah involved scraping back over the HTTP port to check that I was actually exposing the metrics that I thought I was exposing when I was writing my own Prometheus endpoint.

Ben Rady

32:44

Yeah.

Matt Godbolt

32:45

So I think, yeah.

Ben Rady

32:46

Mm-hmm.

Matt Godbolt

32:47

Yeah, and yeah, to to say that the Knight Capital um legend was purely, and not that you did, but like there were so many other aspects to that. It was very much the the Swiss cheese and eventually all the holes lined up and one of the things got through. um

Ben Rady

33:07

Mm-hmm.

Matt Godbolt

33:08

but But yes, metrics, very much like metrics.

Ben Rady

33:10

Yeah. Well, so the the the real thing the real thing here is sort of bringing this back to observability in general a little bit is like, I think, I mean, and I do this in the systems that I have, what you probably want to do in a system that has discovered that it is no longer observable is to stop.

Matt Godbolt

33:30

Yeah.

Ben Rady

33:31

Because it's sort of like the last gasp of like, someone pay attention to me.

Matt Godbolt

33:36

Yeah. Yep.

Ben Rady

33:36

Right? um And so you want to do that in multiple situations. You want to probably have something like that at startup, like registering with some sort of central discovery service or sending out some sort of message saying like, hey, I'm starting up.

Matt Godbolt

33:51

Yep.

Ben Rady

33:51

And if you don't have a way to acknowledge that someone heard you, be like, okay, well, then I guess I'll stop then. Like having some mechanism to do that is a great sort of safety mechanism.

Matt Godbolt

34:00

Mm-hmm. along with heart beating to make sure that everyone on both ends are still actually there.

Ben Rady

34:05

um Heartbeating.

Matt Godbolt

34:05

And like, are you still there? And i don't just mean TCP level stuff.

Ben Rady

34:08

Yep.

Matt Godbolt

34:08

I mean, actual application level.

Ben Rady

34:10

Yeah. Application level heartbeats.

Matt Godbolt

34:10

Like, are you there?

Ben Rady

34:11

Yes.

Matt Godbolt

34:11

Yes. Okay. I'm back. Yes. And just that kind of stuff. That's always a good thing.

Ben Rady

34:15

Yeah, yeah. um And then one of the more interesting ones, and i've I've had some debates with people about this, but I still think this is the way that I do it, is if you have a system that encounters a fault, so going back to our sort of structured logging, like I've logged and an error or an exception, and I try to send that to somewhere to notify somebody, right?

Matt Godbolt

34:36

Yeah. Yep.

Ben Rady

34:36

What happens when /that/ fails?

Matt Godbolt

34:39

yeah

Ben Rady

34:40

I think the right thing to do with a certain amount of retries, like keep retrying, but like if you retry for some period of time, eventually you probably just want the system to stop. Now, that's not universally true for every single system. There are things where it's like, no, this just needs to keep trucking, even if it's having failures. But all other things being equal, my base argument is if you have a system that has an error, fine.

Matt Godbolt

35:06

yeah

Ben Rady

35:06

Errors happen. If you have a system that has an error and tries to report it it its error and it can't, OK, it should keep retrying. But at a certain point, it should just exit.

Matt Godbolt

35:16

I would not disagree with you on that. I mean, just to sort of like remind the the the listener, though, that, you know, you and I come from a world of finance where there's a lot of regulatory stuff around.

Ben Rady

35:27

Yeah.

Matt Godbolt

35:27

If we can't log what we're doing, of you know, again, Knight Capital type stuff, if if we can't tell somebody that something is up, then the best course of action is to to stop doing anything further.

Ben Rady

35:38

Mm-hmm.

Matt Godbolt

35:38

Log everything you can to disk and then kill the process. and be done with it and hope that that gets someone's attention right why are we not trading it anymore oh it turns out the process self-destructive why is that well there's been a network split and it can't tell us that the position's out you know those kinds of things and those are more defensible but but yeah if my pacemaker um can't log an error then maybe i don't want it to stop um but you know i yeah obviously there are

Ben Rady

35:40

Mm-hmm. Right. Yeah. I don't want my home wifi router to turn off because it can't send logs to some place that I don't care about the logs for. Right.

Matt Godbolt

36:11

yeah Exactly. So there are ways, and but but I think as ah as a sensible, um and even within the finance industry, I think, you know, this is something that I've worked on desks where it is okay to not be up and running.

Ben Rady

36:25

Yeah.

Matt Godbolt

36:26

Like, it's not great.

Ben Rady

36:26

Yeah.

Matt Godbolt

36:27

You know, and people there's going to be some very long meetings that you can have to explain yourself, but it's like not... in the if you're not on and trading, the only thing is is an opportunity cost. you know you You weren't able to make money or whatever, and there are there are manual ways of trading out of positions and those kinds of things.

Ben Rady

36:43

yeah yeah

Matt Godbolt

36:43

But if you have obligations to an exchange or downstream clients, then maybe you have to limp on and say, look, it's better for us to continue to be able to provide this service, albeit disrupted, um But I've never worked on a situation like that. So I'm always down with yes. you know like Literally my C++ exception handling stuff is like log everything you can to disk and then kill -9 myself.

Ben Rady

37:09

yeah so

Matt Godbolt

37:09

like you know does There's no way we can carry on after this point here. right We are done and dusted. I don't care if like the destructors don't run properly. Just kill the process at this point.

Ben Rady

37:19

huh

Matt Godbolt

37:20

And that's always okay. yeah

Ben Rady

37:21

Yeah. I tell you though, ah just to tie this back to testing, because why not? That's where we started.

Matt Godbolt

37:26

Why not?

Ben Rady

37:26

The one piece of code I've never really come up with a great way to test is the code that kills the program.

Matt Godbolt

37:33

So there is, at least in a C++ framework I'm familiar with, there is a death test. And it works by forking the process and then communicating between the two processes to make sure that this actually kills the process.

Ben Rady

37:47

Oh.

Matt Godbolt

37:47

Now, unfortunately, Unix being as complicated as it is, there's signal handling and there's like child-parent relations and you can still not always get it right.

Ben Rady

37:55

That's clever.

Matt Godbolt

37:55

But it's not a bad way of saying this should abort the process, right? Literally kill the process and be done with it.

Ben Rady

38:01

Mm-hmm.

Matt Godbolt

38:02

And you go, well, okay, I'll fork myself here. No snickering in the back. And the the child process will do that. And then the the the parent process monitors to make sure that that's what happens through some you know Unix domain thing.

Ben Rady

38:16

Interesting.

Matt Godbolt

38:16

So you can write tests for these things. ah There's never an excuse not to write a test for something, he says. Very well aware that I've just spent the last two weeks writing very limitedly tested code, but that's a whole other story for another time.

Ben Rady

38:22

Yeah. Yeah. Yeah. yeah yeah

Matt Godbolt

38:33

All right, friend.

Ben Rady

38:34

and yeah That's probably a good place to call it.

Matt Godbolt

38:35

I think we should call it.

Ben Rady

38:36

Right.

Matt Godbolt

38:36

Yeah, this expanded from a, I have an idea, to 40 minutes worth of conversation, which is how it should be.

Ben Rady

38:43

Yeah.

Matt Godbolt

38:44

And I've enjoyed it.

Ben Rady

38:45

Right.

Matt Godbolt

38:45

ah But metrics are more useful than you might think. And you should keep them. And structured logging is always a choice too. so

Ben Rady

38:53

Yeah, it's a choice. That's for sure.

Matt Godbolt

38:59

All right, friend.

Ben Rady

39:00

Cool.

Matt Godbolt

39:00

Until next time.

Ben Rady

39:06

and Until next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript