Observability Engineering with Charity Majors

00:01

How'd you like to listen to dot NetRocks with no ads? Easy? Become a patron for just five dollars a month. You get access to a private RSS feed where all the shows have no ads. Twenty dollars a month will get you that and a special dot net Rocks patron mug. Sign up now at Patreon dot dot NetRocks dot com. Hi, it's all right, Holy crap, this is a great room. It's a great lot of echo. Yes, certainly filled this space. He determined level of absolutely the blast radius.

01:03

Right, we are back in Portugal. I love I love Porto, I love Portugal. I was in the middle of town and I came across a restaurant and the sign said churiscaria, which means grill. Right they infante. I didn't know you guys ate babies over here. So no, no, no, not funny, too long, too long, too soon? Infante. I guess Infante is the name of the area. Yeah, but it looked like we eat our babies. Well done, nice on a skewer, I think it was. W. C. Field said it was all

01:44

about the sauce. Yes, I love children exactly. You got it. Oh, we're gonna have some fun. Today. We are going to have some fun. But first we have this little thing called better no Framework. Roll that crazy music. Well I've been waiting for this one because it happened a while ago. But if you don't know, I have a consultancy called at the next and we are the shepherds of an open source project called Polly. Anybody who's Polly? How about uh? How about a clap of hands?

02:27

Use poly lot of Well, Reverend Billy just walked in the room. Well, Polly just did a make just came out with a major update, version eight. And I, believe it or not, it started with the dot net team because the dot Net team was basically looking at the source code and said, hey, we think we can improve the performance and the resource usage of Polly, but it's going to require some new interfaces and you know, almost a complete rewrite. And so the rest of us said, yeah,

03:00

I'm sorry. The dot Net team call you, yeah, say they want to make your project. Yeah, I mean what are you saying? Nah? So it took a lot of meetings and a lot of understanding, and basically what they were able to do is without you having to change any of your code that uses Polly, you will get the benefits of the new performance and resource allocation that's under the hood. But if you want to use the new models and the paradigms and the interfaces, you you can do that.

03:34

So Greenfield, you can go forward with a new style. But just if you use Polly in place, I'm not sure if it's completely compatible or you have to change some class to some other class. But it's pretty much a simple a simple fix. And I have been emailing with Joel hewle In to schedule Polly show for Yes, we definitely will get cool. So that's

04:00

what that's what I got. Awesome. Who's talking to us? Richard Grabby comment off of show eighteen sixty one we did with Jeremy Miller back in the summer of twenty three, talk about minimal architecture because Jeremy likes to cause trouble. Goodness nose And this comment comes from Trevor who says, I love this discussion, enjoyed the comments on microservices Worth's monoliths, which is actually a reference to an earlier show we didn't reporter. We've been following this trend of people

04:25

sort of pushing back on microservice. Yes, I got pushed heavily into microservices approaches with a product that we'd built and re architected into microservices, and it was the worst mistake ever. Things just became more complex, it was harder to maintain, it added a bunch of latency and security issues in the complexity was just not worth it. And so I came up with a new acronym for appropriately sized service or as nice I can relate. This is a good

04:55

one. I one hundred percent believe in services, separation of concerns, and clean architectures, but the approach must be appropriate to the complexity, solution and the size of the team. It makes no sense to have one hundred separate services for a team of ten people. But then it also makes no sense to have a massive single deployment with a code base in the team of a two hundred and fifty people. The services need to work with the cognitive load

05:15

and be appropriate to the organization and team structures. And I was loving the discussion on all of this except for that one point. Stop making the CTO out to be the bad guy. Love from Trevor CTO. That's great, Yep, that's fair. I definitely think we need a whole show on ass I think so it seems up. Yah, So, Trevor, thank you so much for your comment, and a copy of music by its own its

05:43

way to you. And if you'd like a copy of music co buy, I write a comment on the website at dot at Rocks dot com or on the Facebook, so you published every show there, and if you comment there and everyday in the show, it's like your copy of music Cobe. And you can also follow us on Twitter if you want to. But the real cool kids are over. I'm massedon, I'm at Carl Frank tech dot Social

06:00

and Ambridge Campbell at maps. Send us a two we'll get around reading it all, publish it and with that let us introduce Charity Majors to the show. Charity is an OPS engineer and CTO at Honeycomb dot Io. Before that, she worked at Parse, Facebook and Linden Lab on operations and developer tools and always seemed to wind up running the databases. That's because it's where all the problems were. Yeah, we stand next to the database. You're going

06:30

to be running it. Also co author of O'Reilly's Database Reliability Engineering and the newly released Observability Engineering. Charity Loves free speech, free software, and single malt Scotch blamer. Do you ever have round the bloud charity Majors? Okay, I guess we would have start at the beginning, right at the beginning. What the heck is observability engineering? That's a great question. It's like

06:58

the engineering of Windows, yeah, kind of. I mean observability comes from control theory, right, and it's like, how well can you understand what's going on inside your systems just by observing outputs? And yeah, exactly. And you know, I for years was like really religious about trying to define it in a very specific way, and I should have won, but I lost. So so I mean it's come to it's come to just kind of

07:33

be a generic sitting, which is what it is. But when we were trying to figure out how to talk about, what I think of is just kind of the next generation of telemetry. It's kind of distinguished from the last generation of peletry, obviously, which was very much focused around the metric, right, which is just a number. It's tags depended, doesn't handle high cardinality, doesn't handle dimensionality, doesn't handle it's super fast. Is that powerful?

08:01

Now you drop some OLAP terms into their cardinality flexibility, Like it's funny for a database person to drop all lap, but you're talking about just any way that you can really observe the state, the internal state, not necessarily what it's doing on the outside. It's about observing the internal state and being able to explore it right, not having to decide in advance, here's the data I'm going to collect, because here's the questions I'm going to need to

08:26

answer. Here's my dashboard. You know, it's about being able to go to combine your questions to ask because, like anything that you're trying to understand these days is going to be a very complicated answer to cart. It's like, okay, these errors are spiking, but only for users that are running this version of Android, who's a particular firmware in this region, with this language pack. With each of those are the high cardinality to mention it.

08:50

And if you don't capture the data in a way that preserves all that context, you can't ask me questions. Do you have some examples of how observe observability has improved a project in particular? Sure, I mean I think of it as it's really it's kind of where development meets operations, right, Like,

09:11

I feel like big picture. You know, in the beginning, there were engineers who wrote code and they owned it in production, right, right, And then everything got super complicated and we're like, ah, there's too much. So some of us are going to write code and some of us are going to understand it. And that was not about it. That was not a great idea, and so like we're kind of like reunifying the streams now. I think every engineer should be writing their code and owning it in

09:35

production. Everyone who's especialist operations should be also like opening the door and looking under the hood and understanding the code. Right. There's specialization is great, but ultimately, you know, our systems have gotten so complex that you have

09:50

to write it and understand. I feel like you got to dig into that own it in production because it's not like they're also going to be sisumits exactly as they are responsible for the You're responsible for your systems, right, you wrote it, you own it, You unleashed this support upon the world. I mean, I feel like there are these feedback loops in the heart of engineering. Some of them are like code review, right. Some of them

10:13

are like deploys. But like, if you don't hook up the feedback loop, if you aren't being exposed to the consequence of what you're doing, then like you're not you don't actually know if your code is good or not.

10:24

Well, I think there's a great point there as a developer then that if my telemetry just tells me how many times my code was hit, that doesn't necessarily give me anything to do. And this is this is where I feel like operations folks have had a harder time embracing serviability in some ways than software engineers have because with up people, it's like we learned how to debug,

10:43

but it looked like this, I've got a dashboard. Something's wrong, So I'm gonna start paging through dashboards and looking for similar spikes, just like pattern maatting with my eyeballs, Like right, oh, it looks like it's redus, you know, and you get it. It's great because you're like you get this hero journey where you just jump to the and you understand what's going on because you're in this shit all day every day. Nobody else does, like whoa, how did you do? That? Was reset? The redit

11:07

service? Problems went away? Right? But like that's not debugging. Countermatching with your eyeballs is not debugging. Debugging looks like you take the step, you ask a question, you look at the answer. Based on the answer, you take another step. It's like following a trail of bread crupts. You don't know what the answer looks like until you get there. Can we talk about some of the new modern observability tools that we might think about using

11:31

to replace the tools that we're currently using. Yeah, I mean, I think big picture, it has to be based It can't just be based on the metric because remember you've discarded all that you're looking output exactly. It has to be based on ar truly wide structured TETA blocks, which now look like scams, right, Those are just like wide events structured which you can trace because there's been a number that's appended to it. That's what you need in

11:58

order to understand your telemetry and production. Because I can imagine at a peak load, like we think about a metric that shows, you know, this is when we're posting the most number of transactions. You're now really interested in the state of yes, we're we queuing out yes, what's happening? Like metrics are great, but they're they're limited, right, they're a snapshot. What you want to be is like, you know, okay, when this happened, what else happened? Right? What else is connected to it?

12:24

You know? And like the old generation of tool are ones where you find you're capturing this data another time. For every single tool you're like, okay, here's my dashboards and the metrics, here's my logs, here's my traces. So every time you're like I've got a spike, I want to find the logs, there's nothing that connects them. You're just eyeballing timestamps and hoping that they happen to match up. And like, if you're finding the logs,

12:46

we want to jump to a trace. Like that's not actually good enough. You can derive all of those data formats from these arbitrary white from spans. You can't go in the other direction. When you say spans, what exactly you're talking about? A span is a one hop of the trace,

13:01

okay, across all of them. So should we be gathering spans? All of the gathering telemetry one event per request, per service, all of the data should be aggregated into that one arm chreary wide production, so you have all that context, Like a really mature instrumented service will have like two hundred three hundred dimensions per per hop and that's that's magic because you're passing along all of the parameters, you're passing along all of the I E. S,

13:31

you're passing along all of that context, which lets you after the fact come back and say, oh, this thing and this service that happened was connected to that thing and that service that that happened. Now, this is not necessarily a per transaction level, like you're not just chasing a transaction. What's this It basically it's one span from a time it's well typically, well this

13:56

complicated. There are lots of ways that you can define as span, but typically I like to think it about if you want to have a span around something that's interesting. So like if it's anytime that you're crossing the network, you want to span. Anytime you're taking a database request, you want to span because that's historically where problems happen. It's wherever you're crossing the right.

14:18

So when you start it at a user interface interaction and go from there and then you know, likewise we're back end services that go on time like you can have you can have spans and tracing in models too, and it's super super useful there as well. But it becomes indispensable once you have service. Yeah, because if you think about it as a monolith, at least you

14:41

have all that context and it persists throughout the request. When you jump across the network from service to service, you're deciding what state's going to come with. And so how do you do all this without bringing the server CPU to its knees? Do you do this? Typically? The way we do it now is it attached to back around threads and that kind of stuff, and you can lose you know, if those threats hang, you can lose data. There are lots of ways to do it. Obviously, I think that

15:07

my service does it best. Your Honeycomb is, well, you've got to tell us about Honeycomb. Well, sure, you know I'm not I'm not really great at pitching, but I will say that, like, you know, the idea of how observability should happen is how we built our service, you know, down to like the data store, like because like, well back how many have you any if you ever built apps on pars? The mobile back end is the service. No, wow, I loved so much.

15:39

It was it was like Firebase but better and earlier. Any Facebook people here, Okay, cool, I will have I will have a brudg against Spark Soccer for forever for what he did to Pars. Okay, they shut it down. It's like we got acquired. Anytime you want to get acquire or make sure that you have an executive level sponsor who believes you were fond. Right. Not so we got shut down even though we were still growing

16:11

like aang busters anyway. Cars had had one hundred million apps by the time I left, and we had built our service originally on Ruby on Rails, which was not a terrible decision because most startups fail. It's usually not because it's and Rails had the strength of you can move fast, you can move It's just a we believed everything. Yeah, ok, say whatever you want. Doesn't that Ruby on rails. The downside is it's got it doesn't have threads right, fixed pool of workers, right. And so that was fine.

16:44

We had one hundred thousand apps, but we got bigger and bigger, and instead of having one database and back end, we now had thirty forty fifty. And when you've got that many something slow at any given time, which means that the sixth pool is going it's filling up constantly with threads that are waiting on that one back end service got oh and like, as a

17:03

reliability engineer, this was personally humiliating. You're going down every day just like a hit the top ten and iTunes down, goes parts again and again. And I tried everything to try and figure this out. And what finally helped us was number one, we did a rewrite to go length. We actually

17:25

considered uh using dot net and it got out voted. And I learned later that the blog post that I wrote about why it got outvoted, they had a lot of people with Microsoft very angry and changed a lot of their decisions, which is great. Yeah, inspiring anger is really like, but I mean, I can I made a career. I have a tough time of disagree with you on picking go line too. When you think about a back end service at velocity like that, language is very well suited for that.

17:53

It was great, but it was half of the half of the answer because we also had to understand what was going on just right to code understand it observab This is where observability came into play because Facebook also had this service called Scuba, which was, don't get me wrong, but ugly aggressively hostile to users, but it did one thing really well, which is people that you slice and dice in near real time on dimensions of high cardinality with wide events.

18:21

Right high cardinality for those who don't know, it's the number of unique idemans in the set. So if you've got a collection of one hundred million users, any unique idea like social security number for the US folks would be the highest possible cardinality. Something like species equals human would be the lowest because only one right first thing lasting high cardinality, but there's some dupes, so

18:42

it's not as high as the security number. So everything around metrics is oriented around low cardinality dimensions, but everything you want to use for debugging requires high cardinaliti ice some people over you run into the influence before. So Scuba, let us slice it some of these high cardinality mentions and instead of having to like, you know, obscur and be like either I either I read it to dashboard for it or it's going to be hours to like just dive through

19:10

the logs and figure it out and everything. It's like, instead it would be like, okay, we're getting a spike and eras, let's break down by app one in ten million appies. Break down by that. Okay, now break down by her rights. Don't break down by by normalized database query. Now are you making cube gestures or gestures? Call them after Colum. It was just like, step by step it would take me to It's like

19:40

it isn't even engineering anymore. It's like support. Right, These problems went from being like intractable, like it would be I'm doing mean, from like it would take us a day to figure out and then it would never happen again, to just being like, you know, thirty seconds, like every single time. And that was what like when I was leading Facebook. You know, I've never been one of those kids who's like I don't want to

20:03

start a company because I kind of hate those people. But when I was thinking about having to live without this tooling, I was like, I can't. I actually can't conceive of it, Like it's becomes so coore to how I how I perceive the world as an engineer, Like I just can't imagine going back. And so that's why I made talking on So when how did the rest of the people in the organization react to this new culture of observed

20:29

ability and spam? There's a learning curve, right, We've all spent our careers fitting our brain into asking questions in the metrics and dactions type of way. But like you know how every job I've ever had, the person who's best to be bugging is always a person who's been in the moms. Always.

20:48

That's no longer true when we have different tools because instead of relying so much on what's in your head to reason about system, it's right in front of you and you're just asking questions and it's and it's more like the more curious you are, the more debugging you do, the better you get. You don't have to you don't have to have the whole system in your head. You can find the answer more quickly that way, and it's kind of

21:12

a beautiful thing. Yeah, process of discovery too, right, find exceptions. That's why, like observability is not just about yes, you're gonna have to have a Columner store. Yes, you're gonna have to have all these things in the back end and make it fast. Because the other thing about logging tools is like if you want to ask something interesting, it's like you enter the tool, you know, the and then you're like, Okay, I'm gonna take thirty minutes and go out for coffee because it's gonna like it

21:34

has to be fast, it has to be interactive. It has to be like under a seconds because you're like, you're taking steps and you have to stay in the zone, right you're yeah, exactly. It has to be explorable, it has to be interactive, and it has to let you, I think most importantly, draw on the on the brains of the people around

21:51

you. So something we built into Honeycomb is is history. You know, how you're debugging and it's like, oh, I've lost the thread, so you can just go that, you scroll back up that's where I knew I had it right, and you branch out and you try something else. But

22:06

then also you have access to the history of everyone on your team. So if it's like last Thanksgiving, we had this terrible my squel outage, you know, and everything was uh and Ben and Emily were on call, say, and then I'm on call in March and I'm like, this feels a lot like what was happening last last November. I'm gonna go and look at like, well, what were Ben and Emily doing and what did they say? Help them find out what trace are So journaling, yeah, actions systems.

22:36

History doesn't repeat that. It rhinds right so much. So much of the wisdom of your like these are socio technical systems. It's not just production. Like an example I often is this, You've got the New York Times on the Washington Post. They're like both big newspapers, right, but if you took their teams and swapped them, you couldn't actually do that because so much of this this different system lives in the heads of the people us write it. So like being able to draw on that wisdom and use it,

23:07

like it makes you a better engineer. Like all of the ship that I learned about being an engineer was looking up over their shoulder of amazing engineers that

23:15

I got. But it sounds like the journaling approach you're talking about allows us to look over the best exactly and and and get to know you don't have to remember how they did it because it's recorded there, and so you know you can even learn their approach and how they attacked that I feel like, you know, especially now nowadays, when we're doing so much distributed working, like remote working, I worry a lot about how are we going to bring

23:40

up the next generation of engineers, you know, and I feel I hope that we're all starting to think about making this more of our tooling. Just how can we learn from you know, it's kind of embarrassing, but like when I was in college, I learned so much from just going around and reading the Bash histories of all the people I knew, from either trying the commands. You know, it's fucking fascinating. Right, Oh, that's how you have to learn, said not right, what does this do? I

24:11

think we need a lot more of that in our tools. Yeah, and I worry that we're making it even harder to make that jump from junior to an intermediate. I mean, we've always had a problem with intermediates anyway, but a lot of the automation tools that are taking a lot of are eliminating the beginner stuff. Yeah, like this whole like the generative AI stuff, Like it's great for senior engineers. You're now so much more productive, you can put so much faster. But like the way that you get to that

24:40

point is it's scarts. How how are we going to force ourselves? I mean, I believe that the solutions will emerge. I hope it looks pretty bad. And I also think that younger generation will find them too, because they are not you know, we've done this show where we're talking about is all this scar tissue actually holding us back? Right? That we have some of it? Yeah, some of it's value, I think. You know, you have to internalize the damage and say, like, what does this

25:10

really look like? In generally speaking, And when someone says I will never use X product or X technique, it's like you have not eternalized your scar as well. I love that every team has, I think, but also that can they learn to speak to These are the concerns I have when you think about the broader approaches to things that might have created that problem back in

25:30

the past. I read this book called The Trauma of Everyday Life, which is written by this guy who's as psychiatrist and a Zen Buddhist, and he's talking about how trauma isn't necessarily something to be avoided because it's literally what shakes you think about a bond sire that's just a normal tree, but it was it was put in this very specific where its roots couldn't grow right, and so it's not it's like not RECTI people like trauma is great, but it's

25:57

also like there's scar. She was just going to be different. And again, how you react to and how you work to it, you can make beautiful things. So what are some of the other pitfalls that people will encounter when sort of moving to this observability. The big one is the cognitive just the model that we have in our brain. I feel like our industry has avoided this for a long time. I feel like there's a bit of a reckoning. You know, Open telemetry. By the way, I've got to

26:29

put in a quick plug for open here. It's amazing. I know all of us hate redoing our code, but like the promise of open telemetry is you reinstrument your code once and then vendors have to compete for your dollars based on being awesome instead of having you locked in. It is. It's it's the number two project after Kubernetes in the what you call it thank you CF It's super active. A lot of contributors. I was pretty skeptical about this, but it's it's it's the way I wish we had had this ten years

27:03

ago. I think we'd all parties. You know, we could have we could have just chose to political problem, not detectives totally. We're there now, but here we are. These are the tools that we have. Open telemetry is worth putting on your roadmap for the next year or two because there's

27:18

also this this reckoning that's happening with costs right now. Most of these vendors are billing just like ungodly amounts of dollars that do not correspond to the value that you get out of them because they can't because they got you're locked. Yeah, yeah, And so I feel like we need to take her powers well. And it's a great pitch for a feature that's not necessarily a new features to say, hey, I can reduce our costs by moving us off

27:41

this tool and onto open telemetry. You know, I'm compliance some of these sidebar rants here, but like I feel like learning to treat like one artifact of the zero interest rate like period, was it engineers forgot how to talk about our work in terms of dollars, you know, because like dollars are

27:56

the universal denominator. Maybe something the Euros. I don't know, but like money is the universal denominator, and if we can't learn to talk about the value of the shit that we provide to people in finance people, I feel like many many vps of engineering and CTOs have this phenomena where they feel like the junior partner at the table. They aren't really invited to all the critical

28:18

meetings and stuff. And I believe that that's because we haven't learned to talk about the value that we bring and cost in the same language as every other team. Because if we did, we generate a lot of value. We generate all company, We have all the power we should need to have. We got to do is get a hand on it. And Charity when they were up for one moment for this very important message and we're back. It's not at Rocks. I'm Richard Campbell. Let's Carl Franklin. Hey to our

28:48

friend Charity Majors about observability engineering and watching the sausage being made. And so I want to follow up on you. You brought up generative to a and things for programmers. It's great for senior programmers who can be more productive with stuff they might have forgotten how to write or don't really care to figure out and just let chat GPT do it for you. But what do you think the future of observability is, especially in lieu of AI and where it's going.

29:18

Do you think that we'll have AI bots sort of watching our telemetry and giving us English prompts, you know, sending us text messages. Vendors are going to sell CTOs and pps like tens of billions of dollars worth of bullshit that says that they can do that. Yeah, something that blew my mind when I became to see So are you saying we don't need this? We have everything that we need right there in front of something that blew my mind

29:45

when I keep CTO to be wild internalized. But the most executives have more trust and confidence in their vendor relationships than their employees, because employees coming out the vendors left forever as long as you keep paying them. In my mind, but what they're selling when they come in. This is why my dander got raised so much by the whole AIO saying, because they are all just

30:10

like you don't need to understand your systems. Pay us all this money, we'll understand it for you, and but like the false positives are ridiculous and off the charts, all of the data is junk. You would be better off just like turning off all that data, Like it's just so many problems with it. I believe that we should be looking at computers that do what computers do best, and people to do it people do best, and computers

30:33

crunch numbers, people attack meaning to things. Sure, like your graphs are spiking all day long, most of them you don't care about, because our computers are now resilient to a whole lot of failure. Sure, it really takes a person coming along and going matters. That matters often because it mattered to another person, and you're the person who they're connecting those dots. And once you've decided it matters, you need to understand why. And I think

31:00

there are all kinds of ways for computers to help us do that. We do this really cool thing called bubble up the honeycomb, where any graph that any heat map that you've constructed, you draw a little bubble around something you're like, I care about this, and then we compute the baseline for all the hundreds of dimensions and the dimensions that are inside the thing you care about, and then we dip them and sort them, so it's like, Okay,

31:21

this thing you care about, here are the five to ten ways that is different from everything that you don't care about. Computers are great at that, but they can't tell you what to care about, and they shouldn't try it, because it's a fucking mess. Maybe ten years from now I will be eating my words, but for the foreseeable future, I really think that we're all best served if we focus on helping people understand what has meaning and letting computers take care of their rest. Yeah. I mean I can see

31:45

the machine tools helping to point us too unusually. Sure, Yeah, but you still have to interpret them. Yeah. Still you want them to create that graph for you. You want them to intelligently sample often, you want them to do you know, but you don't don't want them in the business of telling you what that No, they don't know, you don't know, and more and more stately, like they're not even qualified to make that s

32:08

that's been in any way. That being said, like I can tell you we're talking a lot of old lap terms here, like a lot of data analytic terms around all this and machine learning models evolved from a lot of that technology. So you can see a shape of this shape of history. You can't see a shape, but I don't believe that it is. It is one that So here's the thing. At the bottom line, we are forget technology. We are held legally accountable for your engineers. We are legally and

32:38

ethically and morally accountable for the codes we put out into the world. Right, we can't point an algorithm it comes to that, even if it's a machine learning I think, I don't know if you've read it yula lately, A boy, oh boy, they work really hard to make sure we're not legally accountable for any I believe in the near infinite possibility employers. That's that's true. I want to make sure that I understand. I also, I mean, I like that we're also going to moral and ethical aspect because I

33:09

think we need. I think that legal aspects holding us back, that we can't own the value of what we makeout, that we can't own the value of a make as long as we're obligating our responsibilities. And then really, you know, the yula was invented to allow us to not hold liability for the impact of our software, and so we're kind of in a trap right

33:34

as an industry. If we were responsible for the damage we did, we would we would our employers would insist on higher standards because they're getting caught up in that as well. But because we've avoided the responsibility so thoroughly, I see what you're saying. That being said like this is now we get into a pretty deep philosophical side of this thing, like let's face it, good telemeter. In the end, we're trying to understand why is the software behavior

33:58

and its behavior? Why are our customers unhappy? I mean, those are the things that actually matter. I think. The more often that we as technologists speak in the term of the customers, I think, why are our customers onhappy? You know? And this is something I've been really grappling with lately. I don't know if I'm alone in this or not, but I have like an almost knee jerk, almost disgusted or like reaction towards like customer

34:22

and value and things. And I've been trying to because we've been battered with it, because we've been battered beaten up those words. Yeah, I don't know, just the business aspect, like I think there's some vestors of me. There's still like ew, we're better than that. And I hate myself as I'm saying that. You know, and you're also open with the dollars matter they do, they're kind to come from the customers. Oh, you should have seen me ten years ago, because this is a chill version.

34:47

I get that. Okay, No, but you're you're absolutely right. We do this for the customer. We do this for our users. That's the reason we exist, and we have a responsibility to them. Sure, And I don't think I'll ever still comfortable saying, well, the machine told me it was fine. That's a cop out every time. Because the machine didn't tell you anything, you interpreted it and chose to vocal. You know, in the end, everything we've talked about program it's fine, getting very philosophical,

35:14

but also none of this is described an action we should take. All we're doing is observe what's going on. We still have to decide on the action. How would you change the code? Given you've seen this in dilematry and you know what else? Like I feel like this looks back really nicely

35:28

into just like what is what is the meaningful life? Right? Like because like that book that what's his face wrote about about work and what makes us happy is it's not like having twenty hours a day whatever, but it's like autonomy, mastery and meaning purpose. Yeah, Daniel pink, Dangel pink, thank you, and like the meaning the purpose that comes into play for us

35:52

when it impacts other people. Well, and you hit on the key thing, which is when we crack this, not like every time you chase a problem downline that and it turns into a code change, you can make that then in later testing shows that problems occur. Boy, that's a good day. Like you talk about purpose, there is nothing better than figuring that complicated problem out and then literally, like you, you live in a very hypothesis based world. It's like, well, I've seen this telemetry, I've seen

36:21

this output. I believe it's this code problem. Now I'm going to make a modification. I'm going to put it into the stream and I'm going to go back and test again. And if I don't see it, then I can, you know, hypothesize really because I might be wrong. We may not have reconded recreated conditions perfectly. That we're on it, that we're pushing the right thing, and nobody knows just how deep that went. No, I also wonder, you know how many times have you been fighting a problem

36:46

like that and you chart changing code? Just see if you can change behavior at all? Like, am I even assistant? They have emergent properties, They're no longer like I feel like part of moving from like the old version of the new is except that TDD is not enough interesting like the tests tell you will this logically execute, but that reality ends at the border of your laptop. Yes, and the universe is weirder. The weird intera is so

37:14

much weirder than that. I feel like our jobs are not done. It's like until we've instrumented that code, deployed it and watched it in production and asked ourselves, is it doing what I expected to do? And if anything else look weird? I know that on the show before I was you know, I've did a lot of load testing. It's like I have never invented the load tests as weird as customers on Saturday actually comes even come close.

37:37

So customers are evil do things you can't. They really opened six windows and hit refresh all at the same time. Did he really really? Okay? May I see some practical advice on behalf of the listeners. So let's say you're listening, you're you've been surfing, you went to uncombed dot I and you checked it out, and you're thinking this might be good. How do you go back to here? How do these people in the audience go back

38:06

to their teams and introduce this concept without getting flogged? You know? How how do you approach that? I mean, that's a great question. My approach is always to look for something that's really painful, like, you know, things that are going down, you don't understand, problem, problem, you can't crack and especially this the siloed approach to telemetry that we're doing anyhow,

38:29

things that are waking people up in the middle. Then I you know, we've seen this a lot where you know, people have tried to bring it in whatever, but then there's an intractable problem and they put money come on it, and it's just like like we've even had multiple times we've had our sales engineers doing demos on people's production systems and you're about to have an outage here because this thing's happened, and they're like what in the like ten

38:51

minutes later they get paiged because it is that Like I know, I'm a founder, believe nothing I say, But is that much easier when you have the right tool, when you have the right visibility, just to be able to see what's going on? Yeah, looking at something like that, or somewhat counterintuitively the other side, another place we've seen a lot of success is people insumenting their CiCe pipelines, right, because if you insument your CCE pipeline

39:14

as a trace, you've can see where all that time is going. Yeah. Yeah, that's kind of another approach to this, the model of what is the hard work here? What's actually hurting us? The struggle is only getting in the front door. We have like zero turn if the company didn't go out of business to keep buying us. But it's difficult to get in the front door. But once we get inside, like no, but I think you've made the most compelling argument, and that's going to be tough for

39:39

anyone in the room wo's thinking about this. It's like you have to go pick the largest dragon in the room and say, I think I could take that one on if I had this lance. If I can get this lance and gay it go, I'll go for the big guy. Yeah, and that's the kind of bet you need to make. But you know the underlying part of this, because a lot of the software is already set up for the right plum tree, but it's the customs suff we're building it is not

40:00

is how you provide give visibility into that yep, orienting it around. You know a lot of people also come most when the start their open plumbatry journey because we have many of the world's best experts in Hotel, so we could actually help consult. What do I need to push onto the open telemet pelementary stack. That's going to help me, that's going to let these tools understand. What do any of you in the room have a question for charity?

40:24

Raise your hand, It's all right right here. Phil Hack has a question, So why don't you repeat the question? The question is that's a lot of data and how does that cost? How does the coast get out? Is the costack get out? Again? This is why So on Twitter I was joking the other week and it kind of got out of control and I could never write a database. No, really, never write a database. And I thought it was a very fun self owned because we wrote a database.

40:49

People didn't understand that. So yeah, it's a fund of data. Like we've got like seven hundred customers and we run the combined production modes of all of them. It's like two billion events persons or something like that, and we give everyone sixty days of storage basically for free. And the way that we do this we so it's a culundar store. So indexes are roboten for observability because indexes are way of picking. I want this to run fast

41:17

and nothing else to run fast. You want to be able to query on any of these dimensions. So it's a calundar store. And you're right, like two years in we ran into this. We're never going to be profitable because there's so all of these SSDs, all this ram. And that's when one of my I brilliant engineer I've been working he was my first manager.

41:37

Name is Ian. He's he's nowhere on the internet and he's amazing. Uh. He started looking into the cost models and did some tests and so now actually we data comes in hits, the API gets dropped into Kaffa and then gets read off onto you know a pair of notes, which are as you would think like lots of CPO lots of RAM, but then after like thirty six minutes it gets tailed out to S three. The queery planner actually runs

42:06

the LANDA jobs. Uh so the query planner comes in forks out spans and and like we thought it was going to be so much shorter like doing processing, you know from all these S threeboutives, it wasn't. It was different performance characteristics, but most careers still return with under a second and S three is the cheap so that's what most of the data is, and the lambda jobs are pretty expensive. That's a big line end up in on our bills.

42:31

So we've done we've actually done some really great talks and written some great pieces about how we use Honeycomb to optimize our LANDA jobs so that the planner, yeah, it's all. Yeah. Our Honeycomb block, by the way, is dope. Like we we don't do a lot of selling there. We just talk about a lot of engineering and it's pretty great. There was another question back here. First somebody back there had there end up. Okay, it wasn't you, but go ahead, son, repeat the question.

43:00

I'm sorry. So the question I think was something there's something about SLOs and metrics and it's too expensive to store all the choices for all events, okay, And there's a few different answers for this. I would probably want to ask you some more questions. It's feel free to find me afterwards. But

43:21

you're absolutely right. It can be absolutely cost prohibitive to store the trace for every if you have a lot of traffic, because if you think about it, you might find yourself storing five to thirty times as much prelemetry data as production traffic. Obviously that's not tenable, right. The first solution that we usually steer people towards is intelligence sampling, which does not mean just like dumb

43:45

dumb sampling, we're like one out of the routen you drop them. It means like we have a thing called refinerate, where there's a different between head sampling and tail sampling, meaning sampling before you know what it is coming and after you know what it's coming. So some of these things you sample after you know what's coming. Be sure and grab all of the slow events, right. Some of it is head sampling where you're just like, okay,

44:06

for example, requests there are health checks, there are two hundreds. This is junk. I don't need to store all of these there's gonna be like a quarter of your traffic sometimes, so like sample them heavily, two hundred okays to the main page, sample the medium, keep every request that's in error, or every request that is to slash payments or to billing or you know, like there's a lot of that is trash that you can like to discard if you kind of go in there with the fine teeth. Come the

44:35

part that was about about SLOs. We derive SLOs from events, and it's actually really important that they're not derived from metrics. We're actually the only product out there that does SLOs the way they're supposed to be done for the Google sor rebook, because other companies don't actually capture their data in a way that

44:59

lets them do that. It's actually pretty dope. Like you have your SLOs, it tells you how how quickly you're burning down the budget, and then it tells you what what is different about the requests that are erring that are burning down the budget the other blah blah blah. We also have the metrics, but I'm not sure if you're talking about our metrics product or the events. Like the number one answer to the events be too expensive is you use

45:25

smart sampling? And the number one answer to the SLOs is you want those badly. You can absolutely do sampling. So one of the one of the things in every event that ge is sent to us, there is a sample rate embedded in it, so everyone will say like one slash five and that means compute this to be five like this, so the numbers all all work

45:45

out to look like they weren't sampled. We had questions. You got to move on question right in the front, hay on can you repeat that with the micro The question was, obviously, you can go wrong with logging because you can get way too many log events. You can go wrong with metrics because you could you can have famously like the thirty thousand dollars metric that had

46:04

high cardinality in it and oops your budget. Answer the question is, I think, how can you go wrong with observability as distinct from those you know? And I want to say, even if you aren't doing traces, if the most important thing to take away when it comes to telemetry is the magic

46:22

of the one wide structured event per request per service. I actually found out years into this that this is how Amazon is done there telemetry all along, they had like a flat file at the root domain of every node where they

46:37

keep one of these like wide It's it's magic. It makes everything because a lot of the logs that you are encountering, or because like when a request is executing through through a service, it's just like, oh, all these strings, right, But if you just like collapse them into one wide event with all of those keys and values, then you have that context, right you can put its magic. So the number one thing that I think people

47:04

get wrong with observability is not understanding that that's the heart of everything. It isn't much tool you're using. It isn't whether you're tracing or not. It's that it's that that is the number one thing that everyone should be caring about. The number two thing I think comes out when dealing with spans slightly ordered higher order problem, and that's because I feel like as an industry we have the really we aren't really we don't really have a set of like good conventions.

47:31

You were asking me, like, when should do you have this span? Man? Like, I hope five years from Everybody's like, well, obviously you should have this span blah blah blah. But we aren't there yet, right, and so it's really easy to either generate too many spans and they get lost in the noise kind of like with logs, or too few spans and then not have the detail that you need when you needed. The question is where to start with open planetry, And there are only two good

47:57

answers. One is my favorite, uh, with the biggest pain and if you have to really like you're like, if you have a really resistant culture,

48:08

then start with the least pain. But I actually think that the best way to roll anything that has to do with cymmetry out is is it kind of think of your attention like a headlamp, and if you're on call for something that's breaking, have an instruments first mentality, like you've instrument to figure out what's wrong, not if you've around with your instrument, have to tell you the answer, and then it's there for the next time you get paid

48:31

again, instrument to find the problem, and it's there. And as your head lamp kind of moves around the stack, you know, within a couple of months most of the stuff that it really matters will be instrumented and then you can put it on the backlog to do the rest and finish up and get rid of your ovenders. All right, Well, I think that's it, so let's give charity majors a big round of law. I will see

48:53

you next time. On time dot net Rocks is brought to you by Franklin's Net and produced by Pop Studios, a full service audio, video and post production facility located physically in New London, Connecticut, and of course in the cloud online at pwop dot com. Visit our website at d O T N E t R O c k S dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives going back to show number one, recorded in September two thousand and two. And make sure you

49:50

check out our sponsors. They keep us in business. Now, go write some code, See you next time you got jacks. See a summer time on that means home. Then my texes in my credit b

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript