The Observability Tipping Point with Steve Gordon and Martin Thwaites

00:01

How'd you like to listen to dot net rocks with no ads? Easy? Become a patron for just five dollars a month you get access to a private RSS feed where all the shows have no ads. Twenty dollars a month will get you that and a special dot net Rocks patron mug. Sign up now at Patreon dot dot NetRocks dot com. Hey guess what, it's dot net rocks. I'm Carl Franklin at Amgard Campbell and we're representing four time zones today. Nice. We have two guests. Everybody in their own time zone.

00:48

We're in their little corner of the world and their little corner of the world. However, electrons do not observe time zones, so we can all be together on this call. I think they observe it a little bit, maybe, you know, maybe a little bit. You know, Apple has some software. There's an old reference for you story. That was a great story call. Apple's got some software coming up, all right, speed light, hard to beat. Yeah, it's kind of the law Buffalo funny. How

01:19

that works? Well, Uh, let's get right into it with better no framework awesome? Alright, Well, I found this series on Microsoft Learned. Oh that is really cool. It's and you know what's great about this is that it's a it's fundamental stuff, leveraging branches with GitHub and Visual Studio and intermediate series. It doesn't talk over your head and it doesn't talk down to you. But you know, there are those guys on the team that are afraid of Gethub and afraid of get you know, I used to be one

01:59

of them. Sure, it's so heavily integrated into studio now, like, yeah, may just make your life pretty easy. Yeah, so this, you know, this is required learning. And so even if you know this stuff and you think you know it, it's good to brush up on it, especially if you're in charge of, you know, teaching other people on your team all about it. But anyway, I put a link to it.

02:23

It's since this is episode eighteen eighty seven. You can go to eighteen eighty seven dot pop, dot m E and then I'll get you there. It's had what twenty five hundred views since January fifteenth. Yeah, it's just it was just put up a little while ago. Yeah, it's very good. It's cool and a lot of thumbs up and I watched it and it's

02:45

good. No learning love it. Who's talking to us today, Richard, I grabbed a calm on top of show eighteen sixty nine, which is from last fall October twenty twenty three when we were at n DC Porto an on stage interview with one Charity Majors from company called the honey Comb that we put a terrible pressure on Martin or anything, but she was awesome. I don't know if Martin is gonna have and and see we're gonna have as many F

03:09

bombs however, No, but that's not about it either. And just to reinforce the point, and g Houseworth had this comment which he said, I never missed an episode of dot Net Ross because they absolutely rock, and this episode with Charity Majors was none, if not the best. Yeah, no pressure, the insight and energy was awesome, and we did have the distinct advantage of being on stage with her. Two you're at NDCA Porto at the

03:35

time. But you know, we're talking about observability more and more, and I'm really looking forward to getting into this one because we've got a couple of different viewpoints coming at it. But if you want to get started on this subject, that show we did with Charity no fool and it was great. It was great. G thanks so much for your comment and a copy of music Code by. It's on its way to unit. If you'd like a copy of music Code buy. I write a comment on the website at dot

03:57

at Rocks dot com or on the Facebook to publish every show there. And if you comment there and I reading on the show, was that your copy of music go by? And uh, you can follow us on Twitter if you want. But the cool kids are at masadon and you know, I put all of my social media links at Carl Franklin dot com cool so we don't have to just rattle them off. I got a bunch of them up

04:15

there. I think I've got a Richard dot Campbell dot me, you know, one of the about me pages that has to catch that stuff up there too. If I actually looked at it recently, I don't know. I think I did it when we when we were doing the mass account thing. Yeah, but yeah, you know it is. But you're at rich Campbell on Twitter and what's your masdon and pretty and rich Campbell at masadon dot social you know, yeah, yeah, oh no, it's our Campbell dot me.

04:43

There you go. R Campbell dot me. All right, got them all. Well I'm going to just start talking about that. Carl Franklin dot com. Just go there. You want to do that now, Okay? Like yeah, okay, shorter, Yeah it is shorter, and uh you know. Then then everybody has a lot more to read that they didn't want to so too long, don't read? Yeah, go away. So let's

05:10

bring on our guests, Steve Gordon and Martin Thwaits. Steve is a plural site author, Microsoft MVP and engineer at Elastic maintaining their dot net libraries. Steve enjoys sharing his knowledge by presenting talks at user groups and conferences, writing on his blog at Stevejgordon dot co dot uk, and creating videos. You can find him on most social media platforms under the username at Steve J. Gordon. Martin Thwaits is an observability evangelist, dot net developer, Microsoft MVP.

05:43

He works for Honeycomb dot Io, an observability vendor. As a developer advocate, he travels the world spreading the good word about open telemetry and observability engineering. He'll talk to anyone about observability, even the likes of us Jamos that's not very nice. So welcome guys. Hey, thanks having us, Yeah, thanks for having me. So this culminated at in the fish Powl at NDC London, right, Yeah, we were all hanging out, you know. I was taking it easy because a co host wasn't there, so

06:23

so at some point somebody just declared it Richard's office. There was a few other foot I think Miss Manners came in and recorded some shows like there's a few folks came and went and used it. I did some run as as well. I was there. Yeah, you could. Mostly we sat around. I do believe a couple of bottles of whiskey may have just disappeared into that room and never we're seen again, as they do, as they do. Well, Steve was in the UK and it's afternoon there or evening,

06:48

right, so you've got your whiskey there, Steve. That's right. Yeah, I'm I'm declaring it post six pm. Fine time to drink, right, I am going tonight to Texas Day Brazil with Robert Ramsey. Nice Richard, Yeah, and we will I'm sure have plenty of meat and whiskey. So anyway, what do you guys think of following up with on Charity's Talk here. Are you scared or are you up to the challenge? Martin said that very well, don't you get it? Very well? Answered that question.

07:24

I mean, F bombs is kind of our company policy. No kidding, you know, I'm not saying he's part of the interview process, but I'm not not saying that either. That is so cool. I mean, you know, let's think about it for a second. I mean, who are you going to offend? Right? You go to any movie, even if it's PG, and there's F bombs everywhere, right everywhere. But I mean we do get some email from folks on set when they show up. We do try to bleep them too, Yes we do. But I appreciate

07:51

the sentiment either way. It's like there's so much of observer your ability that falls into the what the heck is going on in this app or in this case, what the f is going on? And it's you know, trying to be Canadian here. Yeah, we have a we have a you've heard of MTTR the Meantime to Recovery, and so we have a thing called mtt WTF, which when you're debugging systems, it's how long does it take you to go what there was that No, that thing doesn't call that thing,

08:24

that doesn't happen. That's not right at all. So what I haven't used this Honeycomb, But what I remember distinctly about that conversation was it's architecturally different from a lot of other, you know, well known telemetry kind of things that you hook on or add on side cars and to get any kind of insight and observability into what's going on in your application places where you just can't

08:52

put a break point. Yeah, I mean, I think this is the whole tracing debate versus blogs and APM type stuff, where tracing is the thing that's different. It's the evolution, if you like, of that idea of just log everything, throw it in a big bucket, try and run some log stash things on the back of it, or fluent stuff to build it all and try and get a picture. It's really the tracing stuff that is the sort of next evolution that isn't is not new, and that's the thing

09:22

that really gets me. I was going to say, we've had tracing for a long time, but it's new about honeycomb tracing. So well, Honeycomber is just a back end, we just take spans and allow you to query them in weird and wonderful ways. In rapid fire, you know the idea of an outage going on and you're waiting five minutes for telemetry data to get through for you to ask a question, to say, did I fix it? Is it now fixed? That idea of just rapid fire, ask questions,

09:52

Ask questions, because that's really what you want to do. You're on that conversation with your production system that's like, you know what's happening? Who hurt you? Yeah? And that's really what we're trying to show me where the bad man? Yeah, it couldn't help, but notice your pain. But that's what we try to do is really try to allow people to just

10:13

ask weird and wonderful questions. The example we users is, you know what happens when the French user is currently in Norway using the French language pack in the Norway region on iOS fourteen point one, and that's the era that's happening. It's only on those users, because that's that's where we are with systems now. There's no longer this system that you use inside of your organization that only your users are only your organizational users. Everything's global. You know,

10:45

people are working from home, working from anywhere. So even if you've got just an internal system, these people are from everywhere now, so you know, we're starting to get into a world where things have changed. It's not about I've got six month release cycle. I can wait months. I can do a QA cycle. And my my whole thing at the moment is if we don't change the tools, and I'm not talking about the vendor back ends.

11:09

I'm talking about like open telemetry and tracing and profiles that are coming up and various other signals that are coming up, we don't give people better tools, but we're saying you need to do things better than you did before. Like it's not it's not okay, yeah, yeah, the the yeah, the reliability strategy of be more careful next time. Yeah, don't do that. Yeah, don't touch that. It's yeah. The real point here is the tools should lead us to the path of success. They should resist.

11:39

We resist as are making mistakes, and so we get to still be human and have reliable software. What a concept. Yeah, And I mean that was I think from my experience that was one of the problems with just sort of relying on logging as your observability. Logging in a few metrics was was us doing observability. But logging can be quite good if you spend a heck of a lot of time and I use heck there heck of a lot of time getting the logging in that you need. It tends to be quite retrospective.

12:09

You figure out there's a problem in a particular code path, you put a new log in, you'd ship to production, You wait for someone else to hit the error, and it logs something else that gives you another clue of where you want to log next, right, And so it's it's very

12:22

retrospective and looking back. Whereas with traces, if you design them well and you put tracing around your key points in your application code, you get the full, you know, waterfall effect of what's happened, and you've added hopefully decent attributes on there so that you can, as Martin said, go back and answer any weird and wonderful questions and combinations of events with particular attributes on

12:46

them. Awesome, And Steve, I got to ask you, like I normally resist two guest shows because you get up with a lot of voices and I but we're talking of observability, and Martin works for the company. Arguably that makes one of the most popular observeability tools out there. Does Alaska you does Elastic use Honeycomb or is it just you know, you're trying to understand

13:05

what your own product does. What's the story there? So we provide an application performance monitoring solution as well, so we compete in a nice friendly way. So we have a different back end. Our back ends built on top of Elastic Search, which is kind of Elastic's main product, and our APM server uses that as the data store, so you can ship your telemetry data to us. So you can use our APM agent to do that, or

13:31

today you can also use any open telemetry capable agent as well. So the sort of built in what we call the Vanilla Open Telemetry SDK can be used to ship over the open telemetry protocol directly into Elastic APM server and then we'll stare that for you and then give you the again a different form of you to search over those traces and those logs. I'm just staggered at the idea

13:56

that we're talking about Elastic for something other than search. What I know, I thought they made other products, but when don't we ever discussed that? So okay, so this is the elastic has their eight PM product, which is another aspect of them. But you're both guys are in observability in a big way. That's right. Yeah, I think that the really I mean, one of the reasons why I thought it was really good that we're both

14:16

here is because different back ends provide you with different query and capabilities. Right. You know, the idea that you just buy one tool off the shelf and everything just works is just not going to work in today's world. You really need crazy you know, you really need to think about what's going to work for your organization. You know, what are you trying to achieve with observability, and then choose the back ends that work. And that's where open

14:43

celemetry is really awesome because you can send it to two providers. You can send that same exact data to two separate viders and say, well, this one's really good at metrics. It's really good at showing me some dashboards about pre aggregated data. Awesome, that's a really good tool for that thing. It does infrastructure metrics. There's one that does say costing and takes your metrics and can project costs based on previous metrics. Those are really cool tools.

15:11

There's other ones that focus on, say Servilus, there's other ones that focus on you know, everything's just they've got their own niches. You know, Elastic has their own niche. We have our own niche. So open celemetry just makes it easy as your needs change, you can change provider. There's no I'm going to change provider. We'll install the new agent on fifteen hundred machines. You know, well, I'm nobody's going to do that. Or

15:37

now it's I'm going to move vender. I'll change three lines of config right now. Speaking of that, I mean you bring up pain right. I mean, let's say I've got an app in production and I've got you know, thousands and maybe one hundreds of thousands of lines of code. How does one just retrofit this kind of tracing into an app like that? Obviously there isn't any kind of wizard where you can just snap your fingers and it's instant insights, right, ten lines of code, ten ten lines of code in

16:15

not for every object in every page. And so I mean the thing is with open celemetry, we've got three care signals, we've got tracing, we've got metrics, and we've got locks. Tracing is predominantly back end. It's focused on this idea of we've got a context. So a trace is a context, a thing that's been done. So it really works best in the back end because you've got an inbuilt context of a web request. It starts and it finishes and things happen. It doesn't work well currently in the front

16:44

end, so it doesn't work well in the client side. What people call rum or really use monishing, so it doesn't work well in that way. Open celemetry is trying to build a signal for that. So in the backkend it's actually really easy. You just add say ten lines of code to your startup and say I would like to see a span for every HDTP request that I get into my site. I would like to see a span for every

17:11

HDDP request I make out of my site. And when you say span, there's several definitions of that word, but in this context, what does it mean. So we're on a dot net podcast, so let's use dot net terms. You can create an activity, so activity and this is where we say we've had this for ages because activity has existed since I think dot Net called two point one I think was when they first started bringing in activity.

17:40

It's been there for a while, and activity kind of represents a sort of unit of work, if you like, and an actor you create an activity, you create one for your HDTP inbound request, one for your outbound request, one for your when you hit a database. And these are all in built into the libraries already. Espnut core has It's equal client has it, Myscore client has it, MPGSQL Entity Framework, you name it. All of those have been moved up. Even the Azure client SDKs have now been changed

18:14

over to using activity sources and activities to create these spands. So a span being this, I've got a start time, and I've got an end time, and I've got some attributes, and it allows you to build these really rich waterfalls that allow you to see how things transition through your system. But you can do that with about ten lines of code in the startup of your

18:33

application. And with a spire, they're actually promoting this idea of these shared projects where all of that setup happens once, so it's not even ten lines of code per application, it's ten lines of code in a common startup, So you can get essentially all this automatic data just by adding ten lines of code and then putting in a little bit config to say where you're going to

18:56

send it. You're going to put in the elastic APM end points, you're going to put in the honeycombman pints, you're going to put it into any of the end points, and if you want to go further, you can

19:03

you can do zero lines and code as well. So there's also this kind of concept of auto instrumentation where by flicking a few environment variables to configure the profiling hooks that the dot net run time has built into it, if you can point it at this auto instrumenting profil and that can go in and wire up whether it be the open telemetary STK or the elastic APM agent wire out

19:26

for you in your application with no code changes at all. Just deploy the environment variables and restart the app, which can be a really good way to get started with. That's amazing actually, Yeah, with basic instrumentation of what's happening. Guys, you're talking about the easy part collecting too much, Like

19:44

that's really easy. Yeah, you know, I'm a guy who who spent time in the sequel world where we used to do full traces of every transaction going through a SQL server because it was really great at filling up hard drives a a bit. It's the real question is after you gather all this stuff, and I emphasis on stuff, I can think of several other words that would apply begin with s, Yeah, how do you mash this in the

20:11

submission? And you know to the line I would use when working with the team was create actionable items, things we can do to make the system better. That is the rub really, you know, and that I think is what has been tracing's downfall for probably the last sort of five years really because people see tracing as a I have a trace ID and now I can see a trace waterfall for that thing, but I need something else in order to

20:41

tell me what trace ID to use. And what that means is you end up with say ninety nine point nine nine nine recurring percent of your data not being used and just sitting on hard drives and costing you money. And that I think has been one of the big downfalls that people have said, well, tracing is not really useful because I need to know what trace it was so I'm just going to use logs because I have to keep them anyway,

21:04

And I think that's been one of the big downfalls. But I think there's there's a movement now around tracing, around things like sampling and doing various different mechanisms for how we reduce down the amount of data that we store and only

21:18

store the interesting stuff. Yeah, that's assessing what's interesting is not a trivial problem either, right, Yeah, I mean interesting is subjective at best, you know, you know, but the general rule is, you know, you keep every one of your slow traces and keep everyone that has an error in it, because there was at least going to be the interesting ones. But you do that, and then you can start to think about how do I query my trace data, not how do I query my logs and my

21:48

metrics, and then get back to tracing. How do I start with that tracing data, because it's really rich. And I think that's where we're seeing this big change now with open telemetry, because the back ends, all the back ends now are starting to look at this trace analysis type stuff, whether it's ours or Elastic or a lot of the open source ones that are normally based on ClickHouse as a data store. They're starting to do this idea of well, you know, how do I get and ask questions of my trace

22:14

data? Because it's really rich, it's really wide, it's got loads of attributes, all that interesting stuff, and that really is the rub. Yeah, so always a question of what's actually interesting, because you're always afraid of what you shave off because that might have been the interesting data. But yeah, we've had so many people who come to us with, yeah, I don't really like tracing because you know, every time I wanted to ask a

22:37

question, the customer got an error. It was sampled out because or discarded because somebody took one percent of random sample data. And the era happens, you know, once in a million times, so they never get the trace

22:49

that they want. And that's where we're doing a lot of evolution right now in the open telemetric collector and some of the stuff that even me and Steve were talking about at NBC London about how we roll up stands and write tell something processes that are more bespoke and you know, this is the sort of conversations we're having now because it's vendor agnostic and so you've got so many more

23:11

people. You know, me and Steve, we don't work for the same company, but we're working towards a common goal of reducing the amount of data, making it more actionable, and making it easier for people to implement. So you've got, you know, hundreds of developers all working on that same

23:26

code base. And that's the major advantage of open telemetry as a goal, isn't it for the industry that rather than every vendor focusing on their way of doing something and having their own little pieces that they can use in sales and marketing, it's actually what's the right solution across the industry for how this should be done in the best way for the organized relations that need this information to be able to store this stuff in a sensible, cost effective kind of way

23:51

to do this level of sort of advanced sampling, sort of intelligent sampling, if you will, on the on the tail of things coming through. And I think we're starting to see that shift with vendors now kind of all starting to adopt finally, I think open telemetry as that solution. Yeah, it's always a question of mentioning this data together into like do you make sense of it? They and what's that you see mostly it's the same stuff over and

24:15

over again. It is what you already know. I'm tapping on the book I've never freaking finished and still hope to one day. With the history of dot net, where the seql team had a tool where anytime a unique execution path happened inside its equal server, it would stop. It would hit a breakpoint, so as long as it was executing the normal path or normal paths over time, it would continue running. And so it was one It was

24:41

like a tool designed to find the one in one hundred million case. So you're iterating rapidly and it's like initially it's breaking on each unique it's breaking on each unique path, but after you say, okay, that's a fine path, keep going. Then eventually it's like running for days before it's finally like think how about this? You're like, now, well the heck did you get there? And the reasons in the story of the context of dot net

25:04

is because that's what made dot net two point zero. Making dot Net run inside a sequel server, made them test dot net in away never been tested before, and they found rare and unusual bugs because they were running into context of sequel server. Here end of the history lesson. I love the idea of like the system just going hold on a minute, can you just get your supervise it this road before boys, let's have a little chat. And speaking of that, my friends, let's take a brief break for this very

25:41

important message, and we're back. It's dot net rocks. I'm Richard Campbell, it's Martin Waite's and Steve Gordon and my buddy Carl Hey and talking a little bit about more than one way to get observability here and the fact that

25:56

you know, now you're really creating trouble for me. It's not like I was get too much telemetry out of my Honeycomb stack, but now I'm gonna go get elastic eight PM as well, feed the same data to it twice because I just own too many drives, Like let's try and kill them all. Like doing analysis between those data sets could be interesting too. I mean,

26:17

we do that law, you know, the Bakoff type stuff. You know, once you've got an open celemetry collector in place, you just sort of send it to five vendors, see which ones do the best fit out what you want to hear. You know, in accounting they call it reconciliation, right, Like you got to add up the numbers more than one way.

26:37

I wonder we're not there yet in our industry. Like if I really wanted to be a pain in the butt about a performance contract, the fact that I could ask her to the telemetry to be sent to two different sources and we'll reconcile the performance metrics against it from two different places and see if they actually agree. When they don't, wonder why, Like, yeah, you really press against that contract. Yeah, I haven't been in there be CTO in too long. You've given me new ideas on behalf of myself,

27:12

and I am sorry to the industry. Yeah, I mean it's all Yeah. The other side of this is like nowurkicking, we're doing multiple analysis on the same sets of data. Hopefully obviously different tools, like you said, have better insights in different areas. But yeah, I'd love to have a standard set of comparisons. Says oh yeah, no, we reconcile the same way we agree on stuff, so you get more confidence. There's also different

27:41

users as well. You know, we've got the you know, just in the sort of reliability space, you've got developers who are using this information You've got srs that are using this information to optimize the infrastructure. You've got platform engineers that are using it for the same things to build tools. You know, that's just three users just in engineering. You know, you start to

28:02

think that what else could we use this information for. I know of a company that's running at the moment the idea of taking metrics to you to do cost projections. Right, So take the metrics of your kuminettes poor to how much you might need to scale in the future. Take that metrics data. Don't allow people to quer it and do dashboards, but take it and do

28:21

projections. You know, this is now into the thinops world. Well, this is my mid life, right, Like all too often we'd roll out a new feature and it would tip the system over a week later because we weren't looking at the additional resource impacts. And as we you know, as e commerce site started getting really big, we were dark launching features and running

28:42

them on one server just to see what the load differences were. So I knew how much hardware to order, which just speaks to how old I am, right, because back then it was like, see, we love that feature, but we literally need thirty more servers to roll it out. Yeah, and then you end up with one server running on a previous version and then you bankrupt the company, right yeah, well yeah, we don't scare

29:03

me Martin like dinged. But I have used that line in the boardroom where it's like, I think we're about a week away from bankruptcy unless we turn this thing off, because we're gonna we're gonna crush the whole thing, Like we have to think about it. Yeah, it's uh, it is it it? And it's that whole part about the telemetry part was the it's very

29:25

hard to estimate the impact of a feature. Not only you don't really understand the resource consumption of a feature anyway, but you certainly don't know what the user utilization is going to be, like you're just making this stuff up, right, So it is only telemetry that really tells us how much is it being used? What's the actual impact on the systems? At least now we're living in a cloud where you can just flip a knob and the CFO find out about it in about thirty days, right, Like, you not your

29:51

problem. You kept the system up. That's my metric. I'm up, Like I'm good. Well, what is this about a bill? And this is the the other flip side of that is what we call the nines don't matter if uses unhappy, it's like, yeah, ninety nine time ninet nine percent O time awesome, twenty second response times though, that's a good yeah, but yeah, this is the game we're playing, right is looking through

30:14

all of these things. And really we came into this conversation talking about what's the app up to, like we're not even you know here we're talking about portant business cases, but mostly it's just trying to diagnose weird behavior, like I don't know what's going on, and I have a customer that can totally use this Right now, after upgrading to blazer dot Net eight from six, all sorts of weird things are happening the and we're still using the same services

30:45

and stuff, just the newer versions. So it's probably just some versioning thing, but weird unhandled exceptions that are evading the error handlers and closing the whole thing and not giving us an opportunity to find out where they are. Like this this is going to be a game changer for them. So what I should mention at that point is it doesn't work well in Blazer. I'm really sorry Stack or is there something inherent to Blazer. So it's more to do

31:21

with the front end side. So if you're using Blazer web assembly, you have a single thread and telemetry should inherently be done in the background. So the Blazer server has some problems with the gRPC connections and various other bits that we're working on it. I'm currently talking to somebody who runs one of the Blazer University things who's trying all this stuff out and trying to to work out

31:49

what what that should look like for Blazer. But there's a lot of things that are just they're new, they're not they're not in the tracing paradigm right now. Durable functions is another one that has some really big interesting nuances around how do we propagate a context, So how do we say that this function execution was related to this other function execution, and how do we get that data between the two because it's not as easy as just past some HTTP headers.

32:16

So you know, we've got as your service bus has similar nuances. But we're seeing we're seeing right now a lot of open source libraries that are going I need to add tracing because people want this, They want that internal information. They want to know how the cash server is working. They want to know the messaging systems and how long it takes me to post a message to my messaging system, not just how long it's going to take to receive

32:44

and process it afterwards. So they're adding it. Rabbit MQ recently in the new version is going to have tracing as a first class citizen in there, and then it's going to have metrics as well. Fusion Cash was another one that I worked on with some people, and we've just seen loads more people going, yeah, tracing, I need the tracing now. So it's just it's really good to see, right. Yeah, we're at that kind of

33:06

tipping point of adoption, which is yeah, very encouraging. I think as people get benefit out of the box, when all of these libraries they're using and they're already getting useful traces that tell them a lot of stuff even without them instrumenting their own code, that's the benefit of everyone. I mean, we recently instrumented the client library for the language client for Elastic Search, so that's a separate package you pull in if you want to do you know,

33:30

dot net code to talk to elastic search. So we've added some instrumentation down in the transport layer so you can kind of see what's going on there. Added our own useful attribute, so out of the box, if you're using that library and you're using an open telemetry capable you know, collection process, then you immediately get that information and you can you can diagnose what's going on at that level. Yeah, and Aspire, you know, you had David

33:53

on talking about Aspire. I think that is one of going to be the real big game changer in a option because one of the big blockers is people like logs, Well, I can see them in my console when I'm running, why would I use traces because I can't see them in my console? But now with Aspire, it's like you've got a nice, convenient little dashboard that's just running there that takes your metrics and your logs and your traces.

34:15

And I'm seeing loads of people use that. They look at the logs thing and then look at the traces thing and say, well, the traces are way richer than the logs, So why would I use the logs? And I'm like, I've been saying this for five years. Thank you you're right.

34:29

You know. The one thing I came away with while I came up with many things from David's show, but it was this, we're trying to make all the right defaults for you when you're building a cloud native app, and one of them is you use open telemetry, like, you use that level of instrumentation. So you know, it's exactly that you have to fight against it now to not have all that stuff in place. Yeah, it's the pit of success in it. You know, Yeah, it's right.

34:53

But you know, it's an interesting point because there's also folks that are like, I don't know why we need this, and it's like, yeah, you know, because it's too many choices. You know, if you're perfect, if you can get everything right every time, you know, if you never need to be more careful because you're always careful, then fine, you don't need it. But I'm just not that careful. It's like unit tests.

35:15

Spend several days trying to analyze logs in your head, and you know, correlate everything across any even a good logging system, and you'll you'll soon realize that actually a trace might have saved you a few sort of head against all moments. Yeah, just like unit tests. I don't need unit tests unit tests with people who don't trust that they can write code right for the first time. Yeah, that's true. I know I can't write code fight

35:37

right the first time. A certain quote from a certain Billy Harleist comes to mind. Yeah, you might be addicted to code. You might be. Yeah, and you must suck as a code if you need two people writing ten lines of code for every line of code you right. Yeah, I mean I even speaking to an extreme case, but it's like, look, people make mistakes, they get tired, they work too late, they don't

36:05

quite understand the problem. Like, all these things are real, and you look at all the work we're doing lately around shortening the internal cycle, the impacts of GitHub co pilot, they're all speaking of the same thing, more correct code and an aspire to me looks like this tool that says, okay, so you're going to be cloud native and you haven't done it for thirty years for some strange reason. So now we're going to give you a tool set, you know, a framework that leads you to that path. It's

36:34

a distributed debugge. Yes, you know full the cloud that's you know, I joked in some of my talks over the years about you can't attach a debugger to production. You know, nobody nobody attaches their visual studio instance to production. Well sorry, nobody does it more than once. But you know nobody does that because well, for a start, it blocks all the threads,

36:59

and that's you know, probably a bad thing. But also, now we've got you have to attach a debugger to fifteen different services because you've got five different services running three replicas and you don't know which one is going to hit. And you know, we're in this world now where we need that debugging experience. Nothing's changed. You know, we still need to debug our systems, but we need to debug them in an environment where they're replicated and

37:22

they're distributed. Some of them are using the event driven systems, and you're like, I can't, I can't correlate these logs across an hour's worth of time window. And yeah, it's just it's hard. And that's where I think people are getting to now that they're starting to add tools and tools and tools on top of the old logging systems that they had to try and meet

37:45

the demands of these distributed systems. And then they realize that actually know there's something that's more native that they can use instead, and it requires a paradigm shift. And I think that's where as an industry we're bad at acknowledge it. The telemetry and logging aren't the same thing. They because they're not weirdly enough. But you know, I've been saying for a while now that if you can answer all the questions that you need from the logs that you've got,

38:13

you don't need traces. Yes, stay still fine. Yeah, but the question is can you and I don't think you can or you're not asking good questions you Yeah, I mean that's that's the number of rises above none. That's an old scalability problem where I found that. I found an app whor it's like it's a long one person was used as it was fine, but as soon as two did not, it's so good. That was a

38:44

long time ago, My goodness. M I mean, I've hit that problem so many times when people use singletons instead of transients, and you know, they still stay in a singleton, and especially in multi tenant systems, and it's like, yeah, when we've got one tenant, it's fine, but as soon as we get two tenants. Everybody's got everybody else's data, and

39:02

that's a bad thing, I think. Holy man, Yeah, easy mistakes to make too, right, Like, you realize we don't think about that anymore, that the framework is just doing that for us now most of the time, as long as you follow the rules. Like, we're definitely working on a different tier of problems now as we get you We're used to distributing across multiple machines and now it's you know, in different clouds, with different

39:25

platforms. We definitely build way more complicated software. It's no question that we need to change the way you measure them. Uh. The debugging story is separate though, right, Like you just sort of casually mentioned the debugging on top of it. But telemetry is not necessarily for debugging, is it? So it can be for telemetry isn't a debugging tool, but you can use

39:51

it for debugging. You know. One of the big things we talk about at the moment is the difference between three pillars of observability and just telemetry signals. So the three telemetry signals being logs, petrics, and traces, they allow you to do lots of things, and they allow you to build dashboards. They allow you to do monitoring build alerts, don't. You don't need one or one or more of these. You can use one, you can

40:15

use three. You don't need all of the signals. But when you're starting to do the debugging side, you need to work out which signals you need to do it, and you can correlate them. Sometimes one of the key ones is the user request is taking too long and it turns out it's actually just one pod and that one pods has its CPU spinning out, and you look at infrastructure metrics for those things. So debugging or do you call it

40:43

debugging working out what the problem is in production the root cause analysis. Yeah, Debugging to me speaks to I am fixing code often because I'm looking in an executing environment for what's going on within that code. I could see having the telemetry sitting beside me because we're seeing a behavior in production, we're not quite sure what it is, and so now I set up a debugging environment

41:06

to try and capture that. And I suppose so I've been playing with this narrative, and let's try this, which is I think tracing is kind of like conditional breakpoints that you can use with a time machine. Now, when I discovered conditional great breakpoints in dotnet, it was like a revelation. When you're running a SPA app on the front end and it's going through this line of code, and you want to know when it hits this line of code.

41:32

But when it comes from this end point is when I want to be able to hit it, because it hits sixteen times and I don't want to have to hit play every time, and I've got multiple threads going on.

41:42

If you think about that, but being able to do it in production and then say what happened yesterday when this person hit this end point for this particular method or this particular database call that I think is that this revelation of you, now you are doing deep You're you're using that telemetry from production to do debugging, yeah, to construct an executing environment to recreate the problem in production. Like it gives you the cues to say, we need to be in

42:12

this state to be able to see this bug. Yeah, to be able to see those constraints. Because you still can't set a break point in production. Well you can, you just won't have a job after it. Story Steve, I stepped down you there yeah, no, I was just yeah,

42:30

I'm totally on board with you know, not this point there. And I think it's it's the ability to apply constraints to the data that you have that would be very difficult with just pure logs, where you want to set up this series of criteria around I want to look at this coming from this in this region, you know, deployed on this particular piece of infrastructure, and then see a waterfall visually of Okay, what did that flow through in

42:54

the code? Where are those you know, lines on this waterfall wider taking longer? And then you narrow down, Okay, well that's this block of code or this library. That's where I need to go and look to try and find out why that's taking so long that piece of the puzzle with these given sets of criteria applied to it. And that's something I think only only well defined tracing can can really give you. You could, you could sort of get there with logs, but you you would take a lot longer to

43:27

find there. I think, yeah, there's always a question of do I understand with any given log set, it's like these six logs, where is the trends this particular transaction across these logs, like we've always had a tough time joining that stuff up and tell them she does a better jar with that, yeah, and then relating that back to infrastructure because you know, we don't run our code in isolation. Infrastructure is a thing that we use.

43:52

So yes, it ran slower. But was it running slow because that Kubernetti's node was overloaded? Was it running slow because we just switched slots in service? You know though that infrastructure stuff is just as important. Yeah, it's you know again, I've spent a long time on the ops side of this because why is this slow? And it's like it's your hardware, No, it's your crappy software. Like you go back and forth on that a fair

44:15

bit. I have had a situation where we had a hundred based tea hub in the in the network we'd forgotten about and literally were throttling all the network and traffic. So you know, it looked like the database was slow, and it's like, literally it took a long time for the data to come and go from that because this network mistake was there. Like, those things happen, and the trick is to not be mean about it. Like everybody sit on the same side of the table in the boardroom with the projector on

44:45

going. What is this data telling us? And that's correlation? Yeah, you know, that's the idea of you know, the ops people are looking at their infrastructure dashboards and you know there's a there's a thing here, and then the developers are looking at their log aggregation dashboard that show them some stuff and they've got some graphs that they can It's like, whoa these two graphs they go up at the same time. Ah, maybe there is something here.

45:08

Is there a knob here that we can wiggle, like, you know, just kick it out? Can we change that number for anything? Like when we do this, does it make a difference. So at least you have a sense that, yeah, there's a there's a constraint here and it's related to this value, and if we modify it, we can improve it in some way. Convincing that we have to buy a new hardware is challenging.

45:31

And again Cloud took a lot of this away because it costs you nothing to turn that up for a test and then turn it back down again like a buck. It's just you solve a lot of problems just by doing those tests. Yeah, I just run a workshop on Kubernettes for some attendees, and I span up a Cubernettes cluster for every one of the attendees for two days. And it costs me I think four pounds per attendee to run a

46:01

three note Kubernettes cluster in a KS for two days. I mean, can you imagine if you were having to do that on premise and ship in all that hardware three three hundreds just for each intending and a rack whirring in the background, but with hal blinky lights, So that'd be a thing. Yeah, yeah, we'd have blinky lights. But you know you can get a get a bunch of r G B L e eds, put them in a box just called the computer right anytime the wire goes back into the cloud.

46:30

Might do that for the next workshop. It's like, yeah, you're all running those blinking lights in the corner. That one it's off. That's Carl Carl oh Man. What if we what have we not talked about here? Because there's a lot to doing a good job here. What are the learning ingredients? Like how do people really get You know, it's quick to install, but do you ten ten lines of code? You're up and running but there is some things you need to know. I think start with aspire.

47:05

You know, start with a local dashboard. Start by turning some things on locally in your local environment, development environment, and see what some of these traces look like and see whether they're useful before you start thinking about where I'm going to put them, what observability back? In? What scale do I need? Am I going to use open source and host it myself? Am I going to go with a SaaS vender? Am I going to go one in built into my cloud? Use the spire? That's that's what the real

47:31

cool thing about it's available as a container. Now that you can just run up locally. You don't need to use the DCP and all the other bits that go with it. You can just run it in a container and just experiment with it. What data do you see? Is it going to help you? You know, the value of the value to work ratio should be really small anyway, in like ten minds of code. Great, but start with something like that locally because you don't need to choose a vendor. I

48:00

equate this to you. Remember when they brought an eye logger, And the great thing about eyelogger is it saved six months on every project because the first six months of any project was deciding between n log and love pnette and the arguments. The arguments that people had. I mean, that was the true

48:20

flame Wars, you know. But they brought an eye logger, and the problem was that you needed to make that decision really far up from because if you didn't, then you'd have to retrofit it across every all of your code develop everybody this one. They brought an eye logger and they were like, great, yeah, everybody use ilogger. It'll just send it through eyelogger and then you'll choose your vendor. And I think open celemetry is the same thing

48:44

is we've got a gimme. Now. It should be in everything, whether you're to wire it up or not. It costs you nothing to have it there, and when you do wire it out, mental help. Yeah, and that's the place to start. Well, And I like this tipping point angle that Steve brought up, which is that and more and more vendors see it that way. So the information you're getting out of open Columbatry just keeps getting better. Yeah, this whole thing works best as more and more vendors

49:07

get on board. So that's the library authors, it's the you know, the various different language run times, it's the open Telemetry team. It's people within your organization, engineers within your organization buying into the fact that, okay, we've got this. I mean in dot Net it's super easy because we do have activity built in. It's it's there. If you're using dot net modern dot net, you've got the libraries, you've got the APIs. You

49:30

need just to turn this stuff on and start instrumenting code. But as all of these people shift towards that general direction, I think that's where why open telemetry now is really starting to look like a good solution for everyone. What's the app Insights story? Should we just rip it out? And so Application Insights is currently redeveloping all of their SDKs to use activity source and activities. So it's not wow, I'm getting those now saying you're going to have to

49:57

change this soon. So it shows that you Microsoft's listening to the open telemetry story. I mean, they had the experimental flags in the as your SDK last year, I think, and I think those are being removed now. So all the AZ your SDK is anything that you use from Cosmos to service bos if Angrid. All of those things will be instrumented with activities which means they're compatible with open celametry to send it and as your monitor. They've got

50:22

their own exporter. Now they don't accept OTLP yet. Whether they will, I don't know. I don't work on that team, but they do have a distro so that you can do add as your monitor exporter and then you can just use open telemetry and decide later away're going to send it or just add the one line of code that says add to a your monitor and it will appear in app insites. Everybody's doing it. And to go back to this Blazer thing, do you recall hearing any kind of timeline as to when

50:53

things will be more copestic with tracing and everything else that's going on. So I had a long conversation with the other Steve NDC London about exactly that, the more famous Steve. So they're looking at it, they are aware of the issues that are coming up, and they are you know, wanting to solve them. So I would imagine that yes, they will be solved soon and you can do it. It's just it's not as easy as ten lines of code. You know, you've got to go through a few more hoops,

51:32

especially with the whole render mode thing. If you're using dynamic render modes, You've got some other problems that have to get solved, first state problems and things. Something I'm currently working on. But I'm hopeful. I'm hopeful for a future where this is the default. You know, this is well, I just add those ten lines of code, or even just add one line of code when we've worked on the SDKs a little bit more, and just add the one thing that's says send my stuff, and you know it's

52:02

a what is it services? Don't let me debug my things? You know that that kind of simplicity just becomes the new cargo cult if you like, I don't just cargo cult thing. I well, I do cargo cult thing, but it's the new thing I cago cult. Well, guys, great conversation, Thanks so much for coming on absolutely and what's next. Let's send your inbox Marten. So I'm doing a load of conferences. I'm doing some

52:34

meetups around the UK. I'll be at Cubcon, the Kubernettes Cloud Native Conference in match so but yeah, I am flying around the world spreading the good word about this sort of stuff. How about you you? Yeah, conferences coming up, so I'll be in Sweet Sweden next week for Sweet Tug, which will be in the past by the time this this one goes out. And then yeah, day to day we're going to be focused on where Elastic

53:01

and the engineers can contribute to open telemetry. Really, I'm really starting to poke around in the open telemetry sd K repays and think about how we can make that experience good for good for everyone. Say, yeah, I think that will keep you busy for a while yet. Well, this is all great news. Thanks again, and I'll catch up with you at some conference somewhere. I'm sure, sure, all right, We'll see you next time

53:27

on dot net rocks. Dot net Rocks is brought to you by Franklin's Net and produced by Pop Studios, a full service audio, video and post production facility located physically in New London, Connecticut, and of course in the cloud online at pwop dot com. Visit our website at d O T N E t R O c k S dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives. Going back to show number one recorded in September two thousand and two. And make sure you check

54:20

out our sponsors. They keep us in business. Now go write some code. See you next time. You got javans and

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript