OpenTelemetry with Laïla Bougriâ

00:01

How'd you like to listen to dot net Rocks with no ads? Easy? Become a patron for just five dollars a month. You get access to a private RSS feed where all the shows have no ads. Twenty dollars a month, we'll get you that and a special dot net Rocks patron mug. Sign up now at Patreon dot dot net rocks dot com. Hey there, this is Jeff Fritz, the Purple Blazer guy from Microsoft, letting you in on

00:28

a little secret about my friend Carl Franklin. You know, the guy who started dot net Rocks, the first podcast about dot net in two thousand and two, The guy who's been teaching Blazer on YouTube since twenty twenty. Yeah, that Carl Franklin. Well, Carl's joined up with the folks from Code in a castle to teach a week long hands on Blazer class at Are you ready to get this? At a castle slash villa in Tuscany. It's sort of a luxury vacation. It's Blazer learning built in. Carl's calling it the

01:03

Blazer Master Class. You'll learn Blazer from the ground up, finishing the week with the ability to build and deploy Blazer applications. Since the training happens for only four hours in the morning over six days, you can bring your significant other your partner with you and you should right This part of Italy is absolutely beautiful. There's so much to see and do and in. Larry and Marco from code In to Castle are organizing daily activities both at the castle and in

01:34

the area. The castle is in the Marema, a less touristed region of Tuscany, offering both classic Tuscan hill country as well as easy access to the Etruscan Riviera. With sublime local food, wine and olive oil around every corner. Breakfast is included every day. There will be two communal dinners at the castle book ending the experience, and most other meals and all activities are included. And did I mention you'll learn Blazer in person from Carl Franklin Listen.

02:07

Space is limited and for very good reason. This is quality training in a beautiful setting. Go to code Inacastle dot com slash Blazer twenty twenty three that's bla z O R two zero two three to take advantage of this amazing opportunity to join Carl in Tuscany for an unforgettable week of la dolce Vita while advancing your programming skills in this important new technology. After building software for a while, you know it's only a matter of time before you see an HTTP timeout

02:43

or a database deadlock. In software, it's not a case of if things fail, but a case of when one mishap like this and valuable data is lost forever. And these failures occur all the time, but it doesn't have to be this way. Introducing n service bus, the ultimate tool to build robust and reliable systems that can handle failures gracefully, maintain high availability, and

03:07

scale to meet growing demand. For more than fifteen years, end service bus has been trusted to run mission critical systems that must not go down or lose any data ever, and now you can try it for yourself. End service Bus integrates seamlessly with your dot net applications and could be hosted on premises or in the cloud. Say goodbye to loss data and system failures and say hello

03:30

to a better, more reliable way of building distributed systems. Try end service bus today by heading over to go dot particular dot net slash dot net rocks and start building better systems with asynchronous messaging using end service bus Hey, Antwerp. It's dot Rock Holy crap, there must be fifty thousand people here. Yeah, who knew we were in a stadium? I know, right, we are in a state. We're actually in a movie theater, which is cool. It is cool. The last time we did a dot Net Rocks

04:18

in a movie theater I think was in Sofia, Bulgaria. Oh yeah, do you remember that? Yeah, a few years that was dev reached every And the funny thing was the Bulgarians thought it would be funny if Richard and I announced the names of the winners, the Bulgarian winners of the swag at the end of the show, and that was funny for them, for them, Bulgarian names need to buy a bow Okay, it lacks where snudge and they were just laughing their butts off. That's you. Oh you were there?

04:55

Okay, all right, well, uh it's we're glad to be here obviously. Yeah, last day of the show. Fun to do a live show. Yeah, it's a live show. We have some good stuff coming up here. Layla Bougria is here and we'll be talking to her in a minute. But first we have to do this little thing called better know a framework roll the crazy music. All right, buddy, what do you got? Well, I saw this come across Twitter. It's a tweet and while

05:28

I'm linking to it, Boston Dynamics. You know who those crazy people are, the guys with the robots. They make the robots that dance, they do backflips. Now, yeah, a little parkour with robots and stuff. Yeah, they used to have a little uncanny heathered robots, robots that were tethered, and then they got gas powered robots. Now they're battery driven and they put out videos every once in a while of things that look like animals,

05:51

like cougars and dogs. Well, anyway, they've put chat GPT into a robot and now you can talk to it and it will talk to you back. And so there's a tweet about that. It was from April, but I thought it was so cool and a little bit scary. But they're asking a questions like, you know, are you function It's like data, are you functioning within normal parameters? You know? And it would say, yes, you know, I have this blah blah blah. What my levels

06:20

are? You know these kinds of things, my battery level? How many what did it say? How many interactions in your last mission, and it will tell you about its last mission and where it went. Just as they used the word mission, I don't even know if it was mission, but that an extermination mission. It could have been just wondering. It's pretty scary, but it's cool. That's awesome. So that's what I got it. It's a tweet with a video. Okay, yeah, Boston Dynamis videos are

06:48

always amazing, fun. Good one who's talking to us? Richard grabbed a comment off a show seventeen fifty three. That's the one we did with Mika about Visual Studio twenty twenty two productivity back in August last year, and Mark Wansel had this great comment. Mark has a lot of great common spells, and he says one of the things that Mika noted a few times was telemetry. I think that this would be an interesting show topic. There's a popular

07:14

GitHub project called open telemetry that might be a good starting space. There seems to be an art of telemetry. What to collect, how much, performance considerations and primacy considerations. Check out open telemetry dot io. What do you think that that's not a good idea? Not want to do that? Noah, we won't do that show. Yeah, sorry, Mark, I wish

07:32

we could help you, but we can't help you. Yeah, but I will send you a copy of music Cobi And if you'd like copy of music by I read a comment on the website at dot net rocks dot com or on the facebooks. We publish every show there, and if you comment there and ever reading on the show, we'll say, do you a copy of music by? And you should follow us on Twitter. But the real fun happens on Mastodon. I'm Masodon, I'm at Carl Franklin at tech Hubs Social,

07:54

and I'm Rich Campbell at masodonda Social. Send us a two let us know you're listening, and that brings us to our show on open telemetry. What yeah, how sorry, that's the topic. Lila Bougrie is here. She's a software engineer with over fifteen years of experience in the dot net space and currently works in particular software where they build nd service bus. Maybe use it maybe forard a bit, certainly don't know the show a few times.

08:18

Yeah, She's a Microsoft MVP and frequent speaker conferences and interspare time. She loves to knit and crochet. Welcome, thank you. How about Lila, huh, we were lying when we said we weren't going to do a show about open telemetry. That's kind of why we read that. Yeah, yeah, I figured if I've got a comment literally somebody asked her for to open telemetry, now is the time to read it. So, Lila, what's

08:43

the elevator pitch for open telemetry? Well, the elevator pitch. So we've we've had multiple telemetry signals in software for years, right, the oldest one being logs. I think we all know logs. We've also mostly used metrics. Maybe I guess distributed tracing is a newest signal, but I would say that the elevator pitch for open telemetry is really correlating them all together. And that's what makes it really interesting for me at least, because you know,

09:18

each signal has its own value. And we've also been logging for years, and even though we can see, like you know, tracing might be the better option today, but then we still have all of those logs we don't want to go rewrite and higher applications though. We do have log providers so you can plug in however you want to do your logging, but that's not really that's taking the same source and putting it in different places, isn't it.

09:41

You're talking about ingesting different sources into one place exactly, Yeah, and then being able to connect it. So basically, imagine, you know, over the rainbow that you get lowered and there's a metric that looks out of whack, and it's like, okay, something is up. What is up? I don't know, just the metric is out of whack. So you have to go figure out how to do that, and that's usually a challenge because you then have to go figure out, Okay, how do I connect

10:13

that to those other signals that I have. But what if you could look at the metric and say, okay, I can see that it's correlated to these traces, and then go look at that, and then the traces would also be connected to the logs and you would be able to basically, yeah, paint that entire picture of what's going on, and you wouldn't be losing all of that time doing that thing. Manually sounds like a Wikipedia rabbit hole, right. One thing leads you to another to another. So I'm trying

10:41

to distinguish between all these different things. I mean, logging means a particular product like sequel server spitting out logs about how it's functioning, possibly versus metrics being more things like the state of a server, like the hard drives pinned or running low memory. And then when I think about traces, I think about there's tools that specifically are about following a button click from the client to

11:09

the server to the database and back again. Right. Yeah, I like to think of that like following a business transaction, right, right, like the workflow. Right. So it's true, Yeah, that's definitely true. But at the same time, we've also been logging in our applications a lot,

11:26

and that's still then connects together. This is as a developer writing code to push messages onto a log right, So that's usually I'm a developer, right, So I'm always looking at things from the application perspective, and how do I make this application observable? And there's multiple signals, and you would choose each individual signal based on whatever your scenario is. It could even be that for a specific scenario you might come to the conclusion that using multiple signals

11:58

might be useful. So for example, let's say that a failure occurs, right, you want to keep track of that and have a trace that reflects that failure. But you might also want to have that reflected emetrics so you can do the alert de alerting and all of that. Yeah, So that meaning to trip it up in through the system means to say there was a failure that occurred as well as what shows up in the law. Right now,

12:22

there's a lot of third party products out there that do telemetry. Or does open Telemetry allow you to use those as sources and then pull them together, or does open Telemetry have its own things that you can plug in or both? It's both, and that's a good part because I think if we look at our applications and we want to make them observable, there's a lot

12:46

that we can do. But I think with the Open Telemetry project, and also like all of the effort that the entire community has basically put into this, they've made it easy for us, right because you could basically say what second are using, Oh, I'm using a speed on at core and I'm using the Azure is thek whatever it is, right, and you could just

13:07

use those instrumentation libraries that are available from those frameworks. It could be built into the framework or could be a dedicated package and you can turn them on and just by doing that, you're already collecting a bunch of information, and specifically in the distributed system, it's usually going to give you insight into that interservice communication where we have the blind gaps. So it already like gives you

13:33

a lot of information. So the cool thing is that you can then intercept basically those traces that are being generated by those libraries, and in that that I would be calling activity. It occurred, right, and you could add you could create your own activities, which is, by by the way,

13:50

the sort of same thing as a span. But yeah, they basically what they did is the activity API already existed, and instead of creating a completely new API to match the naming of the open telum Try specification, they just implemented the specification inside the activity API. So that's why the name is a

14:09

little bit different. But basically, what you could do is take the current activity, which could be omitted by an instrumentation library, and say, I want to add some information to this that is specific to my application, to the workflow I'm running in so that I can get even more insight. What's the voodoo that allows us to go between the different tiers and an app and

14:31

say these are all part of the same transaction. Right, that's basically a propagation mechanism really, because if you think if a trace, it's basically a bunch of spans that are connected to each other. Now, what happens is at the beginning of the trace, we basically get a trace ID assigned and that's going to be carried across all of the spens and then each span has a unique ID. Now, in order for that information to propagate across multiple

15:01

services, we need a propagation mechanism. And there's multiple protocols that are basically supported by the Open Telemetry Project, of most of well known. One is a W three C trace context for HTP headers always feel like I need a breath, yeah, And then there's another one for gRPC. So it depends on what you're doing there. So I have, for example, and as your web app, right, I have Application Insights turned on, I've got all the switches lit up, and do I need open telemetry at that point?

15:35

What's it kind of give me over what Azure Insights already has. So the way that I look at that is what Application Insights provides you is also sort of known as black box instrumentation, so it's basically independent of your specific

15:52

application code. Yeah, but obviously the things that we are doing in our code is usually the interesting bits, and sometimes we need a little bit more insight to understand, you know, what pieces of code are we executing there, what are we doing, what is like the cause of latency or whatever

16:11

it is, what pieces slow? Application Insights doesn't provide that well. I think it's it's quite different to to compare because at least to my to my understanding, it's more of an overall view that you get, and you can still use the Application Insights as AK directly and still emit like application specific telemetry. But but then you're tied, right, You're tied to the vendor, right, and the thing is right right exactly, and if you don't want

16:41

to change to another vendor, you're have that vanderlock. It's the point, right, Yeah, So if you use open telemetry, then you you could just wire up a different exporter. I mean that would also app Insights has good features for dot net specific absolutely, Yeah, if you've got some other code written in other things that aren't ornet related, yeah, happid sence is only going to do so much for you and certainly not going to work if

17:03

you're in a container on AWS, is it. Yeah? So, I mean it certainly opens the door to working with more platforms, even more places and open hey, hey, there's a concept for you. Yeah, and it being available cross platform and cross front time as well. Like there's implementations for instrumentation libraries in many languages for many framework So that's really cool, especially

17:27

if you think of like multi stack applications. Right. Sure, if I've got a group of Python developers going to build a data importer for me, the fact that I can instrument it the same way as everything I've gotten built and dot net, that's pretty compelling, right, and bring everything under one roof and measured the same way. I mean, that's the real problem is that often we're chasing problems that transition between different systems, and because they're measured

17:51

differently, it's very hard to associate restuff together. I'm just thinking in terms of how much code I need to write as a developer to take advantage of all this and how much of it comes of the box. Well, that's what I That's why I mentioned the instrumentation libraries, right, and even if you know things like event counters and stuff like that, even that has dedicated libraries available already, so you can basically just turn them on and like I

18:15

said, you could plug into that and add to that information information. So usually what I say is, look at what it already gives you. Turn on the instrumentation library. Yeah exactly, don't reinvent the library. So look at what's out there. Turn it on and see what it emits, and take a look at what type of insight that already gives you, and it

18:37

probably is picked up by open telebras you're just fine. Yeah, yeah, Well, for example, let's say that you have aspe core instrumentation enabled, right, so you're going to see the request, but you don't get a lot of insight into the request. But I mean it has hooks that you could then plug in additional information and expose whatever is you know, interesting to you in that scenario, so that you could understand like the sort of business

19:03

context of what's going on. And that's what makes it really powerful. I spend enough time on the firefighting side of being assisted min where there's stuff's being spewed out in the logs that we're looking at. We just don't know what it means, right, right, like anything there? Yeah, it's clear. It's like your own little Internet. Everything you need to know is there, you just can't find it. Yeah, So you know, how do

19:25

you add the additional information that helps someone see? This is where we're having right. That's where defenders really come in, right, because they are then going to offer capabilities that allow you to querry, to analyze that information, and to basically get to actionable insights, because that's the whole point of telemetry. Right. That's a bunch of data, But then what do I do

19:47

with it? Right? So you want to basically have tools that help you get pointers on how do I fix this problem or how do I improve this latency issue that I'm seeing, or maybe even see like which feature gets used a lot? And things like that. She is almost more of a profiling thing, right, Like what are the functions are we called the most often? And we know perfectly well why the system is slow. It's the database is fault. We just blame the DBA. Then we're done. Life is

20:17

good, all right. You know Another one that I always think of is because I'm you know, I'm like an observability enthusiast, I would say, right. And the reason why I'm so enthusiastic about it is because, well, my sort of core focus has been message based systems for two years are to trust. Shoot, yeah, it's just clim messages split across Q and then you get out of order messages and it's like, what's going on?

20:49

Why am I seeing this fail? And especially in a message based system, usually the problem is happening like for the up stream right, right, So how do you get to that? Right? So? And how do you connect at even outside of the messages that are being sent, because it has to maybe connect back to, like you said, a click somewhere on a user interface. So being able to have that full visibility across all of the

21:15

subsystems is really really powerful. Yeah, and again I'm still worrying this a lot of codforma, right, But you're telling me that when when you're using a library that has protocol understanding, it's going to insert a lot of that information automatically for us. So it's selfol together. Yes, Yeah, And because of the sort of nature of how distributed tracing works, that information is going to be connected together through that same trace idea that's basically being propagated does

21:41

have substantial overhead. Is there is there any reason to only turn it on when you have a problem or can you leave it on all the time? Okay, that's that that's going to be a long question. As you can you can go for the It dependspends definitely. So yeah, it definitely depends.

22:00

But so usually what I tend to say is make sure that what you're collecting is useful, right start there, because if you just turn I don't know, every instrumentational library on the planet on, we log all the things and capture all of the traces and all of that, you're going to have to sift through that all of that information to be able to understand like what's

22:23

going on. Right, So it's also not a thing of oh, look at all of these instrumentation libraries and then turning everything on, because you're going to be incredibly overwhelmed, to the point that as a developer you might feel like, okay, this is not useful. Yeah, let's just turn it all back off. I mean, I've also had the problem where I've said, Okay, I'm not going to measure this thing, and then I never get data for that thing, Like it turns out I'm looking for the wrong

22:47

in the wrong place. Like that's not a number that moves. So some of these telemetry products that are out there have ways that they can work on a background thread or they can attach as a sidecar, you know. So do you have those kinds of things where you can sort of stay out of the way so if there is something that takes up some more time, it

23:11

can happen on a background thread. Yeah. So that's where the open telemetry project is also really interesting because if let's say that you look at the basic samples that are out there for dot net right, what you're going to see there is that you can basically, in a service enable open telemetry at an exporter, which means that you're collecting let's say, for the sake of the

23:30

example traces and sending them directly to an observability back end. Basically what that could be as your application Insights or Jager or Honeygo, whatever it is. So, but the thing is is that there's many problems to that. First of all, like you said, there is overhead for that service because it has to collect all of that information. There might even be some processing behind the scenes happening as well. We had a service bus, well, well

24:00

that's we'll get to that. It will get to that, yeah, and then you have to export it. But then imagine that the observability back end is not available for a few seconds because it's you know, it's the network.

24:12

It's the network. Yeah right, so well no, that's usually that's handled by the libraries themselves, but it is adding that pressure to the services telemetry, right, yeah, you don't want to have that disconnected information and then looking at half of the story, right, But there are ways to solve this, and that's where the open telemetry collector comes in. And then

24:38

you have multiple deployment options on how to run that. The first one is, like you said, a sidecar, So basically you're going to have a sidecar for each service that you're instrumenting, and then immediately you're offloading all of the telemetry. Well, you're collecting it and sending it through to the sidecar. But there are any processing that needs to be done, like redacting information because remember sensitive information, you don't want that you're in all of that telemetry

25:04

you're collecting. So that's just down the road. We can be fined very easily. I call that digital white out. So then you have all of that processing and then you could export it then to the observability back end and you'd be able to handle all those communication issues and all of that in the

25:22

side car and your service is not affected. Now that's one option, but you could also set up the open telemetry collector as as at the gateway, so it's a central components and all of the services can basically send their telemetry information to that central component, which then will take care of processing all of that information to the to the back Yeah, and it could batch that information and it's a pretty powerful thing, and you could even let go crazy or

25:52

if you need it right, but you could have a sort of hybrid model in which you have a sidecar per service which then sending their reformation to the central collectors. Is there anything special in the storage on that and there's just blobs or text files or are they actually using a database on that? Well? I think it depends on the signal, sure, because usually metrics go to time serious databases and then logs. Honestly, I don't know. It's

26:22

a good question, depends but it depends. Yeah. But well, actually about those time series databases. That's a sort of interesting topic on its own because at the beginning I was talking about TELME tree correlation, right, and basically adding the trace ID to the metric, just called them exemplar in open CELM tree naming, so that you would be able to connect that together, so you'd see a spike in the metric and see, oh, that's caused by those traces. Right now, the thing is that you have to be

26:55

aware of what's known as cardinality explosion. So I'll try to when a bomb goes off in the Vatican. Dude, but did I say that. I'm sorry, those are cardinal explosions, not the same cardinality explosion. Cardinality explosion. I'll try to explain that, but usually I do this visually, so

27:17

okay, I'll give me a bit to try. But think of the exemplars basically a label, right that you're adding to a metric, and that's going to give you some insights on the context in which that metric is being collected. So let's say that I have a metric called failure rate, because I want insight into that and to have a little bit more background information. I want to know which environment that that happened, bill, development, tests,

27:44

production, whatever it is. And I also want to know which hp status coode came back. Now that HSP status coode in a production environment is going to have like for the sake of the example, thirty possible values. And we have three different environments. So that's three possible values for that environment label. Now the cardinality is basically all of the possible combinations of those values of every label that you add that is a cardinality. So it's a multiplier exactly.

28:18

So we're fine with the environment and then having the HTP statoscope and then I add customer ID. Oh boy, so why let you do that? We're like one hundred combinations, and then you threw in twenty thousand customers a ruined everything more. Yeah, how do thousand customers a million customers? Somebody

28:36

would ever do vouch? And that's how we then basically get cardinality explosion because what happens in the time series database is that every time you have a sort of unique combination and your series is created, so your cost goes up right and it becomes really hard to quire that information. So it's also important to one field, which just one field. If you just don't do that, there won't be this problem. Yeah, well it seems like a good idea

29:04

when you do it right until cardinalities exactly. So it's also important to be aware of, you know, what observability back end are you using and how does that work, because there are some tools out there that do support high cardinality. So it's just something that you have to be aware of. Yeah, and so you can tolerate if you really come to the resolution you have

29:25

to do that one way or the other. I mean, custerrity doesn't seem that crazy because it is useful if you've got a customer on a phone to say, hey, I could pull all the transactions, all of those streams for all of that customer and sort of look at where they were having problems. Yep. That's like full production debugability, right, yeah, without doubt. And with that, I've got to interrupt for one moment for this very important message too, and we're back. It's dotting at Rocks. I'm Richard

29:57

Campbell, that's Carl Franklin. Hey, Hey, talking to our friend Leila about open telemetry. Hey hey, and we've kind of gotten to that place now, right, Like we we how do we visualize this because you're getting you're probably it a lot of information like what is are the tooling that comes with it? Or I have to write my own what are the dashboards?

30:18

Well, that's hopefully where you choose a vendor, right, and then you know, with the abilities of open telemetry, by standardizing all of that information, I usually just say, like, try a bunch out. You see what are the requirements for choices? You imagine you can just look these up to standard graph controls and things like that from your various vendors. If you want a dashboard, yeah, yeah, yeah, there's just so many options,

30:45

and I feel like each of them has their own strength. But yeah, for example, if you're if you're running in the Azure stack then and you're already using application insights, it's a thing you know, can it makes sense against and you're using that as well. Yeah, But then for example, I've played around with Honeycomb, and I think that they're especially like the collaboration that they've built into the tool is really cool as well as AWS product. No, no, it's its own company, Honeycomb. Okay, big

31:17

big taste and a big big bike. Nice. That's the breakfast Cereal. But okay, I was a kid of the you know what I'm thinking. I'm thinking of that there is a honey Honey something product in AWSS, no code product, yeah, okay, different one, but yeah, Honeycomb is an instrumentation library. Yeah. So, and they're very invested in open telemetry as well, and I think it's one of those tools that supports high cardinality. They talk about it a lot as well, right, yeah, but

31:44

yeah, it would sit out to me. There's really the sort of collaborative feature features that they had, because usually if you're looking at a huge problem, you're not doing that by yourselves, right, yeah, you know, and if it's like if you if you have a twenty four seventeen and one and stair shift, you want to be able to hand over where you left

32:01

off things like that. Yeah, I'm wondering about So who is this Typically the systems that are going to get these packages in the first place, that that's where the air resh is first show up, where the problems appear, and then they might be passing it to development saying hey, we're looking at

32:16

this and we think it's it's this kind of problem. Well that's where I'm also expecting some evolution because yes, usually now it would be a different team when they would be looking at something that looks funny, yeah right, and then get some understanding, hopefully some actionable insights, right, and then being able to bring that back to the development team. But it's it's really interesting to me in the sense that we're the developers, We are the ones going

32:49

to be writing that application specific telemetry. So it's it's also really important to get like organization wide alignment as well on the type of telemetry that you're going to be collecting. So this seems like there's an infinite number of decisions to make here, right, I mean you just by using open telemetry, that's just one step of many. Yeah, what are we going to be looking at how what's the granularity of it, How are we going to query it,

33:19

how are we going to look at it visually? Like, these are all things that aren't just in the box. You have to think them through. Yeah. Well, usually what I try to give a world try to advise as well, is that to sort of documents and guidelines like what are you looking for with your telemetry? What are the problems that you're trying to solve and like come up with a specific set of questions that you could as a developer when you write a feature, ask yourself so that you could add

33:51

the telemetry that is going to answer those questions. Right, I mean, what if you don't know what you don't know? What if you don't know what you want to look for as a guidance? Are right to say? Yeah? Some of this sounds like business related decisions, like we sell widgets, and I want to know when a sale fails because of technology rather than the customer didn't want to buy it, right, yeah, definitely, or simple things like this page took too long to render or whatever. Yeah.

34:15

Yeah. And then from the sort of failure perspective, I usually try to look at the code and think to myself, if something were failing here, what I would be What would I be looking at if I were debugging this, like, what what state would be interesting to me? What variables would I be looking at? Have I captured the input enough? Have I captured what's going out to be able to understand and be able to then you know,

34:42

go back and try to understand what happened from the outside. Right, there's an easy solution all of this you just put all your code in a try with an empty catch. No problem is that on error resumed next a sort of yeah, it's more like slash day to turn off all the debuggs. I don't want to know. Just keep getting don't tell me about these guys. I mean a lot of this we've talked about very proactively, like we're going to detect the errors before the customer does or before the customer complaint.

35:09

Yeah. I think there's another dynamic where the customer is complaining and we're getting a ticket that's like what error. I think it'd be very challenging to say, you've got this ticket is about this customer, it was roughly at this time, and now you want to go dig through the logs to say, can we see what this person's complained about. To do is to have the customer on the phone and enable this sort of thing, but just for them, just for their customer ID and say and then just watch it as

35:37

they're going through the process where it fails and now you've got something. But is that impossible? Well? I think so, but it sort of depends

35:45

on what type of sampling strategies that you're applying. And also, like you know the type of observability that you're collecting, because let's say that if you want full insight, so basically any error that occurs to any user, right, be able to say, oh, you know that was since we're an antwer that was Shalts and it was three pm on a Friday, and basically be able to find that request and look at what were they doing, because the thing is that users when they open tickets or whatever it is, the

36:16

thing is that they weren't paying attention to doing their usual thing, right. They didn't set down See, let's cause an error, and I'm not thinking about every step that I did so I could be able to explain it to you later. I usually get it doesn't work. What can you be more explicit? It doesn't work? Right, So then being able to connect that information back and say, oh, yeah, that was shuts right, and then find that, you know, Tracey's locks metric whatever was connected to that

36:46

information. It's like being able to debug in production. Really yeah, And then we think you just described like turning up a lot of data there too, Like that's also you know, we were also warning not to do that because you're getting buried in minutia yea, yep. That's where the sampling strategies

37:05

commits. Sure, and there's different ways to go about that. So if you think about traces specifically, you basically can choose between head and tail sampling and head sampling and dare You're basically going to decide whether to sample the trace at the beginning. So let's say when the business transaction starts, right, they're going to immediately make the decision of I'm keeping this or I just don't care about it. Right. Usually that's the most unbiased type of sampling as

37:34

well quick order. Right. The thing is what if something fails, right, you don't know that upfront, What if this request was super slow. It's not something that you can know up front, so you could be losing a lot of insightful information. Sure, and that's where you get, you know, the tail based approach where you're basically going to collect everything so the entire trace across all of the services that it goes through, and at the

38:05

end make the decision of is this an interesting trace? Do I want to keep it? So? For example, does it carry some specific attributes that I care about? Or was it slow? Yeah? And banks, the question can you turn these things on and off in production without restarting. Right.

38:19

So that's where it becomes important again how you deployed this, right, because if you have a sort of direct export and it's really tricky, but if if you had it deployed as a side card, and it could be a thing of changing the configuration of the side card, like tailbase says, I'm going to assess the finish transactions, right, there's nothing special about this, throw it out right. Oh, this one had an unusual value, it took too long, it generated this air. So forth, I'm going

38:45

to keep this one. And so that way you're sort of sculling as you complete exactly. That's pretty cool. Yeah, there's a cost that comes with that, sure, because basically you're collecting everything, so you've got those overhead on the workload. Although hopefully synchronous to some degree. Actually that asynchronicsy brings some dishing point that you wanted to have the customer on the phone while you

39:05

work them through it, like that one scenario. How much latency do we have when we're doing all that processing separately, Like how soon can we see data from when it tries actionized? You happen, right, that's again another concern that you have to basically be focused though, right, like what is

39:21

what type of latency are you willing to accept? And then it becomes a thing of being very mindful of what type of processing that you're doing there, because obviously all of that telemetry is going to have to go through that pipeline of processors, so you have to be very mindful of that. And then you have to get it out as soon as you can so that you can see it as quickly as you can and you can create as quickly as you can. But we're still talking seconds, aren't we really? Yeah, I

39:49

hope even less milliseconds. That's been my usual experiences with asynchronous and telemetry. It's only slightly behind it, but it's not holding the transaction while it finished, you know that. That to me, the big sin here is don't delay the customer. Get they get the working transaction done. All telemetry can happen later, even though that later is a few milliseconds, right, That's

40:12

why you offload. That's an our components again, Yeah, safer, safer to work that way anyway, which you don't want as a transaction field because you were measuring it exactly exactly that's dumb, that's no no. Yeah, so yeah, definitely quantum. I just got it, just like that, boom, it all makes it just makes sense, even though it still don't understand quantum. I'll leave that to you and sip, okay, we'll work

40:36

We'll keep working on that problem. Yeah, that's definitely something to keep in mind is that, you know, instrumentation is at the end of the day, nice to have. It's not a mission critical component, so have to be very mindful of of how you do that. So one of the biggest mistakes that customers are people using the open til I'm train make when they first started. The biggest mistake, well, I turned everything on. Of course

41:00

you did. I do that too. I like all the knobs, turn on all the knobs and then yeah, another thing was just figuring out what do I really mean? Right, So it's really understanding what your requirement is, because if you're saying I want to be able to debug any user request that happens in the system and have full visibility into that as opposed to for example, I walked into a project it's massive, there's zero documentation, and all of the people that worked on it, they're gone, right, and

41:37

there's no instrumentation. Yeah, at that point, what you would like to have observability wise, it's just insight into how does this thing work? Yeah? Just follow a transaction, right, you know, have one end to end trace once and you'll have made a lot of progress. Yes, Now, the amount of telemetry that you need for those two things, it's completely the opposite, right, it's like one percent versus one hundred. So it's

42:02

definitely still for me as well a learning experience. I mean, this is still pretty new and we're basically just adjusting to see still how the project is evolving. Yeah, because I mean many parts of the specification are stable by now, but a lot of things are still evolving. Yeah. Another thing that keeps coming up is sort of the three pillars of observability being you know, traces, metrics, and logs, And I wonder, but what if a first one comes along that might just happen. I can't think of a

42:32

fourth one, right, neither find us pretty well. Yeah, when when the only thing we had was logs, could we think of metrics and traces? I guess that's true. Yeah, we started dreaming about them. It's like I'm trying to I've had that experience of I have the log from this machine, the log from this machine, now try and line those entries up.

42:51

Would quantum computing introduce a fourth pillar? I think it would introduce to the sixteenth pillars, or two to the two hundred and thirty seconds parallel there. I also like this idea of knowing it failed before the customer does have failed. You know, you get what I'm hoping to, this tail loog

43:06

thing of I see these numbers as insufficient. In some way, it kicks up into a system where someone can look at it and perhaps even call a customer and say, hey, we noticed, can we help you with You know, it's not just you're waiting for the for the people to complain or if it has to be on fire. I have never gotten a message or an email like that from a company that I and if I if that happened

43:28

to me, I would be like really impressed. So I'm on a website, for example, and it screws up, and then I immediately get an email that says, hey, we noticed you had this problem, didn't work. Here's a solution. M Wow, that would be amazed. I ever see that is like a credit card when I'm traveling, yeah where every so often I use the card and a minute or so later the phone rings and it's hey, are you in Belgium? Right, yep, yep, I'm in Belgium. Okay, that's good. Then thanks, And then you're like,

43:57

that's pretty cool. It turns out that was Bob from New Jersey. Just just want to know that happens to me all the time. I asked you about how service bus or you know, service buses and service bus plays into this. Obviously, this is what you do for your job. How does using a messaging system figure into open telemetry. Well, like I said earlier, right, um, building a message based system is is really nice. Well, you know, if you take into account the entire problem space.

44:30

But one of the things you can go around around, basically is how hard it becomes the troubleshoot things in that type of system. Now, with the platform that we're building at particular, we don't only have in service bus as a middleware framework, but we also have a bunch of tools. Now

44:45

service Insite specifically is basically already that sort of black box instrumentation. It's like you don't need to do anything, you just need to configure that you want basically those platform tools to be enabled, and it already gives you insight into all of the messages that are being sent around in the system, and you can see that in a production environment and understand which exact flow and where that come from and which message led to which message. So we've really been doing

45:20

observability for years. Yeah, that's the thing. Well, because of the asyncrency and out of orderness, like you can't count on timestamps, you really need to have some kind of attribute flag to be able to know the related Yeah, those are things that we capture in the message headers, so we can basically know which message LEDs to which other messages, So that that's really

45:40

cool. But the thing that we could never solve, and that's the sort of gap that opens telemetry closes, is having that visibility system wise, so connecting back to you know, a fronted or even a database, a web server that restarted in the middle of something like those kinds of things like, that's where you want the metrics to show. Hey, this machine got into crisis and the supervisor killed the process and it recovery and continued, but it

46:08

kicked off all this weirdness right, like I can. You're looking at it, You're looking at the trades and going, what the heck happened here? Right? Right? Is the is the program broken? It's just a bug? And it's like, no, this is what recovery looks. Yeah,

46:20

yeah, exactly is one of these machines. Some of that is also built into the platform because we have insight into what are the failure how many failures are happening, So we also have messaging specific metrics already in there, and and yeah, now we're working to basically make sure that it also connects and feeds into the open telemetry signals so that if people are using it, that

46:46

they get all of that information in there as well. Wow, that's pretty pretty substantially cool, especially when you start throwing cloud in here where it's entirely possible the cloud vendor might move you and it might have impact on your software, like stuff you literally don't have control over, Like hopefully your telemetry can surface that in a way where you're like, oh, this wasn't us. It's the vendor change something for whatever reason, and we should absorb it.

47:12

But we don't know how yet because we've never had this happen before and right now we have to look at it and saying what would we do differently? Yeah, yeah, because we have sort of recoverability built in, so retries and all of that happens out of the box. You don't even have to configure them. But yeah, then it's like, okay, why does it take ten times for this message to be processed? Every time a message of that type comes in? Right, So it's you don't want to hide that

47:38

away. It could be that maybe there's a database suffering underneath, right, it retries too quickly and it takes that long for that thing to get up to speed or yeah, yeah, along one of the side effects. Yeah, that's also why we have delayed retries. So we'll basically also have this sort of back off mechanism. We'll retry immediately. But if we see exactly that's too yep, yeah, it's the same. It's the same concept. Yeah. Yeah. So what's next for you? What are you doing next?

48:07

What's in your inbox? What's in my inbox? Well, uh, we're so a techarama today. Then I'm taking a few days with the family. It's a long weekend, is it? It's a long Weekend's holiday Tomorrow? And then they're basically bridging to the weekend and then on Sunday, I'm leaving to Oslo for NDC. Very great. I will not be joining you there this year, so I heard. Yeah, not happy to hear that. Sorry, just schedules. Yeah. We love Oslo, we love NDC.

48:37

We're there all the time. Usually, I'm gonna have to miss it this year. Yeah, I'm really excited to go. It's my first time at Oslo's top sure to get into. Yeah, you know, and I always loved the way they did the show on the floor in the but I kind of tell you this techarama's pretty close to the Yeah, and I have it homies. Yeah, yeah, all right, it's really cool. Everybody give it up for Lila Bilbria and we'll see you next time on dot net

49:09

rocks. Dot net Rocks is brought to you by Franklin's Net and produced by Pop Studios, a full service audio, video and post production facility located physically in New London, Connecticut, and of course in the cloud online at pwop dot com. Visit our website at dt n et r ocks dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives going back to sh number one, recorded in September two thousand and two.

50:02

And make sure you check out our sponsors. They keep us in business. Now go write some code. See you next time. You got a dead middle band

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript