#88 - Observability Engineering - Liz Fong-Jones | Tech Lead Journal podcast

00:00

Hey, a quick message. For those of you who are listening to this episode on Spotify. I have a small favor to ask Spotify. Now allows mobile users to rate podcasts. I would really appreciate it. If you can take a quick, pause to go to the technique Journal podcast page, and leave your favorite show. Your best rating on Spotify. It will help me a lot to get this podcast to reach more people on the platform. Thanks a lot. Observability is a technique for ensuring that you can

00:29

understand. Novel problems in your system. That's why it was a necessary addition for me to go from working on necessary to work in observability. So, when we think about defining observability, it's basically this question. Can you understand what's happening in your system and why without having to push your code? And to do so, very quickly by slicing, and dicing existing data that you already have, in terms of telemetry signals that are coming out of your system.

01:00

Hey everyone. My name is Henry Surya with Robin. And you're listening to the technology, you know, podcast the show where I'll be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our Journal. Hello to all of you, my friends and listeners.

01:35

Welcome to the technology. Now, podcast the show where you can learn about technical leadership and Excellence from my conversation, with great thought leaders and very happy today to present the 88th episode of the podcast. Thank you for tuning in listening to this episode.

01:51

If this is your first time listening to tackle the journal, subscribe and follow the show on your podcast app and social media on LinkedIn, Twitter and Instagram, if you are a regular listener and enjoy listening to the episodes, support me by subscribing, as a patron, at technology node, Dev slash Patron. My guests for today's episode is list. Found Jones list is the principal developer advocate for a sorry and observability at honeycomb. And recently.

02:19

She just published a book titled, observability, engineering, which she co-authored with her colleagues, charity majors and George Miranda. In this episode. List shit in depth about the concept of observability and why it is becoming an important practice in the industry nowadays.

02:36

She started by explaining the fundamentals of observability and how it differs from traditional monitoring and how observability can help us to run a more reliable and stable production systems including its relation with the devops and SRE practices. She explained some interesting Concepts such as the corn Alice's Loop, cardinality and dimensionality. And doing debugging from a first

03:00

principle in the later. Part of the conversation list, shared her view of the current state of observability, including the proliferation of vendor and open source tools and how we Engineers can improve our systems, observability by doing, observability driven development, and improving our practices based on the proposed. Observability maturity model

03:22

found in the book. I really enjoyed my conversation with list diving deep into Tea and understanding the different nuances of observability Concepts. If you are interested in this topic, I would also highly recommend reading further the observability engineering book in which honeycomb has kindly provided the e-book for free on their website. Find out the link provided in the show notes to get your free copy.

03:48

And if you also enjoyed this episode and find it useful, share it with your friends and colleagues. Who may also benefit from listening to this episode, leave a rating and review on your podcast. Up and share your comments or feedback about this episode on social media. It is my ultimate mission to make this podcast available to more people. And I need your help to support me towards fulfilling my mission. Before we continue to the conversation. Let's hear some words from our

04:14

sponsor. Today's episode is proudly sponsored by skills matter. The global community and events platform with more than 100,000 software professionals here members, can organize their learning experiences around the Technology topics. They care about most you get on-demand access to their latest content thought, leadership insights, as well as the exciting schedule of tech events

04:38

running across all time zones. So where the devops our data science is your bus or you're a fan of functional programming or all things Cloud, you can make real connections with people who share your interests head on over to skills method or Cam to become part of the tech community that matters most to you. It's free to join. And you will find it easy to keep up with the latest tech Trends. Hello everyone, welcome back to another new episode of the package, you know, podcast

05:08

today. I'm very excited to finally meet someone who I have been following for quite a while. This funk Jones is here with us. So today, we'll be talking a lot about SRE and observability, which is the topics and the trends that are up and coming. Also, these days, this actually is the principal developer Advocate at honeycomb. She has been working in this area.

05:28

And observability for maybe 16 years by her bio, really is a pleasure to meet you today and I'm looking forward to learning everything about a sari and observability today. I never really get to everything, but we can certainly do a good chunk of it. Hi. Thanks for having me on the show. So list for people who may not know you yet. Maybe if you can help to introduce yourself, if you telling us more about your highlights or turning points in

05:51

your career. Yeah, so I started working as a systems engineer in 2004. I've been doing all kinds of work on reliability and making systems better. It easier to operate, that includes spending a number of years working at in studio a number of years working at Google. That's where I spent most of my career as a site reliability here at Google and then my career took a turn towards thinking about. How do I teach not just my team to work on better systems the

06:22

house? All teams across the suffer Industries, do better, that's wanting to come a developer Advocate and I switched over to working out how to vote. So, yeah, I've been following you with all your SRE Contents. I started probably with the SRE contents, mostly, and then it took a turn since you join honeycomb. And now we are talking a lot more about observability and you have an upcoming book which is titled observability

06:44

engineering. So, this is something that you wrote with Charity majors and one of your colleague in honeycomb. Yes. Your friend up here. There's just Mirror and I observability is definitely one of the hottest topics these days especially in the technology World. Maybe you can start by helping us to understand what is actually observability because this term is commonly used these days. Let's contextualize in terms of

07:08

both as a re and observability. Not all of your listeners will necessarily have heard of a story either. So when we start with thinking about what s res do the goal of a site, reliability, engineer is to try to Systems easier to operate for the people that are working on member, whether it's a tourism settle for Ron Paul. Or whether it's the software engineering team that is kind of building an operating and

07:32

running the system. One of those key considerations in terms of ensuring the system is highly, available is thinking about what are your service level objectives? What are your reliability goals for the system? Mmm, Flipside, how do we think about trying to achieve those goals? What happens if there is an incident, does it take 5 minutes to fix? Or does it take three hours to fix to me?

07:56

That's where observability comes in because observability is a technique for ensuring that you can understand novel problems in your system. That's why it was an F is transition for me to go from working on necessary to working abs or Billy. So, when we think about defining observability, it's basically this question. Can you understand what's happening in your system and why without having to question your code?

08:21

And she do so very quickly. Five, slicing and dicing existing data that you already have in terms of telemetry signals that are coming out of your system. So you mentioned a couple of key points. When you describe observability here, the first without pushing new code. If I imagine last time, I used to be a software developer whenever I want to troubleshoot a debug something. Sometimes I introduce new code.

08:42

Yes. Event print after box statement, all of us are guilty of it. So, tell us more about this without pushing new code. How can we actually do this? Yeah, select she way that we think about observability and kind of that first step of instrumentation is as you're developing your application. You should add its rotation to help you understand what's happening inside of your codes that you don't want to be caught in the bat Lair. That doesn't mean you have to predict every single failure.

09:09

In the spring happen. Advanced be kind of have to believe yourself, the breadcrumbs at your voided need to debug in the future. So this starts with, in my view at least kind of having some I'm form tracing to know when did each request start and stop in each service. Where did that request come from? Which other service called you intern and maybe some other properties? Like what user is it the more you enrich your face gives with information that may or may not

09:37

seem relevant the time. But if you can freely add that detail of what feature Factor on that side, which user ID is it, which grows are they using? Which language are they using? Then? It means down the road. You don't. Have to have created by any of those Dimensions. You can kind of see what you have any of those dimensions are relevant. So instrumentation tracing all these are being mentioned these

10:01

days. And people also normally categorize it with these three pillars of observability, called locks metrics raising. So tell us more are these three the most important things that we need to implement for observability. I think that having high quality signals does matter, you're not going to get great observability, unless the data is there. However, you can't just throw a bunch of data into a data.

10:25

We can see that you're done. So I think that's kind of why I pushed back against the three pillars narrative, because one or two really high quality signal is enough. You don't necessarily need kind of the Luminous, debug logs, if you have tracing, that is representative of what you addressed, but Toronto. Lucky, Metrics can be useful in high-volume situations, but the problem of metric is that their

10:49

pre attitude. You can't re slice the data what's been pretty accurate data source, right? So I didn't fix each signal has some benefits and drawbacks that you need to evaluate and you don't need to collect all three of them. In fact, there are new and emerging signals like, continuous profiling. So they're not really even three pillars were lenses or whatever. You want to call them. There are many different Telemetry types that we can think about utilizing.

11:13

As we kind of compiled with capability, that's really not a technical capability. But instead a socio technical capability. What can your engineers accomplish with the system? Not what are you measuring on a system? But can you actually analyze it? You actually get that result of I can understand, figure out what's happening. So one thing that is also interesting for me when I read and study but observability,

11:36

right? It mentions that the goal of observability is actually to provide a level of introspection or details that helps you. Stand, the internal state of the system. So the key word here is internal State. Tell us more about this. How do you differ this with something that monitors external State? Yeah. So I think that when you're measuring external State, you are measuring things. Like what's be CPU utilization.

12:03

What's the memory utilization? Those are things that are potentially useful but they don't give you an idea of what was the application doing at the time that this You gonna request executed. So I think that kind of detail into what was the code actually doing and not, what are the side

12:20

effects of the code. I think that's kind of what differentiates observing it from the inside out, rather than from the outside in The other thing that people in the industry, I used to traditionally, we call it monitoring. So when we talk about system administrators, last time we used to talk about monitoring and in the past, I don't know how many years recently is we start to shift from monitoring

12:42

the observability. So is there any significant difference between 23 observability or is just another term being coined to Brand? The same thing with the new term to me? Observability is a superset of the capabilities that monitoring would provide. I'd so to leave are not at all synonymous because with monitoring, you're just trying to figure out when something has broken but won't necessarily answer why, which is why people

13:09

had considered for a long time. You know, I use my monitors to tell you something has gone wrong and I use my logs to figure out what happens. Guess what? In a complex distributed system logging is not going to save you anymore. It take forever to search through deep doesn't give you causality of what causes what. So what we think about the relationship of Monitoring and observability. I think this is why I mentioned service. Level objectives, early.

13:32

Can me service. Level objectives are kind of monitoring 2.0 there. The answer to be tell me when something has gone. Wrong that monitor played before words, observability kind of maps more closely to me to kind of replacing London. B does to replacing the metrics to use for monitoring the past. So what do I mean, - well, what do you have your service level objective? You define what a acceptable level of successes? As for instance, at honeycomb, we aim for India's pipeline, to

14:01

be 99.99 percent reliable. That means no more than 1 in 10,000, packets of the coming Telemetry are dropped. But if we start seeing an elevated error rate, if we have say, we're allowed to drop 1 million contract for month and we start dropping 100,000 per hour. We're going to exhaust that error budget for the month in 10 hours to us. That's an emergency. I'll go and fix it right away. But if we have like a thousand bad request that are just being dropped per hour.

14:33

Excellent. Think that thousand hours? That's not an emergency. So kind of it differentiates between expected levels of errors and an expected levels of errors and is not susceptible to the same problem that people have had with monitoring in the past of, oh my God. It's 2 a.m. 1 out of 1 requests failed. That's 100 percent error rate. Wake up everyone. No one likes to paste it to. I am for something that flaps

14:56

and goes, right. Way. So like peanut butter and jelly the go better together and then together they tend to provide each superset of the kind of monitoring and logging. Your people used to do the past. So one thing, when I read your book as well, in preparation of this conversation is that you mentioned traditional monitoring is more reactive. You mentioned about getting alerted, you only see side effects of certain things happening while observability is more investigative.

15:21

So tell us more about this investigative manners or when an incident happens. What actually does observability. You. Yeah, so observability really helps you form and test hypotheses really quickly because when I am trying to debug an incentive, my error budget has gone off. The number one thing nicer asking is what's special about the request their favorite instead of looking at this wall of breaths to figure out which lines later at the same time.

15:51

What observability gives me is this powder to have caused Ellen. The power to see on request That failed to correlate with properties were most associated with. Then. We what's maybe it's one customer review.

16:02

Its when building, right? Or maybe it's combination of those two things and the really neat thing is making somebody going to be. So we've implement this is a feature called Bubble Up where you can draw a box around the anomaly and will tell you what's special about that anomaly where you're telling us what? They're not only is we're not trying to guess based off of two Sigma or whatever. Ready work. Just telling you, here's the difference between your control and experiment.

16:24

The cool thing is from there. You can basically go in and refine where you can Filter only to this population of the vents and man it repeat the process or to group by this field or to group by a combination of fields and to kind of confirm that hypothesis. So instead of saying I'm like, oh my God, like, how do I set up this? We re how do I write it in this up to square? E language? Oh God. I hope I got it. Right because it's going to take two minutes, run over all these logs.

16:48

Instead, you get that feedback instantly. So going back to our solos, 99.5% of honeycomb queries, complete within 10 seconds. There it is. Not a costume, messing up. You can feel free to just try and experiment to analyze and slice your data particular rate. You can resolve within 10 seconds. If it doesn't work out, try it out of worried. If it doesn't work out, you now have the MU.

17:09

If you need to run your neck square and then ultimately, end of the day you can visualize this in either can aggregated metrics. Like what's the distribution of Lindsay's or you can go and look at the raw data and see it kind of in tabular log. Click format of where each of the fields and the relevant queries attached or you. And look at it as a trees, wonderful. Okay. Think that kind of allows you to slice and visualize the data. Any way that you need in order

17:34

to understand. Why is the slow worst failure coming from. So it starts with the forming of a place this narrowing down the data and then looking at the data to confirm your hypothesis. Yeah. So if I understand correctly, the one that you described just now is actually the what you call Core analysis Loop. So when an incident happens, this is what the sequence of events that is going to happen. So you look at a bunch of data that you load, initially, then you figure out.

17:56

Okay. This looks interesting. Testing maybe you use Bubble Up in honeycomb right? Where you maybe narrow down the search results. And then from there you correlate and then you start again, and then you iterate until you actually find the root cause do I understand correctly? Yeah, maybe not necessarily the root cause, but like a proximate set of trigger sent me of hips. This system over the edge, there is usually no one root cause and a and then system. The other thing I want to point

18:20

out. Is that the thinking about what a core analysis Loop is, that's work long before I can honeycomb. So when I Working at Google. I think to formulate this hypothesis about the for analysis Loop and make me ask her and then I realized that he comes for you to be the kind of manifestation of. How do we bring this technology? Not it's to Google employees, but to the rest of the world. Nice, nice.

18:42

So one thing that you mentioned when we do these cornelis, this Loop is actually slicing and dicing data, right? So I think one of the maybe commonly mention about the requirements of observability is the high cardinality and dimensionality without this is Very difficult for you to slice and dice the data because that's just not something interesting that you can narrow down. So tell us more. What do you mean by cardinality and also dimensionality and why it has to be high enough. Yeah.

19:10

So earlier I referred to this idea when your instrument in your code that you should feel free to add as many attributes, add as many key value pairs as you like to really explain. As you go along. What's going on your coded? Kind of leave this bread. Crumbs. It's almost like adding test. Writing comments, right? Think it's just a sensible way of leading your future self this imitation that it's waves. What is this code doing wear

19:34

this coat thinking? The reason why people have hesitated to you that in the past at least with kind of traditional monitoring systems, is that modern systems bill you by the number of distinct key value pairs that occur on each

19:47

metric that you're reporting. So that's cause people that either omit, these key value pairs, two logs where they get lost or to not record them adult, so, When we think of it out, High dimensionality, what we're encouraging you to do is to sprinkle these annotations throughout your codes. And to kind of encode as many keys as you like, but there's a problem. The problem.

20:08

Is that even if you reduce the number of keys in a metrics based system, it turns out that if you have a lot of distinct values like user, IE, right? Let's suppose you have millions of users. It turns out that your metric system has to create a Time series for each distinct user and track it forever. If that user is only appeared once so there is this amortization that an excellent system expects that your keys and values will get really used

20:36

often. And therefore there is a high upfront cost to having a new key at Value. Therefore metric system penalize you for having a high cardinality dimension. So to sum this up, dimensionality is about the number of distinct keys. And cardinality is about the number of distinct values perky. They deal system that supports observability fit allow. The sand basically and limited cardinality in a very, very high

21:00

amount of dimensionality. So, for people who are trying to understand about this concept, cardinality dimensionality. Let me just repeat again. What is just said? So cardinality refers to the number of unique values that you're storing in your metric system while dimensionality refers to the number of unique keys that you're sending to your metric systems. Another thing that I learn about observability that you mentioned the book.

21:24

I mentioned in the beginning that I used to go to the server Mako. Changes to do debugging. But in your book, you actually mention, if you use observability, you're actually doing a debugging from first principle. Tell us more about this concept because I still trying to understand this part. Yes. So we talked earlier about the core analysis Loop, which kind of sets us while up to think

21:44

about this. When we have a really well-functioning core analysis Loop where you're able to rapidly form hypotheses and test them, you no longer need to make these kind of leaps of intuition to the Wonder magically right hypothesis and instead you can test a lot of different hypotheses and kind of narrow down your search space and eliminate red, herrings dead ends. Whereas in the previous ways of debunking.

22:14

It used to be that you would kind of have this one person who knew the system really well under pad who could immediately jump to knowing what the right answer was. Oh, I saw this two months ago. It sticks, right? That's why I think first principles. Debugging is much better because it enables anyone on your team who has learned how to do first principles of honking to step into any editing, miliar situation and to figure it out. And if predictable amount of time, sure, your expert, who

22:43

have seen. Mrs. Denman, who's been in the company for 15 years. They might be able to solve the issue in two minutes, but my goal is to make it so that the worst case scenario of someone who's inexperienced, but nose Works whistles, the money they Shield. Take 10 minutes or 30 minutes, or most of the, but rather than three hours, five hours, or 24 hours or Worse, what happens at that person who found the company for 15 years leads to company over hires that happens these days.

23:09

So we talked a lot of honeycomb about this idea of bringing everyone on your team. So the level of the best of other. So you don't need these medical flashes of intuition. And instead that. Everyone has kind of that base, heat ability. This is really personal to me because I started the golden 2008 and I Google in 2018, 11 news I need had been on 12 different teams at Google and there's 11. I was changing teams on average slightly faster than once per

23:37

year. I've state two years on one team, but I'd also say with six months later, another key. I never got that amount of time on any one team to really be in the Deep expert on assistant, but they do still have really great and valued team member because I had gotten really good at first principles to bite at understanding what?

23:57

What software is available in the Google tool stack to understand any and similar kind of service because Google had this kind of standardization every service that you will use the same tracing. The same metric system, the same logging system and therefore both possible for me to walk into a completely unfamiliar situation and figure it out in half an hour or two. That was kind of surprising to people. Because if you spent like two or three years on a system, but not 10 years.

24:24

It might hate you several hours to figure out what's going on. You don't quite have that expert level knowledge of the system. But also you haven't grown that muscle yet. Develop that muscle of how I walk into any unfamiliar situation worse. For me. I've always summer to meeting day. One on that new team and be 180. I'm Nic team. I would have to figure things out on the Fly for myself.

24:46

That was my survival. So when you mention first principal pulled those who are not familiar with this term, what do you mean by first principle? That's the first thing and tell us more about what are the skills or maybe techniques that you categorize as first

25:00

principle debugging techniques. Yeah. So in the field of engineering as a whole, when we talk about going back to First principles, what we mean is throw out everything, you know, let's start from understanding the system from the laws of physics or the laws of mathematics instead of trying to read the manual. To see what this machine says it does. The machine is ultimately made up of a bunch of levers and wires and so forth. So you can kind of Trace

25:27

everything back to understand. Okay, what does this lever do now? What does the manual say this lever does? Or what does this person? Remember that? It'll ever? Does you go and look at the Machinery coltd connected to the back of this lever, to understand what it does. I think that's the definition from the first principles to Mummy is to the wrong, the book and use the laws of the system

25:47

wants. Of reality to understand it when this comes to computer systems, but this fundamentally meant to me initially my time at Google. And then my time to do Cunningham is like, I start buying looking at the flow of execution of one request. Try to understand. What does spread since a normal request. Look like, and what does a abnormal reflect where do they differ? What services are they passing through? Where are they spending the kind? This kind of where I find this

26:14

idea of exemplars. Very useful, fine. Meaning example, traces that exemplify both the slow path in the past, that man just working them up on my skin. Comparing them to try to understand whereas one passwords one slow and then we'll bring to mind hypotheses my ideas about why might these two things be different and then I can start testing them. So, that's what I mean. When I talk about first whistles to buddy. It's not knowing magically what's graphite.

26:39

Which gas for to look at instead spending a little at a time, puzzling out. What's the difference between these two things based off of what I cannot observe about them? I'm outside. Thanks for that explanation. So we've been talking a lot about the techniques. The internals. What observability is? Tell us, more white observability. Now become very hot. Is it? Because it's just a trend, so many tools available, or is it actually solving a real problem?

27:05

I think the answer and fortunately is yes to both and it is a non overlapping set. So in terms of the definition of observability gate, earlier of kind, of understanding complex systems from Principles and being able to debug in the puddles. The reason why we need this today is because we are building some native microservices where you can no longer use a traditional logging system, that collapse a lunch from one host where your entire query runs or

27:36

where you can do a p.m. Along the entire post to understand. What are the slow path on Justice. My post that doesn't work anymore. If you have microservices a request that needs to flow through Services maintained by more than one. A team at that point, you kind of now have these squishy boundaries where no one person holds that information about what's going on in their head.

27:56

So I think the motivation is that the complexity of systems has meant that we can no longer the bugs to Stems based off of the known and knowns the metrics that we thought to create an advanced or kind of these magical, flashes of insight by the expert in that, there is no longer an expert on every aspect of your sister might need advance for not only like one or two aspects of your system. Not how all the pieces fit together. That's the motivate.

28:20

That's why it's our ability is crucial and needed. The problem is that observability has you mentioned three pillars earlier people who are already selling you solutions to one or two or maybe even all three of these so-called pillars are trying to persuade you. That what they do, satisfies this requirement, but being able to understand your complex systems. So basically, it's three pillars of the durability, you can have

28:45

observability. If you just buy our logging, Increasing and reference Solutions. I think that's why he's seen so much marketing buzz and noise about it. Is that everyone in the industry historic calling? What they do observability, even if it isn't this hilarious is like you see all these companies and you know, make data observability that our light source code observability and wait. Does this work mean anything anymore?

29:09

So I think basically going to make history, honey, come first started using the words, our ability in 2017, and it actually predated Us by a little bit. I think the Twitter observe.

29:18

We'll beat increases by one or two years in terms of using the system's control word, observability to spread this idea of understanding systems and you'll see this explosion of people like starting to call it a durability from 2019 onwards, really that there's kind of this plethora of companies that say, oh we do observability tool and it's like really, okay.

29:37

Well, let's see. Can you actually understand your systems altogether or is this just a rebranding of your existing monitoring and logging Solutions? Yeah, sometimes this is also my confusion. So, when you see some brands or some products called themselves observability, most of them are like, maybe white labeling. So, to speak right away. You can actually see locks metrics and traces. Sometimes they are not integrated. In fact, right?

30:00

So you'll see three different features and three different things and you will correlate using a human intuition. Yeah, and what's worse is they're making you pay because throughout the day at three to four times, right? Like they're showing you three different skus. This should be one set of data so that there's not any Miss lashes. You can jump. Leave between 11 and so doesn't cost you an arm and a leg. Yeah, the cost. But I agree because it can cause

30:22

you a lot. When you send a lot better. I don't know, in your personal experience when working with honeycomb or when you're meeting developers out there in my experience, even though we have this microservices cognitive. There are still people who actually believe it or not. Do not Implement any Trace. Do not Implement any metrics or has a very little locks implemented in their system. That is more. Why is this still happening?

30:46

Even though we see bass Trends, but observability and all that. And the reason is that, it works until it doesn't system. Complexity is a very sneaky thing. You think you can understand everything in your head and fate of your that magical person that's able to debug everything within 5 minutes until you get the thing that stops you that takes 3 hours or until when you go on vacation and someone is calling you because they can't figure out the system that you

31:12

built. So I think that's the reason why people may not necessarily. We invest in observability when they should be, is because the cost of inadequate, observability sneaks up on you. It's a form of technical depth to have a lack of observability. We know how good the industry is about paying down technical data. So I think that's what it is. I think the other unfortunately says resetting is like, it feels really good to be the person who

31:37

magically debunked the problem. It feels like job security feels like accomplishing something. If you're the most senior engineer I met him, you know, everything about the system and The one making decisions about the system. You may not necessarily care quite as much as the engineer, whose New Year. He man is struggling to understand how to solve it together. Now, if people have listened to this episode, they want to start implementing observability, how they can do it.

32:02

Is it like go and buy or subscribe to our solutions that are available out there install open source, maybe tell us more how you should start. I think there should be just to let. I think there's kind of the foundational understanding of what observability is innocent and you mentioned our book observability engineering, we can drop a link in the show notes to a way to download a free copy of the Missouri, the lady in here.

32:25

Here, so I would start by reading that book or at least reading the first couple chapters just to make sure that you have that language to explain to your key. Why do you want them to change their practices? You know, you figure out. It was the tools in the world, right? This is what I discovered in Google. We've had really good tracing at Google. Since 2000. Maybe no one used it. No one used it because it was far to use and no one could see why they needed use it.

32:50

There is no easy linkage to people's existing debugging workloads. So adding a new tool doesn't necessarily. Things. I must equal understand the motivation and understand how they are going to use it. Once you have to be a lot in the next step is to add the open Telemetry SDK to your application. So what if open Telemetry open? Telemetry is a vendor neutral, open source, SDK that allows you to generate this club of free data. Whether you metrics formatted log form, editor Trace

33:21

formatted. The idea is that it is a common language for being Be able to produce this data to transmit around, its you propagate that skeet and contacts between different writer service. So when you add opportunity to your application, you're not locking yourself into any Pickers Lucien, you're making an investment in your future in the same way that you're adding a new test framework. Your application is an investment in your future.

33:46

You still want me to add some taps or you might need to add some anyone's tradition, but open plumber. Tree in general, does a good job of handling that that automatic Generations faced a Your outpatient be something with friedrichs that you're using, so you can have a kind of this Rich data about what requests are going in and out with this brain work either relationship. Then you'll have to pick a place to send that data. And there are a white number of

34:10

options. You can certainly use open source Solutions, like, Yeager the trouble with their difference or Solutions is that. They are great for visualizing individual traces, but that it's not necessarily going to provide a comprehensive replacement for cured. Receive monetary work floats because people often need to understand what's the average yield by system as a whole. And then zoom into that trace and gave birth in fulfill these image of the trace. They don't it's about point.

34:36

You might want to look at vendor Solutions and you know, certainly honeycomb is out there. I also think very highly of our waitstaff, but basically any back-end that supports open plum tree is going to be a place that you can send that big issue and the bonus wasn't alone. Free is. It's better neutral. And it supports teeing the data you can They sent a more than her career at the same time and see which one you like, which I think is really great. I think competition is better

35:01

for the market. It's great for everyone and it really incentivize his suspenders to do the right thing by. Therefore. I think that's why I recommend open Plum Tree because it could have handles a lot of what would otherwise lead, wrote work cream, Trace bands, propagating and palmitate around it. It doesn't lock you in, it gives you that freedom of choice. So thanks for the tips off, opting for open Telemetry open Telemetry service like the New standard previously.

35:26

It's called open tracing the murderer, open census, and open tracing. If you had two standards when it works on at what we actually have managed to deprecate both of its senses. And I've been tracing, we're not doing that XKCD comic thing of now, there are 23 standards Alia Eve, you choose for vendor Solutions. Normally, they would charge you by the number of data being sent to the systems. So tell us more bit of practical choices yet. How do you assess Solutions was

35:53

of all? If let's say we want to go With the vendor says based products. Yeah, I think that you have to think about the cost of that. Is this how much is it going to cost you and what are the benefits that you get? So you may not necessarily want to go with the lowest cost vendor because lowest-cost vendor might provide with a very primitive ability to analyze the Bob.

36:11

I think a lot of favorites. Her investment from observability comes from saving your in Heroes time, and improving your customer outcomes and decreasing customers. Sure that is far more important than the cost of the Beast solution for instance. Eat your is free. But how much time is it going to say? Do you have the bugging? Really? So I think there's kind of this Continuum and it's important to focus on one or your evaluation criteria and to specify a front.

36:37

We want to be able to understand issues within half an hour, or we want to be able to measure or serviceable decades in the first place even understand where we're going wrong. So I think that's kind of one dimension and I think that go along with that. How intuitive is it? Is everyone on my team adopting it. So after we Choose all these open Telemetry. We know which vendor solution. I think at the end of the day. The developers themselves need to instrument the code.

37:02

That's correct. The Ottomans religion and only get you so far. But other than some patients never going to capture things through post bodies because that's how you get X have for fleets right after you have to pick and choose which attributes you want to send along and the answer is anything nonsense that basically. But the Tormentor chemicals, not going to be able to do not automatically. Yeah. So although instrumentation can do that automatically.

37:25

For up till a certain level, but I think at the end of the day, developers needs to have a conscious decision to instrument the code, Click by gravity would laugh? At someone said, I can automatically create your cast for you or I can automatically convert code for you. It's like, no, you can't. Yeah, which brings to the technique that you mentioned the book. We have heard about these term, a lot of times driven development. Right?

37:46

So you coined, this observability driven development and shifting left for observability. So, tell us more about this technique, this concept, why is it important? Yeah, I think it's super important to understand. What is this code going to look like in production? So the best way is to add that Telemetry to add those key value pairs into your code. As you're developing it that way, you can test it inside your test Suites that way. Inside of your death box.

38:13

You can admit that Telemetry data to a tracing people and to see what was traced. And look like it. Hey, you might discover, why your tests are failing not by a kind of running a debugger, but instead by looking at At this Pharmacy data. So the earlier you adopt observability injured in all /, lifecycle. The mortar developers will use it at all stages of development. Not just at production and we see her you'll catch box. So that's the argument is that it's really synergistic with

38:41

kept certain development. But you're also exercising your observability code during your tests. Yeah. So if I can imagine, I'm being a developer again, so as and when I try to develop a feature, I have to think about what kind of things that I want to. Able to observe or maybe look at in my of the Villa T tools so that when an issue happens, I actually have those data and then the I implement or instrumented in my code.

39:05

I think that's really key. Another thing that you mentioned, if developers want to have more conscious decision of adding instrumentation is putting them on call hence the but their own system. So that's why this is actually very important. Yeah. So I think that it is important for developers to have seven steak in the production writing at their application.

39:24

However, that does not Not necessarily have to take the form of on called, for every engineer, but I think at the team level is important for kids to build and run your own topper. So in practice, a lot of teams struggle with that because you just say, you know, here's the Peter congratulations, as opposed to, we're going to support you with high quality observability with that slows of him. This methodology is to how you are to service. Now that we've trained, you

39:48

pierce that hater. I think that if a lot of developers understood the payment options, go through maybe amount of Sound lessons that we formed from years of being subjected to am radios and kind of almost the masochism that requires. They might have a little bit more empathy, but it would take time. You choose to throw someone who did repairs directly into that these. We cleaned it up a little bit before you say hi. Give us a favor by the way.

40:11

So, I think kind of that handoff process can be really helpful for encouraging ownership, but ownership, and responsibility needs to be accompanied by giving people, the tools that they need to be able to be successful. You wouldn't Assign a brand new developer, to design, the architecture of your future system. So why are you asking people who have never operated system before to go swim in the deep end.

40:33

I think you kind of have to offer this graceful cat and I think that observability really is this methodology for kind of Bridging the Ops and deverill. See are the reason why people did monitoring before was that monitoring is outside an observation or outdated measurement because often operators didn't really necessarily have the ability to make coaching to system, so that Were limited only to you. What can my ATM agent matter or what can my monitoring tool

40:59

there? So I think that when you have this shared responsibility of developing the two out of three and observing and look a planetary at some Libre de Su World scanner because you're having the instrumentation being added in this virtuous cycle along with looking at the results of the its rotation. So you mentioned all these build and run it it comes back to normally the devops culture and SRE culture, right? So is this observability also?

41:25

Supports the implementation of that particular devops and SRE culture. Yes, I think so, when you originally solve some of my videos on S3 and devops. I highlighted that it is a key responsibility of s res to make the system debugger to make this system have slos and to monitor. The disclosed at Google, we certainly didn't use the word observability until 2019. But what we were doing was probably closer to observe the latent monitoring husband

41:54

Europe, doing this practice. Record Allison and being able to understand the systems. So for people who have been implementing this, how do they know they have done it right? Where they are lacking. I see in your book. You mentioned this term called observability maturity model. Is there a way to gauge where people are at this point in time and what else can they aspire to achieve?

42:15

Yeah. I think that there are these candid key find areas in that Observatory returning model of where observability isn't isn't helping you. We talked about observable age of them. Velopment and kind of getting the code, right the first time and exercising that it's relation have to really that's kind of what area of Eternity. The other Airmen material that we actually didn't touch on yet is TOS delivery for me. Observability has a you must be this tall to ride that is.

42:40

You can skip code every couple of weeks at least because if not, does it matter that you decrease your time to resolve issues from three hours 25 minutes. If you can only ship code every three months or if you can only shoot view, its rotation, every three months, but no, it doesn't make sense. Focus on making your glitter. Jacob Astor. So, kind of, if you are struggling with your code and be able to ship it, start the observing your deliberate. I try not your code.

43:04

So that's kind of another to maturity area. Is do we had our skin white or build a sudden late surge. He can longer to, in our case, 15 minutes, 15 minutes of the heart up were, oh my God. This is too slow. So we try to keep our builds under 10 minutes so we can deploy to production once an hour faster. So kind of keeping that stuff for delivery cycle Snappy. We're getting it should be in snappiest kind of another key area and Then third, we talked a lot about the brakes and

43:27

production resilience workflow. But I think also, it's really important to think about the user analytics to what extent are people making use of your product. If no one is using it. Does it really matter these shifts, that feature? And then finally, I think to tie it all together. Technical debt and managing that technical debt is really important and observable a conservative. Hey, you have the single point of failure or hey, you have the circular dependency.

43:49

So I think that those are some of the key responsibilities that you have when you're kind of thinking about how to add Observer. What each shooter suffered delivery facts? Thanks for explaining about this maturity model. I think you explained it much deeper in the book as well. For people who wants to understand. Where are these areas that you need to invest time in? So make sure to check the book will put it in the Show links later.

44:10

So let's find myself having a crash course about this observability. So, thank you so much for explaining all these Concepts, all these implementation. Really, thank you for that. But unfortunately, due to time, we need to wrap up soon. But before I let you go, I normally ask one last question for every guest that I have in the show, which is to At three technical leadership, wisdom for people to learn from your experience or from your journey.

44:33

So maybe you can share us. What are your wisdom? Sure. I'd like to give a brief and just leave it to one piece of wisdom, which is that it is much, much more important to make the social piece work in this happens. It's really important to get people talking to each other software. Delivery is much more about people communicating and being on the same page and kind of having a cheer working model. It is about what tools you use. So don't think first about

45:02

introducing new tools. Think first about align people on the outcome and making sure everyone is agreed on the outcome and then you can worry about what tools you're going to use to implement it. Wow. That's really wonderful. So social first rather than the technology aspect and try to align people because yeah, I can see people implementing different Technologies, even when you talk about micro service, right?

45:24

It's like everyone just wants to have their own services manager themselves and They don't really care whether it integrates well and serves the customers. Well, so, thank you so much for this for your time for people who want to follow you or learn more about this observability or honeycomb when they can reach you online. Yeah, I am. There's the great pretty much everywhere. So that's on Twitter. That's a good hon. That's how they look me up. I am will also drop those links

45:49

into the show notes. Why is it the gray if I may ask teenage fangirling about Gandalf from Lord of the Rings? Alright, so, thank you. Let's hope you have a great day today. Thank you. Cheers. Thank you for listening to this episode and for staying, right until the end if you highly enjoyed it, I would appreciate if you share it with your friends and colleagues who you think would also benefit from

46:14

listening to this episode. And if you are new to the podcast, make sure to subscribe and leave me your valuable review and feedback. It helps me a lot in order to grow this podcast better. You can also find the full show notes of this conversation on the episode page at pack Legion o.f website, including the full transcript. Interesting quotes and links to the resources mention from the

46:36

conversation. And lastly, make sure to subscribe to the shows mailing list on pack leader dot f to get notified for any future episodes. Stay tuned for the next technology. No episode. And until then. Goodbye.

Transcript source: Provided by creator in RSS feed: download file

#88 - Observability Engineering - Liz Fong-Jones

Episode description

Transcript