#96 - Practical Guide to Implementing SRE and SLOs - Alex Hidalgo | Tech Lead Journal podcast

00:00

Definitey is the top International software development conference, with an emphasis on coding architecture and Tech leadership skills. The lineup for this year is truly stellar and features many Legends in software development names, such as Robert Uncle Bob, Martin can back Scott Hanselman, Franca subramanyam Carolyn honey Alan. Hello, Mary poppendieck and many other prominent names including some of those who have also appeared in this podcast before

00:29

the conference. It's takes place online so you can enjoy it from the comfort of your couch. We spoke to the definitey organizers, and I'm happy to share that technology. You know, has got the 10% discount code for you. Enter the promo code, awsm underscore tlj. When you purchase the ticket on definite e.com, here's the promo code. One more time awsm underscore, tlj. Depending on the time when you purchase a ticket, early price is still available. See you there?

01:01

And that's a totally reasonable way to get started. It's all about just, again, embracing those service truths that we talked about it first, right? Rye, Billy's, most important thing. Your users Define your liability not you. So make sure you're measuring the right thing and 100% is out of the question. So pick the right target. You can Embrace those truths without real time monitoring and advanced statistics and all the stuff that comes along with.

01:24

It just get started, even if it's in a spreadsheet even if it's only just once a month, Hey everyone. My name is Henry Surya with Robin. And you're listening to the technology, you know, podcast the show where I'll be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work.

01:58

So let's dive into our Journal. Hello to all of you, my friends and my listeners, welcome to the technology. Now, podcast the show where you can learn about technical leadership and Excellence from my conversations, with great thought, leaders out there. And today is the episode number 96. Thank you for tuning in and

02:21

listening to this episode. If this is your first time listening to technology, you know, make sure to subscribe and follow the show on your podcast app and social media on LinkedIn. Twitter, Other and Instagram. And for those of you who enjoy this podcast and wanting to contribute to the creation of the future episodes support me by subscribing as a patron at Tech Legend, l.f / Patron, implementing a sorry Concepts and best practices can be

02:47

daunting. Although Google released a few a sorry, books, including the famous site, reliability, engineering book. Many of us still have some gaps in terms of really understanding the essence of the concepts and practices such as this. Service level indicators or SL is service level objectives, as a Lowe's and error budgets.

03:07

And on top of that, how can we start building a good SRE culture and avoiding some common pitfalls, especially when communicating the benefits of these set of practices, for example, to the business or stakeholders. Also, do tools matter in implementing a sorry how reliable should our service be and how should we measure it? These are some of the common questions that I know.

03:29

People usually ask. When introduced with SRE concept, and if you do have the same questions and thoughts in your mind, then today's episode is definitely for you. My guest for today's episode is Alex Hidalgo, Alex is the principal reliability Advocate at Noble 9 and the author of implementing service level objectives book. Alex previously work at Google as a site, reliability engineer, and also customer reliability engineer and also contributed multiple chapters to the site reliability.

04:01

Look book. In this episode, we discuss the Practical guide on how to implement a sorry practices and service level objectives or S ellos Alex started by explaining the basic concept of service reliability and the tree service truths, he then explained the concept of reliability stack, that includes the famous SRE Concepts as allies as a Lowe's, and error budget. Alex, then shared his insights on how we can define a service reliability Target.

04:29

Why a higher reliability Target? Is expensive and the risk of a service of being too reliable towards the end, Alex shared his tips on how we can start building, a sorry culture and how we can use the error budget as a communication tool within the organization. I very much enjoyed my conversation with Alex and even though I have been learning the asari concepts for quite some time. Now, there are still a number of insights that are learned from

04:55

Alex in this episode. And if you also find this episode useful, please share it with your friends. Colleagues who can also benefit from listening to this episode? Leave a rating and review on your podcast app and share your comments or feedback about this episode on social media. It is my ultimate mission to make this podcast available to more people. And I need your help to support me towards fulfilling my mission. Before we continue to the conversation.

05:21

Let's hear some words from our sponsor. Today's episode is proudly sponsored by skills matter. The global community and events platform with more than 1000 software professionals here members can organize their learning experiences around the technology topics. They care about most you get on-demand access to their latest content thought, leadership insights, as well as the exciting schedule of tech events running across all time zones.

05:49

So where the devops our data science is your bus or you're a fan of functional programming or all things Cloud, you can make real connections with people who share your interests head-on. The two scales method on cam to become part of the tech community that matters most to you. It's free to join and you will find it easy to keep up with the latest tech Trends. Are you looking for a new cool

06:12

swag package? You know now offers you some swags that you can purchase online these wax are printed on demand based on your preference and will be delivered safely to you all over the world where shipping is available, check out all the cool tracks available by visiting technology, know that death / shop, Oh, and don't forget to break yourself. Once you receive any of those tracks.

06:38

Hi everybody. Welcome back to Tech lead, you know, podcast today I have a guest with me named Alex Hidalgo. He's the principal reliability Advocate at Noble nine and he's the author of a book, titled, implementing service level objectives and contributed to another book which is part of the asari book from Google, which is titled site reliability workbook. So as you can tell Today we are going to talk a lot about SRE

07:04

and slos and things like that. So if you are wondering about this practices SLO how to set it, right? And things like that, we're going to cover it today. So Alex really? Thank you so much for your time today looking forward for this composition. Thanks so much for having me. So Alex, I like to actually start with my guests to tell his or her story. Any career turning points any highlights? So maybe if you can share yours. Sure. My route to where I I am today has been an interesting one.

07:33

I grew up as a computer nerd. My dad helped teach me how to program when I was eight or nine years old. I was on the internet before the web existed back in was also text. So through, most of my education. I always assumed I wanted to work with computers so I didn't go to college After High School. I went and I got a job in the tech industry, but that first job was doing network security work for the government of the US. And I hated it.

08:02

I was pretty good at it. Actually got a promotion within a year but it made me miserable. And so I thought I didn't want to work with computers. I thought computers were just a hobby for me so I quit my job and realize it was still time for me to go back to school. So I started college a few years. After most people I ended up studying philosophy and history and I took a bunch of creative

08:26

writing courses. All the while computers were still a hobby but I kind of decided Added computers weren't for me as a career. Then I decided to move to New York at the height of the 2008 recession. So right when the global economy and especially here in the US things are just absolutely tanking suddenly I'm in one of the most expensive cities in the world and I can't find a job.

08:48

My money's running out whatever I had of savings at the time because in my 20s I worked in the service industry. I was a server. I was a cook, I worked in a warehouse, my money was a starting to run out. I kind of I can probably still do this computer thing even if it's just for now, even if it's just to make some money and get back on my feet, the very first tech job I applied for which was an, IT desktop support kind of position. I was hired right away. I actually had so much fun.

09:18

Is that the small company only about 10 people? We were the IT department for companies that didn't have their own. So, every day I'd travel all over New York City, every Borough all over the place, just helping people with whatever they need. Needed help with and I really loved it. I loved that human part that interacting with other people and helping them learn how to use their computers fast

09:39

forward. A few years, I'm working for this company called add meld now and add meld is purchased by Google and suddenly I'm at Google as a site, reliability engineer. I don't even know what that meant at first. But as I learned more and more about it, it spoke to me so much. It was really what I had always meant to do. It was every bit of My personality, like the things I truly love, it was all kind of reflected in it. The human aspects thinking about users, putting humans.

10:09

First blameless culture for incident response and incident retrospectives, just everything about it. I absolutely loved. So, I really fell into the role really well fast forward. A few more eaters. My last roll at Google before I left was on the customer reliability engineering team or see reaching the Siri team was a group of Fairly Variants azeri, we were tasked with teaching, Google's, largest cloud, customers, howl, to a sorry, how can people make their services more reliable?

10:40

We realize that the biggest thing that we needed because the customers we were engaging and were not other tech companies, they were not subdivisions, a Google, right? They were retailers and they were industry manufacturing companies and they were all over the place. So we realized we needed a common vernacular, we needed a

10:59

shared language. And what we Sided was SOS service level objectives would be that shared language so I spent a good year or traveling all over and teaching all sorts of people what s lows are and why they're so great and how to use them. Because that was really the building blocks that the Siri team wanted in order for us to engage with people in the best possible way. I had some fun doing that but it was my time to move on from Google. I'd been there for a long time.

11:26

Everyone needs a change at some point so I moved over to Squarespace. When I started at Squarespace, I was asked, hey we want to do SL owes you know how to do is Loz, can you help us? And I was like, sure, no problem. I set this up with my manager, I was going to spend like, 60% my time on teamwork and about 40 percent of my time teaching, the entire organization, including my own team, how to do slos. I didn't realize how much work

11:54

that really was. When you're getting started from scratch, when you are starting from just the absolute bottom, People barely even understand what all the different terms mean. It's a lot of work. You need to build the right to Lang. Chances are you're monitoring systems. Don't even do a solo math. In the first place you need to do education and workshops need to build document repositories. It was a lot of work. We didn't Define our first SLO at Squarespace for about six months.

12:24

After I started after about a year and a half of that. I was running a workshop every Friday afternoon for Four hours from noon until 4 p.m. and we had a break, but still, it was a for our Workshop, every single Friday, and people really liked them. They were popular, but I was just tired of saying the same thing over and over again. So, at some point, I was complaining to a friend of mine,

12:49

they're like a good co-worker. And I was just like, I wish there was a book about this, an entire book, not just a chapter and yessiree workbook, but a whole book because that way I could Point people at it instead of just doing Workshops over and over again. My friend said, well you should write it and I said no an expert, you read it and he said you are the expert. And I cursed like I straight up just like said curse words because I knew he was right?

13:15

I just didn't realize it until he had said it, I suddenly knew I was writing a book and I'd always heard how difficult that is a few months later, I was working on a book for O'Reilly implementing service level objectives. That's directly led me to where I am today. I'm now at Noble 9, which is a start-up based entirely around how to do service level objectives, how to measure them.

13:37

We do all the tooling for you etc. Etc. So I came to fall in love with s lows at Google because I realized they made my life better and they made my users lives better. They made humans happier and that's always been the most important thing to me. I've been very lucky to have been able to focus primarily on us laws for last six. Almost seven years of my career. Now, it's a great place to be Thank you so much for sharing your story.

14:02

It is a very beautiful story, a lot of ups and downs, including how you got to find your passion. And hopefully today, you are not tired to speak about SLO one more time. So at least today, we're going to cover some of the basics. Thank you so much for sharing this story. So Alex you wrote this book, implementing a sailor, there are number of books, not many, but a number of books about SRE SLO and all that. But I still find a people, find it difficult to grasp the

14:28

concept. What do you think us? Some of the challenges that people face in understanding these practices. So I think some of the biggest problems with both understanding, what a sorry is site reliability engineering, as well as slos, what service level objectives are is that no one's really ever fully defined. It there is a course, the first Google as to rebook. But you know what? It's like 30-something chapters and not a single team at Google actually does all those things.

14:57

Anyway, their best practices, their And sure there's the original definition that been trainer /, basically, the inventor of site reliability, engineering. He said, S3 is what happens when you ask software Engineers to solve operational problems. That's great. That's also very vague. And then, what is the difference between a sari and devops is a devops engineer? Even a thing or is devops erson approach. And then marketing teams, got a hold of it, right?

15:28

Suddenly the SRE book was selling really well. Well, so now Sony, oh, we're gonna have a sorry, companies, or gonna have tooling. That is s retooling. There isn't one definition and I would actually it doesn't really matter as long as our end goals are kind of the same and I see the same width as the lows. I have my own definitions.

15:46

I do think there are true kind of definition that everyone can agree upon but you know in the first and second Google s re books, they don't even Define what an SLI is in. The same way. A service level indicator which we can talk More about, of course, a very important part of how to do is Loz the two different books, don't even

16:05

agree. So I think part of the confusion is just that they're often aren't single resources to point people at. We don't want degrees for this, your purely a software developer, you can get a degree in computer science, certain algorithms have certain names have very strict definitions if you're writing code and a certain language, you have to adhere to the syntax of that language for certain languages very strictly Now we can very

16:31

clearly say this is a Java program versus this is a Python program, but when it comes to more philosophical things like site, reliability, engineering like service level objectives. I think one of the reasons people sometimes struggle is because it can be more difficult to start from scratch.

16:48

Like my story about Squarespace, where I realize, even though I knew how to do these things, I had this 1200 person organization that I had to teach from the ground up because there aren't these strictly defined resources that people Can just learn I think what you said speaks truth to my experience as well. It's really hard to read those as our ebooks. By the way, it's very dense. Sometimes could be dry, sometimes could be Google

17:11

related. And there are not many available experts out there, that can be set like certified. SRE to actually tell you, this is how the better practice should be and many tools vendors just came out and probably like what you said, somehow polluted the term, they may be defined their own definition, and things like that. So I think that's one of the challenge. Today, let's try to also discuss about these Concepts from the basic since you have a lot of experience.

17:36

You write this book, implementing slos. But first of all, what is the definition of service and reliability? Because some people actually use these terms interchangeably. So many different variations, maybe let's start from there. Yeah, absolutely service is actually not difficult to Define because you probably already know it's what the word service just means.

17:58

One of my favorite examples is that I've I'm so much my experience and so much of why I love ASL owes is because I used to be a server like server in a restaurant. I provided a service for people which was to take their orders and bring them food. A computer server is not very much different. It is a thing that takes your requests and response to its correctly is very similar in

18:24

concept. That's what a computer service is, Computer Services, something that listens to a request West from something and response to it appropriately that I think is the best way to think about it instead of trying to Define exactly what it means as a technological level because I don't think that's important to some teams.

18:44

It is important to think of their service as a pod running in kubernetes or a series of PODS or a Docker container somewhere or a binary running on a virtual machine or a piece of Hardware even networking gear. Those I'd Services as well and those don't fit any of those previous definitions. For some people, the service they care about is though. Retail website you go to a place to buy a pair of socks.

19:13

Some people at that company, just care about that entire website, even though it might be composed of Ed individual tiny micro Services can really be defined as anything that does something for someone else, I don't think we have to say, okay, that's a service because it Is a set of PODS deployed via a single deployments on kubernetes. Whatever this isn't a service because it's a user Journey. No, no, no, a service. You can think of it holistically, you can think of it philosophically.

19:43

It's something that does something for and that leads us right into what I think the correct definition of reliability is because people often conflated with availability, but they're very different things because the service can be available and not be reliable, Rye. Billy's old term reliability. Frankie's and goes back to the 1940s. It's not a concept unique to SRE or Google or check or computers. What reliability really means is, is the system performing, how it was defined to perform

20:14

for computer services. That basically means is it doing what it needs to be doing. If we are cool with the concept of a service, being a thing that does something for someone else.

20:25

Then reliability, is, is that thing doing that thing, it's supposed to be doing the reason I always always try to steer people away from just thinking about it, as being like, all availability, is because you can be very available and still be doing a bad job, that example, of the retail website where you just need to buy a pair of socks.

20:42

Well, maybe you can log into the website and it's very quick and you can search and you get 10,000 results for socks and they have every color and every size and everything you ever wanted, but then when you go to check out, you can't that's not being reliable. Even if that service is being available to you at the time. So a service is Anything. It's doing something for something else. And reliabilities means. Are you doing that? Well enough. Thanks for Expediting.

21:07

This is very, very simple. I think everyone here could understand that really as part of the foundation, right? When you understand what is service, what is reliability? The next few things will become easier. But before we go into mallis re reliability and all that, I saw one section in your book where you mentioned about service truths. I think this is an interesting concept for me. Would you be able to share, probably what do you You mean by this service truths and what are

21:32

those sure? So I personally believe that there are three things that are true about any service one. Is that reliability is its most important feature, if your service is not being reliable, it's not doing much. As we just said, if we're defining reliability, is are you doing what? You're supposed to be doing? Well, then it kind of follows. If you're not be reliable, you're not doing what you're supposed to be doing.

21:56

So you can always be thinking about reliability, first, the And Truth is that you don't get to decide what your reliability is. I don't care what your measurements say, I don't care what your log say, what your metrics a, I don't care if you have a million healthy podzol reporting up if your customers. If your users, anyone that depends on you thinks you're being unreliable. You are if you're not meeting the needs of your users, you're

22:22

not being reliable. So you need to take their perspective into account, and then the third service truth, is that nothing is ever 100%. Sent so don't aim for it. This is just a truth of the world outside of pure mathematical constructs nothing's ever 100% things fail failures occur. It turns out that people are actually fine with failure. As long as figure doesn't happen, too often. The third truth is, just don't aim for 100 percent, because that's a Fool's errand.

22:50

So instead, pick a more reasonable Target. Understand what? Reliable means, understand that someone else determines what reliable means. Understand that right abilities. Most Part of your service. And then make sure you're only trying to be reliable enough, like an achievable mount, something that works for both you and the people that depend on you as you can tell, these are the fundamental understanding.

23:13

So let me try to reiterate the first is that reliabilities, the most important attributes or characteristics of your system. No matter how your system has so many features or functional requirements but if it doesn't perform reliably that's probably not a good service and the other one is you don't Define you. Reliability, but users do it for you on behalf so when they use your system they will tell you if your system is reliable or not and the last one. Nothing is 100%.

23:39

Don't ever try to achieve that because people are okay with failures as long as it's not failing frequently. Let's go back to the mall. Deep dive definition about SL is Loz and all that. So you have this concept of reliability stack, maybe let's start from there. What are in the stack? How do you define them and how they work with each other? Sure. So you have three primary components of what I call the

24:03

reliability stack. This is what people really often refer to. When they're saying s ellos, when they often use the term as slows, they really mean a few different things. First is SL eyes or service level, indicators service level, indicators are measurements. There are bits of telemetry about your system that tell you, is it doing what it's supposed to be doing.

24:24

And again, this should be from your users perspective as close as you can get at least then Next you have slos or service level objectives and service level. Objectives are just targets for how often you want your SLI to be true. So if your SLI is able to tell you, yes, we're currently meeting the expectations of our users or people to tell you. Oh, we're currently not meeting

24:45

the expectations of our users. You're now able to inform a ratio, good events over total events equals a ratio equals some kind of percentage. Your SLO is just a target for what you want that percentage to be. So you can Something more reasonable. So we just talked about a hundred percent is impossible. No one ever hits that anyway. So you can say something like, well, go back to the sock, buying scenario, we measure user all night. Why not?

25:10

If you can't check out when you try to buy socks, that's not good, but it's okay. If that only happens one and 100 times because you try to check out it doesn't work, what are you going to do? Probably is going to click again and as long as it works the next time fine. So maybe you only have to make sure that sock check Out works 99% of the time. It lets you pick a Target. That's realistic that accommodates failure.

25:34

That make sure you're not spending too much money and trying to aim for something you can. And then finally at the top of the liability SEC you have error budgets are budgets. Are just a way of thinking about how your SLO has performed over a period of time, an error budget, often takes into account a Time window. That's often fairly large generally anywhere from a week to 28 days to 30 days sometimes even a Quarter. The idea is your error budget, is that other side of the

26:03

percentage. So if you say sock, check out should work 99 percent of the time. But you're also saying, is 1% of the time sock? Check out is allowed to fail. Your error budgie is that 1% your error budget? Measures are we feeling more often or the right amount? And so you are budget. Lets you think about things over periods of time over the last 30 days? What? Percentage of the time have we failed? And is that helping us meet our SLO targets?

26:32

Are we exceeding it, your error budget, it's worded that way because of budget, something you can spend and an error budget is exactly that. We're allowing ourself this one percent or whatever you've defined for your own service course. But in our continuing example, you have this one percent of checkouts are not allowed to work. What amount of that have you spent over time? And that then helps you make better decisions about where you need to focus your attention.

26:59

Thanks for mentioning about this budget term because it is that consciously Define. It's not something that someone just take, but it's actually something that can be spent errors is expected sometimes, but it also can be spent. So you mentioned about defining reliability. Many people got stuck even here, I just like how reliable is your service should be? You said, it's not 100% some business people. It should be 100% how can we actually Define our reliability our service?

27:26

How should we go about it? What's the approach? Unfortunately, this is one of those times where I invoke my very senior engineer, who's been doing this too long card and I say, it depends. It's Unique. It depends on your service. The important thing is to be meaningful and to be thoughtful, the important thing is think about what is my service who are my users real quick, tangent. I say user a lot, but I don't just mean customers. I mean anyone that relies on your service, it might be

27:58

another team. The hallway, a might be internal people at your company. It might be another service. Your service might be six layers removed from an actual human, but that other service still a user of your service. So that's why I say user a lot because it can meet any of those things. There isn't a single answer, be thoughtful about it, take your time. Think what does my service do? What is it supposed to do? What do you do the users expect? Can I go talk to those users?

28:24

If I can? Let's go talk to them. Let's ask them directly. Is there a product? Direct management team. Do they have a user Journeys document? Maybe we should go look at what did the find user Journeys? Are why is this software written in the first place? Why was it spun up at this company? What does it do? It's not the most satisfying answer because I don't have like an answer. I don't have a special formula.

28:47

People can use to instantly understand what it is that they need to be thinking about when they pick how reliable they need to be. But what I can say is in every single case, if you're careful and your thoughtful, That will lead you to the best answers. One thing that I really love about that sorry Concepts and slos and all that, they actually always put users in the first place. It's not some random technical decisions, okay?

29:12

This is how the reliability should be and it's actually always comes from the users perspective like to mention in the beginning, 100% is not possible sometimes or even things that relies on Internet by default.

29:23

They are not reliable because your packets could be loss or you need to retry know that reliability and other Thing very important to understand, maybe you can help to explain is that as you go higher, if you aspire to go higher, it becomes more difficult complex, and also expensive, tell us more why? This is the case. Sure? I mean, if you want to ensure that you are being more reliable, you need to ensure you're having fewer failures individual components fail more often.

29:54

So, if you add more and more, you're just going to have more and more failures, right? You have something that feels 99% of the time and you have just that one thing. But if you need to be redundant, if you want to make sure they have this thing, fails thing, a then you also now need to have thing be in place because nothing be needs to take over in case thing a fails that's all good and well, but thing be my also only be 99 percent reliable.

30:18

So now you need something to determine whether or not thing a or thing, B is currently being reliable or not and where to send that traffic. So now you have things, see we can go on and on. On. But, you know, you just end up having to build more and more complex systems to ensure that you are, in fact, being more and more reliable time. This is sometimes totally reasonable for some systems in some Services. You do need to hit very high Targets. This means you need to be distributed.

30:45

You need to ensure your built for high availability, and quick, failover, purrs and redundancy.

30:53

There's all sorts of engineering for reliability Concepts that we could spend hours and hours talking about W once you introduce so many things, you're also spending a lot more money because, no matter if, this is running on your own Hardware, or your own data centers, or everything is in the cloud, you're now running more things so it costs more money, both just in a per month, building situation as well as the fact that now, you need more Engineers to take care of it.

31:19

If you have 100 components to the one, you need more Engineers to take care of those other components. So now we have to hire a whole bunch of Engineers and now we're trying to hit a really high reliability. Ility Target. Well, what a lot of people don't always do the math about is what that really means in terms of like time, if you want four nines reliability that gives you only seconds per month to respond, you can barely have any downtime at all ever. So now, what do the on-call

31:45

rotations look like? So, you now have like a hundred different components of your service because you have to have incredible redundancy in all sorts of extra availability stuff. And now you have a whole bunch of engineering team to take care of all those components. Now, they have to be on call. But they also would be on call on respond immediately. That means no longer can those teams just be singly home, do you need a follow the, some rotation? So you and I are exactly 12

32:09

hours apart. So we could have a team in Singapore and again with team in New York. Great. But now you need offices in Singapore, and New York. Now, you don't just have X

32:17

number teams. You have to X number of teams because every team now needs teens and Singapore and teams in New York to ensure that someone's only being on call during the day, because no one could always wake up at 3 a.m. and And in time to try and defend a Target as stringent as 99.99%, this is goes on and on, and on the closer, you try to get 100% the more and more that grows its exponential if we can. All agree that just logically you cannot ever hit 100%.

32:46

Anyway, it turns into a limit, the limit approaches Infinity forever. You'll never actually hit on upset anyway and trying to get there. We'll just cost you more and more money in terms of how many services you need how much you're paying for your Providers got, maybe you can even be on one Cloud. Maybe gonna go multi-cloud, cause what if AWS goes down, will be better, be running in gcp as well.

33:07

Do you know how difficult it is to write stuff that lives on top of both of those things and ensures that it can route to gcp or AWS at the right moments in time how to even detect whether or not you're a devious or gcp instances are running. Correct at that point in time he gets so complex to try to hit these high Target that you're just gonna kill yourself trying to spend too much money, turns out, you don't have to hit those

33:27

targets anyway. It's Yeah, I was about to say Indian after all these complexities and effort, your users actually don't need that kind of reliability. So I think the best way is again to check with your users. What is their expectations? And yes, sure some systems will require this kind of high availability but yeah hopefully people who listen to Alex explanation. You could actually understand you need to really Define your

33:50

reliability. It's not just some nines that you see okay 49 so let's just put it that way. Yeah so often leadership Tucson We're going to hit this target. They pick a whole bunch of nines all in a row because they think that's reasonable or because they know that some of Google's Services hit that. You know what? You're probably not Gogol. It's like Google has all those things. Google does have teams all over the world. Google does have to eat, it starts off with for LT valerii, right?

34:18

And you probably don't, as you said, just be reasonable, be thoughtful. Think about the target you're picking and make sure that they make sense for you. So let's start with the fundamentals of reliability spec, which is SLI. So you measure is like a matrix that defines how your users experience in terms of reliability. Probably if we can relate to the asari best practice, commonly it advocates for different golden signals for monitoring probably Define the metrics of your

34:44

system. Is this the right place to start? How do you define your SLI? I have mixed feelings about the golden signals. I'm on record does not going to blow any ones. And his know me for a while. I kind of wish we hadn't in the first SRE book written about them the way we did. Because I feel that too many people think that by having labeled them, the golden signals that they're the only things that matter. And I see a lot of people both start and stop there. And there's nothing wrong with

35:16

measuring availability. There's nothing wrong with measuring latency. There's nothing wrong with measuring throughput. That's not my issue. My issue is that people often start and Stop there they are. In fact good starting points. Absolutely? They are. So, if you're asking me, are they good? Starting points. Yes. If that's where you need to start. Start there. Almost everyone can measure those things.

35:38

It's a great place to start but those things rarely tell the whole story, they rarely tell you what your users are actually experiencing as return earlier, reliability is what your users need from you. And we can just tell a quick little story, right? So, is your service available? Yes, it's responding to requests. Is it? Latency low, sure. It's responding to requests very quickly, is experiencing much errors. No, it's available, and it's responding to things very quickly.

36:04

And every response is HTTP, 200, everything's great, but if you're sending them the wrong data, if the data you're sending them is not what they're asking for, you not being reliable at all, and that's not covered another golden signals at all. So I think people can graduate. They can upgrade to a more user Journey based Focus, which is again. Is this doing what? Users Is need it to do which includes many levels Beyond just what the golden signals tell you. So yeah I have some mixed

36:33

feelings. I think they were exid ently taken too seriously or as too much of an end goal the way they were written about. But I do think they are good starting points because almost everyone has that data already. If you don't, it can be generally speaking. Pretty easy to instrument. It can be much more difficult to measure. Did we send the data that the client asked for that? Can be a lot more. It took a long time to get there but yeah this is my kind of

36:59

feelings. They're a really good SLI is ever changing its ever adopting because the world is going to change your user expectations are going to change what your service is defined as even doing is going to change over time. So make sure that you're looking at your SLI eyes. You're looking at those measurements you're saying is what we're measuring. Is it still telling us enough is it? So telling us what our users are experiencing? Is it still looking at things

37:23

from their perspective. So start with errors Art with latency start with availability, but move from there. Think about what else your users actually need. You mentioned about checking like the correctness of data, right? So this is sometimes where I am confused as well. Personally, if you are always checking, whether you send the correct data, that he gets like infinite Loop, how do you check in the first place? The data is correct. What about the testing phase of

37:47

your product? How do you actually cover this correctness thing? Because do we really need to measure the correctness of all requests? Or is it only partially Oh maybe only some service types that need to have this correctness attribute measured. So maybe, tell us a little bit more about that. I think is all the things that uses mentioned.

38:05

It depends again on your service, a favor, a mine is using synthetic checks a good example, is I was once responsible for a very large-scale log system, hundreds of thousands of incoming logs every minute or maybe even four S very very high volume. We wanted to make sure that we

38:24

were doing things correctly. What we realize is that if we could craft a special log with a special tag on it and we insert it and then we waited found out how long it took for it to go through the entire pipeline, it get indexed and for us to be able to retrieve it on the other side and verify that signature we put in, was there that gave us availability latency ensuring there. Weren't errors and data correctness and even data freshness.

38:52

But the thing is they told a pretty complete story. This is what the Users of this log service actually needed to happen. They needed to insert logs and then they needed his logs, be indexed, and then they need to be able to retrieve those logs. So we just wrote a job that would just continually do this and luckily this logs are some more decently Estonia to happen a few times per minute. Even at a few times per minute that gave us more than enough data to set and a slow on top of.

39:17

So we're no longer dealing with the 100,000 requests per second at the service actually dealt with. It was only a few a minute now, but we were reasonably sure. Or at least close enough to being sure that if any step along the way broke for most people, we would find out because our special crafted log, which are specially crafted but otherwise went through the same workflow, the same pipelines.

39:40

Is everything else that gives a pretty good signal and that one measurements told us 45 different things because it covered five different Services, Glock producing Services, the Kafka assessments at between the log stash service that pulled off the Kafka topic. The elasticsearch cluster that log session inserted the logs into, which were then index by elasticsearch. And then the check talk to Cabana that sat in front of elasticsearch to actually

40:06

retrieve the log. So all those services and the availability of all of them and the latency of all of them and the error rates, all, and we're all being covered by this one synthetic that just ran over and over and over again. So that's one thing. You don't always need as much resolution as you think you do. Because even if we only had a few of these per minute, As opposed to hundreds of thousands per second, or we were covering so many components of the

40:30

system. He was just as good a data, but you can also set a solos off of actual 100% user data.

40:38

Now, this requires against racing Solutions, this requires a lot of work we do not have time really to get into all the technology involved for you to be able to trace from your human end-user clients or browser across the entire internet through your Provider and CDN and load balancers into your app, only back to the databases and whatever resources they need to talk to an older backups like render time at the client or

41:05

whatever. But the fact is I've seen that and it is absolutely possible and that is a lot of work but you couldn't fact also your SLI could be set off of actual human traffic, your actual user request could, in fact be used for that as well or anything in between. If there's anything, I want to drive home, it's be thoughtful, be meaningful. Do what works for you. You don't over spend resources, trying to do the real user Journey tracing that's not reasonable.

41:32

Farewell, My Rana has the resources or the knowledge or the time to do that, so maybe set up a synthetic instead or you know what? Let's take another step back, maybe for you right now, just latency and error rate. Just a golden signals, maybe that's just fine. Thank you for clarifying. That again it's a very insightful for me. Thanks for explaining that. So you mentioned a couple of times about User Journey. Maybe some people also confused about this term.

42:00

So what do you mean by user journey? Is it like the entire experience I want to buy socks? Or is it like when you check out or is it when you load a page? There are so many different ways to Define this maybe if you can help. Also to Define what is a user Journey, or some people call it critical usage in it sure. So a critical user journey is generally something that's defined by the product aspect of your organization telling you what needs to happen with your product.

42:26

For you to be a successful company. Generally in this case meaning making money, that's the most General product management product owner definition of what a user journey is, right? It is the expected way that a feature will work and a feature, very rarely connects directly to a single service in terms of a micro service in terms of a single team owning it. So user Journeys.

42:51

Generally, span multiple different components of your system that generally means Is that there are many different teams, not just one responsible for all the components that user Journey travels over, its kind of a high level product manager, explanation to the user Journey. But I think they're just a good analog for what a good SLI is a good service level indicator. Basically is a user Journey, have user Journey basically is a kpi or key performance indicator, which is even one

43:20

more step up, right? This is now the business side. What does the business say? Your business operational? It's not your computer operational team, what do they say? We need to be measuring. What do they say? Is important for our revenue for a bottom line. What does the chief Revenue officer or the CFO? What do they care about? But they're all similar in the sense that they're all measurements, likely having to do with your customers or users.

43:43

And none of them are ever going to be 100%. So you can set targets and all of them. I kind of like to say that an SLI is a user journey is a kpi. They're all kind of the same thing, just different business units. Have slightly different ways of talking about them, I think the most important thing also again is coming from the product rice, not randomly from some

44:00

Engineers, okay? This is our critical user Journey. You mentioned that a user Journey could typically involve a number of services, we could be components database load balancers and all that. It could also be multiple micro services that span across multiple teams. One part of a confusion here. How do you define the SL eyes? Is it per team? Is it for whole user Journey itself? How do you advise people to think about this?

44:26

I think you need both. I think individual teams need SL is set on their own Services. They need SOS settle me room Services, they need to understand how their services are operating for the things that depend on them again. Even if those things are only other services as well, but then perhaps the director of your organization needs to be setting SL eyes. It's not say, the director has to be necessarily implementing them, but needs to own SLI is an

44:55

SLO. For the kind of user Journey, stories that go across many different services and perhaps, the VP of engineering needs to be oming, the concept of like King you check out. So, the checkout microservice team has their own SLO that measures, how often their micro service is not throwing an error.

45:15

The Commerce Department, the director of that department owns an SLO that says does the checkout workflow work, the VP of engineering owns the slow that Says something like or users able to use your website and send us money. Something along those lines. But, yeah, I think these things build on top of each other. I think every step of that way needs its own a slow and you can use those individual as close to inform an overarching. A slow the SLI for an es.

45:43

Lo could be the S low status of different slow you know like as a big has no super hard fast rules here. It's just are you measuring things? Are you taking your users into Cal and are you trying to make sure you're not try to be honest. You mentioned about different departments. Owning slos right now let's move to SLO part so you have defined your SLI so you have maybe a latency error rate availability and all that and then you set your slos.

46:08

First of all how many SLO should a service have? Is it in hundreds of how many is good enough? So I think there's some art here. I think it's mentioned in the SRA book as well in your book as well. What's the art? How to define the number of slos for your service to be so boring through the listeners because it's just, it depends. Again right, it really is though. It's solely depend on your service.

46:30

Make the right decisions, make sure it is enough that you are covering all important aspects of your service. So don't choose to few. Because if it's too few, you can't understand what's going on. And also make sure it's not too many because if you have too many run in the multiple comparison problem, or you have too much data and you don't know what to look at and you don't know what's telling you what and you no longer have a good idea of what these signals are anymore.

46:55

I think that Google has sorry. Book said five or six or service that seems reasonable to me it really does. Yeah. Sure why not. But I also wouldn't again. Like so many things. I don't like hard and fast rules. I like approaches. I like philosophies. So don't feel bad if you don't have five, don't feel bad. If you have many more than five, just make sure it's the right

47:17

amount for you. There's also another thing that is covered in the book, so when you set SLO Target, make sure you are not setting it too much beyond what you see. Expect because it could cause a little bit of complications. Maybe you can explain, first of all, how do we know that we are exceeding users expectations? Maybe we ask the users, like, what you mentioned and secondly, then what should we do? Do we lower the definition, the target itself?

47:42

Maybe if you can explain the bit, the problem of being too reliable here. Yeah, so the main problem you run into by being to reliables, that people will end up expecting you to continue to run that way. So if your users were previously, okay, With you being only reliable 99% of the time, but then you proceed to spend like a year, being 99.9%, reliable, their expectations. May have now changed and now, they're going to hold you to

48:07

that 99.9%. Even though they were totally happy with only 99% before, it's a lot of nights, but I think everyone hopefully was able to follow what I was saying there. So you paint yourself into a corner.

48:19

If you are two reliable too often because user expectations will change, what you want to do is you want to make sure that You aren't being so reliable, you may accidentally do this, you may just accidentally have a few months where everything is like super lucky, or maybe your whole team on a vacation. He's didn't touch anything. So nothing broke for a while. We can couple the bunch of different examples, but you know, then you run into the problem of people are now going

48:47

to expect this moving forward. My favorite example, this is the chubby team at Google. This is written about in the first Google history book. You can read the whole story. See there, the Quake versions, chubby is a global Lock Service so it holds tiny bits of data that are useful for various highly distributed services, to be able to read and understand at certain points in their operation. Chubby, just generally Ram pretty well.

49:12

When I tell the story and I say that my old chubby SRE friends often, give me a dirty look, because apparently they're large, not always that easy, but from a user perspective, Global chubby, which was the global version of this that globally available. Rain very well, I believe they had 49 so a 99.99% Target every quarter. Generally speaking chubby still had that because it ran pretty well. So what they would do with the end of every quarter is they would just shut chubby off.

49:40

They would just burn whatever budget they had remaining. This would be communicated teams emails would be sent. Alice this Thursday afternoon at 3:00 p.m., we're going to be shutting chubby off for exactly 2 minutes and 17 seconds because that's how much error budget we have left. Even though people were told about this other services would always crash because someone would have a dependency on chubby that they didn't know about chubby being good citizens. Being ran by an excellent SRE.

50:04

Team would say, we're going to make sure you find out because we're only promising you for knots. If you're expecting anything more than four nines, you're going to have trouble. So we're going to ensure you find out if you're gonna have trouble because we're giving you exactly four nines per quarter, that's kind of a humorous story, and it's kind of out there and there's not a lot of teams that

50:23

will ever get to the point. There's not a lot of organizations, We'll get to that point, but it's a really good example of ensuring that your users aren't getting too. Used to you being too reliable, because once you paint yourself into that corner, now you might be stuck with being held to a level of reliability. That's otherwise, too expensive and too difficult, another probably assumption that people think of when the service being too, reliable is like Google

50:48

search. When an internet is down from the first thing that they will test is Google down as well. So I think when Google is Dimas reliable that much And so, I think people expect that Google is like a benchmark for internet, even the reliability of Internet. So you mentioned about error budget. This chubby story is really interesting. How they use error budget to actually make sure that the service doesn't perform to Reliable in the error budget,

51:12

sections of the book. One thing I find really interesting, you mentioned an error budget is actually not a technical term, right? It's a Communications framework and this is probably between engineers and business on. It may be other departments. So tell us more about this community. Should aspect of error budget? Yeah. So really what error budgets let you do? Is they let you tell others, here's how reliable we have been. So your s low Target tells others.

51:36

He's how reliable we want to be and your error budget. Let you tell other people. Here's how reliable we've actually been. And the reason that's such a good communication tool is, because it helps other people figure out what their own SLO should be. And it also helps other people just basically understand what oldest actually boils down to because we've said the number 9 and On us like probably a hundred times already but what

51:58

does 99.9% even mean? Well when you translate it to meaning 40 minutes per month, well that's something that humans understand. So when you're able to go to someone and say, all right, we were unreliable for 17 minutes last quarter but that means we still had a budget left of 24 minutes. Although unreliability periods are not always all in a row, they're not always about downtime than always, but outages. But the point is it gives you a more In friendly way of

52:28

communicating. These kind of historical reports. What did q1 look like? What did 2021 look like from our user perspective, from this hypothetical All-Seeing user who never stopped watching us?

52:42

What did it look like in time, for example, it's just an eventual output from an SLO based approach that helps you, then go have conversations with people whether it's via the time-based error, budget, definitions, which agains make it. Easier for some human heads to wrap themselves around it or just by being able to say like, look, we exceeded our error budget every quarter last year. We believe we're aiming for the right target, but we can't.

53:07

So we need more resources or we need more head count or it could be the opposite. It could be like, hey we've been exceeded. Our budget, a ton maybe we move some of the staff over to this other project that's having some problems or maybe we should be moving quicker. Maybe we should be shipping features more often. Because you know what? We're really awesome. We're almost 100% all the time. Let's spend that budget. A chip more features. Let's experiment.

53:32

Let's try things. Lets you chaos engineering. There's so many different things that error budget, lady do, but they almost all involve communicating to other people saying. Hey, this data is told us this thing over time. What can we do with that data? I really like all this concept. They kind of like, built on top of each other, like you mentioned, reliable this Tech. So, the error budget, specifically, if you use it,

53:53

right? And if the people aligned in the The organization that error budget is an important concept for them, you can use this for communicating priorities as well. Like, you mentioned to be shipped more features. Should we hire more people? Should we even fix the reliability, or should we do experiments? And even like chubby, right? Should we just spend it because we are doing good. I think that's really interesting. So I think many people are interested in this s.res.

54:16

Ellos es, el eyes and all that but implementing it. Like you mentioned in your Squarespace experience is hot. Maybe tell us a little bit more tips, how should people start building this as a Culture or even defining how to get the buy-in from the people within the company. The best advice I can give or at least my favorite advice because it's not always possible for everyone.

54:37

I want to be very upfront that I understand that not everyone is in a situation where they can do what I'm about to say but if possible just get started just do it. Just pick a service and pick an SLO and measure it and maybe it's the totally wrong target and maybe your SLI is a bad one. That's fine. Pick a new one, they're not agreements. Like their objectives, just get started with it and start Gathering the data and see what you can start doing with the data and maybe first.

55:03

It's just you and then maybe it's your team, or maybe you can get your whole team on board right away. And then, you can start showing the teams that you work closely with. Hey, look at this. Cool data were getting. Look at the Greek decision-making, we've been able to make and how we've been able to more effectively plan, our Sprint's because of our error budget data. And then this other teams I call that seems kind of cool. Maybe we should try that.

55:24

I really think as lows are From the bottom up thing, really? Honestly, I think they're very often organically grown. I think that people can be sold on them philosophically, but they don't understand why they need to spend time implementing them until someone's kind of give them the hard data. So if possible just start, just go just start doing these things. See what it gives you.

55:47

If it doesn't give you what you want, maybe pick different targets, maybe pick different measurements or maybe you're not quite ready for them. That's also totally possible but every Industry out there already in some way understands failure happens. It's not just Computer Services. Embracing failure is always a good thing. By ensuring, you're not feeling too often. This is always going to lead eventually to happier engineers and happier business and happier users and happier customers.

56:15

Therefore, so that's my favorite advice. I don't know if it's the best advice because like I said, I know some people are not in a situation where they can just go do it. I'm sympathetic to that but that's Generally what I tell people if possible just give it a shot and see what happens. Yeah, sometimes people get stuck into the tooling. You know, can we get the data? I know that but I think we can always start simple, right?

56:38

Yeah, I've seen plenty of teams get started with s lows by manually calculating them at the end of each quarter. Like no joke, they didn't have real-time alerting on the rest Lowe's or real-time error budget status. They even have error budgets. They got started by at the end of every quarter when they were getting ready to plan, What should our priorities for next quarter? Be, they would go run some queries from their monitoring system, do math against it. Put it into a spreadsheet and

57:04

then calculate, okay? Last quarter, we were X percent reliable. What does that mean for our next quarter? And that's a totally reasonable way to get started. It's all about just, again embracing those service truths that we talked about at first, right, Rye, Billy's, most important thing, your users Define your reliability, not you. So make sure you're measuring the right thing and 100% is out of the question. So pick the right?

57:26

You can Embrace those truths without real time monitoring and advanced statistics and all the stuff that comes along with. It just get started even if it's in a spreadsheet you know, if it's only just once a month I like that you men said this even though you don't have the tooling to start because some people think after reading the book again, it's philosophical, right? We are not Google, we don't have all the tools so we are stopped. So I think that's the key

57:48

message here. Just stuck and I think these three services and likely discuss in the beginning is really important. Once you get it right, you will find ways to to actually measure your user happiness. So Alex, thank you so much for spending your time is like the crash course of SRE and SLO definitely. But unfortunately, we need to wrap up pretty soon. But before I let you go, I normally ask one last question for all my guests which is to share your tree technical leadership. Wisdom.

58:12

So this is maybe some kind of advice for you to give us. Also may be based on your career Journey experience or maybe had lessons, sometimes sure there's three things I have to share, be kind. It goes a very long way. You're dealing with other humans, no matter what you think, your job is no matter what your computer services, it exists at some level for other humans, whether they are your customers or your end users or your co-workers or whatever. But be kind just be nice to each other.

58:42

Don't be pompous. Try to always remember that. Every decision you make impacts other people initiate those decisions with kindness be thoughtful, I'd said that a lot tonight but it really is. I think maybe my most meaningful mantra At this stage in my career is think things over. Sometimes, you need to react, sometimes you're in an

59:02

emergency. Sometimes you deal with an incidence or immense business pressures, or your company is under immense Financial strain, I'd been all those places but always, you always have time to be thoughtful, you always a time to take at least a few seconds be like, okay. Is this the right thing I'm about to say is this the right thing I'm about to do is the correct action?

59:24

I'm about to take And then finally, adopt blamelessness, make sure that your organization is building an appropriate culture, where we understand that humans don't make mistakes on purpose, right? That's not the humans all make mistakes. A course we do.

59:41

Of course every single one of us does every single day, but generally speaking, unless you're a bad actor unless you're literally trying to bring down the company from the inside, unless that's the case people aren't doing it on purpose, and always remember that. So adopt linguist list. The combined data thoughtfulness combine that with the kindness and be better to each other, I love the old wisdom because it all touch the human aspect nothing. He about technology or slos. Sorry.

01:00:07

So thank you so much for this beautiful message. So Alex for people who want to follow you, or maybe look for your product Noble 9 where they can find you online. So you can find me on Twitter primarily a dog os3. That's a hid, a lgo SRE on Twitter. Also, my website, Alex Dash Hidalgo, Dot-com and definitely go check out Noble mind. I believe anyone can get started with s ellos, we exist to help you do that. We exist to help you measure s ellos and calculate your are

01:00:36

budgets. The best possible weight no matter where your data lives. So come check us out at Noble. My not calm, that's and obl nine.com. Thank you so much again. I really enjoy this composition so it was a pleasure. Thanks Alex, thanks Henry. I had a blast. Thanks so much for having me. Thank you. Listening to this episode and

01:00:57

for staying, right? Until the end, if you're highly enjoyed it, I would appreciate if you share it with your friends and colleagues who you think would also benefit from listening to this episode. And if you are new to the podcast, make sure to subscribe and leave me your valuable review and feedback. It helps me a lot in order to

01:01:14

grow this podcast better. You can also find the full show notes of this conversation on the episode page at technology node, death website, including the full transcript interesting quotes and links To the resources mention from the conversation. And lastly, make sure to subscribe to the show's mailing list on pack leader. No dot f to get notified for any future episodes. Stay tuned for the next technology. No episode. And until then goodbye.

Transcript source: Provided by creator in RSS feed: download file

#96 - Practical Guide to Implementing SRE and SLOs - Alex Hidalgo

Episode description

Transcript