#42 - Chaos Engineering - Mikołaj Pawlikowski | Tech Lead Journal podcast

00:00

Guess engineering is basically this discipline of experimenting on the system and the system can be anything. It doesn't have to be massive or Netflix scale in order to increase your confidence that system will survive difficult conditions. So the experiments real goal is to either confirm that your assumptions about a system are correct or you find a problem or you find place where your assumptions and the reality don't, Not necessarily add up.

00:35

Hey everyone. My name is Henry Surya be Robin. And you're listening to the tekhelet Juno, the show will be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our Journal.

01:09

Hello everyone. I'm so happy to be back here again with another new episode of the package on our podcast. Thanks for tuning in, and spending your time with me today, listening to this episode. If you haven't, please subscribe to Tech. Did you know on your favorite podcast apps?

01:23

And also follow technology on our social media channels on LinkedIn, Twitter and Instagram. And you can also make some contribution to the show and support the creation of this podcast by subscribing, as a patron. At technology. Know that death /. Patron and help me towards producing great content every week for today's episode. I am happy to share my conversation with Michael. I Polly Kowski. Mikolai is an engineering lead at Bloomberg.

01:49

And the author of chaos engineering side reliability through controlled disruption. I'm sure that many of you would have heard about chaos, engineering before popularized by Netflix along with its tools such as chaos monkey and Simian Army. I'm Curious about cows engineering and how to implement and execute it properly, in order to continuously improve systems. Reliability in the midst of disruption and disaster in this

02:17

episode. Mikoshi 8 in depth about what Kelsey engineering is. And importantly, what cares? Engineering is not by clarifying. Some of the common misconceptions surrounding it Miko, explain the prerequisites and steps required in order for us to start doing Chaos engineering, and also mention some of the chaos engineering tools that we can use at

02:38

different layers of our system. The skill set required of a chaos engineer and how we should explain the rationale and motivation behind chaos, engineering to get the management buy-in. And using a fun analogy by involving hamburger and sharp towards the end Miko. So shared about Kelsey engineering for people and interesting exert, taken from his book and his ultimate Mission over the last few years to make chaos engineering

03:05

boring. I hope you will enjoy this episode and if you like it, consider helping the show by living it a rating review or comment on your podcast app. All social media channels, those reviews and comments are one of the best ways to help me get this podcast to reach more listeners and hopefully they can also benefit from all the contents in this podcast. So let's get this episode started right after our sponsor message. Are you looking for a new cool swag?

03:33

Taglit Journal. Now, offers you some swags that you can purchase. Online. These works are printed on demand based on your preference and will be delivered safely to you all over the world where shipping is available. Check out all the cool strikes available by visiting technology, you know that, death / shop and don't forget to break yourself. Once you receive any of those tracks. Hey everyone, welcome back to another new show of the package.

04:00

You know, today I have with me a guy called Michael. I polakov ski or in short. Let's call him Michael. I saw me. So, is the author of a recent book, titled chaos engineering. So today? I'm sure we're going to be talking a lot about chaos, engineering how to implement it correctly, and maybe some of the gotchas or misconceptions about health engineering. So, because it's good to have you here in the show. Thanks for being here. Very, glad to be here. Thanks for inviting me. Yeah.

04:26

So before we start, probably introduce yourself to the audience here, can you tell us a little bit more about your career? Maybe some highlights or turning points? Sure, so I don't know how Far back. You would like me to go. Yeah, obviously right now, I do a lot of guess engineering. I run a small SRE. Team that managed communities and Chaos engineering kind of evolved as one of the very important tools that we use to basically make our system more

04:54

reliable. And it placed nicely into this entire site, reliability, engineering mindset. I've been out Bloomberg for a while, now, before that. I attempted a couple of startups. So pretty, I think, common take background. I've been hacking away at coding since I was a kid. So I think probably a lot of people can relate to that. Okay, cool. I mean, let's maybe dive deeper as we talked along about your career. You have this this book not so long ago, right?

05:22

Chaos engineering. I think the first time I heard about this term is during that time when Netflix popularized, this idea, chaos engineering, and they've been running in their production, killing their servers, in order to increase the reliability, but maybe for all the audience Are can you maybe explain what exactly is chaos? Engineering. Sure firing and that's probably a good idea because there's a lot of misconceptions and the

05:46

name itself doesn't help. So I guess engineering is basically this discipline of experimenting on the system and the system can be anything. It doesn't have to be massive or Netflix girl, in order to increase your confidence that system will survive difficult conditions. So a unit of course, engineering, the way that I see it. Is, this chaos. Experiment. When you basically have a bunch

06:08

of assumptions about the system. Obviously we all design the systems and we want them to behave certain ways. But only in practice, it turns out that there's a lot of behavior that we didn't account for or system is doing what we told it instead of what we intended. So the experiments real goal is to either confirm that your assumptions about a system are correct.

06:33

Basically your hypothesis works or You find a problem or you find place where your assumptions and the reality don't necessarily add up. So this guy's engineering experiments typically have four steps one as defining observability because this is like a prerequisite for all of that to happen. If you can't observe the system. Any kind of variable reliably than you can't really conduct a scientific experiment. Then you go for what?

07:03

We typically call steady state, which is just a fancy way of Of saying, this is the normal, kind of behavior. This is the normal range. So let's say that for observability. We're looking at variable like throughput or number of requests per second that some kind of server can handle the normal range. The steady-state might be this number of thousands of requests per second and then we go and we do the fun stuff.

07:26

So we try to turn our expectations of the system into a hypothesis and we say, okay, so we designed the system to be redundant and means that if we take away one of Servers from the pool. It should keep working within this parameters and you go number four, and you implement that and you verify what happened. The nice thing about that is that everybody wins? Because if your hypothesis was wrong, then you discover

07:51

something you can fix. And you can save yourself, trouble, and fix it before you users. Notice, if your hypothesis was right was, correct. That means that your system is pretty good. So you increase your confidence. So, you know, it's the nice part of doing Because engineering. So this is really it like, it doesn't need to be more complicated. And then obviously, it came out of Netflix and it made for some really good headlines because they were already doing this in

08:17

production breaking things. In production is typically one of the internet meme. So if someone comes out and says that with a straight face, it's a bit of a controversy. So that's really what I see. Let's go to engineering. I think it's simple enough that pretty much everybody going. At least kick the tires and see what value they Get out of it from your explanation just now. So there's nothing mentioning about chaos. So to speak, it's more about scientific experiment. You know, what?

08:45

You're going to test. You have a hypothesis and like you have the steady steady you want to test about and then you introduce some kind of tests and experiments. Not necessarily all that chaotic stuffs, like just killing things and doing some like one data center takedown or something like that, but it's not necessary about the chaos itself. Although a lot of people actually has this misconception. Okay, let's Just introduced this scale software and then everything.

09:09

Just go haywire. And let's see how the system goes here. So chaos is a little bit of a double-edged sword because on one hand, it catches attention on headlines. On the other hand. It requires a little bit of explanation. So at least two things to touch upon here. One, is that the carriers that we mention here. It's not about increasing the amount of Chaos in your system. It's about decreasing the amount of gas in my definition of

09:34

Australian to make you picture. Up coat and protective glove were, and I were and stuff, because if we can't control as many of these variables as possible. We can't reliably confirm or deny our hypothesis. But the other reason for the chaos in the name is that there's an entire spectrum of things that you can do with goes engineering, Gas, Monkey, the kind of thing that this entire discipline started with really was very chaotic in the sense of

10:04

randomness of the word. So So, if you don't know much about your system or you're looking for the emerging properties Randomness can really give you a lot of value, kind of out of the box and requires very little setup. So we take the system, you release the chaos monkey on it. You let it run and you probably find some things you can parametrize. You can make sure that it breaks on your certain percentage or whatnot.

10:28

And this already gives you value and I see the sign of a spectrum as similar to the discipline of fuzzing in testing. You produce a lot of valid inputs that you would probably not. Think of if you were writing unit tests manually, you just run it through whatever you're testing. And you might discover things that you didn't think of. If you just cram all this inputs and like, brute force is approach, right?

10:55

And so, if you start with a system, it's a really nice way to get into the girls engineering, because it doesn't require much of a setup. You need to have a vague understanding of the system. You can release it and you can find thing. And the other thing I mentioned

11:09

is the emergent properties. It's also a fancy way of saying that, but the greatest rate that for those of your audience, who not heard about that, think about neurons in your brain and a single neuron doesn't have the property of human conscious or doesn't have a property of thinking per se, but then when you put them all together and they interact from within this interactions, you have the emergent property of Human

11:36

conscious. Same thing for another popular example, of the cells in your heart. None of them have the property of pumping blood and oxygenating your body, but put together they create a system that actually has this property. This examples give you very nice properties, right? But in any complexion of system, you're going to have the interactions that you just didn't predict and sometimes they're going to be pretty bad for you. So this kind of fuzzing

12:05

randomness. Scales engineering side of the spectrum is great, and it's useful. And it's part of the discipline over the last few years. We've been trying to push for the other side of the spectrum where you go into the system. You already know the system. Well and you're working in particular properties and you want to make sure that particular failure scenarios are covered from the SRE point of view.

12:30

If you had an outage that you didn't predict for because the system worked differently than you expected you would To make sure that doesn't happen. Again. One of the best kind of regression tests that you can come up with is to simulate that failure scenario. Make sure that the system actually still survives that. So, the other side of the spectrum of the sophisticated,

12:52

very deliberate practice. When there is very, the randomness is also available to you right now, as a practitioner of chaos engineering, you can pick wherever its best on that Spectrum for you. And I think that's It's definitely good thing. So, as a techie, everyone understands at least the concept, but when you introduce this, to the management of the business, for example, so we are going to introduce chaos monkey or chaos Engineering in our system. How do you actually explain to

13:20

them? What is their motivation rationale behind this? Because I'm sure any business, and any Executives, they all want stability. They don't want some kind of a Randomness or attack to the system that is working fine. So how do you explain the rationale and motivation behind this to sell it to them? Yeah, that's probably like the number one question, and that goes back to the name, being a little bit misleading, and a little bit of a double-edged sword.

13:45

So it depends who you talk to. And I, in general tends to have shit. Has two big groups. Either. It's your colleagues or people who have had the experience of actually being paged in the middle of the night. And I think the only real argument you need to get them on board is to tell them that. Listen, if we do it right there is a huge potential for you to

14:08

be called less at night. So there are like basically the worst case scenario is that you just don't go some Junior and you break something, if you do it well and you do it with common sense and you apply all the same best principles for deploying the code because the case experiments are code like any other really you're going to only affect a certain blast radius anyway, but the worst-case scenario, is that. Okay? So we did get engineering and we

14:32

broke something. So what happens then, one of the things that It's worth mentioning is that you doing it on purpose? So if it breaks your typically in the office, you're typically ready to jump to fix that. You don't have the complex, which waking up in the middle of the night. When you're being paged to figure out what's going on, is nothing Pleasant. I'm sure a lot of your audience will have had that experience.

14:56

You have to wake up. You need to make the coffee, read through the alerts, that may or may not be easy to understand deal with people who are panicking because They just got woken up and they don't know what's going on your contacts, which, because between all these things, you login, it takes time. It's typically not a very pleasant experience. Let's just be honest.

15:17

So, if you can minimize the number of times that this happens, and instead try to do it more purposefully, that's a big step up. So if you are on rotor for supporting your system, this is probably. All you need to know. The other group is people who Decision-making on this kind of stuff your managers. And this kind of person for my experience, the best arguments to start with, its to do some back of the napkin mods. Now, I can have a phrase for that.

15:53

I typically colored the hamburger versus Shark problem. It's about the perception of risk. So like I mentioned just a second ago, if you do it, right? And you are careful about the blast radius. Jesus. It's pretty a lot like releasing any other challenge to production, you test things for the stages and you apply all the same principles you do for all other bits of code. But if you ask a person industry, how afraid should they be of sharks?

16:20

They have been already primed by Hollywood movies, the Joe's the Meg, and all of that has to be very afraid of sharks. But if they actually look at the statistics of how many people die of shark attacks. It's a very minuscule number, there is more people dying of Coconuts landing on their heads. Every year than there is a shark's where us the things that are really statistically likely to kill them, things like heart,

16:46

attacks or heart disease. In general, that take more than half a million of Americans every single year compared to probably a single digit number for shark attacks in any given year. You see that this is actually statistically much more dangerous D. So, next time, you see that hamburger, we fold this crease and, and all of that. Think about it that this. Is actually more likely to kill you, then the shark. What was the point. I'm trying to make here.

17:10

The but I'm trying to make is that a case engineering when you first hear about that and about introducing failure on purpose. It's a lot like end-to-end testing of the unhappy paths the way I see it. It's like the evolution. You do the unit test. You just a small little subset. Then you do maybe component, testing some kind of integration testing. Then, at some point you get to end-to-end testing, where you typically test. Happy path or some popular path and stuff like that.

17:38

And then case in June is a lot like end-to-end testing of the system. When you take it as a home during unhappy events, when things break, when machines go down, when Network gets slow, and this kind of thing. So for someone who is hearing, okay, we're going to add failure to our assistance and we would like to also eventually do it in production. It sounds scary, but that's like the shock. If you think about it. There are ways to manage the blast radius and they're armed.

18:06

Ways to manage that risk and the goal of the entire exercise is decreased the amount of chaos and decrease the amount of risk that you have in your system, not increase it. So just a napkin, a few numbers run through that and you can typically explain to your engineers that it's really, they shouldn't pay too much attention to this Curry sending name because there's a lot of return on investment to harvest.

18:29

So, if I hear you correctly or discuss, engineering doesn't necessarily mean that you have to run this experiment in. Second, you can also run it in maybe a pre prod, or some kind of environment where you can actually simulate and do. All these tests. Is that correct? Although, if you don't try it in production, you're excluded from the Coast Community. Okay. So, you know, this is one of the things because of all the blog

18:50

posts. It's shiny doing stuff like this in production is unorthodox, and it makes for a great presentation, but that's the Holy Grail, right? You want to be asked comfortable with the system that you've done this for so long in the other stages. That is now part. Out of your routine and you do it in production, and that's great. And if you can do that, that's obviously great. Because if you think about that, you're never fully testing things until they get into production.

19:17

The data patterns will be different usage patterns will be different. So, by definition, you can never fully test things on it until the actual proverbial rubber hits. The proverbial grow, but the common sense sticking the clothes engineering sticker on your laptop. Doesn't necessarily give you a more Absolution to Piercing common sensor behind the scenes, the less shiny bit is that you are progressing. That the same way that you progress, all other software.

19:45

Rewrite this code. You run it in your test environments. You might increase it. You probably going to do the same thing that you do. When you release any other software and limit the percentage of traffic that goes through the systems that have that and then increase it. If anything goes wrong, you roll back. So the same best Suppose really apply. Once you get to the holy grail and the ranking production. That's great.

20:11

You can write a blog post and go in the conference and talk about it as some of us do, but it's not just about that. And also, it's probably worth mentioning that there are cases that are use cases where it's probably never going to be. Okay, if your choices either to potentially introduce this failure and you're not confident whether it's going to work and someone might die because of that that's probably not. Out the right, moral choice to do that or if you have very

20:39

heavy contractual obligations. Now from the legal point of view, you might not be able to do that, but it doesn't mean that you can't get 80% of the value of harvesting the lower hanging fruit in the process. So, now, that's a myth.

20:54

Thanks for clarifying that. So, obviously, for us people who are not experienced in it. There are many other more misconceptions, probably the one that exclude you from the chaos engineer group, so maybe you can tell us some of The other common misconceptions that people have sure, sir. There is an entire Suite of this and I I typically on the regular

21:15

basis. I asked my LinkedIn Network to say what's the biggest blocker and it still seems to be getting the buy-in in big part because of this misconception. So the production one is a big one and that's typically something that people really think the other thing is Randomness. So a lot of people basically

21:35

disregard U.s. Engineering because they feel like it's a gimmick chaos, monkey, randomly smashing things, while for bigger systems, like Netflix again, that example that works nicely because doing something as simple as taking down. The VA ends is already giving you value. It might not work for a smaller system. So explaining what we just talked about that. It's not just about the randomness. It's about entire Spectrum from random to very deliberate. It's also important.

22:07

Another thing. I keep hearing is the oh, we already have chaos enough. We don't need new ha, we've touched upon this already too. But I really would like to drive the point home, that, if you still think that case, engineering is about adding chaos to your systems. Then you still haven't gotten the memo. We add the failure so that the amount of incertitude, the amount of Chaos in your system actually decreases. So the net net, we do all of that.

22:36

A decreased amount of care. So obviously under that umbrella. A lot of people say that because they don't feel that confident in their systems, all of our systems break. And if you are just hearing about this and you had a massive outage last week, you might feel okay, fine. I need more maturity or something like that. I think that's a false premise to because regardless of where you are on your maturity Spectrum, you can get some value out of doing that.

23:06

Typically discuss experiments unless you go really into the deep weeds. They are simple to setup. They don't take that long to implement and especially as the tooling around that gets better and better by the day. It's easier and easier to do pretty sophisticated things with little effort. Worst case scenario, you don't discover anything and you feel a bit better because you know that this particular scenario is not going to take your system down

23:32

and best case scenario. You can discover something. So it's really like the small. A batch that can pay off that are pretty cheap to do. There's also the benefit of the kind of mindset that goes with that. If your engineers and your team think about the fact that this is going to be tested, that this is actually going to be put in practice and they have to design the systems with this. In mind. It just goes more naturally to bake this reliability from day one rather than doing this as it

24:04

breaks. So the mindset is I also saw that if we could probably talk about it would mock. But we kind of think that who we're not that mature. We still have outages and stuff like that. It's really not helping because at any stage in your maturity, even if you still getting outages from the regular basis, you can add value out of that. So, no, it's not adding more chaos. It's definitely not we're not

24:29

mature enough. I think one more is about the scale, people look at Netflix. And they look at Google and they say okay. Yeah, this are massive companies and they're bleeding edge and they have all the scale and that's great. But it's also worth noting that guy's engineering doesn't require you to have all this fancy and massive distributed systems.

24:55

It can really work. It's more of a methodology that can really work on any kind of system whether you're working with a single process Legacy server. And you would like to make sure that you understand how it breaks and the kind of things that you expect to happen. Don't break it, rich rice or whatever, recovery logic. You can do that. Nothing stopping you. So really don't be shy, kick the tires and you'll see that with a

25:25

little bit of investment. You can get a lot of value out of that and you don't need to be like, oh no, we're too small for them. So I think that's like my top Nate's that. I just keep hearing over and over. It's probably too late though to change the way home. So we're probably going to be hearing them. Thanks for clarifying, all these misconceptions and myths about Kelsey engineering. So in the beginning, I think you have mentioned this as well for people who have heard about this.

25:50

So they want to give it a try. Right. Can you summarize again the four steps that they need in order to start introducing these scales engineering, sure thing. So first is the observability and I like the word because it's very technical. But what it means is being able to To measure something, reliably, what I mean by reliably is that in computers, you know, we're kind of spoiled because we can measure some things reasonably.

26:15

Well, if we're Quantum physicists, it will be a bigger problem because the measurement could affect what we try to measure. And then we have the uncertainty principle to Begin work. But what I'm getting at here is that it's important to be good at observing. This it can be very simple. This variable can be anything. It could be whether the server. Up and down, then there's something can observe our throughput or number of requests

26:41

per second or anything, really? And then, once you have that, I mentioned, the steady state, the normal range of that is how you verify what's going on. So normally the server is up, are normally I get this many requests per second. And then we hypothesize. We say ok, so we are disc right heavy. Let's see that. We're sure that we understand what's happening. Meaning if someone steals a little bit of the disc, so we might introduce a hypothesis. If the disc becomes 50%, slower.

27:15

I still hit, let's say, 50% on my steady stage in terms of requests per second. And then the last step is to implement though, take one of the many tools that are available right now to do that, and make sure that it doesn't affect your observability. Go run it and see what happens. More often. Do not, you're going to discover. Things that you might have not predicted. Thanks for summarizing that again. So I'll just to recap first. You need to have a good Observer bility.

27:43

So I think this is a mass, right? You cannot introduce something chaotic without actually knowing and observing how the system behaves we got that kind of defeats the purpose. I guess. Then the second one is you need to know your steady state. So what exactly your system at the normal State before you introduce this chaos and the third you start to hypothesize and making experiments, I guess like what kind of things that you want to test against the state?

28:06

Ready State. And then the fourth, the last one will be to implement that. So I hope I summarize it correctly. You mention as well. There's this chaos, engineer community and things like that. Is there a specific skills actually required from a chaos engineer compared to? I don't know, like normally engineer or a sorry, or maybe you can clarify on that pot any particular skills for girls engineer here.

28:28

So this is one of the things that I really like about because engineering is that it really cuts through different stacks and different too. Colleges and different kinds of Designing systems. It's really more of a meta skill the way that I see it. That also means that it's not necessarily a job description. I mean, some people have this job description, but it's more of a mindset / skill, that a lot of different people can have that.

28:54

So from what I do daily, you know as a kind of a sorry type person is that I care a lot about my sister not blowing up then my system working well. And as the scale Increases and everything. You realize that the failure is when rather than if, and you start seeing more and more of that as it grows. So the primary driver for me is to have another tool that helps me sleep better at night and

29:22

get, paged less, or call it now. So this is something that works well for the as a radius, but if you're an application team, you can also leverage the same thing. You design your system, your application with a very working on, in a way. It said, it's the most reliable possible. So you apply the same skills and there's nothing stopping you from running your own chaos

29:43

experiments. The SRE, kind of person might be running a platform that runs a lot of client echoed that they know little about potentially and the application side. You run fewer of the applications, but you have much more intimate knowledge about them. So, to give you a more tangible example, let's say that the chances are that using a database. One of the things that you write, as you write your test cases, your unit test and you verify, what happens when the

30:13

database becomes unavailable. You verified that right errors return, for example, so something like that or that there is a rich way, something like that and that's great. That's hopefully what everybody is doing. But now the question is, do you know what happens when the database is still available? But it becomes a bit slower.

30:32

Do you know what happens if it's a all The connections to your database are now being throttle because there's some busy neighbor or just networking is overloaded and you get much slower. Do you understand how the compounding effects on all of that work? And do you have things in place to work with that? Or is your application just going to hang in Forever except enough connections? Empties up the pool and you get

30:59

stuck forever. So at every level of the stock, whatever you working, you can use the same thing. NG I really try to drive this point home in my book, that's much more applied and technical than the other books that were

31:15

available at the time. Is that wherever you are on the stack on the platform, whether you working with the colonel and you want to verify what happens when Cisco's are blocked or are delayed or something like that to the platform levels networking maybe communities level containers level all the way up to the browser things like Okay. So JavaScript is everywhere. But do you really test out what happens when the front end code is having trouble connecting

31:46

orbis? Actually getting slow responses. Do you still display coherent data rather than loading a little bit here? A little bit there and giving people a false impression. So it's really applicable to all the levels of the stack. This is something that's very exciting for me because you get to look at the And things and you know, just contains to this and go box.

32:10

So when talking about these layers, I'm very interested because you can do this at any layers, not just platform levels, which is what we normally heard from Netflix, but you can also run it at the application Level database, level even Network level or like what you said doing it on the browser itself. I haven't heard it to be honest, but maybe can you give us some examples?

32:29

Like the name of the tools and what exactly they are testing for these layers, maybe some examples would be great for the Winston. Oh sure thinks. So let's say that we've all had this moment when someone gave us some Legacy piece of software that has been compiled 10 years ago and it runs but the documentation is a little bit iffy at best. There is more of a tribal knowledge and you don't necessarily want to go and read the Fortran code or whatever C code that it's implemented in.

33:02

And so if you work in a startup that was started last year, maybe Maybe you're going to be able to skip that part of your life, but it's like a rite of passage. So one thing that you could do about it, let's say that it's some kind of server and that's an example from my book. If you want to actually go and play with that. There's a VM that lets you start that and play.

33:23

Is that not a lot of people know that you can use esterase, not only to trace the system calls, but you can also use it to implement changes, so you can error system. Girls. Also, if you have modern in a version of a stress, you can actually Implement patterns where you can fail, for example, every other request. So while the things that you could do at this very very low level is that the server is doing right to send the request to you so you can go and you can verify that.

33:56

For example, if it can't do it, right? If it gets an error, there is some kind of ritual Logics. So if you fail a portion of that, you're going to see what the return logic kicks in. And you get the right response or if it breaks completely. So even if you don't know much and you don't even see the source code of this and all you have to go by is the tribal notice that? Oh, yeah, it's supposed to recover, always recovers. You can actually go and get some mileage out of the most basic

34:26

thing that thing does. Because every program your system is going to have to go and do the right. So this is something that a lot of people use as trays to see the Lolo. What's it can do, but not everybody knows that. It actually lets you implement. Chaos experiments. Like that stress is also a great example of the importance of observability because the penalty of running a program while it's being asterisk is

34:52

pretty high. So the measurement actually effects, the program that you're running, it slows things down very significantly. So there are other ways that you can observe this Wings. There are new technologies. Is that let you do that? Like, EPF the extended Berkeley packet, filter that are becoming increasingly common and Powerful, because you can do a lot of similar measurements for the observability point of view without actually affecting this.

35:20

But in a lot of scenarios, it's going to be workable because you can measure, if all you're interested is success rate. It might be okay for you to accept the penalty for doing your test. Something you probably wouldn't do on a production system because obviously, Let's do things down and affect the users.

35:38

So this is like the lowest level that I could think of and that's why I went all the way there in the book to show you that worst-case scenario single process and you can still apply the same principles and then I mentioned kubernetes when you go a level up, there is a lot of staff that could bring Auntie solves for you, but it also introduces its own complexity and if you don't have a good understanding and just expect communities To always do the right thing.

36:07

You might be in for a surprise. If you are working with kubernetes as a product, if you're running kubernetes or someone else. It's also very important to understand how communities itself fails and what the fragile points are. And this is something that's very easy to do. If because engineering and every one of those experiments is helping, then there are things like, you can build case engineering into your

36:31

application directly. Obviously, in this means that your Woods is part of the application and it's prone to introducing new bags, but you can do things like activate the case, engineering code, only behind some kind of flag and activate that for some percentage of your traffic. Make sure that there is no run time penalty for running that. And the code is not actually hit. So you can break things for the instances where it said

36:57

activated. Although in the browser like I described, this is something that people react of a smile initially do. So. Okay. Well I'm going to Give us this my browser software, but it doesn't take much to show you that. Actually, if you just play some data from the previous request and send data, from the new request. You might actually trick the user to do something silly because they have stale data or inconsistent data. So doing things like this is

37:25

also important. Yeah, that's three examples then. Hopefully give you the entire spectrum of going all the way up from the bottom. So I think you mentioned a lot of these examples. Your book, for those of you who are interested to know more or even play around with some of these tools. I think Nico, explain it clearly in the book for this layer, for this tool. How do you do that? So make sure to read the book if you are interested for them.

37:48

So, I think in the book, that's another thing that you mentioned, which I find very interesting which is why you called Carol's engineering. But for people, why do you write a specific section for that? Actually, it's a little bit unappreciated, but if you think about your people systems, Also known as teams. A lot of the same rules that apply when you're building a reliable computer system. Also, apply to building reliable human systems. Probably easiest way to illustrate that.

38:18

Is that every team is always going to have some bottlenecks, and the bottlenecks might be in a form of throughput bottleneck, or they might be in a form of knowledge. Bottleneck, if you have only one person who's Capable of debugging a particular system and that person is on holiday.

38:40

You're going to have a problem. So a lot of the job of having a performant and good team is to continuously trying to find this bottlenecks and resolve them and a lot of this stuff happens naturally and if the people on the team are thinking in this kind of way kind of chaos engineering way, it's going to be good for your team, long term. So this is something that the last So in the book is actually talking about it in detail and it's going to the human aspect

39:10

of getting the buy-in. So it's touching upon some of the bottleneck, some of the misconceptions that we just described that goes into the curse engineering mindset. Basically, the way that if you think of your system with the expectation of failures happening rather than a possibility of failure is happening.

39:29

It's going to help you build more reliable systems from the ground up. It's kind of like, if you have a good CI, System that always runs your unit tests and every time you push a chance Upstream, you're going to automatically get the feedback from the unit test. It becomes part of the culture, you get this immediate feedback quick, turnaround for the feedback. It works. It doesn't work. It detected. The problem.

39:55

It's similar with the curse engineering for mindset that you built this in into the automatic thinking about things, and it's not something that you might address later. Later, it's something that you expect to happen. It goes back again to the back of the envelope. Napkin if you think about it. It's not a question of. If it's a question of when, and it's fairly easy to calculate depending on your scale. How often you're going to see

40:20

the kind of values. And then the final bit of that is that was inspired by Dave runs and presentation and he gave out one of the conference's when he basically described the teams and I love the description. I still remember that as a set of By your robots, executing a distributed algorithm to produce some work and he actually went much into the details of the games that they came up with to

40:47

surface and detect the problems. I basically followed to his leader to include that into the book and make sure that it's spread through community. And some examples of that. I talked about the bottleneck in terms of knowledge. So, one of the things that you can do is just tell somebody Buddy, and a particular day that they're not allowed to give anybody else and help on this particular subject and that will tell you whether the rest of the team can basically do without

41:17

them. And if they can't, you detect to the bottleneck, he goes into a lot of more advanced games to all the way up to basically telling people a fake outage is happening, and seeing what happens. And what they do to address that this obviously is a little bit more tricky because you need to get the boy. In from the higher-ups and people might get confused. They might not know what's actually happening when they try to do back. Something.

41:43

That's not really happening or you might go all the way in and actually go and break something on purpose without telling your team to debug your reliability in terms of responding to that. So they've really went and created some very interesting Insight on that in that presentation and I just couldn't not include that into the book. Look, hopefully, he still enjoys it there. Thanks for sharing that. I think it's a very interesting concept teams as a distributed

42:12

system, kind of a mindset. So Miko, I know that you are very active in this community, right? For me. All I know cool stories about Kelsey engineering is all about Netflix that chaos monkey Simian Army and all that. Are there any other cool examples that you have heard people showcase, maybe in the conference or things that it's publicly available. Are there some cool things like that? So it's interesting that you asking them because That's what

42:36

I've been trying to do over. The last few years is actually to make coats engine. Very boring. Let me explain why I think boring is good. What you think about the kind of adoption curve of different new technologies. It always goes to the same. Bell-shaped curve initially. It's a novelty. You have a very small early innovators population. That's happy to put up with all the shortcomings of that. Then you have like potential early majority.

43:03

They have the late majority and people who drag a little bit. And so for technology or methodology to reach the white audience, it has to become mainstream enough. Basically boring enough that it can be adopted by a lot of people because not everybody is working is happy to work around the rough edges. So did the example I really bring you up is the SpaceX records, not that long ago, seeing this Rockets go all the way to orbit.

43:35

The boosters go all the way to the orbit and then automatically Land. Look like something from science fiction. They were the first one to pull it off and it was amazing. I remember staying up late because London Times n and watching the nine minutes or whatever of the flight, and then just Landing like something from sci-fi, but then over the time as they get better at it. It stopped being exciting because they stopped blowing up. They just go up. Go down.

44:05

Land on the Drone ship. Of course, I still love you. It's just becoming so mainstream that I no longer find myself staying up for that. So now maybe the spaceship is something that I'm going to want to look at. But if they start Landing every test and they start doing it routinely it becomes boring and so boring is good. The same way that a smart from was something that was outrageous not that long ago and now everybody has one you can It on the chair pants outrageously, good and quick.

44:38

And you have access to so many different apps. We have a small super computer in your pocket all time and you don't even notice that. So I would really like goes engineering to stop being about the exciting stuff that you can do and break things in production and not get fired for that. And instead become this routine thing that we do, just because it creates a lot of value. There are benefits to doing that. Yeah. I'm going to go in the opposite.

45:05

The direction of that and say that it's probably more about making poor, and then you think about it, most of the low-hanging fruit. It's like cyber security in the movies. We see the hackers just randomly punching the keyboard and streams of data going through and I'm in and probably some nice graphic turned out. But in reality most of the low-hanging fruit is so boring because you need to check all the routine things.

45:31

I need to stay up to date and you need to pass with the things that are already known and And you need to make sure that your S3 bucket is not on public setting. So that's the boring stuff. Doesn't make it into the movies. But that's where most of the work is done. So, yeah, if you want the exciting stuff, there's definitely things in the internet, but that's not really where most of the values coming from the most of the values coming from boring. Thanks for that, valid points.

45:58

Hopefully one day. We'll see, all these cars Engineers, not unicorn. So only the cool companies that are able to do that, but hopefully, All the engineering team is able to introduce some kind of experiment chaotic experiment in order to test the reliability of their system. So Michael, thanks for spending your time today. Eventually, we come to the end of this conversation. But before I let you go, normally I would ask this one question for all my guests, which is about three technical

46:23

leadership wisdom. So, can you share maybe some of wisdom that you have, maybe from your career, or maybe from your chaos engineering experiments that you have so that audience can learn and benefit from you. Wow. Wisdom, that's a big word. Okay, that's right. I think one of the things that I learned probably gave me a lot of mileage. Is that a lot of what we do is about removing the BS from the equation Engineers apart from

46:52

the big Egos and everything. I really very finely tuned to detect BS. And so, a lot of the tank leadership and leadership in Tech in general, in my opinion, is about just making sure. Sure that there is a stable beers as possible. So if someone asks you a question, you have few options, you can say.

47:12

Yes. Know if you know for sure the answer you can say, I don't know if you don't know the answer or you can try to be as your way through it and try to come up with something on the spot and I got a lot of mileage just by removing that last option. If you just tell people, honestly, I know I don't know, yes or no, it builds this. Ship, that's available to us when they also feel like they

47:38

can also say, I don't know. I'm supposed to be an expert in the field and paid a lot of money for that and Senior and everything, but there is stuff that I want now and that's completely fine. You work with a lot of people who are smarter than you who have more experience than you. Who are by definition are going to be much better at parts of your job. If you are humble enough to just say, I don't know. Well, what do you think it gets

48:04

you out of mileage? And I think that really helped me stay out of trouble and kind of a corollary to that another piece of an agate. Is that regardless of what you think about yourself? There's always something that you can learn from everybody and we work a lot in this industry and working hard and increasing the diversity of thought coming from a mindset that okay. If that person disagrees with me, there is probably something that they think about or they know about that. I don't know.

48:36

So if I just try to convince them to my way of doing things, I'm not going to come out of that conversation wiser. But if I at least by default give it a shot. Maybe they run. Maybe there's something. I know they do not know, but if I give it a shot, I'm going to get more value of that. And if you keep getting value from every conversation, you really going to accumulate that

48:58

over time. So yeah, I think that basically the kind of being humble and knowing that you can learn all the things from everybody and not be essing is probably what really helped me get where I am right now, because you asked for three. I'm going to also try to attempt the third one. I think you probably hear that a lot in your sure, but the kind of learning mindset, lifelong

49:22

learning is very important. This is something that some people get built in and they start with that, and they just super excited about learning by learning. I mean, like a lot of things, there might be a point where you do everything about your knees. And there isn't much more to learn instead of stopping learning.

49:38

You should probably start exploring other Niche or maybe do something completely out of your comfort zone and go learn a language or to some art or support and see how that affects your brain. In the recent years. I've been reading a lot about brain and how it works. I found myself being able to influence, a lot of weird things in my life just by picking up new skills that seemed to be completely unrelated. It's easy this days, you can get

50:05

a nap. Up, you go and do a lingo and you can pick up a language that would normally cost a lot of money would be impractical, but you can take the few first steps free or very cheaply. Obviously the pandemic made it a bit more difficult to pick out a new sports app. Anyway, the point is that learning things might have unexpected benefits to you. And this is something that I really recommend doing. Yeah, I agree with that. Thanks again for reminding this important learning mindset.

50:34

So me go for people. Interested to learn more about you or maybe a recent book. Where can they find you online? Maybe sure, sir? Probably the easiest way to reach out to me is on LinkedIn if you want to interact. Otherwise, I do have mailing list for the book. If you want to go to chaos engineering dot news. I can sign up and get updates. If you have any particular updates about the book or I mess something up, do reach out.

51:01

There is a GitHub repo out of the book where you can download the VM, and And that's where I can put issues. I'm pretty sure my some things up. So looking forward to that. Hopefully, we can curse conferences when they become in person. Again, kind of looking forward to them. Don't like the way I think of it all. So it's like a chaos experiment in our life where this pandemic suddenly throws people into different kind of living and

51:27

mindset and routines. Hopefully, we will end this pandemic soon enough so that we all can live through our previous normal life. So, thanks again. Go. I hope your cars engineering mindset and you're being the champion of it have more people being able to implement that in their team. They are systems and also the companies so that we can make it more boring. So to speak. Like what you said and I wish you good luck for your career as well. Thank you.

51:52

It was really fun being on the focus. Thank you for listening to this episode and for staying right till the end. If you highly enjoyed, please share it with your friends and colleagues who you think would also benefit from listening to this episode. And if you're new to the podcast, make sure to subscribe and leave me your valuable review and feedback. It really, really helps me a lot in order to grow these podcasts

52:19

better. You can also find the full show notes of this conversation on the episode page at technology. No, the death website including Doing the full transcript, interesting quotes, and links to the resources and mentions from the conversation. And lastly make sure to subscribe to the show's mailing list on technology. No, the deaf to get notified for any future episodes. Stay tuned for the next technique Journal episode. And until then. Goodbye.

Transcript source: Provided by creator in RSS feed: download file

#42 - Chaos Engineering - Mikołaj Pawlikowski

Episode description

Transcript