Gaining Stability with Rule Based Policies with Shimon Tolts - DevOps 226 | Adventures in DevOps podcast

Speaker 1

00:14

Hey everybody, and welcome back to another episode of Adventures in DevOps. This week on our panel we have Will Button. What's going on everyone, We have Jeffrey Growman. Hey there, I'm Charles Maxwood from Devchat dot TV. And this week we have a special guest and it is Simon Tults.

Speaker 2

00:29

Hi. Everyone, it's a pleasure to be here. Thank you very much for hosting me.

Speaker 3

00:32

So fun Yeah, absolutely, I mean you brought all the energy. It's it's funny. We were talking beforehand and you're just excited and I love it. Do you want to just introduce yourself real quick, let people know who you are, what you do at? Is it da Tree, Dat Tree.

Speaker 2

00:45

The Tree? Yeah? Sure, yep. So yeah. My name is Shimon. I've been in the infrastructure R and D space for more than ten years, and I've worked at large companies like Intel, and i worked at startups from like thirty

01:00

employees until we were a thousand employees. And my previous role, I was an engineering manager for a media company and I grew together with the organization from thirty employees to one thousand, and I really saw how the struggle is real when you have four hundred engineers and you're trying to make this work while breaking things and moving fast.

01:24

So this is actually what brought us here and what brought me to almost four years ago, me and my co founder to open the Tree, which helps prevent misconfigurations from ever reaching production environments, especially around kugernities.

Speaker 3

01:38

All right, so the code that crashes my stuff, that's my fault. The misconfigurations that's somebody else's fault. I'm just being clear. I hope my boss hears this anyway. So yeah, so we were talking before the episode and you said that you experienced this outage and this is where you've learned a lot of the lessons that led to the tree. Do you want to just talk about that? Kind of give us the background and the story so that we know what you screwed up? I mean how that went?

Speaker 2

02:07

Sure, sure thing. So let me describe the scenery first, you know, like in a book. So imagine a company and you have one thousand employees, four hundred of them are developers. The company is born in the cloud, paying one point five million dollars for AWS every month, a

02:26

lot of stuff running. No one actually knows what is everything, but it kind of works, you know, and moving really really really fast working in micro services, a lot of different programming languages, and the philosophy of the company is, you know, like small speed boats, like Amazon calls it,

02:45

two pizza teams. You know, you have your team, you find, you have a problem, you find the best solution, you go, you do it, responsible for it, sounds great, removes bottlenecks, makes you move fast, and really gives great energy people because they don't feel like they're a small, small screw in a big organization. And my role there was the general manager of the infrastructure division. So my work was to find things that are relevant to all the other

03:13

teams and build it as an infrastructure. So for example, we built a data collection pipeline that ingested more than two hundred billion with the b every month events from thirteen regions in AWS. And this is what we would do. And every time there's a new technology or something that is cross the company, we would be responsible for it. Might team and sort of like sort of special ops in a way. And one day we had a production outage. Now this happens. You know, we're all people. I make mistakes.

03:45

Everyone makes mistakes, besides Charles. It happens, and like every company we said, okay, let's post more tom the problem, understand what happened, let's find the root cause. And we did it. And a developer made a miss configuration in one manifest file, and we said, okay, we totally understand people make mistakes, and we want we believe in the philosophy of, you know, run fast and break things, but we don't believe in the philosophy of let's make the

04:13

same mistake five times. You know there is a limit to that as well. So we said, okay, so how do we make sure that this does not happen again. So first of all, you send the post mortem in an email to everyone in the company. So we tried that. Nice doesn't really work. No one reads it, no one remembers it. And I got to tell you from the other side as a developer getting emails every day telling me like, use this package, use this configuration, check this thing.

04:41

It's it's not scalable, Like how am I supposed to remember everything? It's it's just not feasible. And we said, okay, we did, you know, internal educational systems, And we did an internal meetup and explain to everyone, and everyone agreed, and everyone understands but it didn't really work well because I think, and this is what we thought, and this is what drove us to actually open the company, is that it has to happen in an automated way within

05:08

the development flow of the developer. Because every inch, every small thing that you do in order to change the workflow of the developer, it's crazy. It's almost never going to happen, and if it's going to happen, it's going to be very very painful for the developer, for the manager, and for the company. So he said, how can we do something that will be seamless in the flow, Because when we spoke with people, they said, I want to

05:33

know when I'm doing something wrong. I don't want to be the person that submits secret keys into our public GTAB repository. I don't want to be the person that takes production down. But sometimes I just don't know. And this is what drove us to actually opening the tree and building a solution that hooks directly within the development workflow. So it's a cli utility. You can run it on your laptop, Linux and Mac. Just run the three tests on your Kubernities manifest file or helm file, and then

06:05

we provide out of the box pre defined policies. So I'll give you simple examples that seems, you know, really trivial, but people don't do it. It's like memory limit SIPY. You limit a likeness probe, readiness probe, pulling containers from a centralized registree, not using the latest doctor tag. Because then every time you build it, it's like going to

06:28

the casino. You don't know it's version. You're gonna get what, You're gonna have it any productions like you know, and next after you have it in your computer, you install it in your CICD. And at this point this is one of the most powerful things because you get a centralized policy management solution. So I, as the develops engineer, can identify a problem, think of a policy that I want to apply to all of my hundred or fifty or five thousand engineers, and with the click of a button,

06:59

I can enable policy. And now all the projects that go through this CICD pipeline will actually comply with this policy and otherwise it will fail. And the idea is that once it fails, it does not notify the develops person. It explains the developer what do they have to do and shows them and links them to wiki and to our dogs and tells them, hey, mister developer, Hey missus developer,

07:26

this is how you can fix it. So we are very very proud of it because I really believe that this is how I would want my organization to communicate those policies and practices to me as a developer, as an engineer.

Speaker 1

07:39

Quit telling me what to do. No, it makes sense, and to be perfectly honest, you know. So, yeah, I write web developed. I'm a web developer for a fairly large financial firm. And what's nice is a lot of this stuff does kind of get pushed into our CICD. But the other nice part of it is that generally when these kinds of policy changes come down and I don't think they're using the tree. I think they're using just we're making this policy change and we're configuring Jenkins

08:10

to do it. But they generally are pretty good about going in and making the initial move right, so they move it to the to match the policy, and then from there when it lines up with whatever we're doing. That's when we get to the point where it's like, okay, So then if we change something that messes it up right, then it's on us. Okay, we can roll this back. But yeah, they're usually the ones that initially make it comply.

08:35

And I just wanted to add that because I think there's some level of responsibility that goes both ways, and so that's what I like about this. But if you're the one that's making the initial change that's going to cause it to fail and SI, then you probably also ought to be the person that's either working with somebody or doing the work yourself to make it comply in the first place.

Speaker 2

08:54

I totally agree, and this is why when we designed our policy engine, we design I need to have several points of granularity. So first of all, you can see how am I doing. Now let's scan my GitHub repositors and see do I have any violations now or not. Secondly, you can enable a rule in a way that we call it gradual rollout. So now every time that they make a change that is not complainant, to tell them listen,

09:24

on August first, this change will not be complied. Now it is passing, and that's okay, totally fine, but just so you know, we're going to have a policy in place in August first, and this is the policy, and here you have time to actually prepare to it, and then once August first hits, then it fails as a warning and not as enforcement, and then you have a great period for adoption of the policy, and only then at the end of the end of the end it

09:54

actually goes to full enforcement. And you're totally right. This is the feedback that we got from our customers, and this is how they designed it, and we built it because this is how they wanted.

Speaker 4

10:04

Now that's super cool.

Speaker 5

10:05

I really like the approach of You mentioned the path of how you got here through the emails and the meetings and the workshops and stuff, but really all that is only relevant at the time. And doing it this way, I think one of the key things there is that you're meeting the developers where they are, because that's the right time to introduce the solution or the information is when it's relevant to them. Otherwise it's just out of context.

Speaker 2

10:32

I totally agree. You have to get the warning and the data in line. I call it in line. And this is why we're working on We have a helm plug in, working on a cube, cattle plug in, vias, code plug in, everything, and this is very important because if it's not convenient, and if it's not in the developer's workflow, then I'll give you a story. Okay, I met with a big enterprise company. The talk to me about the certain policy that we have that says like

11:03

pull containers from the centralized registry of the company. So it's like m doctor at company ACME dot com, right, and he's like, it's it's a good policy. I want to use your solution instead of ours. And I'm like, what, what's your solution? So what do you do today? Oh, we just blocked docer hobbing our firewall and no one can ask it. And I'm like, what.

Speaker 5

11:27

Problem solved?

Speaker 2

11:29

For real? This is the real really, this is what he told me. I was and he's like, yeah, well, we just block it in the DNS and firewall level and that's it and they can't pull it from there. And I think that this is absolutely not the way to do it. As we go forward, developers want to achieve left nice to play solution.

Speaker 4

11:53

Yeah, that works until your developers get smart enough to figure out how to use a VPN or other ways of Yeah. Yeah, I mean this.

Speaker 2

12:01

Is security by obscurity.

Speaker 1

12:03

That's yeah, it's the wrong way I was going to say Jurassic part nature or developers will find a way.

Speaker 4

12:13

Absolutely, you know, just a pile on. I totally love this idea too, and it's it really speaks I think to like the whole DevOps mentality of like flow and like pull requests versus push requests, because I think, you know, the way you were describing it earlier was that it's basically just you know, pushing stuff out, which never really works all that well. But if you, as Will said, you get timing right so that now developers can pull

12:37

that information as they need it. The timing is right, the method is right, the information is there for them to pull it.

Speaker 2

12:45

Just you know.

Speaker 4

12:46

Again, I know I'm just piling on, but I just feel like it actually really fits really nicely and elegantly. Then the whole DevOps you know sort of methodology.

Speaker 2

12:54

But you know what I got to say, of course I gave a very radical example. Well, now, but when we meet with companies that are at this crossroad, because they say, okay, listen, we scale we had thirty developers, it was okay, forty fifth, like now we have like seventy developers. It's COVID. We're all working from home. You can't come to just a room and ask hey, how do we do this and that, and it's like we need to put something in place. And then I see

13:22

companies choose two different paths. It's like two opposites that you can go to. And of course I think that the best solution is the middle ground. But like some companies go the old way, the whole way too. Okay, So DevOps is responsible for the cluster, which is true in many organizations, the responsible for the operational excellence of the cluster and for the day to day operations. But then the developers write the application. Then what they say

13:52

is okay. So now every change that the developer makes to a Kuberniti's manifest or helm or anything that touches the infrastructure has to go through the opsteam. Now what happens at this point is that there's a huge bottleneck. Eventually, it frustrates both sides because the developers they have the R and D backlog and the product sitting on them with timelines that they need to release stuff and they're

14:16

waiting for the up steam to approve it. The OPS team they don't want to babysit developers and tell them listen, you forget you're pulling the latest image put up in down version. No, because it's not interesting. They want to do cost reduction, they want to optimize the performance, they want to upgrade it, they want to bring the new best versions. They you know, do crazy pocs. And then, like all all sides are are basically frustrated because they

14:44

babysit developers. The developers don't get autonomy, and at the end of the day, it's just bottleneck the develops. And not to talk about the fact that SERE and DevOps teams are usually like one to ten developers, so you might have like ten develops people to one hundred or two hundred developers. So that's one side.

Speaker 3

15:03

I can definitely identify with this.

Speaker 1

15:05

I mean, I'm working on a project right now that's on several timelines, right, and yeah, when everything's won't deploy, when it doesn't play nicely with the cluster, things like that, we get frustrated, right, And then my boss gets frustrated and it's like, why isn't this out there?

Speaker 4

15:20

Right?

Speaker 1

15:20

And then you know, well DevOps, right, and so then they go to DevOps and it's the same thing, right, and then the DevOps guys. Sometimes it's okay, well, let's they'll go figure out what it is and it's something that they can fix. And sometimes they're coming back to us and saying, well, there's this problem, and they don't want to come back to us and manage us, and and nobody else is happy because whoever the powers that be are for the business needs, they just want it out right.

Speaker 3

15:49

And so, yeah, what you're talking about. We've run into this more than once over the last year.

Speaker 2

15:55

And then it's like, what I'm doing to know? I heard it from several and several companies and multiple times. And just to tell you another example, the most common thing is that they come to Develops and tell them I have a deadline and then they go like, yeah, but the CFO told me to do cost reduction on AWS. So what do I do? I listen to the CEO to the CFO, or do I listen to the VPR and D or like what's more important? And who knows? I don't know. It's it's hard. Now let's talk about

16:23

the other side. The other side is actually when this happens, but Develops does not assume responsibility and they go the path of educating the developers and they say, no, we're not going to lock everything. We're not going to lock anything, but we're going to put additional efforts into educating the developers in order to making the right decisions, which is nice. The thing is, while this happens, you're really not sleeping at night, both because you're afraid and because you're getting

16:54

like pager paged for things that are happening. And secondly the developers are I find them terrified. They go like, I'm going to do that's something that is going to change production now. And they go like, I'm a Java billing engineer, I don't know Docker, I don't know Kubernites. I'm expert in Java billing, not in you know, Docker, I don't know. And then you find them like almost crippled because because they're afraid, they say, I don't know, I'm not an expert, and I'm afraid to break it

17:27

and I don't want to do it. And then the teams they try to educate them and so on. But I think that what goes best, and it's solutions like the tree or you can take up open policy agent with contest and the gatekeeper and write your own policies and what I've heard from developers is that when the middle ground they call it, I feel like I have guardrails, Like I'm riding the freeway, but I have guardrails so

17:55

I can do it by myself. I'm not bottlenecked by de VopS, but if I do something horribly wrong, the system will stop me. And then it's like a nice middle ground between the two, which I think can greatly help both sides of the equation.

Speaker 5

18:13

I think that's one of the approaches I try to take in specifically in post mortems, you know, because in post mortims a lot of the focus is on root cause and what went wrong.

Speaker 3

18:22

But I try to.

Speaker 5

18:23

Take it a little bit further than that and say, you know, the failure was not that this code did whatever. The failure is that the system didn't warn somebody that this was going to happen.

Speaker 4

18:35

Right.

Speaker 5

18:36

We built an environment where a developer or an engineer was able to make a change that they shouldn't have been allowed to make. And I think that's what you're describing there, is they're free to do whatever they want, but there's the guardrails in place to keep them from doing something that they didn't intentionally want.

Speaker 2

18:53

To do a WS you go delead the resource and it's like, there are fifteen resources attached to this security group. I guess you don't want to delete the security group.

Speaker 4

19:07

So one thing I'm curious about you is when you talk about your clients, I'm curious to hear what are the comment sort of misconfigurations. You know, I don't know if you're to say, you know, if you were to ask you here, what are the top five or top ten, I'm really curious to hear like what you see commonly.

Speaker 2

19:27

Yeah, great question, It's an absolutely great question. And by the way, in our docs have that the tree that I owe, we list all of the policies that we have and you can view everything. So let's go over some categories and talk about them and talk about the their security. So one one day, one company, very big messaging company told me, he told me, I want an if in for a safety score. I want this to run and for me to be to feel not safe

20:02

in regards to security safe. Also it also has security aspects, but I want to know that my safety score is high. So it starts from resource management. So in kubernets, for example, you can have a CPU requests and CPU limit, memory limit, and memory requests. This is very very common, and it specially happens because the developer, she codes the app and then she sends it to the cluster, she doesn't know

20:33

what is she going to be paired with. Now, the develops engineer works on workload management and optimizes the workloads on the different nodes so it will be cost effective. And the problem starts when you don't have memory in CPU limits, and then you have a memory leak in one of your containers, and then it starts affecting it psych a noisy neighbor, but very very noisy, and then Kubernitis starts to it depends on how you configured it and so on, but it starts to kill services, starts

21:07

to run out of memory. There are different behaviors that none of them is good and none of them is as expected. So I'd say this is the like no brainer one that you should do.

Speaker 4

21:20

Now.

Speaker 2

21:20

What we see companies usually do also is that they start and they apply cluster wide memory and CPU limit because you can do it on the runtime level. The problem starts when you know you have different departments. One of them needs four jigabytes of memory and it's great, but then you have the AI engineers and they're like, we need forty jigabytes. And then if you don't configure it on a shift left side, then you go like, okay, so I need to increase everyone's limit to forty jigabytes,

21:49

and then it's like nothing, it doesn't matter. So it's really important to set it on the resources side. So that's one. The second one is I would say around workload manage in terms of making sure that you have a liveness probe, a readiness probe, that your doctor container has a health check. It sounds so trivial, it sounds so simple, but so many times people go and create

22:16

the workload, don't set those things. They just oh, just HTTP to it, ah, HTTP two hundred, it works great, and then they don't configure it on the workload level.

22:29

And not to talk about you know, deeper things where you have a service and you want the health check to include maybe a connection to a database or to a cash and I would really really advise to, like, in order to increase your safety and stability, really put an effort into your health checks, readiness, liveness, because if you do it right and correctly, once things fail and

22:52

things always fail. It will be easy for you to find the root cause, and it will be easy for you to protect yourself and for kubernitis to kill this workload and to get another workload running.

Speaker 4

23:03

So let me just stop you there, because I was to ask the dumb questions. So I think, I think I understand what you're saying. But for some of our for other people, you know, our listeners, who may not have followed that whole train of thought, right, because there's a lot there. You just said in two minutes that I feel like we could unpack, right, So I'm going to say it, and then you need to tell me

23:23

how wrong I am or if I miss something. But so let's say we spin up, you know we we we send out a new package, we spin it up, and I'm the developer. So what I do is I just checked to say, hey, can I hit can I can I with my browser hit it? And I get

23:39

a two hundred response back saying we're good. So that's only a piece of it, because maybe I'm only hitting let's say the load balancer, and so the load balancer is saying I'm here, right, I'm answering you, but the application behind it is dead, or maybe the application is alive,

23:57

but the database behind it is dead. So unless I'm doing health checks that are short of going through those steps we may have had, we may have just deployed something that broke everything and I don't even realize it because all I'm doing is pinging the load balancer and getting a two hundred response and everything looks good to me because I didn't check what's going on behind you know, I sort of peeling back the layers of the onion. Did I get that right? Orm I missing.

Speaker 2

24:23

Something absolutely right. This is one of the most common mistakes developers make is they just check the simple front end web browser and they don't do the entire process. And then when you do have a problem, it's so hard to debug it because everything returns a great health check. So and you don't understand what is actually the problem.

Speaker 4

24:46

So is this something that you know, like the tree does forty? Is this something I've got to figure out? Like how does that?

Speaker 2

24:54

You know?

Speaker 4

24:54

How do you build that into a health check? So that sounds that There's a lot of steps, and it really depends on the architecture your application.

Speaker 2

25:01

Yeah, so we can talk about it from an engineering standpoint. In terms of the tree. The tree is a tool, and it's a tool that you can use in order to say, listen, from now on, all of our Kubernites workloads are going to have a liveness probe, a readiness probe. Now, how you configure this liveness problem and readiness probe is up to you. Same thing, you're going to put a

25:24

memory limit. If you put a memory limit of sixty four megabytes and your server can't even it's some Java huge jar, I don't know, it can't even load up, it's your problem. But what we will do is we will make sure that a policy exists and that it is configured on the resource. The next layer is another thing that is like what is the most common you know, policies. It's actually labels. Again, it sounds so simple, it sounds so trivial to put the label, and there are so

25:59

many and is why to put a label? So I'll start with the one that we're talking about now. So first of all, you can use labels in order to say what type of workload it is in order to determine which type of policy in terms of resource management, for example, it should use. So then you could say this is from type AI and they use those types of limits and those are from type back end front end. I don't know. Different teams call it in different names, and you can use it in order to understand what

26:30

are the relevant policies you should use. This is number one. Number two cost management. This is also a very very common use case that DevOps people have to deal with, which is constantly knowing to assign the cost center because they run the shared resources and at the end of the day they pay the check to a WS or AZURE or whatever you run it, and then the internal company goes like, okay, but how much do we need to build each business unit inside of organization? And then

27:01

they go like, I don't know. We had like five thousand servers and then they go like, okay, now it's mandatory everyone should say which department this server belongs to, because otherwise we're not going to know how much to allocate to it, because then you don't know what is the cost of goods of your business, and then the CFO doesn't know if the business is profitable or not profitable, or can we hire people? Can we not hire people?

27:26

And it's crazy because it's like board of director's decisions that go down to the CFO that go down, down, down, down down to the simple label that you need to put on your comin this workload in order to know how much it costs.

Speaker 5

27:39

Can you define for us the difference between a liveness probe and a readiness probe?

Speaker 2

27:44

That is a great question. So liveness probe works on I said readiness. Readiness probe is when I'm ready to serve traffic. So let's say I'm initializing myself. I need to start. I need to go create a cash, make sure that they can put it there, and so on. And then a liveness probe is when I'm running. Am I running correctly? Can I continue communicating, for example, with my existing cash or whatever it is. I like the

28:13

I think it's too much. I like the health simple health check you know that goes end to end and does the check. In addition, by the way, I also suggest for companies it has nothing to do with the trends on but like to configure outside health checks that actually go and do a user activity on your services for real, because the worst thing you want is a customer calling and saying the service is down. You want to be the first to know, so I think that those are the main things I would focus on.

Speaker 5

28:45

That's a really good point. I've been in a few outages where everything was working internally but nothing was working externally.

Speaker 2

28:53

Yep, I'll tell you one of my most severe outages. It was so hard to debug. It was my previous company. It was not an outage, it was even worse. What's worse than an outage? Everything slows down and works really really bad, and it doesn't break, so you don't really know. And it was a data pipeline that collected two hundred billion events every month, and it was a geolocation based routing, so it would every time someone will click and add, it will route to the closest aws region and send

29:28

the event there. And then we had thirteen regions that would send everything to a centralized kinnessis and then we would have workers that would process it. Now, in order to do the duplication and add some attributes, we had a ready skesh and this ready so all the workers would access the ready skesh in order to put in ideas,

29:48

select ideas and so on. And at some point, with the amount of messages increased, you know, slowly, slowly, slowly, slowly, and then at some point the memory of the readies got filled. So what did it? It switched to swap. This is a problem with swap. It's slow. And then all the requests started returning really really slowly. And then

30:08

you don't understand. You think, okay, there's a problem, so you put on more servers and then they bombard the readies even more, and then you put on more workers and like you're trying everything from like you're just trying to debug everything until finally we're like opening and I was like, oh my god, the Reddit is running on swap. And then we had to increase their readis memory and

30:29

then and then it fixed it. And if we had a check that life is check that said I'm going to perform I put events to the readis and I expected to take two to four milliseconds. I'm just making this up. And if at some point this is more than four milliseconds, there's a problem, we would have immediately new where is the root cause of this issue? But we didn't. That's the truth.

Speaker 4

30:58

Yeah, another good reason why I put more are so important. I'm curious to know because here's my anecdotal I guess experience is that I find that very few organizations do post mortems well, and if they're doing them, I don't think that they do them in a very effective way. I think they do them in more of a finger pointing, root cause analysis, who caused the problem and who should

31:23

we fire right? And I just feel like you feel that Unfortunately, Listen, I'm on the security side, so a lot of the post mortems I'm involved with our security incidents, so those might be a little bit, you know, handled a little bit differently than like an you know, typical outage or or that sort of situation.

Speaker 3

31:43

But yeah, we will did it.

Speaker 4

31:45

Right, Yeah, I mean so, I guess I should say I feel like most of the time the post mortems just don't happen. I feel like the times that they do, it almost becomes a witch hunt. And those are very rare. But when they do happen, again, that's just my experience, they just just get nasty. So I'm curious. I want to hear a better story because I feel like my experience is not good.

Speaker 2

32:15

I've never experienced anything like it. Thankfully, the organizations that I've worked with, my company, thank god, we did not have a security incident that someone stole all of our records or something, because then I think you're like obligated to take action. And maybe most of the like your cases were those type of severe cases where you know, it's it's just it's like borderline, like federal, it's like

32:45

it's really a problem. And what I am, yeah, what I'm referring to is more of a engineers and like like the ready say story, I just told you, who are you gonna fire? No one, It's just gonna make everyone the better, you know, and tell them and then think of how we could have fixed it. And then we implemented the check. Believe me, every time there was a problem, the first thing everyone checkedes the reddits. Everyone went to see that redit is okay. It was like

33:12

a small baby that everyone takes care of. But I don't believe in the witch hunts. I really believe in the culture where people come and they say I made a mistake and people help them understand. And again, as long as there was no negligence, I don't know, you know, something criminal or something like that, people make mistakes. Another story, there was an employee. It was her first day. The company was still using SVN and not GET and on her first day on the job, she deleted the entire

33:43

SVN three nothing happened to her. Yeah, so I think restored the backup and it's okay. But I think this is the main difference. I don't know what is your experience personally.

Speaker 4

34:00

I've seen both ways, you know.

Speaker 5

34:01

I remember in years past post mortems were the lynch mob had to pitch the pitchforks and the torches trying to find out who we were going to grab. But I think that's in my experience that's gone away over the last few years to people being more willing to accept that mistakes happen. But it almost feels like a pendulum where now an it's over overly trying too hard.

34:32

I guess to make sure that someone doesn't feel attacked in the post mortem, that you never get to the root cause either, you know, And so I think you got to struggle to find the happy medium there. And I mean, ultimately, you know, in a lot of these situations, someone did do something incorrect and you've got to point that out in order to identify it. And when you point it out, you know, you're not like calling that person out or attacking their skills. It was just a mistake.

35:05

It happened, but it's important to fully understand what that mistake was so that you can build in the systems to prevent it from happening again.

Speaker 2

35:13

Yeah.

Speaker 1

35:14

Yeah, I've been in the situation where and not because of a post mortem, but just because of other things.

Speaker 3

35:21

You know.

Speaker 1

35:22

I had a boss come in once on one of the teams, I was team lead, and he basically walked in the room and said, somebody's getting fired today, right, And you don't want people to feel that, right, because I took him outside and I said, I said, if you're going to pull this, they're all keeping their jobs. I'm just going to quit, right, And it's because nobody should live in that kind of fear. Right, We're all

35:46

trying to work on the same thing. But the flip side is is, yeah, I mean, if somebody is routinely reckless, right, it's always Jim.

Speaker 3

35:54

Right.

Speaker 1

35:54

It's gone down four times this month, and Jim has been the one to mess it up every time, and this is all stuff that we've done training on, and so Jim should know better, you know. The first time, Hey, Jim's a human. Second time, Jim's still a human. Third time, Okay, Jim's a human. But Jim is starting to cause some problems. You can have the conversation about whether or not Jim

36:14

needs to keep his job. But if people feel like they're going to be punished for making a mistake every once in a great while, then you're going to slow the whole system way down. And the whole point, as Shimon keeps pointing out, is we want to keep moving fast. We want to move fast, we.

Speaker 3

36:32

Want to get stuff out, we want to solve problems for our customers as quickly as possible, and at the same time maintain some level of stability.

Speaker 4

36:40

Yeah, really agree. I think the last point I would make is that I think the whole idea of root cause analysis, even if it is one person's you know, at the end of the day, even if you can tie it back to one person's typo or mistake or whatever, I personally feel like the root cause analysis is generally

36:56

flawed in that it's rarely one person right. It's it might be one person you know again who typed it in wrong or did whatever, but there's a process breakdown as well, and there was an authority breakdown, or like what she was talking about before, the guardrails didn't exist. You just can't point it at one person like it's the system broke down. Yes, it resulted in somebody's mistake

37:22

in a manifest file or something like that. But if you go, you know, if you take it back, you look at it and you say, well, wait a second, guys, because our process isn't all that great. He was trying to do the best he could, he didn't know or whatever it was. You can't be an expert in everything made a mistake, but it's because the entire process broke down,

37:38

not just because one person made a mistake. And I feel like that's the piece that you know, you're trying to do the cause analysis, that's the piece that people just don't think about.

Speaker 2

37:47

I totally agree. Just just to finish on this point, the best root cause analysis process that I have ever seen in my life is get lab. They went down and they've opened a live doc that everyone could see, all the customers, all everyone, and they've had a sessions

38:06

that are like open a Google hangout resume. I remember what they did, and anyone could join, and it was a totally transparent process of them debugging the outage that they had, and of course afterwards they published everything like including like logs, crazy stuff and like here's what happened. Here's for transparency, and here's for you to learn how not to make our mistakes. And I really admired it.

Speaker 5

38:31

Yeah, I think there's something to be said for gaining credibility with your customers whenever they find out that there's an outage from you, instead of them telling you that there's an outage, and then you provide real time or near time updates to them up until the issues resolved.

Speaker 2

38:48

Definitely.

Speaker 5

38:49

So I think we've all seen scenarios where ABS has had an incident and you find out about it either personally or on Reddit three or four hours before the AWS status page updates.

Speaker 2

39:00

That is, if it did not affect the status space because that happened as well, you know.

Speaker 1

39:10

Yeah, well, and that's interesting to me too, right, is that sometimes it's hey, we screwed this stuff up and so therefore our app didn't run. And then yeah, we see these big companies that use a lot of the AWUS or other infrastructure on the out there on the cloud, and what winds up happening is yeah, what we're kind of talking about, except they take down the entire US Eastern one region, right, and everybody goes, why is the

39:37

Internet not working? And yeah, it turns out that, yeah, the Internet relied on that that region for a whole bunch of stuff and it's gone. And so those kinds of externalities too, where it's it goes beyond even your code, your company, your infrastructure, your cloud set up. That's fascinating too, and those cases, you know, as Will's pointing out, we all kind of want to know, right, because it's affecting everybody.

Speaker 2

40:03

This is why, you know, it was very interesting when Jeffrey said that, like which hunt and so on? And I think this is like the define line between security and infrastructure, where it's like the culture and infrastrucures like yeah, we all like seventeen out the jazz and no problem. And then when it crosses this line specifically you know a privacy security you know, personally identifying information, and then

40:32

it's like, okay, something's different going to happen here. And it's interesting because in organizations, like government organizations, there are special ways to investigate what happened. That's saying in a military when there was an operation, so they want to learn from it. So there are two paths of investigation. One path is like the regular path, they investigate and like they can put someone to jail and so on. And then there's it's called the professional combat review where

41:00

everyone can say whatever. They can say, I killed someone and they will not be eligible for anything, like they can't do anything to them, and they have one hundred percent immunity in this process. And this is done in order to make sure that we learn and that everyone say what really really happened, and like everything you say there is classified, it cannot be used against you and so on. So I think it's also an interesting thing to think about in our field.

Speaker 4

41:26

I totally agree. I feel like the organizations that yeah, like I said, I think that which I'm you know, mentality is a terrible one regardless of what what happened. I mean unless you are talking about like we said before, nofeasance or negligence or something like that, or you know, beyond negligence, but you know, really criminal negligence like which rarely happens. Right, It's it's generally speaking, you know, it's a breakdown process and just fix it. I mean, just

41:54

work together and fix it. Nobody wants to. I mean, I've just been involved in so many companies post breach, and so everybody just wants wants THEMS to go back to normal. It's like COVID, right, everyone just wants things to go back to normal. Let's just pass this move on, you know, do we have to do, but let's stop reliving it on a daily basis.

Speaker 1

42:14

Yeah, all right, Well, I think we're kind of getting towards a place where we can start to wrap up. Are there any other kind of big pieces of advice that we need to put out there before we go to our picks.

Speaker 2

42:27

I want to point out one thing which I really believe in, which is it's a big word called gee tops. But in general, make sure that all of your configuration and all of your assets, everything is in code and in GIT. And if you live with one thing from this podcast is make sure that everything is infrastructure's code and in geed because then you will be able to at least see what happened and what was the configuration and how did we configure it. So this is my final small remark here.

Speaker 3

42:58

That's good advice.

Speaker 1

42:59

All right, Well let's roll into picks then, Jeffrey, do you want to startus off with the picks?

Speaker 4

43:03

All right? So it's something I was just thinking about. I was actually thinking about as we're talking, you know, just having our conversation here. So my pick isn't a specific thing. It's more of just an approach. So I get asked all the time like how do you, you know, sort of continue to continuously learn and you know, learn new technology is new, you know, sort of stay on top of current threats. It's technology in general is just that constantly changing space. But I mean, honestly, I think

43:35

that applies beyond technology. Our world is just constantly changing, and how do you stay on top of that? And how do you do that without spending eight hours a day just trying to read or learn or watch or whatever. And so a couple of things that I have learned. So I think there's more just ideas than actual like go go and buy a product or something like that is you know, the with it we learn. I think that you know, there are different you know, different people

44:03

do learn differently. But what I've seen is that, you know, there's so much out there now like on YouTube, for instance, I mean, there's so much content out there, but it takes a long time to go through, especially now that

44:15

everything has adds in it. So now now every video takes much longer to get through, right, But if what you're trying to learn is very specific, it's sometimes harder to figure out how to learn it because there are so many blog posts that are too too generic or just repeating what everybody else has already said on the topic already, and everyone just wants to put it into their blog to try and you know, get whatever it is se o or get you know, traction out of it,

44:41

traffic that sort of thing, or you can try and you know, pick it out of like a video, but you know, you could be going through a sixty minute video and trying to figure out where where it is. So I think part of it is and there's no real answer here, but part of it is just figuring out what's the best medium for learning what I'm trying to learn? Am I just trying to get an overview of it of that subject, then maybe a video is good if I'm trying to learn something very specific, maybe

45:04

going to like stack overflowed. I think building that skill set in yourself of figuring out what is it that I'm trying to learn and what's the best way for me to get there is something that we all have to just sort of develop. And I think a lot of us who've been doing this for years, you're probably thinking, yeah, I've been there, I've done that. I think I'm there already.

45:23

But I think for some of the people earlier on in their in their in their career, this might be something that you should really be thinking about, is just how to be most efficient learning something new. And obviously it also goes back to figuring out what the best sources are, because, like I said, there's a lot of content out there, and it's just regurgitating what's what's already out there and and sort of dumbing it down sometimes

45:46

like pulling out some of the details. So those sources, you know, you want to toss and you want to just sort of go to, you know, figure out what are the right sources that that you know give you their information. So that's one piece. The other thing I was going to say is I think sometimes a lot of times we are we have this sort of natural tendency to look for, you know, when we do have to buy something, we think about, what's the cheapest product

46:08

out there? Right, what's and I think so many so often the cheapest product actually takes you more time, more energy, and you end up having to do things over you know, over again or whatever, and it's not the cheapest product, and I think, you know it's it's again, you know, as you go through that learning and figuring out what what is it that I need, don't fall into the trap of just buying the cheapest product. Sometimes it's buying the more expensive product. I mean, sometimes it is the

46:33

cheapest product. Generally it's a use once type of thing, or you know, I'm really going to use it. Great, But if it's something you're not going to do that you are going to continue toly use, spend some time figuring out does it make sense to invest in something a little bit you know, better quality. So anyway, those are my two picks methodologies. Whatever thoughts for the day?

Speaker 3

46:52

Nice Will, what are your picks?

Speaker 4

46:55

All right?

Speaker 5

46:55

So I have been working my way through this book, The Manual from Bictitis. So he was a stoic philosopher, and I've actually tried to read Marcus Aurelius's meditations in the past and not really sure how much I actually

47:09

retained from that. So I came across this book and I really like it because it's just it's very short, like each page just has one particular quote or saying from Epictetis and it's been really helpful to just kind of come to understanding with the whole Stoic philosophy and that in combination with daily emails the email list from the Daily Stoic dot com, I start each day by reading those and it's a really good way to kind of level set your mind before you get started in

47:43

a day and put things in perspective, because I think that's helpful, especially with the amount of information and if you can't avoid the news that's going on every day, it kind of helps you temper that message and keep things into more of a longer range perspective. So the main from Epictetus and the Daily store dot com are my picks for today.

Speaker 3

48:04

Nice, what do I have for picks?

Speaker 1

48:05

So Father's Day, I've got a couple of picks for stuff that I did or got for Father's Day. The first pick that I have is my wife's like, Hey, you get to control the TV, which never happens at my house, both because I don't watch a ton of TV and because my kids just are on video games all day during the summer, so you know, I'll go down there and I'll just kind of see what's going on. But yeah, So on Sunday afternoon, I watched Willow, which is one of my favorite old timey movies.

Speaker 3

48:37

So I'm gonna I'm gonna pick that because I enjoyed it. I really enjoyed it.

Speaker 1

48:41

Of course, all my kids the second we turn it on there they sat there for ten to fifteen minutes and then just cleared out of the room.

Speaker 3

48:47

And I'm just like, my guys, is a good movie. Whatever. Whatever.

Speaker 1

48:52

Anyway, the other pick that I have so my wife, I've been having issues. My grill has been falling apart for a few years, and I like cooking me some meat. So my wife got me a trigger smoker for Father's Day.

Speaker 4

49:07

Oh nice.

Speaker 1

49:08

And it's got a couple of meat probes in it and stuff like that, which is super nice because a lot of time it's not. It doesn't have bluetooth or anything in it. I know some of the more expensive models do, but it's nice just because you can kind of cook at the temperature and then you know you're ready to pull it out right. And so anyway, made a brisket on it for Father's Day so good.

Speaker 3

49:29

Oh my gosh.

Speaker 1

49:32

You know, I've got some baby back ribs in the fridge that I need to throw on there sooner rather than later. But it's just it's so nice and all of the stuff that you kind of cook on the slow cook end of things, they just come out so so so good.

Speaker 4

49:47

Right.

Speaker 1

49:48

So the other forms of that I guess are like the crockpot or the souvide, But yeah, the smoker's nice too because it gets all this flavor in there. And anyway, yeah, I am loving having So I'm gonna pick that, Simon.

Speaker 3

50:02

What are your picks?

Speaker 2

50:02

So I'm gonna have a barbecue now, fifteen of my friends are coming and I have an Apoleon grill and I really love grilling and also I always measure the temperature of the meat and I really really love it in terms of my picks. So I found daily dot dev. It's something cool that you can daily dot too. Sorry

50:25

that no, I'm I'm mistaking several things here. It's called daily dot dev and it's a Chrome homepage extension so when you open up a new tub, it actually shows you like stuff from news and stuff like that, but you know, targeted at dev So it's really really nice because it just gives you a like a thumbnail and a title and it shows you what's going on. So I thought it's it's something nice because it's really targeted

50:54

towards our target audience, so it's nice. So that's my small TEP besides the get tops tip that I give at the beginning.

Speaker 1

51:03

Awesome if people want to connect with you online, where do they find you?

Speaker 2

51:06

Yeah, so I'm at Shechemon Tolts at the Twitter and you can always go to the tree do io and they see our website there. You can try to message me on LinkedIn, but it's gonna be you know, it's it's a we can do a whole session about like what is LinkedIn become in that regard. But yeah, so Simon Tolds at Twitter, that's the best place to reach out. And I look forward to hearing from you and listening to feedback from users because this is what we love

51:37

the most. When people come in run our run our c l, I get some stuff, and then they write to us this is great, but we hate this thing and why can't I do this and that? And then we talk to them and we hear their feedback and this is how we prioritize our roadmap, so I encourage you to give us feedback about our product at the tree. Do I O d A t r e E do I O awesome?

Speaker 3

51:58

All right, well we'll go ahead and wrap up here. Thanks than for coming. This was a lot of fun.

Speaker 2

52:01

Thank you very much for having me. It was really really fun being here and gegging out about develops with you. I feel at home, so thank you very much for having me.

Speaker 3

52:10

All right, well, until next time, folks, max out

Transcript source: Provided by creator in RSS feed: download file

Gaining Stability with Rule Based Policies with Shimon Tolts - DevOps 226

Episode description

Transcript