Thank you. Welcome to the Staff Engine podcast, where we interview software engineers who have progressed beyond the career level into staff levels and beyond. We're interested in the areas of work that set staff plus level engineers apart from other individual contributors. Things like setting technical direction, mentorship and sponsorship, providing engineering perspective to the org, etc.
My name is David Noel-Romas, and I'm joined by my co-host, Alex Kessinger. We're both staff plus engineers who have been working in software for over a decade. Alex, please tell us a bit about today's guest. Yeah. Loren Hochstein is a senior software engineer at Netflix, where he works on the managed delivery team. And you might also know him from Twitter as NoRueCause.
I'm excited to share this episode with you all because we touch on the topics of resilience and reliability, which I really think underscores probably most staff engineering roles. So let's dive in. Lauren, thank you so much for taking the time to join us today. I'm looking forward to chatting with you. Could you please start by just sort of telling us who you are and what you do? Sure. So I'm Lauren Hochstein.
I work at Netflix. I feel a little bit like a fraud here because I am not a staff. I'm a senior software engineer at Netflix because Netflix doesn't really have levels like that. Pretty much everyone, almost all the ICs are seniors. So I work in the delivery engineering part of Netflix these days on a project called Manage Delivery. So we're working on a sort of declarative delivery system that is easier to reason about than traditional pipeline stuff.
So there's a lot of interesting problems around automation and a system that's automatically doing things that I found really interesting. And so I kind of convinced that team to take me on. I like to jump around a lot. It's like my third team at Netflix now since I've been here for about six years.
Cool. Yeah. And we totally understand. I think that the title is, it's not as important as like, you know, the work that you do. So really don't worry about it. I think we're going to talk about some really interesting things. So I'm curious, regardless of sort of titles, what do senior software engineers do at Netflix? Is there a typical set of expectations or does everyone sort of have their own spin on the role?
Yeah. So one of the really interesting things about Netflix is that historically they've only hired seniors. So there is not like a mix of juniors and seniors. on the teams so everyone on the team is a senior and also you know we do like you build it you run it so everyone on the team who's doing software development is also doing operations so in a sense kind of like everyone is expected
to be a leader and do leadership stuff. And so, you know, in addition to doing like everyone does development, right? So I do, you know, definitely a lot of coding. Everyone does design work. Everyone does, you know, coordination with other teams. Right. To do cross-functional kind of cat herdy stuff. Right. Everyone has to do a little bit of cat herding. And so you can sort of choose how much of that you want to do.
And it's sort of up to you based on your particular interest on where you're going to spend your time. Are you going to spend your time thinking about what is the long-term design stuff with this project that we're working on so that we don't...
hit a whole lot of pain in a couple months because it's too hard to change. Maybe you're interested in, okay, I want to coordinate with this other, you know, we're all, we say, what is it? Loosely coupled, highly aligned, right? Is the, is the, what we talked about, although.
you know often it's loosely coupled loosely aligned right like it's when you're when you're loosely coupled you're sort of optimized for moving quickly individually but not necessarily for alignment that's much harder to do and so there are a lot of engineers now who work more around cat herney kind of stuff trying to get the teams to move in the same direction trying to coordinate like all these different teams are doing different efforts and you want to make sure things are
coherent. I work under the platform engineering org, so all our customers are internal Netflix engineers. And historically, it's been very sort of disjointed experience for them. Like all the tools are like totally different. And so there's a bigger push now to make that more cohesive. But that means better coordination. And since there's no, you know, architects or anything like it's.
it's sort of tricky to move things forward and get big things done that way. I actually personally don't do as much of the... cat herding kind of stuff the stuff that i personally do that crosses teams is like me crossing teams right so like so the way i think about it is i like to like spread knowledge around by moving around the the org and
One thing that we have, I think, not been as good at as other companies at Netflix is like historically people have not moved around as much. It's gotten a lot better recently. But like when I started, there was no internal mechanism to let you move around like it was.
almost taboo a little bit and now there's like an internal job site and it's it's more of a thing but that was definitely something that was that was quite different when i first got there interesting so it sounds like you talked a little bit about how you sort of approach
your job, you know, maybe a little bit differently than the necessary, the broad expectations. You know, is there anything else that you feel like you do that's special to the way that you practice being a senior software engineer? Yeah. So, I mean, I'm... personally kind of interested in like grassrootsy kind of stuff. Like,
bottom-up things working with the engineers, not necessarily on sort of larger initiatives, but like sort of improving things. So I, for example, run systems reading, which is a like a paper reading group inside of Netflix or we, you know, people get together and talk about interesting papers. It's funny when I got there, I...
saw there was a group that had existed at one time. There was like a Google group, but it had lay fallow. Like everyone involved is gone and it had stopped. And so we started that up again. I did something called, I tried to do something called Oops, where I get people to talk about sort of near misses that have happened inside of netflix not just you know the big incidents but the stuff that like didn't necessarily have customer impact but there's interesting things to learn from
So that's another example. I brought in Hillel Wayne. He teaches on TLA+. So one of the first workshops he did was at Netflix. I brought him in there. There's interest in that. So these are things that are not like, you know, we're going to move a big rock up a hill, right, to accomplish this. But it's like trying to kind of upskill the engineers inside the organization.
Really interesting aspect of, you know, what I would sort of think of as staff-level work, and obviously at Netflix that label doesn't exist, but this idea of knowledge sharing, one of the challenges that I think a lot of organizations face is that
they don't sort of, a lot of organizations fail to incentivize that properly or fail to reward it properly. First of all, would you agree with that? And do you think that there's like, that Netflix does things better in that regard? Is there sort of like... you know, beyond sort of thinking it's the right thing to do, are there any incentives nudging you in the direction of sort of helping the folks around you level up?
Yeah, so this is another area where I think Netflix has gotten better in the past few years. Like when I first started, the expectation was we are going to hire seniors and we're just sort of assume that you're going to be at that level and we're not really going to invest in like... you know like upskilling you like we're hiring you to be high skilled right and now there is like a developer education org right so it's changed over time and now there's more investment in you know
I would say like improving the education, the skills of the people inside the org. A lot of that is around like, you know, sort of like classes and training kind of stuff. But the stuff that I'm more interested in is like learning from other people inside the organization, right? I find like I have always personally learned best by like, you know, looking over someone else's shoulder, working next to someone who's really, really good.
You know, if you're not deliberate about that. So, I mean, that happens organically on Teams. But if you're not deliberate about sharing that, it doesn't happen. And I don't think there's a... a huge organizational push for that. That's just sort of something I'm, I'm trying to push from the bottom. I mean, one of the challenges is that like, you know, like there's not enough time to do anything, right? Like everyone.
everywhere has more work than they have capacity right like you always have like so it's always hard to make space for stuff that like does not obviously have like you know, a near term impact, right? And so spending the resources to do stuff like that is hard to justify. And so I find you sort of...
like kind of have to do it like on the side a little bit, right? Like one of my motivations for doing the oops work for getting people to do, you know, write-ups of near misses is because I want them to teach other people how to deal with. operational stuff so like one thing i do with my team so before i was on this team i was so actually i started netflix on what's called the chaos team right like i applied from the website i was like oh like i've heard of chaos monkey that's really cool
And so I got on the team, and I thought that was really interesting. And what happened, though, when I was on that team, you know, building tools that were intentionally causing failures in production, is that I found that it was actually more interesting.
Like the real failures are more interesting than those sort of like synthetic ones we were injecting. And I just sort of got like sucked into that world of incidents. And I'd always look over at the incident management team, which was our sister team. And I was like, oh, that's really cool. And I ended up moving on to that team because I just wanted to spend all my time studying incidents. And those folks are super good at operations. That's all they do, right?
And then I did that for about a year. I was on the incident management team and I'm like, okay, I want to be a regular software engineer again. And then I came on to this team. And this team did not have as much operational expertise, right? And so because I had been on another team for a year, so what I did on my team is I run a meeting called This Week in Managed Delivery Operations, where we talk about all the things that have happened this week.
that are interesting operations wise. And the goal of it is to... have people talk through okay what did you see at this time okay what were you thinking where did you look and to try to teach people from the experiences of other people to understand how they were like debugging in the moment which is not typically the way people think in terms of like
talking about what happened after an incident so you have to kind of be like deliberate about that you have to have that as a goal that you want people to walk through like to see through other people's eyes to learn from their experience it's hard to scale something like that i'd say it's like working pretty well on my team people have gotten much better it's a lot of fun but the trick is to sort of like infect
the rest of the org so that people start doing that and sort of spread it that way. But that's like not quite a process thing, but it's sort of like a habit that has to be developed. Right. And so like building better habits across an org is. I would say sort of the kind of like interesting, you know, staffish level work that I'm trying to do in some small way. Yeah. I think that idea of.
of really trying to change culture so first of all the only way you can change culture is by influence right you can't mandate a change to culture right and so one of the things that alex and i talk about a lot in the podcast and otherwise is that like the main distinction between or the sort of the interesting distinction between staff engineers and more traditional types of leaders within organizations is that we're explicitly handed...
You know, we're explicitly expected to influence folks, but we're not handed in the authority to do it. Right. And that might seem like a handicap. But I think when you're thinking about changing culture, it's sort of the only way that you can do it. And so it's fascinating that you sort of like intentionally set out.
This area where like the business obviously would get a lot of value if you were able to change the culture toward one that approached operations differently. And actually, I think we'll circle back to sort of. what that culture would look like, because I have a lot of questions there. But assuming that such a culture exists, you're trying to shift the culture into that direction. And the only option toward doing that is to influence other folks. Is this something that, like...
Is this like an explicit strategy that you've like outlined to management and they bought into it? Is it more sort of like, Oh, Lauren's off doing his thing and we sort of trust him or like, how does that situation work? Yeah. So on my current team.
It's more the latter, more like, okay, Lauren's off, like... doing this thing on my first team it was like that too like my first team it was like okay lawrence off doing these these weird things studying you know incidents even though that's not what he does when i was on the second team when i was on what's called the core team that was more explicit that was more okay
you know, I'm going to be doing some like resilience type stuff. Like this is sort of like my like scope. Like I had to go on call because like all the engineers at the time, like on that team, when I was on there, the only way I could really get on that team was to do like.
incident response as well as the analysis like i didn't want to do the response but i'm like all right i'll do it the other option was to become like a tpm i didn't want to do that but like there it was it was quite explicit and like we ended up hiring a couple of new like resilience engineers onto that team And, you know, I worked on, you know, creating like job descriptions for that. And, you know, I was involved in hiring those folks. So there it was a little more deliberate on that.
And then I honestly sort of kind of got like burnt out on that. And I said, so I did that like after a year, I'm like, wow, this is really hard. And like I, you know, it was too much, I think, to do both the.
on-call kind of work like to move back and forth so one of the challenges i found At an organization where you have to work at different levels like we do, like one day you're like coding and debugging and next day you're, you know, doing sort of larger scope project stuff is I have a hard time moving.
up and down those levels and on that team i had a hard time switching back and forth between doing you know incident response and then doing the sort of broader like analysis and then okay what do we do how do we look across a whole bunch of incidents and find themes and how do we you know what do we do with this and so i was like okay i you know like i
The problem was like I got what I wanted and like what I found the second time my career that like I went after something That was hard and I got it and I was like wow actually day to day like this is not really what I want to do And so I went back to more traditional
role, but I still am very interested in that stuff. I sort of have to do it out of the corner of my eye, I think. I feel like I have to do the higher impact stuff on the side rather than being the primary focus because otherwise it's just too much for me. That's a really interesting insight.
When you felt like you recognized that the role that you had got wasn't what you wanted, how did you go about having that conversation with your manager or your organization to transition to a different role? Yeah, so my manager at the time was really, really great. So he was the one I would say was mostly responsible for me being able to move over to that team. And at the time, when I moved over to that team, the core team,
My manager was then an IC on that team, and he sort of like sponsored me to come over. And then, you know, he became manager. And then he created space for these other roles and the more human factor stuff. But he was super easy to talk to. And he knew that I was getting stressed out. And I told him at one point that I couldn't do both.
told him i'll put like i just i can't do both and you know he took me off call for a while and i just told him that like i did not find myself being happy with that but he was just like he's great we know we still get along really really well and he was just a very very like approachable
person to talk to and he's like okay you know you want to switch teams we'll make it happen so i was very fortunate yeah i think there's a lot of people who listen to this who are staff and they may not be exactly where they want to be or senior
And having those kinds of conversations could probably be incredibly stressful because you have to sort of acknowledge, like, maybe I don't want to do the thing that I'm doing. But in my experience, and I know it's not universal, like when you actually just bring these things up and talk to your managers, they're usually very compassionate.
up about these kinds of things so it's good to hear more examples of that yeah i mean i would say that the hardest part of that is like actually saying it out loud to someone like when you are at the point where you're like actually you know this isn't really what i want to do like you might like think that and feel that but like saying it out loud to someone is like extremely cathartic and especially saying it out loud to your manager is a huge thing
Yeah, I would say like I see a lot of people at Netflix switch back and forth between IC to manager and then back, right? Because they think they want to do something, you know, they think they want to try that path and you go there and you're like, well, actually, no, this isn't a good fit for me. And, you know, many of them just like oscillate back to IC again.
Nice. I wanted to talk a little bit about sort of like the work that you're doing around resiliency. I thought a really interesting example that you brought up was the oops group or the oops talking. I'm curious about that because in a lot of places I've worked at, if an incident didn't happen, people would have been like, great, we did our job.
And we didn't have an incident. And so do you think you could explain a little bit like why an oops or like a close call is almost as important as an incident to learn from as an incident might be? Yeah. So, I mean, it depends on what you're trying to get out of it. Right. So to me, you know, I think of incidents as a way of.
like understanding how the system actually works right so one of the challenges is like we all work in these sort of huge machines and we all only see like these little tiny parts of it, like our own part, right? And when something unexpected happens, when an operational surprise happens, like something...
happened in the system that somebody didn't expect, right? There was something we didn't know about the system. And that's usually really interesting. And it's very often an interaction between two parts, right? We all have our own parts and these things sort of fit together and we don't realize that like something weird is going to happen.
you know even if there's no customer impact you can still learn just as much about like you know this thing about your system you didn't know from those sort of close calls and the other thing is that you know i am interested in things like okay is there something confusing about a control interface, like an operator interface. You can still learn from those about that and you can still...
you know, deal with problems like that or just watching it. I mean, my favorite is watching experts in action, right? And the close calls, typically there's an expert that caught something early, right? And so I want to be able to learn from their experience. And so if I can get them to... capture that experience and i can read over it then i can you know i can learn from that like
there was one guy on a team who this always blows my mind they like so there's this there's a service at netflix and it's java based and you could actually like run a repl on and like it basically runs a lot of like jars that people and he connected to it like ran a repl and was like querying the internal state of it to see that it had gotten into a bad state. And that just sort of blew my mind that you could do that. That's usually not a lie. You can't do a ruffle in production. That's nuts.
But of course, like the Rails people do that all the time, right? So yeah, so like if you're interested in learning in particular about like how experts do things, like I think that like close calls are great or maybe even better. The challenge once again is like...
making space for that right like it takes time to do that i mean we get very few people doing it and even me like i try to do them you know when they happen we have operational surprises on my team and sometimes i get halfway through and i'm like oh i'm too busy i'm not going to finish this up and i have several
I have many like, you know, half finished oopses that I just never ended up publishing, which I feel bad about. And then the real irony, the scary thing is that there are like, I'll hear something.
And I'll go talk to someone and say, hey, there's like, you know, I saw the surprise happen. Can you write it up? And the person's like, no, I'm like, I'm totally like underwater. I can't. And I'm like, well, actually, that's really dangerous, right? Like anyone who like the oopsies that don't get written up are on the teams that are like running too.
close to the margin and so the places where we have the least signal are the ones where the most danger is and that's kind of scary and so you know one thing that i've always been really really interested in is how do we collect those kinds of signals that we don't usually see about like teams that are like running into trouble so that we can act early on them. Yeah. The thing that I'm struck by is like the, the value of a near miss.
is like it's easier to talk about because you didn't cause an incident, right? And so people are, I think, more open to the idea of talking about it, which is always nice. Do you feel like these things that you have done, are they influencing the organization that you work in in a positive manner? Yeah. So I think it's...
Very small scale. So like I sort of am able to infect people in different parts of the organization, right? Like I think if you like you step back, you probably won't see that much. The impact is hard to see. And honestly, sometimes I don't even really know. But I think you can find like little clusters, right? And sort of starts to spread around that way, like putting like a drop of ink in the water. And that's sort of like I have found like that's tends to be the most effective.
way to make these sort of changes it's like you need right and this is like well known right you need a champion right so the only way you really get like a change to happen is to have a champion who's pushing it and so if you can build champions then you can, you know, sort of orchestrate change that way. And I'm, I guess I'm trying to get people like excited and interested in this sort of thing. Like the people who write up the oopsies are the people who start to get really into it.
Right. Like that is their self-motivated. They're like, oh, this is really cool. I like reading about these. You know, I want to write them up myself. And there's like, I have an oopsies channel. And I like, I slack about like, hey, look at this cool thing that happened here. And so, I mean, I don't know.
Maybe there's very little impact. I mean, it's very easy to say, like, look, you know, I don't really see anything. But I'm hoping that, like, as I sort of infect people, that it sort of spreads that way. Nice. Do you feel like you could name the cultural value that you're hoping to spread throughout the company? Yeah, so...
I don't know if I'd phrase it as a value. Like I'm trying to think how to articulate it. Like there's definitely a notion value, you know, whatever you it's, I'm not so worried about the specific verbiage. Sure. So like, I'm very interested in distributing operational expertise. right so
I mean, operational in particular, that's my personal interest, but expertise. So basically at every organization, there are people who are really, really good at stuff, right? And I'm sure you both can name people in your orgs you've worked with that are really good. And my question is always, how do we leverage those people?
people in a way to like bring everyone else up right so like and so that is sort of the value that i sort of push the hardest on that i'm most interested in is how do you take you know people that are that are good and make them better by leveraging the people in the org who are better and like spreading their their skill around right like we're good as a society i would say from training people up from like sort of novice to like intermediate but like going
Beyond that is a different way of, it's not like training, right? Like the learning is different and it's more experiential. And so like, how do you sort of scale up people's experience, scale up their expertise? That, to me, is the grand challenge of improving engineering in an organization. Yeah. Do you feel like this sort of blocker to going from intermediate to expert?
is that complexity is growing at such a rate, and our ability to build capacity is probably what moves us into the expert level.
building capacity into being an expert is such a mysterious thing at this point because the complexity is so high. Like, do you think that that's like maybe one of the big blockers to sort of... leveling up expertise in our modern especially when we work in tech and we work in distributed systems and that kind of stuff so interestingly i don't think so so you know complexity is definitely an issue and like we all
we all face that all the time right like we are we are overwhelmed with the amount of complexity but like that's always a problem and like the systems are always too complex for us to really get a true handle on i think that the primary obstacle to upskilling is the carving out time for reflection right the way you get better
from your experiences the way you leverage experiences either your own or someone else's is by reflecting on them by spending that time to look back and when you're stretched when like you don't have time to think about it then you don't have an opportunity to actually make the most of those experiences and get better. And so that to me is the hardest part. So there's like, you know, capacity in that sense is like carving out the time to look back and understand.
what happened so you know as an example like you know once an organization reaches a certain size migrations are going to be happening all the time right like at some point it's not like you know are you doing a migration it's like how many migrations are happening right and like and you have to get good at like every organization you know has to have like once it reaches a certain level
Doing migrations well has to become a core competency. And I don't know about you, but in my experience, many times the migrations are very painful. But I found it extremely rare for people to reflect and say, okay, those migrations... What happened? Like, what did we think was going to happen? How did it actually go? What did we learn from them? Usually it's like, okay, it's done. Let's forget about it and move on. And I think this is like my pet theory is one of the reasons.
we don't really get better over time, even though we think we will, right? Okay, last time that was my terrible, but this time, you know, it'll be better, is that we don't spend that... effort to learn as much as we can from the previous migration so that in the future we can design our systems to make the next one easier.
And like, I just see this happening again and again. And like, you know, I have on my list of things to do. I would love to go back at Netflix and like treat as case studies the various migrations that we've done to understand, like, what can we learn from them?
But it hasn't happened. Like, I just, I haven't carved that time out. And that would be an interesting role. But like, and I don't know, I mean, I don't know, you know, if you two have had experience with that, like looking back at migrations. But, you know, I have to say I haven't really seen it happen very much.
Yeah, I think that's a good point. I think, broadly speaking, sort of retroactively analyzing anything is hard to do in our organizations, right? They're trying to move forward so quickly. I know that I kind of harped on this already a little bit earlier, but I'm tempted to go back to it because now that I sort of understand a bit more about the changes that you're trying to drive, for myself and I think probably for a lot of people listening, like...
you're kind of you're preaching to the converted right it's like yes let's make more time for this stuff and i think the the sort of refrain or the sort of like the hesitation that i certainly feel and i think a lot of other people feel is like
Sure, but like, how do I justify that to management, right? And so going back to that question of like, what's the story that you tell, right? To a certain extent, you can just kind of do stuff, right? I've been there, done that. Don't ask for permission. Schedule the retro meeting, whatever it takes, right? But like...
You know, it sounds like this has become a pretty big part of your job. And after a point, someone's going to ask, all right, Lauren, what was your performance evaluation for the half or whatever, right? And it's like, what goes in there? Yeah. So we don't have performance evaluations.
Oh, awesome. Right? Which is kind of wild. Which is actually one of the things I like about the org. But, of course, you get resources, right? Like, it's one thing to... you know on my own do things on the you know on the side but it's another thing to say okay now i want to like spin up a team to do this and then it's going to be like well you know are we going to get an roi on this like is it worth it and honestly like so
I have not been super successful at that, to be honest with you. So here's my sort of general thoughts on that. And it's funny because we are – Netflix is, at least in my org, a platform has not been as – I don't know, explicit about thinking in terms of like, okay, how much progress have we made on certain things? And now we're doing more like OKR-ish kind of stuff. So I would say in the future, it's going to be like even harder to justify.
You sort of have to, like, I was fortunate that I convinced, you know, my manager that this stuff was important. They bought into it. And like my manager's like skip level at the time was also into it. And so you had sort of champions throughout the hierarchy. And this is one of the things with resilience is that you have to be able to justify doing things even if you can't show a metric for it.
right that like this is the right thing to do that's one of the that's one of the worst things right is because like the metric is basically bad things don't happen and the action that you're trying to take is like cultural change so it's a very slow change
Where the feedback loop is going to be that nothing happened. It's like, it's really difficult to measure. Right. Yeah. I can't give you a count of the number of incidents that didn't happen. Right. Like that's the metric that I would like, but like.
I can't. Right. And so you kind of have to like infect management. And so the question is like, how do you do that? Right. And so one thing that I was doing right before I switched teams and unfortunately didn't finish because of COVID and stuff was. Like, it's one thing to look at individual incidents and go into a lot of detail, but we were looking, I was doing some work with some peers, putting Ryan Kitchens, who's still on the team, looking at, okay, let's look across.
the incidents that happened this year and not like metrics wise and buckets, but like what are themes that we can see because we did like more like qualitative analysis. on the instance can we look at patterns like okay like here's something that somebody didn't know right like so like one huge problem that you're going to see again and again and again is that there's some missing bit of shared context like you know this person didn't know x this person didn't know y and now in an organization
Like this is also like the hardest problem to solve is getting the information into the heads of the people who need it, the right information. Right. And like Netflix is like the opposite of Apple. It's like super open in terms of information. But that means that like you could.
full time just reading docs and do nothing else and you still wouldn't get all the information and you would get no work done right and so like it's not just an access thing it is like how do you figure out what the important bits are and that is really really hard but it's a critical factor that comes up again and again and again and so
you know and and here like one of the other challenges is like i can come up with problems but not like this sort of this sort of approach is good at finding like problems but not necessarily solutions you're going to sort of try different things but if you can provide insights to management about stuff that they wouldn't see otherwise
I think that is how you show that there's value. Like, look at this thing. I look, I saw that like this team is starting to go underwater, right? Like, and like, if we don't do something, then like, you know, three people are going to leave and they're going to get burnt out. And like, if you can provide those insights and you say like, is how I know this and it's sort of like qualitative kind of analysis
then I think you can make an argument for more resources to do that. So you've got to provide the insights. And there's a famous quote by, I think, Danny Kahneman, the psychology researcher, who says that no one ever made a decision based on a number. They need a story. Right. And like what we do, like the results, it's all stories. Right. And so like, if you can tell a good story about why this stuff is valuable, then you can, you know, I hope then you can, you can argue for it. But I mean.
to be honest like very few orgs are able to justify this and it's it's hard and like i would not say i've cracked this nut yet and you know management can change and that's it like and you know the whole thing changes and you lose it and so it's very i would say fragile and precarious and very contingent on the particular details of your work. You can kind of do what you can to foster this sort of...
I don't know the qualitative analysis of what's going on, but it's easy to lose. So, you know, maybe going back five years or so. Every engineer that you asked would agree that like developer productivity is important and like your ability to deploy changes quickly to production is important and like your ability to have automated test coverage is important. All these things are like engineers probably agree and managers who...
came up as engineers probably agreed as well, but they didn't have a way of quantifying it. And then the main change that I think happened in that arena is when Accelerate was published with Gene Kim and Jez Humble and Nicole Forsgren.
And, you know, they sort of coalesced around like these four key metrics and they tried to support that like delivery, lead time, deployment frequency, mean time to restore and change fail percentages, like that sort of the gold standard by which all developer productivity can be judged.
And I don't think it actually made a difference on the ground. Engineers always knew that stuff was important and they continue to know that stuff was important. But I think it made a difference to management because now people could point and say, hey, like, here's the rationale, right? These are now our metrics for the org. And, you know, you guys have to judge us based on...
that basically do you think there's sort of an analogous thing that's possible for resilience engineering and do you think that's coming yeah so i think the real challenge for resilience engineering is to tell management That you cannot get away.
with relying on a small number of of metrics to do these sorts of things right like that's the key thing and it's really hard right so the appeal of metrics like the and i don't think you're absolutely right like the findings that like dr forkswin like published about and and wrote up
accelerate right like any of those things if you talk to engineers like they would say yeah these are important right like we knew this right like no one says like oh no i you know i don't care how fast it takes to deploy i don't mind waiting an extra two hours or a day right this was known but like
It is very tempting for leadership, which is trying to oversee an organization that they can't see much of, right? Like no one knows. I don't know about you, but like my manager doesn't know what I do during the day, right? Like they have no visibility. And it's very, very hard to manage something where you just. can't see what's going on right and so metrics give them visibility right you can say okay how are we doing what's what's our mttr look like how's the trend
what's the time between you commit and it actually goes out to production. Right. But like resilience is about the fact that like. The interesting stuff is, well, I don't know if it's about the fact, but a big part of it, I would say, at least from my perspective, is the stuff that you can't see that way. It's the stuff that is not visible through the metrics. It is like the workarounds that people are doing to get those metrics.
metrics up but they're actually taking additional risks because of that right like what are we sacrificing to improve those metrics right so there's all these signals and no matter like and like so you could say okay like do like a huge number of metrics but like that's not practical for leadership right like because if you give them like you know a thousand metrics what are they going to do with that right and so so the challenge is how do leaders get signals about what's
important what's dangerous right so like what i worry about is like when the metrics are fine but there's a danger right like if the metrics are bad okay so the thing with the metrics if the metrics are bad that usually means there's a problem if the metrics are fine There can be a problem, but you don't see it. And that's what I worry about the most is the metrics are fine. They're stable.
But there's a problem and there's a risk and we don't see it happening because people are, you know, putting off this, you know, tech debt or whatever or sacrificing operational stuff. Right. And so the challenge for leadership is like. Okay, how does leadership get better at collecting those kinds of signals from the organizations in ways that are not easily visible?
And that is really, really hard. And that is a very tough sell because like leaders are already like completely squeezed the same way like line managers are the same way we are directors. Everyone at the chain. is stretch to capacity right and to tell them okay like i'm gonna make your life harder right you're gonna have to work harder to figure out new ways to to collect information you didn't see before i'm gonna write qualitative reports that are like you know
50 pages or something or 30 pages, which I've done rather than give you like a graph that shows you like our, you know, our products, like it's like, forget it. You know, like, like that is a really, really tough sell. And. You know, I'm not a manager. It's like, it's a very difficult thing, I think. I'm very comfortable as an IC. But I think like that is the pitch we have to make. And that is a very, very difficult pitch.
And if you look at historically at, I would say like trends around management, they are usually like, here is a process that will make this tractable where we are saying, look, you just have to become an expert and you have to like. build these muscles and figure out how to like talk to people and listen and get information from different sources. And it's just, it's a much tougher sell.
And I don't exactly know how to sell them. We sort of have to, I'm hoping actually, so you mentioned like stuff comes up from, from engineering management. I'm hoping like a new generations of, of ICs that are, you know, the learn about resilience engineering kind of stuff. When they become managers, they will have these. perspectives but you're talking about generational change this is like progress like one funeral at a time kind of thing it's like you know maybe multi-generational
One thing I'm struck by is I often have the experience I think that you're talking about, which is like, I want to protect you from negative outcomes. People are like, great, do that. At the end of the day, let's say you do that, it's hard to prove that you have protected people from X number of negative outcomes. But I think...
It seems like a lot of the folks who are focusing on resilience are starting to understand that the same things that contribute to your resilience contribute to your capacity to do more work, right? Because you talk a lot about Rasmussen and that sort of like the boundaries around work. there's like a financial boundary there's sort of like other things but work is constantly pushing us towards an error boundary and so if you aren't
doing the work to increase your capacity, you're going to hit the error boundary and that's where incidents happen. But the same thing that sort of like pushes the error boundary away from us as we do more and more and bigger and bigger and more complex work is also increasing our capacity to do work.
Do you feel like there's like a story that we could tell that's like more of the positive? Like we are increasing your team's ability to do more things over time. And like, do you feel like that could be a more interesting or a more compelling story to tell?
management, then we've protected you from X number of negative events. Yeah, I totally think you can. To make a pitch about improving expertise, on my team, improving operational expertise meant that we were you know we were more quickly able to diagnose problems we spent a lot less time you know when
like debugging certain issues because we've got visibility because of metrics. And so less time spent troubleshooting is more time spent developing and delivering value. Right. And, and it also like, you know, the engineers just. perform at a higher level right we do become more more efficient in that sense right and so i think the learning aspects right of the sort of upskilling is compelling because it's saying like look we're going to sort of reduce the overhead of the you know
know firefighting kind of stuff right the kind of stuff that drags on us in a way that like it's not just okay we're spending a whole bunch of time paying down tech debt right like that's one way to to improve productivity but that's also like a chunk of time so i think you can definitely make the arguments around
around improving expertise, that that's just like, you know, there's clearly ROI there. There's clearly like, we are going to get better as an organization, right? Like we know that like experts, everyone knows experts are more valuable. That's why we pay. you know, seniors, higher salaries than juniors, right? Like everyone, everyone is aware of that. And so I think that's an easier pitch to make about the learning and as a mechanism for improving the.
Yeah. Capacity is a good term. The challenge of increasing capacity is then you just ask to do more, right? Like, so capacity and then they throw more work. Okay. You can, you can move faster than we're going to throw more work at you. And so you just like, you move the boundary out and you move closer to the boundary. The harder part is.
i mean it's always an eternal struggle to carve out the additional like the thing about capacity is that you have to keep some of it right like capacity means you have some some extra juice that you can use when you need it right and you're not sort of and
You need some like social, like organizational capital to justify like not running at full capacity. I mean, this is a challenge with the centralized incident management team. Like these folks just sit around waiting for incidents to happen. Like you could have them, you know. the software engineers and building stuff right but their extra capacity that's around one of the things that i think is interesting about this is like it sounds like what
we're sort of saying is like, like the work that Dr. Forsgren has done in Accelerate, it's valuable, but it doesn't paint the whole picture, right? There's lots of things that we're saying is like, there's always going to be this like squishy space. the company or the culture that you work in has to value exploring the space constantly because that's where you're going to find the things that you can't measure is in that sort of like squishy space. But like maybe they're...
Could be like, what about like things like psychological safety and like other things that like, if you know that a team has psychological safety, maybe they're better at exploring the squishy space. Right. So maybe there's ways in which you can. measure or evaluate a team where it's not like, are you following these 10 metrics? But it's like, do you have the right environment to create the ability to discover the unknowable at the moment? Yeah, I think...
I feel like psychological safety has really sort of caught on. Like, I think everyone at least, like, I don't know, pays lip service to it. Like, Netflix is pretty good. I would say people are pretty, because once again, they're all seniors. Everyone sort of has strong opinions. They're general, I mean. A lot of people have imposter syndrome coming in, but then people are comfortable disagreeing.
with each other and are okay with that i have tried i would say for so like one of the things and i blogged about this like i i try to make it okay for when i have done incident write-ups that like you know i name all you know everyone's names i put there explicitly because like that's okay right like it's not the like
you're here then like we trust that you are good right and so the assumption is that like if we want to learn as much as possible we should assume that everyone who was involved was doing things that made sense to them at the time and by like putting the names in we're signaling like there's nothing to be ashamed of here
And I do this myself. But of course, when you push to production and something breaks, you feel terrible. We're humans. We feel bad when we're involved in breaking things. To me, the psychological safety thing, I'm very lucky to work in an organization where I feel it's there. And so it's hard for me to like, I can say these things, but like, I don't work in places where like.
I have read horror stories about, like, government contractor stuff during the healthcare.gov where, like, someone basically got fired, you know, sort of got fired on the spot kind of thing or they accidentally dropped a database, right? Like, there are environments that are like that. But, like, those are...
I don't know what to do about that. Like, I'm fortunate I don't work in one of those. I would just leave. Like, I can choose my environment. Like, I'm very, very privileged about that. I think if you don't have psychological safety, you're like, you have a huge problem, right? And it's much harder to do these things unless you're at a place.
where you you feel like where i can go and say like half-bake things to my team right and like here's i'm sketching out a doc and it's like probably all wrong but we're just gonna you know talk about it so one of the things you know i've been recently reading a book about
engineers it's called designing engineers and it's about how like engineers actually do design and the guy who wrote it is a professor of engineering at mit and he did some case studies he went out to various companies and sort of observed what was going on and what he found was that a lot of the design work happens
in the meetings, in the interactions between people where like different people have sort of incomplete, you know, views of what's going on and then they talk and they sort of negotiate what's happening and it's in those interactions. between people where the design actually happens. And I think one of the things that I would like to try to push is to think of the team or the org as the unit.
Right. Like, it's not like I'm designing it or like I'm operating the system because I'm on call, but we are collectively doing this. And each of us is only has a partial view. And like, it's the emerging results of what we do. That is the thing. Right. It's not like, you know, I did this and you did this, but we are doing this together.
And like, you should not expect to, you know, you don't have the whole picture. You only have one perspective. And it's like the interaction of us together that is the thing that is, you know, developing and operating these services.
We talk a little bit about that, but it's a big perspective shift. And even I'm still wrapping my head around that, that this is a joint cognitive system, is what the resilience people would say. This is what we have. It's not just us. It's the system that we care about.
I think it's interesting, though, that they're using psychological safety as a reference because there, too, I would argue that there's sort of like an analog to the Accelerate book, which was Google's Project Aristotle, which was sort of like the seminal thing that translated psychological safety.
into like a concept that all managers could get behind because now there's like a research paper that validates it. And there, again, they actually have metrics that you can use to measure psychological safety in a team. Whereas like to us as engineers, I don't think we would have tried to go about and do that.
It's just sort of a yes or no thing. And I'm still sort of left thinking that like, you know, either there needs to be a sea change in management, which maybe, you know, goes back to what you were alluding to, to like one funeral at a time. But I feel like even then. We might still be looking for something that can translate sort of the resilience culture that you're describing into...
cliff notes for managers that they can measure. I don't know if that's ever going to happen or if it's even sort of realistic to talk about it that way. I do want to hear your thoughts there. I think what we need to do is we need to figure out a way to provide management with a tool for aggregating. the sort of massive information that they have access to that is not simply metrics, right? We need to get them an alternative. And I think we don't have a good story around that today, right?
You know, you mentioned just now, like, you know, metrics run psychological safety or whatever. And once again, that's a way of aggregating data, right? Like, and they need that. They only have a certain amount of bandwidth. And we need to figure out a way to provide them or upskill them with a way for them to aggregate the signals.
without relying on metrics. And I think we just haven't figured that out yet. No one's written like, well, I guess there's been a resilient management book, but we need more in that direction.
Interesting. So we have a few minutes left and there's two questions that we ask everybody. One of them is just sort of, and it sounds like you've got lots, so I'm excited to hear you answer this question. The question is like, what sort of resources have influenced the way that you work? And that can be... be like books of course and research papers and conference talks but it could also be just like people that you follow etc sure yeah so i mean i got sucked into this two ways
One is reading a book by Sydney Decker called drift into failure, which really was sort of like my entree into this resilience world. And like, I, so I have an academic background and that's sort of one of those more academic books and it just like completely, I don't know, like. I loved it and I strongly recommend it. Another perfect one is John Allspaugh.
who has been banging this drum for a very, very long time. And at one point, I was like, okay, fine. Let me start looking into this stuff. And John works with David Woods, who was Sidney Decker's PhD advisor. So the connection is there. And so... you know, after John constantly evangelizing about this material, I started to read about it. And then I just like.
I just got completely sucked in and started reading tons and tons of papers. And if you go to resiliencepapers.club, you can see my list of papers that I've collected. I haven't even read all of them, but I've read many of them. And there's just a ton there. Nice. Sydney Decker was my entryway as well. I love the in the tunnel, out of the tunnel perspective. That was the first thing that really resonated with me in terms of like, oh, we're looking at this all wrong. So highly recommended.
So our last question is how much of your time do you spend coding nowadays? Quite a bit. I would say roughly half my time is spent coding. So I'm really like a traditional, you know. software engineer it varies you know some days it's more docs and meetings but you know i do spend a good chunk of my time still coding nice awesome awesome well lauren thanks so much for joining us on the show today it was really lots of fun yeah i enjoyed it
That's it. Thanks so much for listening to Staff Edge. If you enjoyed today's show, please consider adding a review on iTunes, Spotify, or your podcatcher of choice. It helps others find the show and is a really useful signal to us that folks are finding value in this so that we keep doing it. You can find the notes from today's episode at our website, podcast.staffenge.com. The website also has our contact info. Please don't be shy.