#117 - How to Establish SRE Foundations From Scratch - Vladyslav Ukis | Tech Lead Journal podcast

00:00

I think the strength of Surrey is exactly in the ability to bring about that alignment on operation concerns between the product management product development and product operations, whereas other methodologies, and great work that we talked about before, I had bought their strings, even other areas, like I didn't call bit their strength is designed the ID functional Enterprise develops. The strength is in the philosophy of or like delivery,

00:29

whereas Sorry, really. I think the strength of the methodology is in the alignment on operational concerns, the

00:37

entire organization. Hey, everyone, my name is Henry Surya with Robin. And you're listening to the technology, you know, podcast the show where I'll be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our Journal.

01:17

Hello again, my friends and my listeners, welcome to the peclet you know, podcast the show where you can learn about technical leadership and Excellence from my conversations with great thought leaders in the tech industry. If this is your first time listening to tackle a journal, Please Subscribe and follow the show on your podcast app and on LinkedIn, Twitter and Instagram and to support my journey. Creating this podcast, subscribe as a patron at technology node.

01:42

Dev slash Patron. My guest for today's episode is dr. Vladislav UK's, doctor Vlad is the head of R&D at Siemens health and years and the author of establishing a sorry foundations in this episode dr. Flat shared insights on how to establish a sorry foundations from scratch based on his first-hand experience at Siemens Health in years and the concepts described in his book.

02:09

We started by discussing the basic Sr a concept and how Differs from the other related Concepts such as I tell Co bit and devops Doctor flat then explained in depth. How is our implementation can help to create an alignment between the product management, product development, and product operations teams. He also shared the importance of having internal SRE coaches to facilitate this transformation.

02:35

And when an organization can start to realize the benefits of implementing a, sorry in the latter half of Our conversation, dr. Philip walked us through how we can begin. Aracari journey. Make further progress in the journey and measure the success of our SRE implementation. Also do not miss his sharing on how a sorry implementation can help to improve reliability in a stringent industry, such as health care. I really enjoyed my conversation with dr.

03:05

Vlad though, there are quite a number of resources, explaining about SRE concept, not many provides practical.

03:12

A step-by-step guide on how to implement it especially for the traditional organizations that need to evolve from their legacy operations and practices I especially admired octave left for writing an SRE book despite not having the experience working at Google. The place where the SRE concept and practices were created dr. Vlad successfully, embark on the journey to implement a sorry, in his organization and shared his first-hand experience in his book and this episode So, if you

03:43

liked this episode, make sure to also check out his book as well. And as always, if you find this episode useful, please help share it with your friends and colleagues. So they can also benefit from listening to this episode. I always appreciate your support in sharing and spreading this podcast, and the knowledge to more people, and don't forget to

04:02

give this podcast a rating. If you are listening on a podcast and Spotify. Let's continue the conversation with dr. Blab, after hearing some words from our answers, today's episode is proudly sponsored by skills matter. The global community and events platform. With more than 100,000 software professionals here members. Can organize their learning experiences around the technology topics.

04:27

They care about most you get on-demand access to their latest content thought, leadership insights, as well, as the exciting schedule of tech events running across all time zones. So whether devops our data science is your bus or you are And a functional programming or all things Cloud you can make real connections with people who share your interests head on over to skills method or Cam to become part of the tech community that matters most to you.

04:55

It's free to join and you will find it easy to keep up with the latest tech Trends. Hello everyone. Welcome back to another new episode of the package. You know, podcast today, I'm very excited to meet someone who wrote a book about SRE which is titled establishing a sorry Foundation. Action. But he is actually not someone coming from Google as probably most of you would have known as there is a concept that is introduced and widely popularized by Google.

05:21

But today we'll meet someone who actually didn't come from Google but wrote a good book about heh, sorry his name is dr. Vladislav yuccas. He is currently the head of R&D at Siemens Health in ears, so I'm looking forward for this conversation. Dr. Flat and hopefully we have a great conversation about SRE in general. Yep. Definitely. I really thank you very much for having me on the show.

05:43

I'm really happy that it worked out that we can talk about SRE on this really wonderful podcast which I wasn't aware of before. And now that I've seen all the great speakers from previous episodes, I'm really honored to be able to talk to you and be on the show. I keep for your kind words. So dr. Flat, I always love to ask my guess to actually share a little bit more about yourself.

06:09

Maybe sharing more about your career highlights, or any turning points that you think are interesting for us to learn from. Yeah. So I spent my career so far in the healthcare domain with the Siemens healthiest, Siemens held in years is a german-based company that produces Walt solve Medical Imaging devices. So things like, for instance can hit it from older files or magnetic, resonance devices or ultrasound devices. And so on there is lots of software that associated with

06:40

those devices. So there is a lot of one premise software. First of all, that's driving the devices themselves so that you can do scans on those machines. Then there is what's a post processing on-premise software which is software that is used by the Physicians inside the hospital departments but not necessarily directly at the

06:59

scandalous. Then there is also Cloud software that software, usually manages, the fleet's of those cameras and helps Personnel in the hospitals to around the machines, more efficiently in the inside. So I worked in the on-premise software at Siemens Health in years. And then, at some point, I moved into the cloud software. That was also. The very beginning of the cloud Journey for simultaneous.

07:29

Basically, over the years, we built the first cloud-based platform for medical applications for the entire organization and now every This units that wants to build a cloud application, they build it on top. Also the teamplay digital Health platform, which would build. And that was also the product where we started exploring how to, actually operate. Such a platform at scale and this is where it's a reclaim it and this is where the whole story started that led to the book.

08:00

Thanks for sharing your journey. So one thing that I pick up you have been in this healthcare industry for a long time and actually you implement SRE Concept in it, I think. Most of us are educated about implementing a sorry for cloud-based systems. So hopefully today we'll be discussing a little bit more, maybe from your journey in terminating it. In the healthcare industry which is probably quite stringent and it's kind of like mission-critical, right? So you can't have it to be

08:23

failure. First of all, how did you come about with this as re concept because you didn't come from Google? And obviously, you come from a healthcare industry which is traditional maybe tell us more a little bit on this journey. Yep. Yep, so as I mentioned before, the build up all day, pimply digital Health platform was new for the company.

08:45

We didn't have Cloud operations before and then also we didn't have the skills in order to be able to operate such a black pump because so far, the majority of the software produced by the company was on brevis. And as a development team, you don't operate on Brenda software that is done by professionals. Personal services, which is telling another organization that was a total paradigm shift for us. When we understood that actually, it will automatically work.

09:14

We need the development organization to do operations. So then the next step on that Journey was to see. Okay, how could we do this? We have try to operate the platform. Just using means that came to our minds. So we started setting up a loads. We started promoting the idea that actually, the development teams. Need to be also to be into the operational aspects and not just developing features and we kind of have tried doing this for

09:42

several years. But we saw that, this is a difficult idea to grow in a development organization that has never done operations before. Because in the mines, all the developers who had never touched operations. Their job is to develop something in the minds of product managers than their job is to tell the developers who work to develop.

10:03

Basically, everybody's thinking that there is This operations Department that will do operations, and the operations Department cannot do this because they don't have the context and you don't have the internal knowledge necessary to operate services. At some point. I kind of understood that we need to do something different in order to really grasp the operational aspect on the black. So I went to all the que con conference. In London. There, there was a whole track

10:32

on a surgery. So I stayed near the delete icon. It's on that track because I knew that was the pain point that the organization had of the type. Then I learned a lot about it. Sorry. But of course, that was just conference knowledge after the conference, I kind of understood

10:48

that this is something real. That there are lots of companies in the software industry at large going that way and trying out this Arena. And since you have some success, however you define it with it. So basically I thought, okay, that might be worth it. Right? For us as well, the navigate I'm back in the sort of started learning more and more about this living. The Google has a read books, of course. But the trouble was that that was Nicholas read books while

11:14

great. I kind of saw that they were describing the high end of what you do when you're really operate services at Google scale. Then I also read a couple of other books but all they were kind of describing was already at a level that was far away from where we used to be so then I understood.

11:32

Okay, fine, this is great but we need to So we're gonna start small and really start seeing whether it's a re would be applicable in an environment like us, which is obviously different culture than Google in different context than Google and basically different everything compared to. So then we kind of started adopting a sorry slowly but surely making small steps implementing a little bit of infrastructure and that them boo to the entire organization.

12:01

Then operating the entire platform using a Serene, this is only joy. Ernie was shape, really exciting, right? So, you learn about it in a conference, you went through like one whole day, probably, but you let it from the conference and you started implementing it and you become like an author of it. So I think that really speaks like a tremendous Journey.

12:21

But so far, I think many organizations would have been in your position as well, right from traditional company, maybe from Health Care, maybe from some companies, which are not born on the internet era. In the first few chapters of the book. You actually try to compare a sorry with the No way off you doing it operations, which are like, ideal covid and also devops in general as a culture.

12:42

So maybe if you can give some context and summary what is the difference of a sorry with all these other Frameworks? Yeah. So when I was writing the introduction to the book, I understood that there is no just a sorry which kind of was my focus before because I was the focus for many years on just doing a sorry. But then when writing book, of course, I needed to zoom out and see what else is out there in order. Has been introduced. I'm sorry.

13:07

Well, in the context of the existing methodologies. So then I saw that especially prevalent today is the isil implementation. This is basically a framework for how you set up the operations, the it function of an organization. And then, there are also a couple of other problems like this like, cold meats and

13:30

others. What they do well is to basically, How and it function or a big organization can be established and can be around, but then there are other kind of related things like, for instance, Devil's, right? So devops is it very, very high content in the industry.

13:51

This is kind of a philosophy. How you run the product delivery, especially with the shorts recycles and feedback along the way from different environments and from different people, including the customer and So on but then generally develops stays at the philosophical level. So basically it's a philosophy or fast. So quick delivery and fast feedback loops especially in the area of operations. But also in other areas it doesn't prescribe you enough to get started so just sleep.

14:23

So basically you can think of devops is an overarching philosophy of running product delivery and then you can think of other Frameworks like I tell and Corbett as Frameworks for I think the, I key function of the Enterprise. So now, when does the sorry, can in that context? So actually as a really is a and opinionated implementation of

14:46

the Ops part of demos. So you can also take a very narrow, kind of view on devils and say that just based on the name, devil tries to bring together development and operations. Although I think it's brought up, but you can kind of take that assumption say devops based on the main tries to unite them all. Building operations, so that they work well together. But then again, this is kind of staying at the philosophical level development and operations.

15:13

They should be working together. But, how should they do? Bad programming? Should they do pay operations, watch your video. How should they work together? So, actually, it's already as a concrete implementation of the devops philosophy. Is this opinionated framework to implement the Ops part of devops so to speak. It's got concrete prescription Souls. Should be put in place and also roughly by room in order to bring development and operation stick.

15:43

This is also coming from the origins of SRE which is trying as software Engineers to come up with the framework for operating services. So this is then sort of software engineer lead approach to operations. This is what is the reason then if you take those things like my tail and Corbett for For organizing the ID function of the Enterprise, then develops as the overarching philosophy of drug delivery. And then as a rig, being a concrete implementation of the devops was a bit then you can

16:15

say okay? So actually I sorry can live very nicely in parallel with Italian Corbett because with item covid you just set up the idea of action of Enterprise and then with that object sets, the overarching philosophy for program delivery. And then a sari is a piece in order to enable the She's lot of your product delivery organization. Well, well, thanks for this explanation, so we can see clearly where it fits into this overall along with the other ID and operations Frameworks.

16:47

So, thanks for sharing that you mention a phrase, which is very popularly. Described as area, a sari is like a way of operating software if you give it to a software Engineers, right? I think it's a code from been trainers. Lost. I'm also interested to listen from you. What is your definition of s? Are you coming outside of Google? How do you Describe a sorry, is there anything that you probably have different in terms of perspectives of opinions? I think you can definitely see

17:13

that. It's coming from the software engineering world. I would agree that this is what happens when you touch this operation ears with doing operations and actually very interestingly, this kind of thinking when you Cast Operation ears with doing something and be also applied to other areas. So for example, being specifically in health care with what a lot of agriculture, Button. So we've got a lot of Regulatory Compliance regulations that we have to fulfill traditionally.

17:42

These regulations are not fulfilled. From this equation, has point of view so they are basically fulfilled from the original tree person, point of view and therefore there are lots of documents that you had to write in order to demonstrate your compliance. With the regulations, those documents have to be handled in a certain way in order for you to be able to provide evidence

18:01

during what it's that. You actually comply with the process and so on. Additionally, if you task a software engineer to design the regulatory function, it will be done something completely different. Certainly, I think you can apply that kind of thinking, just a software engineer with designing X and that leads to a totally different implementation than if it comes from a place where it's all inspired by software engineering.

18:25

So I think there's generally a great potential to apply that kind of thinking also in other areas and we also started actually changing our regulatory process with ending. With the maintenance on that. It is definitely makes the whole process. Much leaner makes the whole process much easier to apply it and then leads to actually a great acceleration of products delivery. Because if you are less constrained by the regulatory burden that we can, we'll patch box them.

18:55

So we're all definitely would still agree with the original definition by Google posed by been touring has lost that this is what happens when you touch them. Software engineer to design be operations function. Yes, it definitely would be happy to apply that kind of thinking. Also, in other areas, thanks for the an interesting perspective. So if you want to improve something, you can also applies of engineering, 2x problem,

19:21

right? So maybe one day we'll see Regulatory Compliance engineering. When you apply software engineering to some area. You mention about last time in it, the Ops team will be its own Silo development team. Its own Silo, and product manager its own Silo. So, in the book, you Today, explain how a sorry actually is able to unify these three different teams. So tell us more how asari actually helps bring alignment into this three different areas.

19:47

So, I think the strength of a surgery is exactly in the ability to bring about that alignment on operational concerns between the product management or development and product operations, whereas other methodologies and framework that we talked about before, I have bought their Things even other areas, like I tell in Corbett, their strength is designed the ID, functional Enterprise devops the strength is in the philosophy of orluk delivery, where it's a really,

20:17

really I think the strength of the methodology is in the alignment on operational concerns, the entire organization. The cool thing about this. I read that also benefits us a lot in the employee digital Health platform. Is that it prescribes that? See, we need to have Level objectives for your services which are then based on service level indicators.

20:40

Then once you've got service, level indicators and service level objectives, then we automatically, you've got so called Arrow budgets, which are given budgets for errors. And then if you exceed your given budget for errors, then you can do data-driven decision-making about whether we invest now injury liability, or whether we continue investing in features. So basically you've got these kind of primitive Oops, that you can work with and that you can

21:07

use to guide your thinking. So now, how does as a really then bring about the alignment of different bodies on operational concerns. First of all, under the SRE framework, the developers, they need to go on call for their services, to the extent agreed in the organization. So that means that the developers need to experience

21:29

firsthand what? It's like to run their services in production and that can be Just a little bit, if the agreements to provide the services is such, that probably loving is just a little bit involved in operations. And it may be that the development teams are fully only called for their services. So we can arrange, sort of between 10 and 100 percents, can be done by the developers

21:51

themselves. The product management typically is not involved in product operations, but under the sari framework, they need to be involved in correct operations in order to actually make those decisions when we invest in there like that. It's you versus what we invest in product features, but these decisions are then done based on sound data frame production and that's based on the error budget

22:14

consumption. So, if there is no error budget left anymore, what do we do the product management needs to be involved or the definition of the so-called Arrow budget policies that state service by service and team by team. What do we do if there is no, error budget left anymore? Do we value invest in the ability to invest in reliability than how do we do this? Do we take One engineered wood on the reliability or do we re paralyze the back?

22:38

Opens on. As you can see, the product management becomes totally part of this entire conversation, how to guide investment on reliability, the product operations. They usually not involved in enabling others to do operations, but under the

22:55

authority framework. That's exactly what they need to do. So basically, they don't necessarily need to run services in production but they need to provide as our infrastructure to enable the Developers to go on call and the efficient to, basically, a traditional organization puts their responsibilities on the head, simply because the developers, they never went on call, but they need to go and call down the operations people, they never enabled others to do operations, and they need to

23:23

enable developers. Now, the operations, the product management. They were never involved in operations, but now, they are part of the conversation. How do we guide the investment in the reliability? As you can see, it gives each party. Certain set of tasks and aligns them using those Primitives like is alone zero budget, zero, budget, bonuses, and so on. So that the whole organization then is aligned on those operational, concerns ideally before hitting production.

23:50

So I think you refer to this as a collective ownership in your book. So from three different perspectives from development from operations and so product management and I like, when you set that now, the three different groups actually aligned on what kind of reliability that they would want to, Offer for their services.

24:06

And if there's something wrong about reliability, for example, your error budget is exceeded, then you make a certain decisions, should you invest in reliability, more, or should you continue investing on features? Knowing that your error budget is succeeding. So, I think this misalignment is pretty common in many organizations, right? So maybe from your journey, tell us, how did you solve this? Because when you introduce a sorry, it's a totally radical

24:30

concept. I assume for traditional companies, and some people would even think of this works in Google. Not ask. So how did you overcome that misalignment? And make sure three different groups actually can align on this same implementation. He had specifically related to make healthy years. Was that first of all we placed a sorry into the portfolio of things that we do in the

24:52

blackphone. We are in agreement that this is an important topic, it's important for us to improve the operations of the platform and therefore we'll try a sorry because everything that we've Right before wouldn't add success. So we'll give it a try but couldn't into the portfolio of things that will work on with that. Then we are able to allocate resources across the board. We were able to stop the development of the sr&ed infrastructure in the operations

25:22

teams. We were able to start talking to you the development teams and literally go kind of team by team and introduced the concept there and take each team won. Its tgb. Journey towards operations improvements using a surgery that kind of team. Based coaching was also key to establishing a sorry in the organization because every team is different. Also, every team is using the server infrastructure slightly

25:52

differently. So basically, taking the people who are implementing their infrastructure in going, kind of T by Tau into the product teams, including the product owner was key to actually being Unable to establish at sorry, broadly in the organization. It's very interesting because, yeah, like you said every team is different. Some may be more advanced in terms of understanding.

26:17

So for those people who haven't really understood about SRE concept, I do have an episode in episode 96, which covers the basic ideas principles and Concepts in a sorry. So today we will not be covering that because you can refer to that episode. Do check it out if you want to learn about what we are talking about, SLO error. Jet SLI right. So, dr. Flap and you actually tried to implement this. I guess some people may be puzzled about a sorry.

26:42

So it's like how do you actually educate people? You mentioned about the role of a sorry, coach tell us more about this role. The significance of it. Like, how do you actually recruit them? Is this coming from external or is the internal? So maybe tell us more about this important role of a sorry coach, he has only to embark on the. Is there a transformation? Want to establish a real organization?

27:04

Then you need Somebody who would be driving the transformation and this is, they're all sorry, coach, that needs to be somebody who's really driven to improve operations and who can understand the organization well enough in order to nail what you need to do, to establish it that all the different organization by organization, and especially that'll be indifferent. Also Culture by culture, therefore, I think it means to

27:30

be somebody in total. I wouldn't think that somebody external would Able to do this because you need to know too much all inner workings of an organization in order to make it happen. Also, I think there might be in just not enough, trust between the teams and be its own coach in all that you'll make that change. And also another thing typically something like this is a multi-year journey and usually somebody external is not there to spend years with a focused. So I think you need to be

28:02

somebody in total. Somebody's drinking too. Group operations and somebody convinced that sorry might be a good approach that person needs to be a mediator between the operations team. In The Mending, their infrastructure development teams, adopting a sorry at different bases and the management team endorsing the transformation. And also providing it on below

28:24

budget for this. So, I think this is definitely cure-all and must be there for quite a long time until and sorry, becomes the usual ways of working or For any team and also to the point where even Yugi has started, then they kind of don't think about how to do operations, but they just use the existing as our infrastructure and approach everything from the three point of view from the beginning when they set up new services. And you mentioned, this could be

28:53

multi a journey, right? So maybe from your experience, how many years did it take you to actually really start seeing the effect and benefits across maybe one or two teams? If not the whole company? So I think you can see the benefits pretty quickly because if you are coming from a traditional development organization that has never done operations before, then there will be basic things that you can improve immediately just to talk about basics.

29:20

For example, you need Uniformly, logging across all services that you've got typically in a traditional organization, you will not have uniform logging everywhere. They will be always, you know, one team does it in one way and the team does it in some other way. And there is this new service where the logging has not been yet enabled and so on. So basically, you can immediately can Elevate the maturity a little bit.

29:44

Then the next question is, okay, when did you really start seeing improvements at the management level, where the management team members will see? Yes. So actually before we had this, and before the teams were doing operations, the services weren't enter liable as now, something like, this will take a longer of course, but not necessarily that long, I would say, because, typically, in order for this perception to be developed, you need to see some external

30:12

change. And usually, the external change is measured based on the customer complaints, right? So usually would be a management team. Um, notices is that okay? Somebody's complaining because something is not working. So you'd be amount of external complaints gets reduced, this is what the people will stop noticing. I actually, this is already kind of best some fruit. Although from the azeri coaches going to do they of course will know.

30:39

Okay, we're just actually at the beginning of the journey but already if the development team is stuck, taking production, seriously, if they start really kind of looking into how this in his surrounding, even if it's not The best and sorry. We then that's usually already brings a huge effect in terms of reliability. Yeah, I can totally relate with

31:00

what you said. So previously if you haven't done any kind of a sorry, for example, there are typically many kind of implementations especially across teams in the same company, right? So like logging is one thing, observability, whatever that is you will have so many tools and they are not uniform as well as pretty difficult to trace from one service to the other and I think with the sorry concept that you start, maybe standardizing all this.

31:21

Having conventions. And I think, like what you said at the end of the day, it impacts the customers, right? Users have been as. So from the management point of view, I think, once you start seeing reduction in customer, complains of customer related issues, maybe that is the sign when you actually implemented SRE successfully. So you have explained as re concept to everyone. Everyone seems to understood about the concept. So how do we start by laying the foundations?

31:45

If I understand SRA correctly. It always revolves around service level objectives when we discuss about Collective, Bishop, I think from the product development for the operations and product management. We also agree on the same service level objectives, right? So, we have the same definition. We have the same understanding about a consequence of not meeting that objective and things like that.

32:05

So tell us more what should be the step-by-step foundational steps for people who are trying to implement a sorry in their company or team. I think a couple things need to be done at one important thing is that I think it's important for the initiative to be taken. Seriously that requires a surgery to be put onto the list all organizational priorities that they'll get ization wants to explore.

32:30

So every organization has got some list old big topics that they're working on. Ten topics roughly that we are working on right now and that needs to be on the list because this is such a profound change the development team, use suddenly starts going on all for their services and the operations team starts suddenly to implement some free Works in order to Enable the developer team Studio operations to basically the development team start doing some operations for

32:58

and the operations team started doing some development work and visit Aunt will kind of transformation for both parties. Then be organized when suddenly is in the middle of the discussions about reliability which is still being used to that. So I don't think you can really bring this about if the organization is not serious

33:17

enough about being judged. So I think the first thing put this on the list, Of initiatives, you are undertaking, then the second thing, wind a possibility in the racial team to implement a little bit of a mess, our infrastructure and really the very, very minimum. So I would say just start with one SL.

33:35

I say availability, because that's kind of the most common ones in the most fundamental one and then he just Implement some infrastructure that can take some logs and calculate the availability of a service based on those logs and then can alert on say Certain thresholds that third layer is already enough, that big say, one development key that is responsible for some services that would be kind of open to exploring a new way to do operations. From then onwards kind of

34:07

operate. This re circulate very frequently between what the development team actually means. And what there's our infrastructure needs to provide, basically bring together that one team and the operations team in the wedding ditra structure and bring it. To a point where that one team would say, we actually make sense. It's an improvement compared to before it's better now and we want from this our infrastructure additional 10 features. I think that's a good starting point.

34:36

So that's solid already. You've got an owner of this rail tracks are two new chords infrastructure used by a Regal Team and if they want more, this is actually a good point to say. Okay. Now we know we are on to something useful and let's go find the T then the second team can already make use of this for infrastructure. That was built for the first team.

34:58

They will, of course, have some different context and this is where you can then decide, which additional features to implement into this or infrastructure from then onwards. It's about going team by team and talking to them, discussing with them. Whether it's very would be something bought them but you should make the more teams you on both the easier it gets because they are infrastructure.

35:19

Is already more mature every time To take on a new team because you've got more features there and, you know, the day useful because also, there is a kind of social proof that this new thing. It's a really kind of helps with the services operations. I would say this is kind of roughly the journey that you would need to take. And this is also where the SRE coaches are essential, especially at the beginning. So I think that totally makes

35:44

sense. Why you just do it gradually building the infrastructure as well. Don't forget about that because you can't just tell the team. A you go Implement without Structure or platform invited by the SRE best practices because otherwise people will

35:56

implemented differently. I totally can relate as well because I did try, similarly leaving the teams to implement it, but I guess people's understanding varies you would start seeing some themes which implemented properly some themes, which probably just follow the request of the order. So to speak, some actually just finish and then forget about that filter, which brings us to the key concept of SRU. So when your SLO is actually being breached, what do you do

36:22

about it? Because I think that is maybe the Tipping Point how the culture change our because if you have an error budget it got breached but nobody cares about it or do any action. So maybe from your experience how do you handle this? Such that it becomes a Tipping Point where people understand. Okay. This we have to take seriously. Yeah. So several step step, step one is involved the product owner into the sorry coaching sessions from the beginning.

36:47

The product owner is totally a full member of these discussions. Number two rule. Once the pill you moon is mature enough to come up with the arrow budget policy. That this is something that's really helpful, because it forces the team to think exactly about this question. What will we do when there is no error budget left of this service or for that service?

37:13

It does this ahead of time before you exhausted your error budget and that provokes really good discussions that also Just upon other things that the team is doing, for example, the team might be say planning to do some Workshop in order to plan the next increment. What happens if your error budget gets exhausted exactly on that? They do you then this time the workshop or what do we do with our planning, something that provokes a big discussion, you need to basically put it on paper.

37:43

I Aro budget policy is if this happens. That's what we do at that happens to that one. So that's already definitely the next level of maturity. T, but I think it's important, it's not to start with the whole error budget policy before the team is actually ready for this because that's a pretty Advanced concept. That's also how I put it in the book. So I'm talking about the basic SRE commendations and advanced

38:08

as a repo indications. And the basic answer recommendations are SL is Loz and error budget. The advanced. That's very foundations of the error, budget, policies, and error, budget, based decision making. So, here we are. Day in the advanced space, and we need to be aware of this.

38:25

It doesn't make sense for the sorry coaches to force the discussion about the error budget policy, because either people will just follow this fashion and forget about it the next day, or they will just not be able to talk about this because they're just too far away from this. We know that for you to talk about the error budget policy, you need already. Everybody's understanding about PS allies. Everybody's understanding about this fellow's. Everybody's understanding about

38:48

the era of budget and the solos. Need to make sense you need. To already have several iterations about the whole thing and then you need to reach a point where it's a point, okay? So now we've exhausted every budget, but we haven't talked about yet what we'll do now. And this is where you bring the discussion not before once they ever budget. Policy is in place, that's done good enough but then the real

39:10

test is okay. So we'll that error budget policy, the executed, once it's really reaches the point where we don't have that Arrow budget anymore of the service and that's important. Does the team really follow what They have done. That's the next step. And then, this step after is the error budget, based decision made. I think that's the cool step backs at the talk of the three concepts remain in the book, I put all those Concepts in the parameter and that's kind of

39:36

been the tip of the pyramid. It's all basically what you do you start tracking your error budget consumption or your SLO adherents over time, you divide the time into narrow budget periods. So, could be one error, budget. Period is just 1. Month. And then you pour all your services and there is a little red hearings by error budget you. So you can have a group there you know you've got all your

40:00

services on the y-axis. And on the x-axis you could say half a year's time line so would be 60 budget periods and then you see where you broke your slos and it's where you didn't break them and Zone. If you break the SLO, for the error budgetary, it's actually useful to see whether you broke it but just a couple of percentages or whether Really kind of broke it up significantly.

40:23

The same also applies, if you are fulfilling your SLO, then you can fulfill it with a lot of Arrow budget left at the end of their budget Baron, or whether you nearly exhausted your error budget. And therefore you kind of nearly broke their solo, those, the details are useful as well, but the cool thing about that kind of high-level bash.

40:42

What is that? Especially in the product managers, they are enabled to work with this because then you can say all the time from experience, we actually thought that We would have this certain service level but we are not fulfilling this and therefore we need to prioritize some significant work in order to build a reliability into this particular Services. The cool thing is that once you make those decisions for your Based on data and not just on

41:07

the opinion of the architect on the Obion, all the developments on which depending on the team culture, May weigh a lot or not, a lot to the product manager. So I can totally understand. Once you have this cool dashboard where you can see each different Services, how they're performing the SLO, their budget, whether they are bridge or healthy, I aspire to go that

41:28

way. One day like you said, it's a data-driven approach of knowing where to improve and went to improve your across different teams. And don't forget the SLO. Here is a representation of users happiness, The Summit AT internal metrics that you just want to capture a, but it's actually a representation of the

41:43

users happiness. So when you see it's being Bridge, basically, the user should be unhappy As well, which is coming to my next question because in the healthcare industry, I always assumed that healthcare industry must be very reliable. All these devices that you mentioned they should function perfectly, but we know in a sorry that hundred percent reliability is totally impossible. Maybe can you tell us more about

42:05

from your experience? Implementing in healthcare is Healthcare, always like 99.9% reliable, or is there any kind of explanation that you can give here there are different device? Since and Associated software systems in health care. There are Hardware scanners, there was your availability and reliability, requirements are the highest.

42:28

There are very strict medical device regulations that oblige you to perform certain tests and so on in order to reach that high level of reliability, then there are so bold post-processing workstations. These are still on premise software systems that of course need to also work your Lively because they work in a hospital. Very fast a decently persistent. Doesn't warn them that can block the entire workload. But at least it doesn't harm human life anymore if it doesn't

42:56

want. So then there is the clouds or wet which are services that enable you to run your fleets of scanners more efficiently. If that doesn't work. Then at least you don't block immediately the work in the hospital department, but you would block for instance, the operations person we see all of the hospital Who is now trying to optimize the assignment of the patients to this cameras in order to be more efficient? So there are different kind of

43:25

levels. All reliability that you need to have that and all that is governed by the medical device regulations. They are also a bit different depending on the criticality of the system to the human lives and interested very much interested in the hardware scanners. When you mention it has to be the most reliable. What happens when the error budget is about to be breached or maybe there's an alert. Being other people scrambling because it could affect people's

43:50

live, right? So, tell us, when this situation happens, what actually really happens? Well, first of all, we only applied so far as a retail software, but I think it's a good idea to also, think about how that could apply to on where the scammers themselves. They've got lots of checks and balances in order, not to harm the patient. So for example, you can really think of a complete tomography CT scanner.

44:15

That's a discolored injects. Some acceptable radiation dose of radiation into the patient in order to acquire images from the patient. Of course, if the scanner would notice that now, the patient would be hard because there is scary too much. Radiation about to be exposed than the machine will stop the scan. So immediately, the stop will be notified. Please sample. Then remove the patient from the ringing, where the scanner is installed.

44:42

And so on. So there are definitely lots of precautionary measures that then additionally, What the cloud software can do where we apply this theory is that we have got? For example, those management software. Those management is a cloud-based application. It can also monitor, although not kind of in real time. What's going on with the scans being done right now but it can

45:06

monitor things. Like, for example, imagine you are a patient that requires several scans that means that actually your accumulated radiation those Over time can actually be exceeding a certain threshold that is also about meat by the government regulations. So actually you need to keep track of the radiation not just during a particular scan, but also overall, over a period of

45:32

time. If you've got some serious disease, then you might be prescribed to do certain scans within say a year or so. So what that software can do it can actually keep track of your accumulated dose and then also issue warnings. You the hospital Stout. Actually that Beijing already got so much, and therefore, be careful insult. As you can see, this is very interesting industry where the different levels you require liability and that can be on the one hand, really kind of Life

46:02

critical right now too. Okay, so it's not like preschool right now, but you still need to have some monitoring. Also, the patient in order not to expose the patients to somehow control treatment Thanks for sharing all this. The reason I ask is because some people might have perception. Okay, this may only work for certain cases maybe internet based software, right? But actually you can apply it almost in anything, especially in technology.

46:25

So let's go back to the discussion about implementing a sorry in your organization. So let's say you have done it. The teams have already implemented, maybe from your view. How do you actually measure that this asari transformation is successful? Are there any indicators that you probably check periodically to say that? Okay, this is sorry. Bill Addition is actually progressing really well? Maybe some tips here. I know that you have some dips

46:48

in the books as well. Maybe you can give some summary. What kind of measurement that you do in order to find out that your transformation is successful? The so, especially the beginning. It's a good idea to just see how many services the teams are onboarding on to the infrastructure. This is non outcome-based course because you still don't know whether it's a group reliability. But at least this output based indicators in the beginning there. Suddenly good.

47:14

Because if teams are not willing to put their services onto the Surrey infrastructure, then there's no use in the infrastructure that's one day once you see, okay? So the number of services is growing on the infrastructure, so that's good means teams have finding the infrastructure useful. Then the next thing is okay. So do the team's Define this ellos in the number of its ellos actually growing. Then the next thing, okay? Now that the slos and there but are they being fulfilled if

47:43

there is a load? But the majority of them are broken, then that means the teams just forget about them, they basically don't make sense, then DC? Okay, so, what is the percentage of his fellows that have been fulfilled? Then the next step is coming back to those customized collations that we were talking about earlier. It can easily happen that all

48:01

your solos and perfect green. So that's all cool but because the must keep calling, these are the measures that you can take in order to monitor the process of the sorry introduction. I like the way you describe it, your error budget can be all green but the customers keep complaining. So I think that is also like a big sign that actually, you might have missed the most important slos of all, right?

48:23

Because the true measure of implementing a, sorry correctly or consumers, should be happy and then your error budget is representation of that in their book. You also mention a couple of other things, like, for example, perception from the partners that you work with, not just the users, I think that also matters because sometimes if you are like access provider, right? So some Partners connecting to you also matters whether they come, Not so thanks for all this

48:44

measurement. Dr. Pratt is been a great conversation so far but unfortunately due to time we have to wrap up. But before I let you go, I have one last question that I always love to ask my guess which is to share your version of three technical leadership system. Is there something that you want to share with us here dr. Blood let me think of something that would be in the context of this Rich information but still kind of applicable.

49:06

See General technical work. One thing that's important is pure lag in context. I think that's important because there are very many Technologies, methodologies and so on, but it's the context that actually decides whether this particular thing that you will have heard about everything about have thought about would actually fit. So, I think this kind of acting in context is an important, I'd

49:30

say evade principal to follow. Another important aspect, I would say is to practice servant leadership. That means that on the one hand, trying to Show the way to the organization to teams but on the other hand also being very well aware of their context and therefore kind of inviting them to go on a certain Journey instead of coercing them to adopt a particular thing on one end league. But on the other hand, do this in a servant. Wait, I think that's important. They start thing would be.

50:07

I'd say all these things that we've talked about they're kind of not easy. For the developers. It's not easy to just jump on cold, right? It's a totally different game. If you've never done this world for the operations people to start suddenly act as a real software developers providing Frameworks to the other teams and so on it's a totally different world and for the product managers to suddenly be involved in operations.

50:33

It's also you know out of their world so far so I think I do acknowledging that this is not easy empathizing with the people. Specifically is a third Point, kind of looking for ways to motivate people. I think this is really important to make some change thick especially on this Region's formation. During there are so many ways where you can point out very small wins, I think a really pointing out those small wins that now is a team, we've done this, that really motivates.

51:04

The people are taking this serious from the asari those for you to do and really seeking things that people can be thanked for. For them by driving the motivation. I think it's really really people. Well, I love the last one. So first you have to acknowledge people situation empathize with them and find motivations and celebrate the small wins. I think that's always important for any kind of data dr.

51:29

Flat. If people want to learn from your journey or so maybe they want to establish the same as sorry transformation in their organization and they want to learn from you. Is there a place where they can find you online connect with you or ask questions? Yeah, definitely. Can be easily followed on LinkedIn now with LinkedIn. There are several conversation threads about the book. I think that would be also the best way to just join those threads or start new ones where

51:52

we can discuss those things. And also thank our watch all the conversation on the questions. I think that brought out early, good aspects, and was really fun to talk about. And while we were doing this, I was thinking that there is formulate another Myriad of things that we could talk about because that topic stove us. Yeah, when I saw your book is very thick and lots of good information there. I even finished reading it definitely, but people do check

52:15

it out. So I think it's really like transformation based kind of learnings, right? It provides you like, the rationale of how you implement necessary, the step-by-step Journey, including to the end measuring the transformation, making sure that you are successful in your journey. So thanks to dr. Vlad for this conversation, I really love it. So good luck with their future Improvement for your, a sorry transformations in your company. Thank you very much.

52:38

Thanks a lot for having me and thanks to everyone who took the time to Listen to this episode. Thank you very much everything, like to improve operations together. Thank you for listening to this episode and for staying, right until the end if you highly enjoyed it. I would appreciate if you share it with your friends and colleagues who you think would also benefit from listening to this episode. And if you're new to the podcast, make sure to subscribe and leave me your valuable

53:05

review and feedback. It helps me a lot. In order to grow this podcast better. You can also find the full show notes of this conversation on the episode page, at Tech Legion o.f website, including the full transcript, With interesting quotes and links to the resources mentioned, from the conversation. And lastly, make sure to subscribe to the shows mailing list on technology. No dot f to get notified for any future episodes. Stay tuned for the next technology. No episode.

53:34

And until then goodbye.

Transcript source: Provided by creator in RSS feed: download file

#117 - How to Establish SRE Foundations From Scratch - Vladyslav Ukis

Episode description

Transcript