The Evolution of Disaster Recovery Strategies in Modern Cloud Environments - DevOps 186 | Adventures in DevOps podcast

00:14

What's going on everybody. I'm your host today Will Button four Adventures in dev Ops And before we get started, I do want to remind everyone, or tell you for the first time maybe that we are now doing these shows live in addition to the podcast, so if you do want to catch it live,

00:32

we're recording on Tuesdays at nine to thirty Central Time. That's GMT minus six, and actually it's like nine thirty ish because we start at nine thirty, but then usually have a little bit of a pre chat to get our guests up to speed on how the process works, and then a click to go live button, So shortly after nine thirty Central time, you can catch

00:54

us live on Facebook, LinkedIn, and YouTube. And speaking of our guests, today, I have Segee Brody, chief Technology officer of Opti nine, consultant former software developer, according to his own words, still has code and production that probably shouldn't be and I can definitely relate to that. Today we're going to be talking about resilience and disaster recovery in the cloud, how it's relevant, why you still need it, and then dig into that. See

01:26

welcome to the show, Thank you great to be here with you. Will I'm excited to speak to a technical audience, which is not always the taste. So you know, see if I get called out on anything. But it's great to be able to yet as deep as I want to be with not having to pull myself back, right. That's always my biggest fear in doing talks and live shows like this. It's the part of the show that I call stump the chump where you say something wrong and somebody just calls you

01:57

out on it. That pressure, that pressure can be very useful with harness the right way. I used to force myself to volunteer to speak on highly technical topics at conventions that I knew nothing about, and I had like maybe two months and so these topics that I was putting on the bottom of my list that I knew I needed to learn, but I was just dragging my feet on. You know. Now, now I have a time and day where I'm the expert and I'm going to potentially get stumped, and so,

02:27

you know, fear of embarrassment is a very great motivator. Oh for sure, I like the approach. That's bold, but that's definitely going to be effective. I like it. I may steal that from you. Yeah, cool, So tell tell our viewers a little bit about your background and how

02:45

you got to be the CTEO of Optanine. Sure. Yeah, So when I was you know, like like probably many who were listening, you know, it got into sort of this industry as a teenager, just you know, screwing around with computers and the Internet and having fun and that you know, that sort of curiosity somehow turned into a job, which is great. So it was late nineties, my co founder of a company called webear.

03:15

We were kind of in the right place at the right time, and we just started hosting websites for our friends, so you can kind of think of it as a hosting company. So we were working with technologies like mi FreeBSD

03:25

and Apache and you know, the typical sort of web stack. And this was great because it was before before Google and before things like PHP, and before things like customer service or support, and it was kind of like sinker swim, and so we just built everything ourselves and just sort of scaled up with our customers, grew that business, sort of pivoted towards enterprise about maybe ten years after that, and started to focus on management of private cloud deployments.

03:54

Management of public clouds, orchestration and sort of owning the glue in between these hybrid cloun environments. So a lot of networking, which is which is always fun, and then got into b C d R so disaster recovery as a service backups. Networking was always our secret sauce, which is fun, you know, saying things like it's great that you're you're copying your data somewhere. How are you going to consume it? You know, what does consumption

04:20

look like? You know, these are networking problems. So I've always been a big a big network I eventually we sold that business to private equity. I stayed on as the CTO and we rebranded to opt and nine after we bought or merged with two other companies. And I'm still here mostly in a sort of a more of a you know, chief Technology Product officer, which is starting to become a thing now, you know, which which you know, CTO is so vague, there's different personas. So really my role is

04:50

sort of product focused company, you know, sort of customer focus. What are we building for customers, how are we helping them? How are we working within the bounds of the third part of technologies that we use. From an integration perspective, how do we push down bloat and the multi doing some

05:06

consulting on the side and just trying to stay busy right on. I think that's cool perspective or a cool journey, because for a lot of us, we end up spending a few years at a company and then jump to another company, and so we end up going from company to company. I've done it myself just to well, let's be honest, I've done it for salary increases, but also because of the opportunity to work on different technologies that I

05:36

wanted to go deeper into. But I think that's a cool path and an unusual one in the fact that you've been with the same company even though there have been mergers and acquisitions along the way. So you've really built your skill set in driving the company as it matures versus driving your skill your skill set as you mature. Yeah, you know, I would say we we were

06:06

a service provider, and a service provider is interesting. It's very different from an enterprise environment, and a lot of people don't realize the nuances and differences. It's funny whenever a vendor used to call us and try to sell us something to be like, all right, cool, do you have multi tenant capabilities? No? I'm like, okay, do you know what a service provider is? Like? You know, realize like and then we'd say, well, listen, if you can, if you can make a service better

06:30

happy, you can probably make anybody happy. So you know, we'll tell you what you need to do. But but you're absolutely right. What was great for us was that our customers were, you know, sort of in multiple industries and verticals, trying to run different applications and solve different problems.

06:47

I'd say the market is more sort of segmented now and matured now. But what's great about being a service provider is your customer customers are coming to you with the with the problems that need to be solved, with the use cases, as the as the industry changes and grows and as there's new shiny objects, they're coming back to you and pushing you and saying, hey, we heard about this really cool thing. We want to use it. Can we

07:11

use it? And it's like, uh, you know it's it's like yes, of course, well and then we'll figure out on the back end. Or no, it's like you want to lose a customer and listen if there's value in what they're saying, and you've heard it more than more than once in the last four weeks, and then you listen to it. And if you think someone else can benefit from it, and you listen to it.

07:30

So it's our customers have been pushing us always and that's sort of driven innovation, and so we never had to. Okay, maybe you're not like going way outside of your out of your target zone, but you're constantly pivoting. You're constantly trying to keep the leg up on your competitors, because if your customers are asking you for that, then competitors are hearing the same thing.

07:50

And so I think where we've done well is we've owned our own you know, we've never owned our own IP, but we've owned our own glue and empowered us to be able to mix and match best and breed and just and just innovate and and be at the forefront. So I agree with you.

08:07

I think service providers are a great place to be and even for you know, listen, I was a founder, so maybe it's a little different, but within our environment, over the years, you know, a new problem to be solved would come in and maybe one employee would sort of just jump on it and just be like, hey, you know that's cool, I can do that. And as a smaller company we'd be like all right,

08:28

you know, like uld say. The silly example is when you know, like uh, you know, uh no, SQL platforms got big and people wanted us to manage Mango and we'd have we had one gentleman who is like, yeah, I'd love to do that. It's like all right, you know, great, next week. You know that guy is the Mango expert. Everything have, any is goes goes to him. And so there's just a lot of opportunity for self growth there if you can recognize and take it.

08:58

Yeah, for sure. And I think that's the key to longevity in this space is a desire to continually grow and learn new skills. Yeah, but there's also some old skills that we can't let go. One of those being disaster recovery and backups. And you mentioned it before we started recording. It's one of those that seems to have been pushed on the back burner over the last ten years or so. But doing so, it has some definite, some definite impacts to your business, So talk to us a little bit

09:37

about resilience in DR in the cloud. Yeah, I'd love to. So, you know, we have we have traditionally provided a disaster recovery as a service offering for you know, I don't want to say legacy, you know, sort of non non sort of cloud native applications, so things that are not necessarily running on AWS or Azure, and so over the years we would look at out maybe a deployment running on VMS, running on VMware or hyper

10:09

v or KBM. We would basically provide an entire ecosystem needed so that your applications would continue to operate despite some sort of outers aut the production site or cybersecurity event or somebody fat fingering database and so you know what that looks like is obviously replicating the data, but more important than that is sort of understanding what is consumption. What does consumption look like? How are your users going

10:41

to consume the application from the DR side as they did in production. And that's a big sort of networking sort of task or challenge to deal with, and then also dealing with dependencies, like what about all these shared services that they're relying upon, authentication or networking, c p IPM, stuff like that. And so we take ownership of authoring the WRUNG books not only for fail over fail back, but what if it's just one application that you want to

11:09

faill over and what do you do with shared resources? You know, if you have a legacy database server, which is a weird thing to say, and it's running, you know that is hosting databases, tend different applications and you want to fail over one, do you bring the database server with you?

11:24

And so always interesting situations and challenges. And then you know, when when public clouds started getting popular, you know, I had a pretty pessimistic look on disaster recovery in general, and I think I think the entire sort of industry was excited about the fact that, like, we won't have to deal with that anymore. We have ability now to just build applications that are inherently resilient, you know, from the bottom up, and you know,

11:52

we'll deploy them on the cloud. They'll be self healing, and then we won't have to deal with this, you know. And I think you know what's happened is that people have tried that, and people have tried to build these applications that will run let's say, in multiple eight of US regions, and they realize the complexity involved in building the applications from the start with that thought in mind is just it is just far beyond the bounds of what they

12:24

want to deal with. And we see that even when you invest time and resources into that, it doesn't necessarily mean that it's going to work. You know, every time like AWS East has an outage or goes down, you know how many very large popular sites, you know, household name sites go down that are technology companies that we know are deploying the multiple regions. So

12:46

why are they down? Because it's almost like it's like this impossible thing to build, and it's not always their fault, right, Like the interdependence between their applications and even third party you know SaaS or their party pass mean that can they actually test this thing, can actually test their resilience plan without you

13:05

know, without actually affecting production. So what I've seen is sort of like the industry going towards the middle ground where where some people don't even realize you can do this, but you can basically employ an application in a in a single region not have to sort of build this whole resilience concept into your application from day one, and then employ traditional disaster recovery strategies towards you know,

13:35

sort of gaining you know, resilience of your app. So the point on one region and maybe now we can use replication tools that are more cloud native focused, and then we could still take all of those things that we learned over the years from from traditional disaster recovery, things like dependency mapping, building run books to deal with different situations, building sort of network strategies so that

13:58

I can test at the DR site without poisoning my production data. You know, if your production app is connected with the sales Salesforce API and you bring up your app and DR and you start playing with records like oops, we're modifying production data. So you know, all of these sort of you know, sort of core disaster recovery strategies. Give them a modern data mover that knows how to replicate or rewrite, rewrite of resources. Let's they use the

14:28

terraform or cloud formations. Give me something modern data mover and then apply everything from traditional DR and you can actually achieve resilience without having to go crazy from the development. Yeah, one thing that you mentioned there a couple of times that I think is is really key is testing that, And it reminds me

14:50

every time I think about that. It reminds me way back early in my career decades ago, my boss asking our team, Hey, are you guys ready for a disaster and ology And we're like, oh yeah, we're all set and he's like, okay, great, everybody show up on Saturday. And so we showed up on this Saturday and we went out to he ran into a conference room in a hotel, had some servers sitting there, and he had our backup tapes. He's got great restore everything, you know.

15:22

And we didn't even make it five minutes into the process before we realized, oh wait, we don't have the floppy disk to update our bios or we don't have the boot disc to reinstall the operating system. And it was a really, really long and painful day. But the lessons have stuck for a

15:41

couple decades now. Yeah, you know, that's a It's great when people are sort of overly focused on some sort of data application when it comes to disaster recovery or even where some people just think that their backup strategy is also their sort of resilience or disaster recovery strategy, and won't get too much into that, but you know, you have two separate goals with you know, sort of two separate strategies to be employed. So yeah, you really need

16:10

to sort of pre author the run book. And I think today what's interesting now too that we're seeing is that if you look at an event like a like a ransomware attack or a cybersecurity event, you know it's the incident response plan or the sort of the and order the disaster recovery run book or something like that, it's not it's it's not something that a single team would be

16:30

dealing with. Right, Like a DevOps team is responsible for sort of the the the uptime and resilience of an application, and presumably they own sort of all this orchestration for production to dr multi regions and fail over. That's great. But now if you bring in this sort of you know, this sort of security aspect that this has this need to fail over was in relation to a security event, now you have a completely new team. Maybe it's an

17:00

internal soccer security team or an external MSSP. And now you have these two teams that unfortunately many organizations don't speak that much, and now they need to be lockstep as part of the incident. And you know, if you think about a CTO or CIO at a higher level, you know, they kind of become the quarterback between these teams during an incident. And it's not something that I think they even realized that they were ever going to have to deal

17:26

with. And so the incident response plans, the disaster recovery rum books need to be inclusive of who is who owns what during a you know, sort of a security incident. You know, can you even bring up the application? You know at the r site do you want to? So how does a team that maybe they recognize that their their dr the resilience plan isn't where

17:56

it should be. What what what are the first steps like because to get this done you need to devote time and resources and it has to be prioritized and sometimes that you have to prioritize it above like day to day operations. And I think specifically it comes down to what are you going to say no to and so that you can so that you do have the bandwidth to say yes to this. So what are some good early steps for people once they

18:26

recognize that they're they're not where they should be. Yeah, so I think the good question. I think the first thing that they need to do, and I think this is I think that the market has matured a bit here and this is austraily obvious now, but you know, the teams need to kind of sit down and figure out, you know, what they have an

18:48

appetite to take ownership and responsibility for in this realm. And so if you look at a traditional you know, DevOps sort of how this goes and general for like a DevOps conversation is, you know, are we are we application developers? Are we sres? You know? Who is responsible for ongoing ongoing management? You know sort of metrics collection efficiency? And obviously that that's a

19:15

there's no right or wrong with any of these things. And a lot of it has to do with sort of the the DNA of the company and what they kind of want to be when they grow up and do they want their you know, certain IT teams adding value to the business or managing infrastructure, and so you know, we'll see I'll see smaller organizations that are like,

19:34

you know, we're a small team. We own everything, so we're going to just internalize it, and also see very large organizations that have, you know, an abundance of resources, and they basically make the they basically make the decision that we don't want to be in the business of managing disaster recovery. We don't want to be responsible for it. We'd rather outsource it. And an interesting thing to think about here is, you know, the the

20:03

complexity of all of all of our applications and our deployments are. They're not getting simpler, They're getting more complex. In fact, I think you can argue that part of the goal of of DevOps these days, part of one of the things they should be striving for, and maybe even a key metric to focus on is to what extent am I making my you know, the deployment that I'm managing, to what extent I make? Am I making it

20:30

simpler and less complex? And obviously the more complex, the harder it is to manage, to monitor, to scale, to secure, and to make and to make resilience. So I think people need to acknowledge that. And when you have that conversation, you know, one of the answers that comes out of that conversation could be, Hey, we want to make it simpler

20:51

how do we make it simple? Well, how do why don't we outsource certain layers and certain responsibilities and disaster recovery and resilience is an easy one to outsource. It's low hanging fruit, you know, typically it does not affect your production too much. If you can use sort of that middle ground strategy that I mentioned at the beginning, you don't have to modify your application, you know, much at all in order to be able to achieve resilience.

21:19

So that that would be my answer. The first thing I'll do is sit down and figure out, you know, what is your appetite to manage and own that internally? Yeah, for sure. Yeah, And I think that's a huge selling point. If you have a strategy where you don't have to find your existing infrastructure application a whole lot, that's always going to be a big selling point. Let's let's do this. Take a step back and help me understand why moving to the cloud or using the cloud cloud providers like ABS

21:49

is not a dr strategy in itself. Yeah. Well, you know, they give you the right deserverbody it knows right it's going on homes on a home depot, and they're giving you the right tools and you got to you know, makeup if you want. So you have to look at it on a per sort of platform, you know, uh, per platform sort of

22:12

environment. So if you look at something like S three, which obviously is being you know, is being stored in multiple local zones within within a region, or even has the ability to sort of have its own inherent built in sort of cross region replication, you're probably good there from a you know, if you wanted to, if you wanted to build a disaster recovery strategy between let's say East and West, when as three perspective, it is, it

22:41

is fairly straightforward. You can kind of put a check next to that layer. As far as your data being available at the at the dr site, you know, recovering from a cyber attack or sort of a manipulation of the data, that's another story. But if if you're if you know, if the entire interviews goes down and you want your application back up and running within within a set rt O, you know, you can kind of put a check there for other for other sort of you know, other platforms. It's

23:15

not always the case that that that that is done. Typically it's not, you know, and so there are snapshot capabilities that exist, but then there's this entire orchestration task that sits on top of all that. So you have all of your all of your configurations and resources, maybe have another site, but now your applications are not necessarily written to be able to reference those at those reference ideas at the at the the R site. And so now we're

23:41

so so it's really a replication orchestration strategy, right. And so what we'll do is we'll look at your various applications and then we'll look at the A w U S and we're doing this mostly for a w S today in addition to the legacy environments which I mentioned before, but for a public cloud. We'll look at the various platforms that your application is using, and we will employ underlying AWS technologies to ensure that data is up to date at the DR

24:11

site. And so maybe that's maybe that is cross regent snapshots, or maybe that is a w S d r S which works very well for certain platforms but can be expensive. So now we get into the application criticality question of you know, how critical is each application to be up and running and sort of match the right replication technologies to the cost and to the application criticality.

24:37

Beyond that, you know, we're using orchestration tools and one of them is called r PO that we'll use that will orchestrate some of this back and forth. And our PO might be something that's great for a team that wants to internalize all this and just say we got a tool, let's use it. Where OPT and I comes in is it's not just about the tool, it's you know, who is you know, do you want to take owner ship

25:00

of the sailover process and the sailback process? Do you want the ownership of the testing, building the network integration strategy, building the automations into let's say, you know, d n S, maybe sd WAN policies, so on and so forth. So we kind of sit on top and own the entire process, you know, suit to notts, so that DevOps teams and IT teams can just wash their hands of it and focus on building applications right on. Yeah, I'm actually an RBO customer and it's a it's a great tool.

25:34

It's just it's it's one of the few tools I've seen that just does what it says it's going to do at an exceptional level. But just like you mentioned, you know, that's only part of it that handles the infrastructure. There's still the whole human aspect of it of verifying what you've replicated and doing a failover to it and testing it and making sure it works. And

25:56

that's another full time job in itself, it is. And what the funny thing is for us is again have being a company that has been doing and providing disaster recovery as a service for you know, uh VMware platforms, physical servers, IBM I series, you know zen KBM based applications. The funny thing is, you know we are we're not. You can say we're a

26:22

technology company, but it's really that glue that we're owning. But we have we have broad invest in breed data movers and sort of replication tools to to you know, to focus on specific platforms and our and and so when brought in our PO it's like, hey, here's the best and breed tool for

26:41

cloud native adobs apps. But everything else that we're doing, all the value we're providing, and all the wrappers around around the replication tool like they're all the same as we were doing five ten years ago, which is actually pretty cool. It's like if you can stay up with the tech, and you can build a platform that can support multiple rations in a modular way like you can, you can stay relevant through all of these crazy croudchets, for sure.

27:07

I should. We had Doug from RBO on the podcast a few weeks ago. I should do another episode with both you and him and just go into a deep dive on this week, so him and I and I've known him for a while and I really am super polish on their platform. I think it's amazing. Him and I are doing a webinar tomorrow actually about all this in detail. Oh right, I will get that from you and make sure that that's in our show notes when this episode goes live. That will

27:37

cool talk. When it comes to DR in the cloud, you mentioned that providers like AWS have a lot of the tools built in. You just have to look at them on a case by case basis, see what those tools are and it make sure that they're enabled and that they're working properly for you.

27:55

How often do you see the need or do you recommend cross provider DR strategies like backing up our AWS or replicating our AWS environment in as your or GCP, because that brings with it a whole, like an exponential increase in overhead as well as costs. Yeah, that's a great question. You know. I think that you kind of have to look at at three buckets here

28:21

in general. You know, you have your high available you know, the ability to achieve high availability, right, which which maybe is sort of you know, I think in order to build high availability for your application cross region or cross cloud, you're really not going to be able to get away from sort of building your application with that intent from day one and having to apply so much more complexity to your application, to your c c D process,

28:51

and really the the level of expertise that you need from your developers. Just I think it's it's on another level, right. And so if if you're just starting the process of building an application now and that is your goal, you can't you can't go back. You can't go back later and just be like, oh, we'll just do that later. No, it has to be. It has to be in the DNA of your application. This is also an interesting point when you start to think about integrations with third parties.

29:18

You start to think about all of the third party providers that you're going to

29:22

utilize. From an EPI perspective or from a data perspective. You know, if you if you have this mandate to have resilience and high availability as part of your application or security, and you build a framework or a requirement around that you need to have, you need to have those conversations with those third parties, you know, before you start using them and not after, because if they're the if they're the weakest link in the chain from that perspective,

29:49

if they don't have great resiliency to provide you with the options you need, then then you're stuck. I think too many companies go and they'll you know, they have the SaaS sprawl or they just start using them and then you know, you might spend I don't know years ago being an AA application that that works a cross cloud, but one of your vendors you know, is

30:07

not locked up, and boom, you know you achieve nothing. But so understand the difference is a j backups and sort of traditional dr applied here and really sort of figure out, I would say, figure out where do you want to where do you want your sort of vendor lock in to be right if it's if it's data, If you're okay with with vendor lock in with

30:30

one cloud, that's fine. I don't think there's anything wrong with that, especially again if you're if you're building your application with forethought into that, and maybe you know, we see people, I'm sure you've seen it many times, people that are like, I'm going to use a WUS multiple regions, but I'm purposely not going to use any any platform services. You're going to

30:47

run my own SQL instances and kind of go backwards in that way. Fine, you know, if you're using our r PO, as far as I'm aware, it isn't today, it does not have any cross cloud replication capabilities, but let's say it did. Great now your vendor lockin is on that level. So I would say a lot of this is sort of risk, you know, risk aversion, risk mitigation. I think the likelihood of all of a w US going down and having a need for sort of cross cloud

31:19

is, you know, hopefully very little to none. But I think a single region outage as we've seen is you know, fairly, it's definitely in realm possibility and happens. But I do think what you're saying makes sense from a backup perspective, right, Maybe we don't need, you know, an rto of being able to fail over from a of US to Azure within four

31:44

hours or twenty four hours. But if we're copying our data, if we're having a copy of our data there and we and we understand what the path to sort of bringing it back up looks like, I think that you're in better shape than most are today. Yeah, I agree with that one hundred percent. I've had as a consultant. I've had multiple companies come to me over the years and say, hey, we need to implement dr so we want to we can't trust AIDA, or we don't want to trust AWS,

32:13

so we want to use multiple cloud providers. And my approach with them has always been, you know, I don't think AWS is and not picking on AWS here, but I don't think that's the weak point. And then we go through and look at their stack, and it always comes down to the

32:28

fact that you know that hasn't been the weak point. You know, they've chosen to use a managed database provider and so all of their data is not even in their AWS environment, or they have all of these external dependencies like Salesforce or different things like that, and it's like, okay, if you can replicate all your infrastructure over to another provider, but this third party tendancy is still a single point of failure and much more painful that goes down.

33:01

Which makes me think along those lines, since you work with a lot of companies in this how willing are third party vendors to talk about what their own internal d are and high availability strategy is? They all have they all have their boiler plate off the cuff answer that they have to provide, you know, and it's always going to be pretty vague, and you're probably going to

33:29

have to go back two or three more times. And sometimes I'll just refer you to the s l A and obviously their s l A credit mechanisms, like most are going to be just a joke, right, And so it is it is a risk. I mean, I will say on the on the compute side, I do think that you know, Kubernetes has has democratized sort of the the compute layer and has made it very easy to sort of deploy you know, your your code where you want when you want to.

33:58

But but you're right, it is it is the database layer, uh, and sort of the rest of the shared services layers, and that's kind of as it is kind of a hard pill to swallow because you know, again, what if if you kind of want to manage and run and operate your own databases, that's fine. It'll be less expensive that way, you'll save you'll save money, and you'll have more control and you will be able to to sort of make good on this sort of cross cross cloud resilience if you

34:24

want to. But now the operational overhead has has increased, and so you know, part of what we've done and sort of what I've sort of been dabbling in, you know, with some consulting is just doing that dependency mapping,

34:37

application mapping and figuring out what we what we want to do. And by the way, just because you're using paths in production, doesn't mean that you can't have sort of a single database deployment in dr with some sort of you know, sort of sort of you know, snapshot or replication mechanism in

34:53

place as a as a backup. And look, it takes you two days to get that to get all the tweets work out, you know, post event, you know, most people will say that's not the end of the world, and they will accept that as a solution because, to be honest, a lot of folks are looking unfortunately, they're looking to check the box on a DR strategy, are having one in place for compliance and having the DR strategy does not necessarily mean that you have you have a run book or

35:21

you have super low R t O s R pos. It just means that you have sat down and written what you would do during an event, even if it hasn't been fully tested. And so if that's what you're after, that's what your goal is. Because maybe you are not a you know, a fully technology based platform, as a as a as a business, as a revenue generation oftentimes that isn't a Yeah. I think the having the conversation about rt O the recovery time objective is really important to have because all my

35:52

entire career, you know, I've never worked for companies like Google. Well, there's been one exception where I had one of my employers. We were doing health care for trauma patients, so we had to had to move quickly there. But for most businesses, having having that RTO conversation is very helpful because while ideally you would like to say, oh, yeah we can,

36:21

we can fail over in two hours. That's cool, but it comes with a set of costs and acknowledging the fact that you know it would be embarrassing to tell your customers will be up in two days. Maybe that is the right strategy based on your your business. Yeah, you got to start somewhere, right when when when I've worked with companies to build a disaster recovery strategy and actually roll it out, you know, the first thing we'll ask is

36:50

what what are the what are the business goals you're trying to achieve? And and some of the questions might be, you know, do you do you need and only looking to protect against a sort of a full failure at the production site where all the applications need to be filled over concurrently, or are you're looking to protect against situations where you might need to fail over individual applications.

37:12

And then there might be other questions like do you want to faill over if there's like one server that is sort of ransomwared And you know, of

37:19

course everybody says yes to everything, yes, we want all that. The problem is the sort of the more situation, sort of the increased complexity, and you know, ironically enough, the full failover event everything needs to come over at the same time is actually much easier to build for and to achieve than all the others because typically you have the sort of interdependence between applications are sitting maybe behind the same firewall and the same VPC, the same network,

37:47

and so if you can keep them on all the same IP addresses and keep references intact, then it is much easier. And so typically we'll employ a phase approach. Will let's let's be able to achieve that, improve that, show that it works, and then we'll sort of peel back the rest of the layers of the onion and strive for more. Yeah. It reminds me of an analogy from drag car drag car racing, speed, cost money,

38:14

How fast can you afford to go? Yeah? Yeah, And I really think it's interesting when you think about these things and you think about the burdens if you're looking for complete aha, multi region or even multi cloud, the burden, the extra burden that you're putting on your you know, DevOps or app dev teams, you know, and what is what does that translate into just sort of the business impact? You know, how much longer are your

38:40

development cycles because of that? And what are you not being what? What features are you not able to work on because the oldest extra time put into the forethought of this high availability. That's why I like. I like the middle ground approach where let's have our developers focus on developing a application that runs on a single let's say a WUS region and you know, hands head down, hands to keyboard, focus on building applications, which they probably have,

39:09

probably have a lot of experience doing that. You know, this whole multi region thing is typically fairly new to someone and they're going to go off on a tangent. So APT devs you focus on building, you know, an application that is resilient within a region. AWS makes it fairly straightforward to do

39:27

that. And then maybe a separate team or SR team or a company like Optinine kind of comes in over the top and says we are going to employ a disaster recovery as a service to that single region deployment and achieve resilience using tools like RPO and using proven strategies, and that way the APT devs can just highly focused. I think that's such a win win, and honestly, I don't even know that there's a ton of developers out there that can even

39:51

achieve the HA with the high degree success. Yeah. I think one of the other benefits of that approach is discovering tribal knowledge, because in a lot of the scenarios I've been involved with, we do things and we take certain steps or actions because of this tribal knowledge that we happen to know. And in many cases we don't even know that we're making decisions based on tribal knowledge.

40:16

But when you bring in a third party like opt to nine, then you're you're coming at it from a fresh perspective without the tribal knowledge, and it works really well to expose that. It's like, oh, okay, now we have this piece of information that has to be documented and formalized. Absolutely, And like I said at the beginning, you know, when I hear tribal knowledge, you know, I hear complexity, and I gin that. I think there's this whole idea of managing complexity, managing complexity sprawl,

40:51

you know, fighting to reduce complexity. It's not it is not being pushed enough, you know, from an industry perspective. In fact, I think we have the opposite problem. I think we have a lot of folks out there and I'll even you know, different times in my career. It definitely have been guilty of this. You know, we have we have shiny objects syndrome, and we want to be able to be exposed to all the latest and greatest tools. You know, I think we're all curious people in the

41:17

indrodustry and we like playing with new things. I think I think part of it also is just maybe a little bit of fear and ensuring that we have the latest and greatest acronyms on our resumes. Sure, but shiny objects syndrome is you know, is I think the complete opposite of I want to keep my environment simple so that it's manageable, so that I can reduce the need for tribal knowledge. And this kind of goes into you know, like other soft skills, right like, you know, if I want the person at

41:45

four a m to be able to fix what I built. You know, to what extent am I a good technical documenter? And to what extent do I take pride in that as a standalone skill that I'm good at, you know, as as a developer or an SR or you know or DevOps person. Yeah, agreed, And I just speaking from personal experience, I'm not good at documenting. I'll write something that just seems to be as clear as it can be, and then usually me six months later looks at it and

42:16

was like, who's the moron it wrote it? Oh wait, never mind? Yeah, I think we've all been there, right. I mean it's when I manage a lot of technical teams, and it's always that last ten percent when you see the documentation. How are we monitoring it? How do we know if it goes down? How are you backing it up? I mean, we want to build cool things, right, and then we just want to pass it off. But I do think that that us as sort of you know, DevOps engineers, we need to start taking pride in in

42:45

sort of skills that are outside of the hands to keyboard technical documentation. Taking pride and being able to walk away, go on vacation and people knowing what's going on by reading my documentation without calling me. You know, I think also being good troubleshooter and this kind this kind of kind of goes back into the complexity and sort of disaster of every conversation. But to what extent,

43:07

you know, is my troubleshooting skills set high? And I think unfortunately a lot of the soft skills don't have great KPIO metrics that you can kind of throw on a resume that can show how well you do with those things. But but I love I love honing the troubleshooting skill and being brought into a problem that I know nothing about and you know, figuring it out, you know, quickly compared to maybe folks that wrote it or have been dealing with

43:31

it. It's you know, that's fun. It's a great little challenge. Yeah, that's one thing I've advocated for for years now is my role as a DevOps engineer is to work myself out of a job, you know, to set everything up so that it runs and when it doesn't, it's clearly documented and what stuff's to do, and someone new can come on board and get their app to production without having to rely on and do so in a

44:00

way that makes sure that they honor the constraints of the business. And if I can do that, then there's no reason for me to be at that company anymore. And I think that's my own personal metric for job success. Yeah. I think, actually, you know, not not to not to pull mortion any objects into the conversation. I think so jen Ai, I

44:22

think has a huge potential to help in the screen. In fact, I'm talking to some startups that are already starting to do this where you will plug them into all of your internal documentation and they will basically just give you a chat bot where you can just ask questions and so you know, having service provider experience. This is this is really interesting because you know, if we're

44:45

managing multiple customer deployments. You know, part of what optenine does is pro we're doing managed cloud ops for managing AWS deployments on the pass of our customers. But you know, not to say we don't want every kind of similar to be the owned science project, but there is always going to be this balance of standardization and customization. And so we have very detailed documentation on each

45:10

customer's deployment and diagrams and all that. But it's very hard to scale that, especially for the person at four am that gets the phone call that something is down and having to sit through and read all that documentation and catch up. You know, it's like it's an impossible task to do when you need to spend hours catching up before you can even begin into troubleshoot. And this,

45:30

I think is which is really cool. It's just where Jenny I can help where if you have this you know LLM that's constantly looking at this data, and you can have a bot where you say, hey, where's this customer stuff deployed? When was the last time something was deployed? When was there a change? And you can just quickly get those answers to me as someone who's managed twenty four seven teams, I mean, that's just super exciting and that really helps us scale, you know, the Knock and the stock

45:53

organization. Yeah, for sure, because context switching is huge, and that's where it seems to really raise its visibility of how painful and expensive context switching is. And I think you probably are very familiar with it from your experience at OPTU nine when you switch from not only project a project, but customer to customer and so you are working on one customer's environment that's built this way, and then you know, the pager goes off and you have to switch

46:22

to a completely different environment. And so how do you minimize that amount of time where you're just sitting there with a blank stare trying to figure out where to begin in this environment that could have infinite number of combinations yeah, and i'd say, like now based on you know, you know, saying like the complexity, it's almost impossible, to be honest, it really is. And having you know, the tribal knowledge and the experience working on a specific

46:52

customer's environment, you know, helps greatly. So what we do is we obviously try to have as as many standardized tools as we can standardize and monitoring. I like looking at different monitoring strategies where we have we build monitoring again into the the CICD work focus far as what we're going to monitor. But what I'd like to do is to really have sort of macro level alerts go

47:20

off at the same time as sort of micro level alerts go off. So if my application is down, if we're monitoring a specific query and we want to see that it's returning you know, greater than twenty five results from the customer perspective, if that goes down, I would like to see you know, four or five different monitors you know, they're monitoring specific layers of the back end or specific API and points also going down at the same time.

47:43

So the poor technician at four am, we're kind of spoon feeding them, Hey something, you know, there's a serious problem. But at the same time, hey, we also noticed these four things that are out of black, and so instead instead of having a start from scratch that can kind of work backwards from the lowest hangings. Yeah, just giving them a series of bread crumbs to follow. Yeah, and again, I mean I think that's

48:06

that's a strategy. I mean to me, that's is that a technical Is that a is that a sort of technical skill or is that sort of a quasi non technical strategy that you need to employ, you know, with this resilience or sr HA you know for sure sort of DevOps right there in the

48:24

middle of DevOps, I think. Yeah. One of the things I like to do is in all of my alerts, I like to include like, hey, here's the alert obviously, here's why it went off, and then here is a link to the application dashboard and the run book for that, just you know, to leave those breadcrumbs and help minimize that context switching time. Absolutely. Absolutely. Documentation. You've mentioned that multiple times, and it's a pet peeve of mine because I don't like Confluence, I don't like notion,

49:00

I don't like read me pretty much. I don't like any of the documentation tools but you mentioned standardizing on tools. Do you have a preferred documentation tool. I don't. I've used, I've used sort of all of the above. You know, I would I would say the answer, I don't think that there's one tool that's better than the other. Right, And this is this is a cliche, but right, it's more about the use. It's like talking about it's like talking about the best diet. Right, it's

49:28

the best. It's the one that you can do consistently over time. Yeah, I say one of the and I think so as long as you as long as it's simple and you can build them into your work, quote fairly easily, that that is the best tool. I'll tell you one one win related to that that I that I experienced years ago. It happened to be with Confluence, But the same example I know is the same sort of capability, and it was available and almost all documentation tools now years ago we used

49:57

to use. We used to use Zio to create like you know, diagrams, and then we'd upload them into the documentation tool. And that whole process of sort of you know, bringing the bringing the work or the output from one tool and the other. That process, like people don't want to do that. They'll end up just sort of keeping the diagram, let's say in their own let's say they're using you know, Lucid Charts or Gliffy or something

50:22

like that, they'll end up just keeping it in that account. So a big win for me was was when Confluence started adding in these plugins where you can actually create the diagram without having to go out of the documentation system and have the diagram embedded right into the documents right there, instead of having a building in a separate tool and an important copy and all that, and so, you know, I think that was great because now I mean it,

50:49

I'm authoring a document, I want to show a visual representation. I'm a I'm a big visual person, and I can just create the diagram right there without having to leave the page. It's saved, and now the actual the actual ip of that diagram is embedded into the document. It can never be pulled apart. Nobody can ever tell me, oh, yeah, I never

51:07

upload I never uploaded the latest version of the diagram into the document. So there's that whole concept of of you know, working in the updating of documentations and diagrams into your workflow. I think it's a really good example of how you can do that. Obviously with them, I think with Youurra and get

51:25

help, you can do that. But I don't think that that capability exists enough for more of an infrastructure operations a sor re perspective, right, Han, when it comes to like making sure things are up to date, whether that's documentation or run books or your failover strategy, what's the minimum frequency you would recommend someone reviewing that, Well, I'd say twice a year is probably the minimum. But then you also you need to add you need to add

52:04

hooks into your change control. Right, anytime you maybe deploy a new service, you know that should be a hook to whoever is responsible for resilience, maybe an outside vendor like but it's not if it's an outside vendor like Optie. You know. So if I'm sitting in the customer seat, I'm going to add as much as many hooks as possible, and I'm going to say to my vendor like, hey, we just change this, we just change

52:25

that, make sure our make sure our million still works now. On the flip side, if I'm an opti and ized seat, I might say, yeah, no problem, We've updated it, which which we'll do in earnest and we have to but hey, you know, we did what we had to do, but we got to retest now, right, So you got you you do have to find that balance. Then it doesn't mean you can't, you know, update these things in an ongoing basis and then kind of have a list of what you want to ensure functions during the next test.

52:52

I will say, though, one of the important things with testing is you don't want to just have the I T teams doing the application testing. You really need to have users testing. Maybe it's QA, maybe you have internal staff that are using the system. You need people that can smell out a problem with the application, can smell out the fact that it's maybe a little bit more sluggish, or that certain functionality doesn't work as good. And this

53:15

is a big miss. A lot of our teams try to internalize it because they want to just move past it. For sure. Yeah, just as an it background, my my overall objective is to avoid as many conversations with other humans as possible. But this is one of those areas where you just kind of can't do that. And I'm guilty of doing it too, of performing a failover looking, Yeah, all the health checks pass, no alarms, that must be good and then moving on. Yeah, and I like

53:45

the idea of almost making product managers responsible for some of this. You know, if if resilience and high availability is you know, is a feature, a component you know of sort of the outward product, then I do think that they can be the liaison between the developers, third parties or whoever is

54:08

whoever's owning the resilience. You do need a quarterback there. And if and if there is a product product management function, I think this is a great aspect for them to ensure continuity of long term Yeah, agreed, Like a seasoned product manager is just worth their weight and goal because they understand all of these different layers of complexity and interactions between the teams and and just by job definition, they're really good at at orchestrating and pulling in the right resources at

54:37

the time that they're needed. Yeah, and with third parties, right, if they get win salesforce, you know, what they're going to want to do is you know, potentially pull in the you know, whatever positive capabilities are being pulled pulled through, maybe they're pulling it into a product feature step they also need to better understand that, you know, what it means for the outward messaging on the resilience or if they can still make good on that

55:02

promise. All right, well we are coming up on an hour here, is there anything else that you feel like we should be covering when it comes to resilience, dr and managing complexity? So I'd say, like the most important thing, and this might be a little cliche these days, but you know it's just make no assumptions, Make no assumptions on any of the platforms

55:29

that you're using in regards to what built in resilience or redundancy exists. And also keep in mind that high availability and resilience does not always equate your ability to recover from specific types of events. You know, if you're hit with a cyber attack and your data is corrupted in production systems, you know, having a replica or having high availability even with multiple regions does not mean you can recover from that. There are other sort of strategies that you need to

56:01

employ. Obviously, you know how far back is your is your you know, is your snapshot history, your journal, and you know you'll need to have separate run books for that type of situation then sort of the high availability type of situations. So just understand there's sort of you know, those are completely separate and again make no assumptions. Yeah, it's almost like this would make a really good board game. M that would make a good board game.

56:30

Yeah, we should do like a jump to conclusion. Uh yeah, well done on the office phase reference. Yeah, nice to see that. Well played. Cool. So if folks, if our listeners want to talk more about this or reach out to you directly with additional questions, what's the best way for them to do it? Uh, find me on LinkedIn. That's probably a time that I'm most active on, you know, or or right to me on there or at optinin tech dot com or you'll find me.

57:10

My name is kind of unique, so I have no doubts that that anyone who's listening to the show will will not will be able to not be able to find me. So your name is unique? Is that short for something? Sagi is a is a Hebrew name. Like other Hebrew names, they can kind of get butchered, you know, in these parts. But there are much worse Hebrew names that I, you know, so I don't

57:36

have it that bad, but it is. I mean, it is nice because when I get cold calls, I immediately know that this person never spoke to me before the land the built in screening feature. Yeah yeah, that's good. Awesome. Well, thank you so much for joining me today. This has been a cool conversation and I think it's one that we we need to spend more time I'm talking about because it often gets overlooked or assumed, Like you said, make no assumptions. Absolutely cool. Well, and thank

58:08

you, thank you. I mean, this has been fun. You're a great presenter, and it's nice to talk to someone who's kind of also lived it and been through it as well. Right as old guys have to group together and tell war story once in a while. All that right, All right, thanks again the Gie, thank you for listening, and we will see y'all next week.

Transcript source: Provided by creator in RSS feed: download file

The Evolution of Disaster Recovery Strategies in Modern Cloud Environments - DevOps 186

Episode description

Transcript