Episode 91: Azure Chaos Studio

00:00

Welcome to the Azure Security Podcast, where we discuss topics relating to security, privacy, reliability and compliance on the Microsoft Cloud Platform. Hey everybody, welcome to Episode 91. This week is just myself, Michael, and our guest this week is Rigel Carlson, who's here to talk to us about Chaos Studio. Before we get to our guest, talk briefly about a couple of news items.

00:30

One, we have just issued a document from the Microsoft Threat Intelligence team called Midnight Blizzard Guidance for Responders on Nation-State Attack. Very much worth reading. This is basically some more attacks coming out from a threat actor we often refer to as Nobelium. So please do go take a look at that. I will put a link in the show notes. The other one is an interesting one.

00:56

Over the last few weeks or so, I've been doing quite a bit of development on Always Encrypted, which is a technology in Azure SQL Database and a SQL Server. I wrote some sample care and was experimenting and playing around, what have you. The first query took 15 seconds. I'm like, oh, that's not good. So I did a bit of digging around, finding why it's taking 15 seconds. Actually, I thought the problem was with Always Encrypted. It turns out it's not.

01:20

The problem was actually with the way I acquired the Azure credentials to use Key Vault for the key storage. Basically, the way the code works inside of the Microsoft.data.sql client library is it tries to hold off doing all the work it needs to do until it actually has to do it. So it's quite lazy in that regard.

01:43

And so when I go to do the first execute, like execute the first query, it basically does everything there, including going to SQL Server or SQL Database, pulling down a store procedure to find out what columns are encrypted, then going to Key Vault. All that stuff happens. And so that's why it takes a long time. And the other one is that it calls default.azure, sorry, default Azure credential.

02:05

And that actually goes through a whole bunch of different credential providers to find out which one to use. And that can take a lot of time. So I put a couple of little tweets out there about my findings. I'm not saying don't use default Azure credential, but just be aware of the implications of using it. All right. So with that, let's now turn our attention to our guest. This week, as I mentioned, we have Rigel Carlson, who's here to talk to us about Chaos Studio.

02:31

Rigel, hey, welcome to the podcast. We'd like to spend a moment and just introduce yourself to our listeners. Thanks for having me, Michael. I really appreciate it. So I'm Rigel. I am a product manager here at Microsoft working on Azure Chaos Studio. I've been at Microsoft about four years. I worked as well on the Windows deployment and update stack. So lots of fun stuff, both in the deployment stack and in Azure Chaos Studio.

03:01

Really excited to be here talking with you today about Chaos Engineering. Well, first of all, congratulations on getting it out the door. It's obviously a huge milestone, even though it's taken you guys a little bit of a while to get it out of preview. How long has it been in preview? It's been a while, right? It was several years. So we, Chaos Studio has been around since about 2019. Chaos Engineering as a whole has been around for a much longer time.

03:33

I believe it was popularized in the software world in 2011 when Netflix introduced their kind of internal Chaos Monkey tool. So when you say Chaos Engineering, that's often what a lot of folks will think of is this Chaos Monkey tool that Netflix built that basically just, it was pretty simple at first and it went off and killed instances of virtual machines running in production.

04:08

This was, I believe, around when they were migrating to the cloud for the first time and they were hoping to test their resilience a little more effectively and started saying, hey, what if we just go and kill some instances of VMs in production and watch the results?

04:27

It's become a much more common practice since then with cloud providers like us, like Azure, offering Chaos as a service and to a much greater extent than just kind of killing individual VMs or compute instances and startups entering the space as well. Now obviously from a security standpoint, I mean, this is a security podcast with a major focus on our cloud platforms.

04:56

But if you look at Chaos Studio through a security lens, my guess, and correct me if I'm wrong here, but primarily we're focusing on availability and reliability and uptime and resilience and so on. So if you look at it from a security standpoint, you've got the classic CIA trifecta, confidentiality, integrity and availability. It sounds to me like Chaos Studio is really on the availability side.

05:24

And if you're building, say, threat models or designing systems using stride, which is spoofing, tampering, repudiation, information disclosure, denial of service and elevation of privilege, it's the D, denial of service. Is that a fair comment? You're really focusing on the availability and reliability and mitigating denial of service issues?

05:44

I think that's a good comparison and a good assessment that we're focusing on those issues where systems might not be available or they may be behaving in strange ways. I come from a systems engineering background before coming into the software world. And we think about how systems, these complex systems that the world is made up of, as systems get more and more complex, there's not one way that you can describe them.

06:28

There's tons of relationships and feedback loops that make up these complex systems, whether it's societal or security systems or cloud reliability and cloud infrastructure. And they exhibit emergent behavior, which is the things you can't necessarily plan for, the behavior you can't plan for. So chaos testing, chaos engineering helps with some of those scenarios, whether it's in a security context or just sort of a cloud resilience context.

07:07

I think focusing on availability, focusing on those denial of service scenarios is a good place to draw the parallel. And I think within chaos engineering as a whole, there are scenarios we focus on that chaos studio can help with like, okay, what happens if I'm experiencing a whole lot of resource pressure on my virtual machines? Or if a network connection is knocked out to certain IPs or certain ports, do I know what's going to happen to the rest of my system if that happens to occur?

08:02

I think I was also in thinking about this podcast episode, I was looking a little into the security chaos engineering discipline. I think there are a lot of parallels between the non-security chaos engineering and security chaos engineering. There's an O'Reilly book on security chaos engineering by Kelly Shortridge and Aaron Reinhart that I was looking at a little bit. And one thing that stuck out to me was a quote about how cybersecurity must embrace the reality that failure will happen.

08:46

And kind of goes on to talk about how people are going to click on the wrong things and security mitigations will be accidentally disabled, things will break and are breaking all the time. And that definitely aligns with how we here at working on chaos studio think about the world and recommend that folks test. I can also go a little into how our service works. So I mentioned we started kind of back in 2019 ish and we were in public preview for a few years.

09:29

And just recently at Microsoft Ignite in November, we brought chaos studio into general availability. But we've had quite a few customers using us in the public preview phase. So Azure chaos studio is a managed Azure service that works to measure and understand and build customers resilience to different real world outages. So like I talked about with chaos engineering as a whole, kind of being a way to test resilience by breaking things with fault injection.

10:14

Chaos Studio lets you do that for Azure services in a more integrated way by providing those connections to virtual machines to Azure Kubernetes service to key vault and providing various faults that can mess with those services or mess with kind of your configuration of those services. We have a couple different ways that that can happen. So we have faults that are pretty straightforward and just talking to another service, making some API calls like let's take virtual machines as an example.

10:58

If you're running a whole bunch of compute in Azure using virtual machines, virtual machine scale sets, you may not have tested how your application and your system as a whole behaves when some subset of those virtual machines go down for some reason. Chaos Studio can help you do that by giving you the tools to set up an experiment, select and onboard all of those virtual machines that you might want to test and abruptly shut them down.

11:37

Now you may be thinking, you know, okay, just shutting down VMs. I can go into Azure portal and do that myself. The value of Chaos Studio comes into play by kind of orchestrating that scenario and that fault with other faults. So you may want to do that in sequence or in parallel with other actions happening. Maybe I want to know what's happening when all of the virtual machines in a certain zone are taken out and they're no longer available.

12:11

And I also, you know, in parallel I see a whole bunch of resource pressure on, you know, CPU or memory pressure on some other subset of my compute. And maybe also my, you know, Cosmos DB account is failing over between regions. So it's building up those more complex failure scenarios that is where Chaos Engineering kind of comes to the forefront and where Chaos Studio can really help. So I talked a little about those service direct faults where we're talking directly to other Azure services.

12:56

We have a Chaos Agent, which is, you know, a small piece of software that you can onboard to virtual machines and cause, you know, other issues within the virtual machine like that resource pressure or network disruption, network latency. Those can all be really important for just resilience scenarios or even security scenarios. I know you mentioned Key Vault earlier. You know, you were having some issues testing out some encryption with Key Vault infrastructure.

13:36

We've had a lot of customers use some Key Vault faults that basically deny access for a certain period of time to Key Vault or, you know, see what happens when you go ahead and update certificates or lose access to certain Key Vault instances. So definitely something that Chaos Studio can help with. And then internally, we also have a couple other methods of fault injection. We do perform chaos testing internally on some of the Azure infrastructure.

14:19

So we have some teams within Azure that, you know, work with us on and use our tooling to test, you know, what happens if this, you know, Azure infrastructure starts experiencing issues and are we able to deal with that from a resilience point of view. I know I went off on a bit of a tangent there, but I wanted to get a few of our fault types covered. You said about basically playing around with certificates, like rotating a certificate out. Can you do that?

14:52

So the Key Vault faults that we support are, yeah, so we have Key Vault access denial. So basically blocking all of the network access to a certain Key Vault for a period of time. There's disabling a certificate for a specified duration and then re-enabling it, incrementing a certificate version or just generally updating a certificate policy. That's what we support for Key Vault and that may be, you know, may be useful for security scenarios.

15:25

Yeah, yeah, because, you know, certificates can get rolled underneath you. So a couple of questions. First of all, if someone were to use Chaos Studio, obviously it's going to start causing all sorts of havoc in their environment. Does that mean that Azure needs to be aware, like someone within Microsoft or the Azure infrastructure or personnel needs to know that you're using this when all of a sudden, you know, alerts start going off and things start failing?

15:52

Or do you not need to do anything special if you're going to start using this? That's a great question. Yeah, so nothing special is needed.

16:00

The nice thing about Chaos Studio and, you know, one of the principles that we built it on was, you know, giving, providing customers with the tools to do this controlled chaos, especially within a customer perspective, we're not necessarily taking that approach that I mentioned with, you know, Netflix's initial foray into chaos engineering where they were just going off and shutting off random instances in production.

16:36

We take a little more controlled approach in that customers need to, you know, come to come to Azure, come to Chaos Studio, they need to explicitly onboard the resources that they want to affect. So whether that's virtual machines or a Cosmos DB account or their, you know, key vault resource, a customer does need to explicitly onboard all of those resources into Chaos Studio. They also need to have the permissions to perform certain actions against those resources.

17:16

We're, you know, we are built around the Azure Resource Manager, the role based access control model that, you know, folks are familiar with within Azure and everything goes through that RBAC model. That means we're not doing this random chaos, so you do need to be a little more intentional about it. But we, you know, we see that as a good thing that customers need to be, you know, intentional and planning out the scenarios that they want to cover.

17:53

That's an interesting point about planning the scenarios. I imagine in many organizations, people are not necessarily experts at chaos engineering. So if I was given a scenario, I don't know, some environment, let's just make it up. You know, it's a browser talking to, you know, an Azure app of some kind, say, and Azure function that then talks to Azure SQL database and Redis cache and key vault. I mean, if I'm given an environment, I mean, I'm not necessarily going to know what things to do.

18:26

Does the tool help, like come up with experiments? We have a new feature that just recently released around our GA timeframe and as part of our general availability called templates that provides rather than being dropped into just a blank chaos experiment with no faults or actions kind of pre-populated. We're giving a little, you know, a little quick start for customers to jump into certain common scenarios.

19:09

The two that we have within the templates interface right now are an Azure Active Directory outage for virtual machines and virtual machine scale sets and availability zone down where we abruptly shut down VM scale sets within a certain availability zone. So that helps a little bit.

19:32

We definitely are looking for, you know, now that we're GA, we will be ramping up, you know, the amount of samples that we provide for various, you know, various use cases and configurations, building out that template library and of course, you know, adding more faults in general to our library.

19:52

Of course, you know, over the long term, we will look into additional ways to, you know, provide more intelligent recommendations on what sort of scenarios to run, what sort of experiments to run and as well as, you know, other integrations across Azure. Can you, I mean, when you said there's an outage, can you like blip something so it just blips for a second or like goes offline for a split second and then comes back or is it really a lengthy bit of downtime?

20:23

It really depends on the fault and kind of the, you know, the scenario to cover. You can perform shorter duration network, like network faults, whether that's kind of disconnecting certain traffic or introducing packet loss and latency. We're looking into sort of other possible blips and pauses, but I think the network latency, disconnect, packet loss, that's probably all the, you know, the closest we can get.

21:00

We also have one common scenario that we recommend to customers is using our network security group rules fault to affect a broader range of services than, you know, than we have explicit faults for. So some people, you know, they'll come to Chaos Studio and see, you know, okay, you don't have any faults listed for say entirely disconnecting my like Cosmos DB instance, or you don't have any faults listed for SQL at this time.

21:35

And so what we can recommend to them is we have a fault that can create some network security group rules for a short time or for a specified time and do things like, okay, I want to block all of the traffic to a certain Azure service.

22:01

And it supports Azure service tags. So you can use, you can use those service tags to say, and I don't know if I'll remember the tag correctly, but you could say, you know, Azure Cosmos DB dot East US and, you know, handily all of the IPs associated with Cosmos DB and East US are covered by that service tag. You can pretty easily block all that traffic. So that's another thing that we another method that we often recommend to customers.

22:36

Do you have any details about like, what the most common like little things people do? Like is the top couple of things that everyone everyone does as an experiment or part of an experiment? Yeah, that's a good question. I would say our most common scenarios that we see are virtual machine based. So whether it's shutting down virtual machines and virtual machine scale sets or using the using those agent faults that I mentioned on on virtual machines, those are really common.

23:11

And then the other the other scenario that is quite popular is using our integration with AKS chaos mesh. So chaos mesh is a an open source framework for Kubernetes chaos engineering. And it provides faults like network, you know, network disruption, pod kill and pod disruption, various stress faults, HTTP, you know, all the good stuff. And rather than reinvent the wheel, we built a way that customers can, you know, start chaos mesh faults from chaos studio.

23:56

And we have some tutorials on how to do this that, you know, I can share in the in the show notes. But that's that's another popular scenario. Kubernetes is obviously a very common, very common part of many applications infrastructure. So having having some integration there with chaos mesh has has been important for many customers. You know, it's interesting, I had a customer some years ago and they had a big web presence, retail presence. And by the way, I'm going somewhere with this story.

24:28

And they they found that their usage for as a key vault, their bill was actually pretty high and they couldn't work out why. Well, the reason was every time they made a connection or someone made a connection to their website, they would go and hit up key vault to pull some data down. The problem with that, of course, is not only, you know, first of all, Key Vault is not really a transactional service at all.

24:48

You know, you're not supposed to be hitting it thousands of times a second, which is what they were doing. Rather what they and then you end up getting timeouts, which Key Vault does by default. And so what they ended up doing, which is like caching information for 30 minutes and then so basically they're hitting key vaults every 30 minutes asynchronously. And so not only did their their car, their key vault cost go down, their performance went up, but also their reliability went up, right?

25:12

Because they weren't so dependent on Key Vault being there thousands of times a second. This wasn't found through through any kind of anything other than just someone saying, why do you do this? But I'm sure you see things like that, right? Where people are just, you know, they're on Kailh Studio and they say, wow, you know, why does our application go down because of that one thing? You know, that scenario happened and then end up changing their their their design.

25:38

So this is the question coming out of all this. So, I mean, what sort of changes do you see people make to their designs to make them more robust in the face of Kailh Studio or, you know, intermittent outages? So I think actually Key Vault is a is an example that that I would mention here, too. We had a some internal teams, one of the case studies listed actually on our product page on Azure dot Microsoft dot com.

26:09

We have a video that that talks through some case studies and one of those case studies is from an internal team who who has done testing with some of these key vault faults. They also they also made an appearance that alongside us at some conferences last year. So I can share some of the some of the links to that in the notes as well. I believe so I don't I don't recall exact details on kind of the, you know, the changes that they made to their their infrastructure.

26:44

But they found issues relating to, you know, Key Vault and how they were treating certain failure scenarios and were able to get those remedied. The the other other scenario I would I believe we see is not handling a virtual machine outages, you know, as expected. Basically, you know, the availability zone scenario has been has been quite important testing what happens when all of all of the virtual machines, virtual machine scale sets in an availability zone are are out and abruptly shut down.

27:28

We've had some some internal teams go through that testing and see issues with with their infrastructure, you know, handling that handling that sort of case and have been able to sense sense fix them. I think those are the main examples that I would that I would draw on there. But, you know, it really varies since it's quite quite dependent on infrastructure and how how your workload is set up and and all that good stuff. So you mentioned earlier about essentially authorization policies.

28:00

You know, you don't want every Tom, Dick and Harry just running amok inside of people subscriptions using K.L. Studio. So sort of what level do you restrict who can do what using K.L. Studio? Yeah, that's a great question. So two aspects to this one is, you know, permissions to use K.L. Studio. So we have, you know, real based access control policies for being able to create chaos experiments on board resources as targets, start chaos experiments.

28:39

So, you know, you can control do I want these people or these identities in my organization being able to actually even work with Azure K.L. Studio. The other component then is actually executing faults against resources. And the so we have a an identity, either a system assigned managed identity or a user assigned managed identity that is attached to the chaos experiment. And that identity needs to have the proper permissions for each individual resource that we're targeting.

29:22

So if it's a virtual machine, you know, a virtual machine contributor access and those are all listed out on kind of our product page and can be automatically assigned to the identity. If you assuming that you have the permissions and if you if you so desire. So that's kind of our our role based access control model kind of, you know, restricting both sides, whether you sort of who can use chaos, but also what are the how do you restrict what resources can actually be affected.

30:00

So on the flip side of that, so does K.L. Studio work well in kind of isolated environments with, say, private endpoints, V net injection, that kind of stuff? Yeah, so you can we have had some recent feature additions in this in this area. You can use you can use some of our our fault capabilities with with private networking. So, for example, a chaos, you can perform those chaos mesh faults against a an a case cluster that is private.

30:35

And we have a tutorial on how to do that in our documentation. And then we just recently added private link support for agent scenarios as well. So, you know, allowing our chaos agent to talk to, you know, talk back to the experiment and the orchestration infrastructure while still, you know, staying secure within within within private link. All right. I think we're probably time to bring this this episode to an end.

31:06

So, Roger, one question we always ask our guests is if you had just one final thought to leave our listeners with, what would it be? My one final thought, I think, is think about how your systems fail, these complex systems that, you know, that we work with in the cloud. Do you know what emergent behavior you might see when when unexpected outages happen? And then how can you go and test it? And test it using chaos studio, right?

31:38

Ideally, but, you know, we're all for chaos engineering as a discipline. We would we would love it. You know, came came to use us. Yeah, because we had some some links to the show notes about chaos engineering just in general, I think that's a really interesting area. So, yeah, we'll definitely go ahead and do that. Hey, look, look, Roger, thank you so much for joining us this week. I always learn something on these podcast episodes, and this is absolutely no exception.

32:01

And I'm a huge fan of chaos studio, specifically in chaos engineering, just in general, because, you know, some things sometimes things don't go the way you expect. And that's that's always good to at least have a better idea of what you know, what might happen if things do go awry. And to all our listeners out there, thank you so much for joining us this week. We hope you found this episode of use. Stay safe and we'll see you next time. Thanks for listening to the Azure Security Podcast.

32:26

You can find show notes and other resources at our Web site, azsecuritypodcast.net. If you have any questions, please find us on Twitter at Azure Set Pod. Background music is from CC Mixter dot com and licensed under the Creative Commons license.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript