SOC Metrics: Measuring Success and Preventing Burnout

John Hubbard 00:00

support for the blueprint podcast comes from the sans institute since the debut of sec 450 we've always had students interested in a matching course covering the management and leadership aspects of running a SOC if you'd like the topics in this podcast and would like to learn more about blue team leadership and management check out the new management 551 building and leading security operation centers this new course is designed for security team leaders looking to build grow and operate a security operations center with peak efficiency and it's a hands on technical leadership course that takes you through everything from scoping threat groups to use case creation and organization threat hunting planning SOC maturity and detection assessment and much much more check out the course syllabus labs and a free demo at sans url comm slash 551 and i hope to see you in class

00:45

this is the blueprint podcast bringing you the latest in cyber defense and security operations from top blue team leaders blueprint is brought to you by the sans institute and is hosted by sans certified instructor john hubbard and now here's your host john hubbard

John Hubbard 01:04

in this episode of blueprint we have guest john Hencinski director of operations at expo this episode i think is a real gem and points to the future of where security operations is going both in terms of tools and in process jon's incredibly clear and well researched thinking defining what a SOC should be doing and how to measure it is both a breath of fresh air and a job that sometimes seems impossible to measure and improve and also a guiding way forward for many of us stay tuned for an awesome conversation about measuring and improving your SOC using methods pioneered in the manufacturing and reliability engineering industries how john uses data science to measure and improve security operations and much much more today on the blueprint podcast welcome to the podcast everyone today we have jonathan sinskey director of operations at expo i ran into john in some mutual discussions on twitter and i've been seeing blog posts and other stuff throughout the years and you can tell he's a person that's definitely really lived the real blue team life and has those battle scars kind of seen it all and also has a really clear and interesting way of breaking down the process and the measurements for a SOC and i really like how he kind of approaches this stuff so i knew i wanted to have him on the podcast to talk about some of that stuff so welcome to the podcast john

Jon Hencinski 02:16

thanks john great to be here

John Hubbard 02:18

so start off give us a little bit of information about your background and your current role and what kind of your day to day work is like

Jon Hencinski 02:24

yeah sure so in my current role i'm director of operations at expel once expel we are an mdr or a SOC is a SOC as a service what makes us just a little bit different is we don't actually like ship an agent or deploy technology we integrate with the tech that our customers have so basically we connect up to that technology consume the data try and make sense of it and then really our value proposition is we provide answers not alerts so you've got all these vendor alerts you don't know what to do with it we with a mixture of technology people and process we try to make sense of it all and provide our customers answers so that's expell in a nutshell my background i've been in security operations for over a decade so the easiest way to think about it is i know what like a really good SOC looks like and maybe a SOC that you know not so great if you will so um one of the things that's near and dear to my heart is this notion of SOC analysts burnout alert fatigue one of my strong beliefs and we'll probably chat about this john is like i think a SOC can be a great place to work but you got to do it right like if you do it wrong it's not going to be a very great place to work and there's gonna be a lot of burnout as well so i've worked in many SOCs spend some time at fire i managed defense doing mdr there as well so most of my background is doing SOC work both in public and private sectors

John Hubbard 03:39

awesome yeah definitely want to touch on the burnout thing before we get into that and leading into that maybe one of the things that i was really interested in from what i've seen you take a very kind of forward thinking you know new tools and technology maybe also srp style like thinking approach to security operations which i think is really you know unique i guess i don't see a whole lot of people thinking about the problem like that and so i wanted to start off kind of high level and just ask you like how do you conceptualize something as big as the SOC and say like this is what we're doing these are the things that matter and what do you think about measuring to just to start to wrap your head around that sort of problem

Jon Hencinski 04:18

yeah that's a really good question and in terms of like my approach was a start with strategy and i'll use my day job as an example because that's the most relevant and i think about what we want to accomplish well we're integrating with all this technology we're consuming at this point hundreds of millions of vendor alerts a day and when i think about a really good security operations experience either as a vendor or an internal security program what does that mean well you want to move quickly there's a lot of latency sensitivity within security operations what i mean by that is like hey if you have a high severity alert for a web show being accessed and it sits there for a couple days probably not where you want to be and if you have an alert that sits in the queue for a long time time and it takes you 12 to 24 hours to find it and figure out how to remediate that's probably also not where you want to be and the way i also think about security operations is it's like a system it's a living system you got work that enters the system you've got latency sensitivity but there's also this concept of throughput like how quickly can work move across the system as well and so when we think about a SOC it measurements start with strategy and actually like write it down and what i mean by that is like define what good looks like so when i think about SOC strategies the number one thing that i would recommend us have a firm handle on capacity like what's capacity like well how many alerts are going to show up today and how many analysts do you have to to look at those things and fun fact if you don't know what your capacity is you're likely oversubscribed and you're likely going to experience burnout so like the first thing i did when i you know change from an icy individual contributor role to managers we all work in excel as investigators time lining but i put that down and said okay instead of using excel to investigate activity i'm going to write down my capacity model and figure out how much capacity i have available as well and then the other thing that was part of my strategy was i want to improve service level objectives what's the service level objective like how long are alerts going to wait before we pick them up for first action and then the other thing i said is like when we think of like a world class soccer mdr is is i have this line it's kind of cheesy but kind of not is i want to respond faster than delivery pizza like 30 minutes or less to go from a fix and there's some other parts of that strategy but when we have that strategy and we say hey team this is where we're headed and then we deploy a set of mesh metrics and measurements to inform where we are in that journey so we started with capacity get your capacity model in place and you're going to be wrong you're gonna have no idea what you're doing at first and that's totally okay i'm still figuring it out as well and when we think about slo times ask yourself well when alerts hit our sim or whatever platform we're using how long do they wait on average and then when you think about your response times how long does it take you as an organization or as a vendor to go from alert to fix and start deploying those measurements that will inform where you are and then when you kind of review those metrics and measurements you start to make adjustments to get closer to where you want to be so that's super high level that was the approach start with strategy rally the team around that strategy deploy a set of metrics and measurements to inform you where you are in the journey but then be super transparent about that like it's totally fine if our alert to fix times aren't great right now but the starting point is figuring out where you are and then talking about as a team and organization what it's going to take to actually get there so that's super high level john

John Hubbard 07:38

wonderful answer there's a million places i'd love to hear from the first thing i think that caught my eye that chronologically makes sense capacity planning right that's something that there is a fixed kind of pool of people that are working and then there's like these dials you can turn right like how sensitive don't want my alerts to be how many alerts that we're going to turn on and all of that and so how do you take all of those variables and say like do we have enough people like how do you factor all of that in and approach that problem

Jon Hencinski 08:05

it's a fantastic question so the first step is in that capacity model is let's assume you're running a 24 by seven SOC and the first thing you're gonna ask yourself is what's the minimum amount of people required to run a 24 by seven thing it's it's unlikely 12 could be closer to eight if you really want to be lean and mean and then you've you've got so you've got 12 analysts probably broken out by shifts so let's assume for the sake of simplicity you have two to three people per shift and fun fact when you hire an ml so you don't actually get eight hours assume they work an eight hour day i assume 70% loading for every particular and that means for every analyst i'm not going to get eight hours i'm gonna get like five maybe four hours because people breaks and things like that you're not just going to just sit right in front of the console every every specific day and then the other thing you can start to do is understand how many hours you have available today but then start to ask yourself well how many alerts are warning you have over the next 12 to 24 hour period and then ask yourself when those alerts hit the queue how long is it going to take us to work those classes of alerts and this is where it gets interesting maybe jumping ahead but when we built this system it expel the way we built it was an alert hits the queue you know we've all been there we're in our sim we're in our alert console and what we like to do is optimize for triage basically when an alert hits the queue i actually don't want my animals seem to have to jump into the sim inquiry logs rabbit process this thing i want to optimize for triage but then there's other classes of alerts you may have haven't optimized yet and then you're going to move them to an investigation to do a whole bunch of work to figure out was this good or was it not and then you know a percentage of those will actually make it to a fully declared incident where you've got some remediation as well so on and so forth so bottom line to get a starting point understand how much capacity you have and then start to break up the work that you do on a daily basis into like alerts what are the triage decisions we made were there any alerts that required it effort, what did those look like? And what did you have to do? And then finally, on a typical day, how many incidents do we see, it's probably one, maybe two, on most days, there probably aren't security incidents if your internal SOC, but then start to build those into your capacity model, because then you could say, on a typical day, we've got make up a number 36 hours of analysts time, we'll look at 200, alerts. 70% will be triage. And those take about five minutes. And then extrapolate that out, we'll do about 10 investigations, which takes us about an hour, and maybe one to two security incidents. And then you can start to project and kind of think about where you're going as an organization as well. So I know that super high level without having a Jupyter Notebook or a spreadsheet up, but I'm doing the best I can here to kind of late

John Hubbard 10:42

Yeah, certainly one of the things that came to mind there, and that answer is you said you assume about a 70%, you know, loading capability on per person or whatever, with the whole, like site reliability, you know, mindset and all that sort of thing. Like I know with that there's like what there should be and then what there is right? In your mind, is there a goal? Like we should spend X percent of time on reactive, whatever versus improvements? And how do you balance that kind of thing? Oh, that's

Jon Hencinski 11:09

a great question. So probably be being too candid. But like as my role as director of operations, sure, I'm measured based upon the quality of service, but my tasking is to not scale the organization literally with people. What I mean by that is like, my job is to increase throughput of the system. So let's think how do you actually measure that I break up alerts into different classes, you've got commodity malware, you've got suspicious logins, you've got insert class of attack. And then what I'll do is I'll say, hey, over the past month, a typical suspicious logon investigation took us like 31 minutes, what would it take to make that 20 minutes I'm talking using technology, not Hey, SOC, right? Wrong. When I'm clear to the listeners, what we do is like, we'll actually inspect alerts and investigations and say, Okay, what were the steps that we took? What are the right steps we can take? And how do we hand that off to the platform, we call them robots, we've got this robot named Roxy, and she does all the investigation for us. And then over time, what happens is you increase your throughput. And what we can do then is I can add more customers to the platform without actually having to increase the number of animals on the team. So it really in a nutshell, we're like, I'm looking at throughput, balanced with latency like is next month when we added a customer to did alerts wait longer, yes or no? I hopefully, that's no. And then when we think about throughput, how long it takes us to do perform certain classes of investigations? Can I make those faster? And again, that's not to move faster? That's how do we handoff more rope to the platform and do more automation to make it easier for folks?

John Hubbard 12:46

So yeah, I mean, that's one of the things I'm always trying to think, how to figure that into capacity planning. And everything is a grouping different types of alerts. And so it sounds like that's what you're doing right is saying like these types of alerts, on average, take this much time. And these other ones take x and y and whatever it is, and kind of mash that all together and try to come up with a best estimate

Jon Hencinski 13:05

for all of that in one of the other things that we do is if you're working for a vendor, one recommendation is actually talk with folks in sales and ask yourself, Well, how many customers are we to add? How big are they going to be? And based upon those projections, how much new work do you think's going to show up? And so one of my my cheat codes is actually going on sales organization and say, What are our sales goals and targets? And what would that mean in terms of net new customers? And then, in my capacity model, it's not, here's what's happening in March 2021, and go out six to 912 months based upon projections of the business and say, here's how much work is likely to show up. And here's the level of automation technology that likely needs to be in place for us to be able to deliver high quality service. And if you're an internal security program, hey, are we thinking about any acquisitions? Are we thinking about expansion, and if we're thinking about deploying additional agents or additional network appliances, more work will likely show up in your Sim, and you should be able to account for those things as well. So I like to project out into the future as well. Because basically, if I show up to work today, and suddenly, we're over capacity, it's already too late. It's already too late.

John Hubbard 14:08

Yep. So when it comes to capacity planning and slo is and trying to respond as quickly as is necessary for a given organization, how do you approach knowing what fast enough is right? Like if a customer comes to you and says, like, what's the right amount of time we should take to like, address an alert, right, like you said, 30 minutes was the goal, but like, how did you come to that? And is that true for everyone? Or what might someone do to figure that out for themselves?

Jon Hencinski 14:35

I think it's a great question. I think I looked at that and said, Okay, let's start 30 minutes is a good marker. We have this expression, Kentucky windage. It felt right. And I said, you know, if I go to a customer and say, anytime we identify an incident, our goal is to go from alert to fix and under 30 minutes, I think a customer and say that sounds good. And the other thing is, I don't want to say it put pressure on the team but like you have to aspire to hit that. That's In five minutes, how do you do that? Well, when you think about a typical organization, I don't handle every incident the same. What I mean by that is like, I respond differently to targeted attacks, then commodity malware on one particular system, or business email compromise involved in one attack. If I had to prioritize, I'm going to fix the internet facing exchange over the web shell where someone's running post exploitation program versus let me reset that password. That's real talk. So hopefully, I want to get to a world where like a critical alert, where it's no targeted hands on keyboard style attack enters the system, we've got automation to pick it up, auto contain, and then we can make a call as to whether or not release it from containment, I want to push our organization that way as well. My personal opinion is, listen, I don't want to knock critical systems offline for the sake of doing it. But I think sometimes based upon my experiences, my personal experiences, like we're always worried about being really, really right before we take mitigating action, we think about security, it's managing and maintaining acceptable level of risk, sometimes the right action is to contain that box. And then we can learn more about it as well. So bottom line to answer your question is 30 minutes was what felt right. I think for most organizations, the right question to ask isn't where do we want to get to the first question I'd ask is, where are we right now? And that can on your next steps is even know. And that's okay, the real value I bring to my day job is asking the questions and just saying, hey, what would it take for us to do alert to fix in five minutes? What would it look like? And sometimes the best thing you could do is as security leaders just say, okay, you know, before we ask those questions, how do things work today? What are our SL O's? How long does it take for us to go to learn to fix or even basic than that? Hey, when's the last time we spotted commodity malware? Do we have a business email compromised problem? Okay, what is shown in CES about us on the public Internet, and then you can start to use those questions to inform next steps, which will develop a strategy and measurements and now you're in the playbook of what I do every day.

John Hubbard 17:01

Yeah, yeah, that's awesome. In terms of breaking down, like these measurements for SLS and everything, I just said, like, do you know where you are now? Because obviously, like, at least knowing where you are, you know, the direction you want to go in right? shorter is generally always going to be better. But the way I like to think about, you know, the whole process is decompose it into as many bits as possible and try to measure the individual bits, right. And we're talking like, alert triage time, generally, right. But I've seen in some of your posts in the past, well, someone's post from Excel, I don't remember if it was yours or someone else's. But I mentioning, you know, time to acknowledge the alert time to take a remediation, action and all of those things, what are those key steps within that whole process that you're looking at the times for?

Jon Hencinski 17:42

The first part is alert latency. So the way our system works is we integrate with all these different devices, EDR, network, Sim, we've got guard duty, all these cloud events, GCP, Azure, we consume those events, normalize them, run through a detection engine, do some enrichment, and make some calls as to whether or not we need human expertise. So the first measurement is when an alert hits our analyst console, how long does it Wait, and we've got a for severities. Today, I've got critical, high, medium, and low. And so when a critical alert lands in the queue, the SLO is I want that first action to be under five minutes. It's pretty aggressive. And then from there, it gets less sensitive. Five minutes, 15 minutes, two hours, six hours. So I'll look at SLS first. This is where it gets a little bit interesting is in our system, you're looking at an alert. My personal opinion is there's not two answers. Is this bad or not? There's actually three, it's Yes, this is evil. No, it's not. But three is I don't know. And that's when we move the alert to an investigation in our world. And what happens there is you're taken out of triage to put into this new UI. And in that UI, you record your investigative actions like, Okay, I've got an EDR alert for a suspicious process, I'm going to look for other process activity on that same system or query enterprise wide. Or maybe I'm gonna jump into my sim and do this and do that. And all of those things are recorded within the investigation. And so what I'm looking at within investigations is cycle times. What's cycle time, like, okay, when we move that EDR alert to an investigation, how long did it take us to arrive at the conclusion of Yes, this is a thing or no, I know enough about this to make a call. So I've got so close. How long did it wait? And then cycle times for how long does it take us to perform these classes of investigation as well? Now, if that investigation becomes an incident, we toggle the investigation as an incident. Now I'm saying how long is it taking us to scope it like you know, is it just on this system or other systems is the attacker logging into one mailbox or several, and I'm measuring like incident response cycle times now as well. so basically the easiest way to think about it is we've got latency sensitivity how long does work wait cycle times when we start the work how long does it take for us to do it which is a proxy to get throughput now how quickly can we process work across the system as well and again just just for listeners this is not so i'm like hey we need to move faster when we jump into the sim mode we need to do that no when we're looking at the steps we're taking i'm asking myself hey what can we automate because you know what like you know what computers are really good at is taking a list of steps and then executing those you know in the way that they intended and then what happens over time john is my investigation cycle times that line instead of going up into the right like every industrial likes they go they go down because you're becoming more efficient and more effective so i'm really passionate about this because what happens what you're doing is you're becoming more efficient and effective but what you're doing the byproduct of that is you're decreasing cognitive loading on your SOC analysts because they're focused on making this decision rather than oh god i gotta go do this i got to do it that just to have an informed answer and now you're optimizing the team for two things at that point making really good decisions and then interacting with either your internal stakeholders if you're an internal sox are your customers if you're a vendor as well so our policies for the long winded answer this is something i'm deeply passionate as well on the back end we do a lot with data science i think one of the smart things i did when i stepped into position management or leadership was like embracing operations management learning things like time series analysis change point detection and these aren't fancy things but these are just what i call like when you're an investigator in your tool bag i've got like encase and ftk and i've got maybe some peek app but as a manager i've got data science i've got i've got jupyter notebooks and i've got all these techniques to understand what's happening in my system as well so it's just like you get to imply that investigative mindset but instead of looking for bad guys i'm looking at ways to optimize the system and i love it so you know sorry to go off on a tangent there just something

John Hubbard 21:58

yeah it's amazing i'm a geek about that stuff too so that's it i love it in terms of change point detection for example let's let's pull that thread what is that and how can SOCs look to use that as an idea to make things better detect you know changes that are important

Jon Hencinski 22:12

what kind of stuff yeah so i've got this saying i want to throw in a t shirt when they when you want to uplevel like a junior security antal analyst what do you do you sit them next to a senior analyst right but if you want to uplevel your management team like security operations man sitting next to data scientists but one of the benefits for who i sit next to a day job is i sit next to a data scientist and i think the business did it intentionally they literally sat next to each other probably literally i guarantee they did it or maybe i said i want to sit there because this person knows a lot more than i do in this area but i think they did intentionally and one of the things i said to the data scientists or elizabeth webber i said hey elizabeth i'm looking at just general alert trends in our soc and i'm looking at a time series so just visualization of how many alerts were sending to the team over time summarize like day over day and she said hey john have you heard of this thing called change point detection or change point analysis i'm like no i have no idea what that is it's basically at a high level it allows you allows you to determine changes in the mean number of alerts are sending to the team over time so basically it's just you know it's a fancy way of saying hey when i look back seven or 14 days did we experience a statistically significant shift in terms of the number of alerts we're sending to the team and it will go up there will go down so change point detection allows me to look at okay did the daily mean number of expel alerts excuse me alerts we're sending to the team for review did that go up or down if it goes up why what happened there to be onboard a new customer did we integrate with new technology if it goes down also being able to answer that as well so change point analysis when the context of alerts allows you to answer did the daily mean number of alerts you're sending to your team go up or down to take it a step back when you think about moving alerts to investigations we do the same thing with investigations the mean number of investigations we start to perform go up or down over a given period of time as well so just a quick easy way to use change point detection to determine like did the daily mean number of alerts or sending to the team change

John Hubbard 24:19

say it goes up right there's a ton of variables that could have caused that right more attacks just happening that week change in signatures changing something right how do you narrow down and is there an automatic way or an easy way to identify what caused that change

Jon Hencinski 24:32

the thing is you know with my management team it's hey we just detected the change in the daily mean number of alerts we're sending to the team we have to get after it to answer like what's this variance from and it could be most often it's a runaway or bad signature or update so then it releases new rule we've all been there things kind of run away from us a bit on that's going to happen and the other thing is working for a vendor it's did we onboard any new customers numbers are we consuming a lot more data and what will that mean and then of course things like the weeks we've had these past couple of weeks things have been busy a lot of exploits explainable cause there but typically did we have a bad set of signatures or signature are we onboarding more customers are things just in a really bad state now because there's a lot of exploits out there so on and so forth so and then corrective action would be okay to be suppressed or filter or dial in those runaway signatures when we look at new customers or maybe new organizations or satellite offices we've on boarded can we dial them in how do we tune them down in a way where we're like we're looking at the things that matter and not just being flooded with false positives overnight activity as well so great question because it's not just spotting that there is a change point detection it's really the value is now what are you doing about it as well right

John Hubbard 25:48

yeah yep one of the things i locked onto they're a bad signature right how do you define a bad signature and the bigger question i guess is looking at a list of 1000s of signatures when you look week over week at all the signatures that did fire which ones are you focusing on to tune and is it you know percentage of times they're wrong count of times they're wrong like there's a whole bunch of ways you can approach that how do you take on that problem

Jon Hencinski 26:11

i think we're all trying to solve this and i think we all kind of come at this very different ways which is totally fine right now i focus on probably doesn't surprise you like the basics what i do is we use a third party technology called data dog that does some platform monitoring think like srp type things and what i'll do is i'll fire a notification to the team anytime a given rule or signature exceeds an acceptable threshold how does that look like in practice oh it's topical well yesterday i think clawless agent started scanning for some post exploitation things on microsoft exchange crowdstrike falcon exploded meaning tons of alerts across multiple customers so what we'll do is we'll fire a notification someone within our detection and response engineering function will spot that and say we've got to dial this in really really quickly so that's more of the day to day tactical operations but week over week what we look at is probably what you expect give me my top talkers by vendor by signature and then we'll do some things to dial those in but one of the things we built into our platform which is really helpful is we've got this concept of what we call like we actually have a tuning queue and what happens there is you imagine you're integrating with all this technology and the question another question to be asking is like how do you account for when a vendor releases new signatures what do you do with those you have to write new rules like we actually have a piece of technology that allows us to say okay anytime an alert fires in the customer environment for the first time i send it to a SOC analyst for review to say is this evil or not number two let's make a call as to whether or not we want to put this into production so we keep looking at those so we actually made a good call a number of years ago and say we've got this tuning for seen concept and anytime there's a new vendor alert that fires in an organization for the first time we send it up as an alert to say okay we have to look at this is this evil or not but if it's not let's make a call as to whether or not we should put this into a production severity meaning we're putting in front of the SOC team again and again

John Hubbard 28:11

tuning cue thing that's a really interesting concept could you go a little further in like what else goes in there other than things that are you know first time seen and all that which is obviously an interesting factor what else would get something put in that list and prioritized

Jon Hencinski 28:23

that's mainly what goes in there for now but you can imagine there's a good amount of alerts that fire on a typical day for that so on a typical day we'll get anywhere from 25 to 50 tuning alerts because carbon black response we're in a new threat feed thing or signal science released a new signature or palo alto networks were just released those exploit vulnerabilities signatures for the latest ms exchange activity they'll fire there so it's a catch all so if we don't already have logic built into our detection engine it's a catch all way just to make sure that we don't miss so it's a fail open concept the easiest way to think about it's a fail open concept which gives us a lot of flexibility so if you know you're working with you know a customer someone like how do you know well we've got this failed concept built into our platform

John Hubbard 29:05

one of the other things you had mentioned was jupyter notebooks i wanted to touch on those as well that's one of those technologies i guess would be the word that i've been eyeing and saying like this is obviously a big thing coming but i don't know where exactly it fits into security although i hope that it you know turns into use case management and threat hunting and it sounds like you're already doing that and blog posts i've seen in the past i'd be curious to know how and how effective you found using jupyter notebooks to manage that kind of data is and i guess we should probably start with how are you doing it and what's kind of going on there and then

Jon Hencinski 29:37

yeah one of my other strong beliefs is like management today is living in the world of excel and spreadsheets i also believe you know five to 10 years from now most managers will be living in jupyter notebooks or notable whatever so we use jupyter notebook in many different ways but i'll talk about from the security operations management perspective first there's also some threat hunting aspects well so in our platform we set early on we got to instrument really good api's because that's how it allows us to kind of read and consume data so every week at thursday this afternoon thursday afternoon i'll chop into my jupyter notebook and run my set of operational metrics that's how i actually calculate a wait times massa low times and calculate investigation cycle time so basically we use jupyter notebooks it reads a set of api's from our analysts platform and a lot that's where we create a lot of our operational metrics so it's does the calculation does the presentation and we use that to for reporting and to drive strategic change so a lot of our operations management is built around jupyter notebooks today

John Hubbard 30:43

very cool with the ones that you've used for threat hunting how have you found managing those and one of the things that i was thinking about you know i'm hoping also that the future goes in that direction because i've seen the power of jupiter and other use cases but don't see a ton of it in security except kind of on the leading kind of edge of where people are going you know python right it's a thing that's not at least historically like a primary skill for a lot of soc analysts people have been more focused on i need to know how to read peak apps and log files and whatever and if you know python that's awesome and it's great but not everyone knows it has that been any kind of hindrance so you're teaching people python or i guess you can use anything but i'm guessing using python how has that played out for you

Jon Hencinski 31:24

we actually have an analytics engineer that reports directly to me so i can work with that person to say hey more i've got an interesting concept of a thing we've got a measure kiana go put it together but on the team we have a lot of folks that know object oriented programming so i've got a backup but i think my my strategy there was let me hire someone specifically that can own this particular capability and if your SOC manager out there listening i think if he can make that investment it's well worth it because then you can say hey i've got an idea for a thing i'd like to measure what would that look like and how it works today is i get with more and i say hey i've got an interesting idea if we measured this it may lead to some new learnings about this thing and then more goes out and builds it for me and then we look at the data and go but the strategy that i've used is hire someone specifically do scott that background and experience

John Hubbard 32:09

gotcha one of the other things i wanted to touch on was the interplay of quality investigation and speed right and how they're kind of opposing forces obviously you kind of already touched on like we want to speed up things that we can speed up responsibly but for the stuff that people are doing manually that's still a human driven kind of analysis task is there a way that you found and how do you measure quality on that sort of thing

Jon Hencinski 32:36

measuring quality in a SOC it's i'll give you some background in terms of how we got to the point where we are now is early on our journey expell as we said okay we got to make sure that as we scale that quality comes along with us and in fact as we scale we want quality to improve so my boss at the time matt peters he's now the chief product officer he's like john you're gonna build a quality program in the SOC and i'm gonna give you some guiding principles and he said i'm going to look over here so let me make sure i get them right he says john when you build measure quality SOC we're going to sample the sample has to be representative of the population that's that's fancy speak of saying hey whatever we sample must be representative the actual thing that we're doing measurements of the sample need to be accurate and precise and the metrics reproduce need to be digestible he's like john get after it and at first i said i have no idea what he's talking about okay thanks man i appreciate the help man and what i had to do is actually learn what quality control was so i was actually looking at isos of manufacturing industry saying okay to the art here's kind of how i think about it today is you've got quality and you've got two buckets you've got quality assurance which is all the things that you do as part of your process to make sure that when the thing leaves the door either you're you know handing the deliverable to your executive or you're interacting with the customer that you're taking steps to check to make sure that the thing is of high quality it's accurate and precise on and so forth and then you've got quality control which was new to me and quality control is basically what are we doing to randomly inspect all the things that we've already made decisions on like the alerts we've already triage the investigations or already completed the incidents that already done what are we doing to inspect those to ensure that you know we did the thing that we thought we were going to do and then i said okay we've got a lot of quality assurance already built into our process today meaning like before anything goes out to the customer we're going to do some peer review i've got a bunch of monitors and alerts that go off when certain events happen we talked about those threshold monitors blah blah blah but actually said oh man we're actually not doing anything quality control wise because we're not going back in time to inspect the work that we've already done and i got to that point and then i said i looked at did a whole bunch of research on this and landed on on these isos in the manufacturing industry that are called acceptable quality limits and basically it's like imagine if If you're a manufacturing organization and you ship widgets like in, let's say you do 500 a day AQL, or that particular ISO is going to say, Okay, if you're shipping 500 widgets, you better look at like 25, to make sure that the quality is good. And when you look at the quality of those things, you have a, you have a check sheet, and you're just checking off. And the analogy to think about is like, you know, when you go get your state inspection, you know, the, you know, they're pulling up, they're like, does this have seatbelts? And does the brake lights work, so on and so forth? And so I said, that's an interesting concept, what would that look like within the context of security operations. So we built the Jupyter, notebook, surprise, and we said, okay, we're going to use acceptable quality limits in these manufacturing industries, and go full circle on you. Now, we're going to use change point detection. To determine how many expel alerts, we look on a typical day, we're gonna use change point detection, determine how many investigations we look, those will be the inputs, I put on my AQL table, which and for how many things should be looking at. And then when I look at my alerts, I'm going to take them through a check sheet, in my investigations, the same thing, and I'm gonna count the number of defects. And the number of defects trended over time will be the metrics that we send back to the organization. So then I've got a nice time series of the number of defects in my SOC over time. And if you're wondering, like any interesting observations, it's probably what you expect, john, it's like, when I have new analysts join the team, my quality goes like this, it's no, but once they've been on the team for a little bit quality gets back under control as well. So that's how we think about quality today where I want to get to and this is kind of goes back to investigation qualities, like we talked about, like a suspicious login investigation. And when you investigate a suspicious login, I bet you it's it's comprised of like, and I looked at the user's previous login activity, and I looked at their role, and then do they use a VPN? I want to be able to say, okay, Jupiter, I want you to go out and grab me and investigations for suspicious logins that don't include certain actions, because now you're not meeting my quality mark, because you're skipping steps now. Because I can programmatically say, hey, Jupiter, we did 30 investigations yesterday, give me seven. And of those, give me them from suspicious logins where we didn't do these sets of things, because now you've got the makings of something that could be poor and quality, the other intelligence will put into the notebook is, and I also want to grab alerts or investigations from folks that are newer on the team, because they just don't have the experience. And again, that's not to discredit what they're doing. Right? If you go get your, you know, car worked on, you probably want the guy working on it's been there for longer than a year. And that the guy, it's his first day, if they're working on your breaks, right, you know, so on and so forth. So we'll include analysts 10 year into our inspection, not because we're looking to discredit, but it's like, hey, you're new here, we want to make sure we're here to help as well. So that's the story of quality control on our SOC here today. So it's been a journey.

John Hubbard 37:43

Very cool. I love that kind of cross pattern of category or whatever you want to call it cross discipline kind of approach to saying like, well, who else has this problem? And how do they approach it right, and taking that manufacturing knowledge, and I have engineering in my background. So that's something that kind of speaks to me as well. It's a really cool approach. I like one of the things you mentioned about defects is looking for the lack of an action having been taken or whatever. Is that to say that when you're looking to measure defects, you're looking for maybe a skip in a playbook step or something like that. And if so, you know, is that something, you could just say, well, you have to do these steps, otherwise, you can't close the alert? And also, what other types of defects would you be looking for? Is there anything else?

Jon Hencinski 38:24

Yeah, I think the first obvious one, are we skipping steps? And when we look at that, we'll say, okay, we're gonna count it as a defect. But again, we're not counting defects for the sake of just counting them. We're asking ourselves, well, does it make sense to skip this step? Can we use the defect count to inform an evolution to how we investigate this activity as well. So that's important. The other things that we look for is simple things like remediation actions. Now we think about incidents, did we provide the right remediation actions either to the organization, that's super important when you think about it, like, I've got a system infected with commodity malware, but I told you to contain the host, but maybe I didn't tell you to block the command or control IP address in the domains like, now we're not doing all the things that we need to do to kind of truly mitigate risk as well. It really comes down to investigative steps and investigative processes, like are we following process? And then the quality of what we're doing? Do the outputs Make sense? Meaning, like, did we when we identified root cause of that incident, was it the actual reason or we just populate anything just check that box as well. You know, to be tell him it was an infected USB drive, it was a macro enabled workbook, those details really matter. So we're inspecting for that as well. But we're also inspecting for are there signs that someone just isn't getting it? And not because we're trying to identify, we're also looking for training opportunities, if that makes sense. Like, Hey, I was looking at these investigative actions, so on and so forth. I'll be really doesn't seem like we've got a good grasp on that in the SOC. Can we launch a training on AWS guard duty, for example, so on and so forth, so always looking also looking for opportunities to uplevel the team as well through those quality checks

John Hubbard 40:02

one of the other topics you mentioned early on that i wanted to make sure we hit before we run out of time was burnout teach classes to you know students from sox all over the world and i get a whole variety of experience from i love it it's the greatest to my jobs terrible right and burnouts a very real thing obviously for everyone that's working in a SOC we've all seen that kind of thing and the potential for that in your experience what are some of the drivers of burnout and how does you know your way of thinking here and looking at quality and automation and kind of all of that approach some of those drivers and hopefully eliminate them as much as possible

Jon Hencinski 40:35

yeah when i think about the folks that we're looking to bring onto our team first off that more entry level i value traits way more than skills and traits or who you are and skills or what you know how to do and the folks that we hire the traits that i look for is is curiosity candor passion for learning notice i didn't say passion for security a passion for learning and the capacity for knowledge and what that will get us is folks that can come in learn things very fast they're curious about all those things so on and so forth so to prevent analysts burnout within the team that i run it's really funny i believe that our service quality is commensurate with the quality of the SOC analyst role meaning like if you're showing up to work all day you know like this is legit i look at alerts that matter and the things i'm looking at they actually change because you know we've we've solved that now we're gonna go focus on this other thing i think that's where you want to be if i had to categorize burnout i'd say you show up every day you're doing the same thing management doesn't know how bad it is oh my gosh we don't embrace technology it's just the same thing over and over again and nobody hears me and as a result i think like the burnout is i'm not learning i'm not growing and i'm looking around i'm just i'm ready to go somewhere else where i can learn and at least that's been my experience so the way i designed the the SOC that we have today it's like we think about the business objectives i need to scale this thing we need to look at you know improving investigation cycle times well to do that the job is always changing a little bit you know like today you may be handling commodity malware but we're going to automate that so the next time your emphasis is going to be focusing on you know targeted attacks we got to mitigate things early on with ransomware we've automated that we've taken care of it and as a result you look back and reflect and you say wow like i'm growing and i'm learning and i like it here and i think the burnout is just doing the same thing over and over and over again and you're like hey i got into security because i love to learn i don't feel like i'm learning so that's where i think a lot of the burnout that that i've heard about or even experienced earlier in my career came down to like i'm not learning and growing and i think if we're doing our jobs as security leaders we're evolving how we're operating again and again and again i even look back at how we thought about the security analysts rolled expel in like 2016 it's very different like alerts would hit the queue and be like okay what do we do but now it's literally alert hits cue robot picks it up do i need to do anything okay blah blah blah and then how do i actually you know write some additional work workflows the program said robots and i think if we're constantly improving every single day we're doing our jobs and if when we do that the security analysts are gonna be really happy that makes sense

John Hubbard 43:16

yeah absolutely i think i would have answered that question almost identically i got some presentations out there over the past couple of years where i talk about like variety is like the big word i always tried to ingrain in people's heads when i explained my experience and my kind of way of eliminating burnout is like he couldn't same thing everyday things you're doing automatable you probably hate your job if not then you probably come into work excited about doing something new every day right which i assume most people like

Jon Hencinski 43:42

yeah and listen i think there's real value in learning that like the manual steps oh this alert fire we're going to go grab that data i'm going to do it manually like once maybe twice but if i might take those steps every time let's automate that and then again you know that that principle of ask the question well what would it take for us not to have to involve a human for this let's talk about that and then now we're we're freeing up we're freeing up mental capacity there to focus on the next problem and that's where people get really excited at least that's what gets me excited as well

John Hubbard 44:13

yeah as they say right if you've done it twice you should automate it at once that's the kind of yeah it's a tongue in cheek thing but yeah there's some truth to that right as soon as you find yourself like i could have automated this right you start to hate that task and then it just grows into this kind of like resentment of doing that thing and your job and everything and it just kind of poisoned stuff so yeah i mean automation is obviously a huge huge play in there to wrap it up final question is there any particular topic or kind of new technology or i guess question you're struggling with right now that's you know the next thing and security and you know the future of where this kind of discipline is going

Jon Hencinski 44:51

i don't want say i'm struggling but the next big challenge i've got a breakthrough it may be specific to the organization i'm working in but complexity and managing that, because we integrate with a lot of different technology. And I'm trying to manage, make sure that we're reducing complexity. And when we reduce complexity, reduce cognitive loading, so on and so forth. It just managing, like some of those subtle differences between all the security technology on the market, and I really want to abstract that away. You know, I love that endgame does a little thing different than CrowdStrike Falcon cool, love it. But how do I abstract that in a way, so the way we respond to EDR alerts is just one way rather than and we've got to account for these nuances for this thing, so on and so forth. But what's interesting is, when you think about those products competing, they want to have these little features that are a little bit different. So one of the things I really focused on this year is abstracting that away to reduce complexity in our particular platform. But there may be a problem specific to you know, where we are in the journey that expel.

John Hubbard 45:51

Yeah, I mean, I think that's a pretty generalizable problem, right? No one likes writing one off solutions to deal with individual products. And if we could, you know, standardization is a big thing, right? In infosec, you know, my attacks huge, because it's like, we can use the same language to talk about stuff. And if we can deal with our tools in an automated way, I agree that the total game changes. So hopefully we get there someday, to wrap it up. a two piece question here, because we've talked about a lot of kind of cross discipline stuff. First, are there any resources that you have a read that have kind of informed your view on this stuff that we could point listeners to?

Jon Hencinski 46:24

Yeah, absolutely. There's a book. It's called statistical process control. For managers, I believe it's the second edition, what that's going to teach is, what questions do you have to ask to be able to answer let's take alert management as a process is in a state of control? Or is a complete chaos it had, you know, and you'll learn the fun things like what's a short control chart, read it, maybe a little bit drive versus some of your audience. That's cool. But I promise you, it's worth the time. The other book that I'd highly recommend, it's a book called The goal by Dr. Eli goldratt. It teaches you the Theory of Constraints. And that's system level thinking. Because if you work in security operations, fun fact, you're managing a system. And what that teaches you is how to spot bottlenecks, like we talked about some security metrics, you know, where's the friction today? How do I actually exploit that to optimize that, and then what really good managers are able to do is predict where the next bottleneck is going to be, Oh, I know, when we fix this, this is gonna be messed up. And then we're gonna have to go do that. And then when we do that, it's gonna be this thing. And like, that's really the value I bring to my day job is saying, This is our problem. Today, we're going to do that. And then this is going to be your next problem. But we're already thinking about it ahead of time. So those are two really, really good books and time series analysis, the basics, what's a trend, what's a residual, what seasonality, we've got some blogs on the Excel site, if you're interested. And just the basics of operations, management, capacity modeling, those things matter. And I know I love it when we promote, like, you know, senior security analysts into management, and you've got to learn management. But Fun fact, you're also an operations management now. So here you go, are some new tools to learn as well. So those are those are probably my best recommendations right now.

John Hubbard 48:05

Very cool. Thank you. And then also, where can we find your stuff and continue to follow you online?

Jon Hencinski 48:10

Yeah, that's great. So I'm on Twitter, J. Henson ski. And then also, we've got a number of blogs that have written on the expel website expel.io forward slash blog, and check us out.

John Hubbard 48:20

All right, thank you very much jennison ski for joining us on the bootcamp podcast ton of awesome actionable information here and some great takeaways for listeners. So I think people are really gonna enjoy it. So thanks for coming up.

Jon Hencinski 48:31

JOHN, this is great. Thank you so much, had a lot of fun.

John Hubbard 48:35

Hey, blue teamers. Hope you enjoyed today's episode of blueprint. If you've got a second and want to help support the podcast, please subscribe and leave us a review on Apple podcasts. It would be really, really meaningful to us and if you have any ideas or suggestions, I would love to hear them. Your reviews are going to be one of the best ways to help others find this podcast. So anything you could do would be a big help. As always, thank you for listening. You can connect to me on social at sec hub sec HQ, BB on Twitter, or on LinkedIn. So until next time, thank you for listening to the blueprint podcast.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript