← All talks

Educating Your Guesses: How To Quantify Risk And Uncertainty by Sara Anstey

BSides Leeds · 202328:5467 viewsPublished 2023-07Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
Service
About this talk
Sara Anstey explains how to move beyond subjective risk matrices by applying statistical methods to quantify cybersecurity risk and uncertainty. Using Monte Carlo simulations and calibration techniques, the talk demonstrates how to translate expert guesses into defensible quantitative models—with practical examples in Excel showing ROI calculations for security investments like phishing detection software.
Show transcript [en]

okay I'm gonna go ahead and get started um I want to First say thanks for sticking around till the afternoon I know it's been like a long day of talk so thanks for actually showing up and if I knew that my talk was going to be in the afternoon I'd have brought like beers as props or something but they didn't tell me to far enough in advance um but yeah so I'm Sarah um I probably have a bit of a different background than a lot of people at the conference in that I kind of got into cyber security because of the company that I work for but I actually have like a data analytics statistical more

mathematical background and a lot of what I've been working on since I've gotten into cyber security about six years ago has been bringing different types of statistical models and way to analyze you know numbers and uncertainty and applying that to cyber security Concepts um and that's kind of what I'm going to talk about in this I'm going to keep it pretty like light there will be slight audience participation but it won't be bad um and yeah so I'm just gonna kind of hop in um other than that though I'm the director of data analytics for a company called Nova Coast that I'm sure none of you have ever heard of if not for any reason other than we're headquartered in

the U.S if you can't tell by my accent um but yeah we do cyber security and Consulting and like I said that's kind of how I ended up in this field that I never really I don't know had any particular interest in I guess but now that I'm here it's a pretty cool field um but so like I said this talk is going to be a lot about um risk and uncertainty and yeah we're going to get into like statistical modeling techniques and stuff for it but I think to start we have to Define what even is risk um if we're going to be talking about it and how I usually Define it is at a

really high level is risk in cyber security is is kind of the same as like a risk in life in that it's anytime you're doing something or something could happen where you don't know what the outcome is going to be right so like anytime you're taking a risk in life it's because when you're doing something you've never done before or you don't know what the outcome is going to be of something if it was all known and you could predict exactly what was going to happen it wouldn't be a risk as much of a decision um and so obviously cyber security has a lot of risks right we don't know if we're going to be breached in the next

month or year we don't know how many attacks are going to be susceptible to and so it also kind of begs this question of if risk inherently by definition is the unknown or something that we don't know how can we measure it or quantify it like is that something we can even do if if it's unknown right it almost kind of is an opposite of what you would think so I want to first start by talking about the current methods that are typically used sorry for the white background in uh in cyber security to understand risk so a lot of you are probably familiar with this this is like a typical risk Matrix right so rating

things on the medium to high maybe like a one through five scale one through three scale I see this all the time for um like vulnerabilities right so as a vulnerability low medium high severe but it can be a lot of things but they're very common in cyber security um so let's talk a little bit about these and while they're you know easy to understand why are they maybe not the best thing to actually use to understand our risk so let's just put it into play and use an example and let's look at risk a and risk B right risk a we're going to say has a likelihood of 50 and an impact of nine

million dollars risky sixty percent impact to two million dollars okay if we're gonna just really easily get like an expected loss then you multiply likelihood by impact very easy to talk about risk in that way we can see risk a has an expected loss of 4.5 million risk B 1.2 million now this is not like a crazy weird whatever risk Matrix I haven't done anything fancy here but you can see that risk a is categorized as a medium and risk B is categorized as a high even though clearly if we're using the math and the expected loss risk a we can see is way worse than risk B so how does that happen on this risk Matrix

so I want to put up another scenario let's say I'm trying to understand my risk of getting breached in the next year you know I'm an analyst or something at a company and so I say all right I'm going to go ask my CTO and my CSO because you know you want to get two people's opinions and I'm going to ask them on a scale of one to five what do you think are risk of getting breached in the next year one being very low five being you know certainty and so I go to the CTO and I ask and the CTO is like well you know ciso is an idiot we're not doing anything right and

this and that and you know there's probably thinking they're thinking in their heads like maybe an 18 chance we get breached in the next year so they rate that a one on a scale of one to five and they say to me one okay great so I go and I talk to the CSO and the C says like yeah we're doing awesome and we have all these patches up to date and whatever and they think you know we're probably in a really good position there's probably like a three percent chance we get breached in the next year so they rate that a one on a scale of one to five and so both our CTO and our

CSO are in agreeance that our risk of getting breached isn't one let's go one to five but all of us in cyber security our practitioners know right that there's a really big difference between a three percent and an 18 chance that you get breached in a year right that's actually pretty significant but they've both rated it in a one and it's actually what's happening on this risk Matrix here it's a statistical thing called range compression where basically the ranges that you're using to measure something are you know you're compressing from a quantitative continuous scale into this compressed range and you might say well you know you're talking about quantifying risk a one to five scale is quantitative

oh one of my scale is actually what we call a numerical ordinal scale meaning that an ordinal scale this means that there's an implied order so low medium high is an ordinal scale even a regular user versus like an admin user that's ordinal one through five just means it's numbers but we could replace one with improbable and we can replace two we sell them breathe occasional right and all of a sudden it doesn't become quantitative anymore and so those are what what we what happens when we have range of Oppression and at a higher level this whole phenomenon is actually called analysis Placebo so analysis Placebo is just a broad concept for any time that the way

that you're measuring something so the mathematical or statistical formula or algorithm that you're using to understand data either gives you no measurable you know more understanding or as in the case of risk a and risk B actually gives you a worse understanding of your data just because of the way that you're analyzing which is what you can see happen here and that's a huge problem with the way that we're doing a lot of things in cyber security right now when it comes to risk and on top of that what really starts to go wrong and what's going wrong in this risk Matrix you've got a one through five skill and people start applying mathematical operations to it

so then they say we've got a one and a two so our average is 1.5 but remember when we talk about an ordinal scale does it make sense to say well we've got a regular user and an admin user so the average is a regular plus user you know it it's like saying the same thing you can't actually apply mathematical operations to ordinal scales but people do and that's what happens and that's how we get analysis placebo so now that I've kind of like debunked I guess quote unquote the the way that we're doing things right now I want to get into like how do we fix it but not only how do we fix it but is there a way

to do it that's as easy as a risk Matrix because that's a really low barrier to entry right rating things on one for five so before I get into that I want to talk about you know going back to the beginning of like when I said can we even quantify risk because it's unknown we can but in theory if we get a little like philosophical if we're going to quantify risk isn't it always going to be a guess of some sort right because there's no way we're ever going to know when or if a breach is going to happen so it is all going to be some type of guess right right and I'm going to show a method

that actually does involve some guessing when we quantify risk so before we do that I want to pose a question are you a good guesser and is there such a thing as a good guesser like can a person be a good guesser can you learn to be a better guesser right so just so that nobody falls asleep on me we're going to do a little experiment um one thing about me uh I don't know if this applies to the average cyber security professional but I love reality TV the trashier the better I love the bachelor okay this is a this is a great show and if you've never seen The Bachelor okay I challenge you watch it

so it's on Monday nights it's two hours sit down on the couch so there are better ones but the bachelor's like that staple like that's your entry to reality TV right there it's two hours on Monday nights get a bottle of wine drink the whole thing yourself okay watch it and by the end you will feel so much better about your life okay so great show right been posing this question here how many seasons of The Bachelor have there been and don't say it out loud and don't Google it okay I want you to think in your head what you think the answer to this question is um knowing that you probably don't know it for sure but you kind of have some

information so think about the answer in your head but don't think of just one number I want you to think of a range that represents your 90 confidence interval to the answer to this question and I know I'm bringing you back to college statistics when I say confidence intervals but it basically just means you know a lower bound and an upper bound that you're 90 sure the right answer to this question would be in that interval so about a 10 chance you're wrong right but you're like 90 confident it would be between those two numbers so now like I said hold on I might have messed up the order my slides here okay go to if you guys can

go to menchie.com and use this code it won't make you download an app it won't make you sign in it won't make you pay I promise it's a really good audience polling app if you go to mentee and you put in this code real quick we're gonna um we're gonna have a little bit of fun so first though before you vote on that sorry I think I messed up the order just slightly before you vote at mentee we'll I'll go back to that slide in a minute think about your confidence interval and now let's say I'm going to play a game and I want to preface uh this in this game you can win a thousand dollars

besides it did not give me the budget to do that unfortunately so this is going to be a fictitious a thousand dollars you can win right but say I give you two options for the way to win this thousand dollars option number one if the answer the correct answer to that question is within your confidence interval you win a thousand dollars if not you don't option two you spin this wheel okay it lands in green you win a thousand dollars if it doesn't you don't gut reaction don't say it out loud but I want you to think which would you prefer like which option would you pick and it's not a trick question it's like

what's your gut instinct right which would you rather spin the wheel or would you rather go with your confidence interval to win this thousand dollars and now we're gonna go to mentee and you'll see there should be a question up there um and I want you guys to say which option you Cho you would choose okay like what's your what's your gut instinct um hold on might have to actually go start it one second and the codes should still be up here at the very top um but which option would you choose which would you go with your confidence interval would you spin the wheel do you have no preference

it's at the top yep

I will just say two quick plug if you guys are ever doing presentations mentee's like free it's awesome audience bowling it's really easy I don't work for them um okay so it looks like most people would spin the wheel few with confidence interval maybe one or two people who had no preference but most of you guys had some type of gut reaction and most people would spin the wheel okay so it's kind of interesting let's think about that all right so I'm gonna go back to the slides I'm not going to leave you guys hanging the answer 27 Seasons there and this I I didn't even preface with this is only The Bachelor so it

does not include the bachelorette bachelor pad bachelor in Paradise Bachelor Winter Games I could name more all right but 27 Seasons right so let's go back to mentee one more time here all right whoops how do I play hold on sorry technical difficulties there we go was the correct answer within your original confidence interval no judgment yes or no be honest

so I'll say I've given this talk before that's usually about where it ends up um so all right we'll say like 15 you know maybe had it in about 80-ish spin um so but I want you to now think about the original question I asked right I said give your 90 confidence interval so even without knowing the right answer if this audience was what we call perfectly calibrated meaning that you guys all actually gave your 90 confidence interval then 90 of you should have had the correct answer within your interval and 10 should not have which means this audience is like 70 overconfident in general okay and it's not just because we work in cyber security that's

actually there's been a lot of research studies that's human nature so the average human is really overconfident when estimating things they don't know and it's the reason that when your boss asks you how long a Project's gonna take you say five weeks right and it takes six months it's the same actual philosophy that we're really bad inherent estimators and guessers but there's also a lot of psychological tricks that teach us how to be better guessers and so one of them like I showed here this is called the equivalent bets method and there's actually been research done that shows if you weigh some type of monetary loss with your guesses you can actually teach yourself to be a better guesser and the

way that it works is that if your gut reaction was to choose your confidence interval you should shrink the interval until you have no preference between the two and if your gut reaction was to go with the wheel you should widen the bounds of your confidence interval until you have no preference between the two with that monetary loss in mind right because both should represent a 90 chance of winning a thousand dollars and that's how you know if you have a preference so for example if you wanted to spin the wheel that's how you know you didn't actually give your personal 90 confidence interval maybe you gave your 70 or your 60 interval right so there's a whole bunch of other really

interesting methods that's just one but these ways that we can learn to be better guesses the tangent but we will apply it later to our methodology so getting back to our uh our risk quantification right the rest of the talk I'm basically going to show a one-to-one substitution for that risk Matrix that we originally showed and the reason I call it one to one um two reasons one no additional software or technology is needed um the example I'm going to show is actually going to be in Excel um you could do it in python or kind of anything else um and then two we don't need any additional input data for the model that we would wouldn't need you know for a

risk Matrix because like I said at the end of the day we can do this all with just estimations and guesses if we need so this is the one more time I'm going to apologize for bringing you all back to the PTSD of college stats but um what I'm proposing a Monte Carlo simulation a lot of you are probably familiar with it there's a wide range of applications for it um like I said same inputs as a risk Matrix what's good about a Monte Carlo simulation when looking at risk and uncertainty is that it accounts for having limited input data which basically just means a lot of the time in cyber security we don't have a lot of

good data to put into models and again we don't know what we're you know estimating it's all a writ like a risk destination and the way that it does that there's two different ways the first is that the inputs to this model instead of being static numbers or variables are confidence intervals right it takes intervals of values as the inputs to the equation and then it does thousands and thousands of replications on those different input you know confidence intervals that you put in to come out with averages and so anytime you're doing replications on a range of values you have to have an underlying distribution that you're pulling from so if you guys look at the red probably

all of you are familiar a normal distribution right centered at zero standard deviation of one it's even if you were to use like the Rand function in Excel it goes off in normal distribution okay or actually that's the uniform one but like a lot of things are in normal distribution what we use a lot of the times when doing these Monte Carlo simulations is called a log normal distribution which is shown in blue and there's two important reasons the first is it can never go below zero which basically just means that as much as we might like it there's never a negative chance of us getting breached right so it simulates the real world a little bit

better and then it also shows you can see a lot of the density of the curve is more toward the y-axis and then there's a long tail which which kind of accounts for like extreme outliers and if we translate that into the real risk and uncertainty it's that a lot of the times if you do get breached it's not that massive five million dollar Target data breach that's all over the news right it can happen so we want to account for that but a lot of the times it might be you know someone got fished and you have to wipe a couple laptops or a little bit of remediation needs to be done it depends but we're just saying majority

of the data breaches in the world are not those five ten million dollar catastrophic losses we want to account for them but understand that they're not the normal so what we're going to do first things first you have to start by defining the risk that you want to quantify same way you would in a heat map so maybe it's a particular vulnerability and what's our you know exposure risk to that vulnerability maybe it's something more broad like I've been saying which is more around the risk as an organization of us getting breached but we want to be really clear and precise in our definitions and then we also need to define a Time range for that risk to occur because it

actually doesn't make sense to say the risk that we get for each you have to say the risk that we get breached in the next month or in the next year right we always wanted to find a Time range with it and then we're going to come up with all our input variables and assign our confidence values you know our intervals with our our values with our confidence intervals right so now this is where we're saying maybe the risk we're looking at like the example I'll show in a minute is something around phishing okay so what different input variables might affect that right maybe we do phishing simulations and we know our average click rate in general maybe we

know the volume of emails that are coming in a day or historically you know how many what percent of all emails are phishing attacks or something like that okay so we do have some input data and then again if we don't have the hard and fast numbers and we're doing subject matter expert estimations like we did earlier with the bachelor um we want to repeat with multiple experts if possible and take averages although it's not required and then we're just going to run our simulation and we're going to take the average basically we're simulating the next year you know 10 000 times and then we're saying of all those 10 000 fictitious next years that could happen on average

how much money did we lose so I'm going to actually show a quick example of this in Excel now just so you guys can see an approachable method of how you could actually do this without having to have software there are different softwares that do it now the Market's starting to get a little bit saturated and I don't love a lot of them but you guys want to talk later about some of those I'm happy to give my opinions but it can be done in Excel so in this example that I made we're going to look at like a fictitious company that you know is in the technology industry with about a thousand users and we are doing a phishing simulation

basically to see they don't own an email um like a phishing detection or email software right and should they invest in one so like a proof coin for something is it worth the money to invest in so we're going to Define all of our input variables just like I said so looking at things like um you know the likelihood that a user would require remediation or response if they were to click on a phishing email the business impact of a data breach which we can pull from things like the Verizon data breach report the IBM cost of a breach report a bunch of other industry sources or we can estimate or might maybe we have historical

information for our company you know we can look at the hourly rate of the person doing the remediation how much is that going to cost for you know the help desk person who has to wipe the laptops or whatever it is um and then maybe we run phishing simulations like I said and we know our average click rate and so just to show how easy this is I'm going to do one last mentimeter thing so we're gonna fill in the um number of users that click on the fish in the 12-month period you can see that's empty right now so if we go back to mentimeter here I want you guys to give that 90

confidence interval use your techniques you can use your new um your new calibration techniques right and think if we have a thousand person technology company what would be your estimation for a 90 confidence interval for the number of people that you think would click on a fish in the next 12 months and then I'll use whatever numbers we come up with I'll plug them into the model and we'll go over the results that we get

and again maybe you're using this from like your own organization like I think a lot of ones I see a lot of words I see when they run phishing simulations are at like 20 click rates but some might be way higher some might be way lower I don't know I just got a weird look from the audience all right so we're looking at we'll say like 177 to 475 those are the last numbers I saw so we'll put in 177 for our lower bound and 475 for our upper bound and all sudden when I move this off we're gonna see our results built in here so let's go over what results we get from doing a model like

this

house which ones yeah that's that long normal distribution right so a lower bound would be the 90 and then an upper or like lower bound and up or down being the 90 confidence interval it doesn't need to be it's not a uniform distribution right how normal is even on both sides log normal is not so more the density it's kind of like when you're talking average versus median average might be skewed by a long tail and a probability which in this case it would whereas median is like less resistant fiscuitous um so here lower bound and upper bound we're saying for this pretend company and I could change these but we're saying the lower bound if a breach

occurred would be about fifty thousand dollars whereas the upper bound they would have to pay is about 3 million if a breach occurred so it's all kind of the input variables build on each other right so in this one we're not saying how much a breach will cost we're saying if it occurs how much it's going to cost but that doesn't represent our risk because we don't know 100 for sure if it's going to occur so looking at our average annual costs before here that's the first thing this model gives us which basically just says if we weren't to implement this software that we're thinking about what's our risk right now our inherent risk we're

saying in the next year on average based on our simulations we're going to lose about three hundred thousand dollars due phishing attack like that on average is how much we we will lose if we change nothing in our security posture right now and we say if we were to buy and implement the software that we're considering maybe a proof point or something the average annual cost after would be about 182 000 which basically just represents you know we run the model once with our current state and then we run it once with assuming we had the software in place how are we adjusting those estimates or adjusting those data points and then from there we can get a really

easy Roi number return on investment an Roi multiplier which is basically just saying how much money are we saving if we were to buy the product divided by how much does it cost both in terms of the licensing and then the cost to install Implement and keep running you know at a high level of efficacy right so those are the first outputs we can also look at our simulated loss histogram which here again we can see it basically just shows for every um for every trial of the how many did we do ten thousand in this on at like which one where did they all fall and then our law succeedance curves which I think I'm short on time so I'm

not going to go into these but I could talk about the plan if you guys are interested they're really cool curves but anyway I'm gonna go back to this um that's the statsner to me that wants to come out a little bit but I'm gonna rein it in so basically to sum up because I'm running out of time um a lot of the feedback I get on is like okay you're still just guessing you still don't know your risk which like yes like that's what a risk is it's gonna be an estimate it's gonna be a guess like there's no way around that because we don't know what's going to happen in the next year it's in the

future but it's a better one because we're now using statistical models that don't inflate things and that aren't statistically working against us just because of the way that we're analyzing these models and I want to point out there on this slide that like I'm not Reinventing the wheel here like the insurance industry has been doing this for 30 years and if you think about it the insurance industry has a lot of the same problems that we do because you know how do you know if someone's going to get in a car crash or get a disease or something like that how much should you charge them for a premium and how much are you going to have to pay out if

some event happens to them right well we don't know the answers to those questions those are risks that the insurance company's taking on but we do have some data right we do know their family history are they healthy how old are they right we have some input data the same in cyber security we don't know what's going to happen but we do know how much security awareness training we've done we do know our average click rate we do know about how many phishing emails we see a day we know how many other companies have been breached in the last year that have had to been you know disclosed so I think in cyber security we like to

think that we have these like insane problems that no one's ever seen before and you know we're and maybe sometimes that's true but not that often in the data analytics world of things at least so it's a better guess and it's one that I think should be more widely adopted in not only cyber security but really just all risk estimation in general um but yeah that's all I got I do want to say a lot of this um there's a really good intro book I just want to give a plug it's called um how to measure anything in cyber security risk I think or something like that um it gives a lot of these examples I

pulled some content from it if you've read it you could probably tell um but great book otherwise I do have some free resources for other things or like template Excel files so if you guys are interested just come find me but otherwise there's my LinkedIn too and yeah thanks