
I am looking forward and excited to share with you that our next speakers one of the speakers I'm personally excited about she is an amazing security professional she's actually a not just a security professional she is a team leader she leads security research for Windows Defender advanced threat protection at Microsoft now I have to share with you a secret about our next speaker her name is Donna Burrell Donna was here in Israel during the pandemic but she got called by Microsoft to Seattle so she is joining us from the west coast so this next talk may not be as live as we would all like it to be she's not right here on this stage but
we're extremely thankful for her for joining us at this early hour of 0 4:00 in the morning I think or something around that time all the way in the west coast so Donna's going to be talking to us about eliminating alert fatigue a fantastic topic for an after-lunch coffee break session grab a coffee because this is going to be a fast fun talk about eliminating alert fatigue and how to do better as a security professional so Donna started her career in the famous 8200 unit I know you know what I'm talking about wink wink say no more and she's also worked for Google in their security team in Zurich Switzerland Donna is very passionate about operating systems and
windows internals and in her spare time she also volunteers with high school students mentoring the next generation of cyber security experts I first met Donna through the black hoodie initiative which is a reverse engineering seminar for women first started in Vienna but now it's become a global phenomenon and Donna was working super hard to actually bring black hoodie to Israel this summer it didn't work out because of the pandemic but sign me up for next year so I hope you're ready for this amazing talk about eliminating alert fatigue please welcome give her a round of applause even though she's not here in the room Donna Burrell thank you Donna
hi everyone thanks for joining my talk about reducing other's fatigue through better engineering I am super excited to share with you some of the work we've done over the past two years before we start let me introduce myself my name is Donna I am the security research lead for Windows Defender ATP I am originally from Israel and currently based in Seattle Washington in the Microsoft headquarters I work on fascinating research projects and how do you / to knit to present some of my research in conferences like black hat and blue hat and I love dogs so let's talk about dogs for a second I don't know about you but my dog is a Barker when I got her I
would be alerted every time she barked but now I learned to ignore the barking as I realize most of it is false alerts when I expect what I experience is alerts fatigue I am tired of wasting my time investigating my dogs alerts this is exactly what happens to stock teams hunting for threats if they get too many false alerts they experience the same alert fatigue that we would like to avoid as security vendors you are probably thinking so why do we produce FP in the first place there is a number of reasons for that first we need to remember that most software and environments that we use are dynamic we install new software and we
update existing software in many cases these contain API miss usage the benign software uses api's in ways that may seem suspicious another reason for producing FP alerts is data gaps sometimes the security provider doesn't have the required data to determine if an activity is malicious or not when that happens some will generate an alert out of conservatism one more reason for FP is false detection logic this could be related to a lack of testing flawed assumptions or other human mistakes I'll talk more about how we can minimize this but I had mentioned that the the first two reasons are the core sources of a piece and fortunately the human factor is normally insignificant the last reason has to do
with a lab alerts misinterpretation when TP alert is identified as an FP Lert as part of the work I'm presenting today we made some product improvements to address this however this is beyond the scope of this engineering discussion all right so we understand the root cause for FPS let's tackle is the traditional approach to producing security alerts is to cover as many TPS as possible with the minimal amount of F beat as the diagram shows we set the threshold based on the context balancing fans and fps highly sensitive environments in conservative to ensure they don't skip any TP generally you want to cover most cases introduce the list number of offense while avoiding FPS this should be the
approach when introducing a new detection algorithm however this approach introduces a trade-off between offense and FPS we were looking to shift the card eliminating fans without compromising FPS through better engineering in order to do this we need to leverage data to drive engineering improvements data that helps us understand FB generation focusing on alert conclusion data we will average data from three different sources first wheels customer feedback four inch alert we produce we ask customers to provide us feedback while we don't always receive feedback it helps draw our attention to pain points next we leverage expert graders that's our in-house experts that verify alert accuracy initially the gradient alerts randomly however in order to optimize their productivity we developed
contextual clustering so instead of working through alerts randomly degrade related alerts in bulk to save plenty of overhead this to manual sources are great but not enough in order to understand our detection precision rate we need to scale our testing for that reason we introduced a machine learning model to predict to predict alert conclusions leveraging our manual inputs and scaling their impact this tool predicts individual alert conclusions conclusion using multiple machine learning models to achieve high confidence alert grades so let's take a moment to understand where we are we understand why we have a piece and the trade off that involves reducing them we can leverage alert conclusion data to transform this trade off so let's
examine how we we started by analyzing the other generation process breaking it down into its components to identify SP root cause I like to refer to this process as a loop producing new alerts that are being assessed then being monitored to identify pinpoints to fix breaking the loop into steps helped us to focus our attention on the most vulnerable components to ensure our efforts move the needle as part of this we analyze the process step by step step by step to identify which kind of a speii can be avoided in each component this included better engineering guardrails and reiterating engineering guidelines in addition we started to measure SP fixing time in order to minimize it we then tackled the alert
fixing process introducing new tools and I will cover in the next slide these tools enabled us to discover SP cases faster and solve them quickly so even if we produce unwanted FP the fixing cycle will be closed first without customer reporting and before customers even notice it in most cases for example we introduced a note a new component into the alert generation process that's the FP filter this filter combines rule-based logic and machine learning models to address FP in real time before they become customer facing the rule based filtering enables us to apply broad filtering for common FPS patterns across different type of alerts the machine learning model predicts FP based on prior conclusions and other
contextual features if the FP filter makes a definitive judgment on a specific pattern we will circle this insight back into our detection logic however this component is intended to pick up on unpredictable behaviors that cannot be programmed but cannot be programmed into the detection logic and last I previously mentioned Auto grading tool which is another addition to the alert generation process that enhances our detection capabilities the auto graders the auto grades are fed into this flow supporting the precision assessment and the detection fixing and of course the SP filtering the next step to reduce SP through better engineering is to introduce new tools the first tool we developed is a proactive and normally detection tool the monitors alerts
monitor alerts production in real-time and flag the suspect FP based on observed anomalies such as volume spikes this tool notifies us when it detects cross customers and normal ease these are phenomena that impact multiple customers making a bridge less likely this notification will trigger an internal flow of urgent fixing to address this case we also introduced an alert clustering system which clusters alerts based on features similarity and then delivers a conclusion for each cluster this way we get an overview of alert patterns and we can tackle them at scale providing a conclusion such as TP or FP for each pattern this allows us to review the alert patterns and their conclusion automatically and reduce time
to fix to summarize as part of our engineering efforts to minimize FP cases we first analyzed the overall allure generation process for improvement in addition we introduced a new SP filtering component to filter FPS in real-time next we developed the automatic grading mechanism to create alerts at scale for better conclusions then created the alert and normally monitoring to detect FP cases in real time so we can tackle them quickly last is our alert clustering approach which was introduced multiple times as part of alert grading and as post alert FP analysis our main takeaways from this process our first clustering increases our productivity so that instead of reviewing FP case by case we can cluster similarities review alerts in bulk and
tackle them at scale second by breaking down processes we can better engineer a solution it allows us to analyze steps along the chain determine which ones can be automated and of course scale human impact empowering researchers next like the old-fashioned engineering approach we should always start small with a few C proved its value and then scale everything I presented today started from a POC last in order to produce meaningful results that move the needle I must identify which key levers we need to optimize for we optimize for time to discover time to fix and detection precision precision to keep our eyes on the ball