← All talks

Make Alerts Great Again

BSidesSF · 201727:24927 viewsPublished 2017-03Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Mentioned in this talk
Platforms
About this talk
Daniel Popescu - Make Alerts Great Again Why can’t this be easier? Writing good alerts and keeping them actionable is hard. Ask anyone on any security team, ever. Alerts are notoriously either too noisy or don’t have enough coverage, and finding the sweet spot is nearly impossible. Additionally, some alerts are idly sitting there functionally incorrect and don’t actually work as expected (when was the last time you tested some of yours?). To make matters worse, there is a general lack of industry standard for alert definitions, priorities, and incident response steps. At Yelp, we have created tools and processes that enable the security team to keep a handle on our alerts, thus making the alerts actionable and maintainable. We do this by making sure we know which alerts are firing at what frequencies, having a run-book for writing new alerts, and utilizing self-service alerts whenever possible. Certainly no alerting solution is perfect. However, by implementing some of these tools, we’ve effectively improved the signal-to-noise ratio for most of our important alerts. This in turn relieves the security team of tedious tasks and enables us to work on more important (and interesting!) things.
Show transcript [en]

our next presentation is going to start now it's going to be Daniel uh Pesco with make alerts great [Applause] again hello hello

hello everyone so this talk is called make alerts great again because that's what we did at Yelp yeah so for those of you that don't know who I am which is probably most of you my name is Daniel piscu I'm a security engineer at Yelp I've been there for a little over a year prior to that I was at Microsoft uh for a number of years so fun fact about me I've been attending security conferences for the last 12 years but this is my first time speaking at one so thank you bides thank you bsides and Yelp for giving me the opportunity to speak with you today so if you don't know who Yelp is

which I hope is none of you Yelp is a company that connects people with great local businesses and as of a few months ago uh here's some statistics about Yelp so so tldr on this slide is that we have 100 million monthly active mobile users and we have more than 100 million reviews in our system so to power a web application uh of this scale like Yelp there's a lot of stuff going on behind the scenes so Yelp has thousands of employees thousands of servers hundreds of microservices that are deployed on all those servers and believe it or not all of these things behind the scenes generate a ton of logs so the servers all have

endpoint monitoring software on them the laptops all have endpoint monitoring software on them uh the laptops have antivirus as well those are all creating logs logs are being produced everywhere all these microservices are creating logs logs logs logs tons of logs so as a security team what do you do with all those logs you build a security pipeline that looks something like this this probably looks familiar to all of you guys we have logs from some of the sources I spoke about before and tons of other sources that I didn't mention and tons more that are not on this slide uh we collect those logs from the various sources we run them through our streaming pipeline we index them into

our data stores U for us the two main data stores that we use are Splunk and elastic search and once we have that data in an indexable format uh we're able to visualize and alert uh on that data for Splunk we use whatever built-in uh alerting mechanism that Splunk has and for elastic search we buil built a framework called elast alert uh which we've open sourced and we use for visualizing and alerting uh our data that's in elastic search so but there's a ton of data and and there's a ton of alerts too and so it's easy to fall into some of these common pitfalls of alerting and uh we're no different at Yelp we fell into some

of these problems as well so some of these problems are lack of V visibility into alerts so how many alerts do we have how many how how often are they firing which ones don't ever fire uh some of the alerts were not actionable a lot of the alerts were going over email and you know when when you send alerts over email a lot of times they're going to get ignored and you're not going to have any ownership assigned with that nothing's going to get done as a result so there's no standardization a lot of the times some people think things are certain things are really important other things people think things are not very important some

people think email is fine some people want to page for everything that happens so there's no standards and that's a problem um functional correctness this is a big problem so how many of you have tested all of your alerts and you know that they all work yeah that's what I thought so that's a problem and false positives are obviously a problem so I'm happy to work at Yelp where our alerts are not necessarily rainbows and unicorns uh probably nobody's are but uh we have a pretty good system that works for us uh and I'm going to share some of those details with you so for us we have historical metrics on all of our alerts

we know how many alerts that we have we keep our alerts actionable by assigning ownership to all of those alerts and we have clear incident response steps for all of the alerts that we have defined and we're pretty sure that all of our alerts are functionally correct so let me walk you through some of the problems and I'll walk you through some of the solutions that we came to get to the almost rainbows and unicorns land so the first problem is lack of visibility so for us we have alerts that are defined across multiple different systems most of them are defined in Splunk and in uh in a last alert which is on top of elastic search so the

problem there is that there's no comprehensive dashboard that shows us how many alerts do we have which ones are firing all the time uh which ones send email versus which ones create jur tickets versus which ones page uh in the past you know when when we wanted to get the answers to those things we'd have to do a whole bunch of manual steps and you know produce a spreadsheet that would then be out of date a few weeks later so uh yeah so the solution that we came up for that is that we created a service called the alert reporter so what this this service does it's a standalone python service and it produces a report weekly in uh a

spreadsheet form and it basically Aggregates it it collect it asks blunk it asks it last alert it asks XYZ alerting Frameworks that we have give me all the details from the last week about the alerts that have fired and it produces a report that allows us to have gain valuable insights into which alerts are possibly too noisy how many alerts do we have I know that we have 173 alerts somehow that's pretty awesome um and it lets us kind of you know take a take a higher take a step back and take a 10,000 foot View at our alerts and see which one ones are maybe firing too often which ones have too many false

positives and you know we can focus our attention on which ones are important to us so that visibility is super important and if you don't know how many alerts you have defined you guys should do something like this so the next yeah so uh here's an example of what that report looks like so actionability so this is this is the biggest problem that we had that uh that we saw soled so uh first of all if you have alerts that are being sent over email they're just not going to be actionable you can't expect people to be monitoring their inbox all the time for for critical alerts they're going to email what happens is you send these alerts to

people's inboxes people see a couple of them and then they're going to create filters and they're never going to see them again so you know that that happens and there's no ownership with emails so so ticketing systems are better but so you can at least assign an owner to a ticket and that's kind of cool but there's still no enforcement that anything's going to get done with that ticket it could just get you know lost on some juru and who knows if the person who that ticket is assigned to is actually going to do anything you know people have to go the security team has to go and look at the queue and follow up with people we don't want to do that

so talking about email so here's an example of back in the day when when we used to do alerts over email so here's a set of alerts that fire when people are making changes to our AWS infrastructure without following proper change control procedures so as you can see we have about 15 emails or so and all of those emails are coming from our alerting framework and in only one of the cases do we have a real human actually responding to the alert so that's interesting what about all those other cases so let let's look at some of these emails here's the email from the human Giannis Giannis says hey this was me creating the RDS instances for some help test

ticket and if we look at the email that he responded to here's an event that's in there and we see yeah there's the event there's Yannis creating a role called RDS monitoring role that seems to corroborate his story now let's look at some of the other events and some of those other emails that nobody replied to so here's an example where a user named Matt is adding the user J Endor to the admins group so when somebody adds somebody to the admins group I want to know about it and in this case nobody acknowledged it so was this something o is this something normal is this some malicious Insider is this some malware I don't know because nobody acknowledged

it and I probably wouldn't have scoured my emails looking for these events if uh I wasn't giving this presentation so you know these things get lost in email so here's another example here the user Martin has somehow removed himself from the users group require TFA require require MFA so MFA stands for multiactor authentication obviously this is obviously a problem if someone can remove themselves from the group that mandates tofa so if somebody does this I want to see the help desk ticket or some kind of ticket some kind of background context on why they did this and I don't want it to go you know unacknowledged here's another yeah here's another case or here's another event from one of

those other emails so here the user El Matthew has changed a network security group in Amazon in AWS and it looks like he's allowed basically any machine in the whole world to connect to any port on any ec2 instance that's associated with the security group that looks like it's either something malicious or possibly a mistake um but I don't know if El Matthew was ever you know notified that he made this change or I don't know if he even made the change maybe it was some malware I don't know I really want to talk to El Matthew about this and why he did that so uh so yeah so that's email let's not do that so here's an example of some

Jura tickets so Jura tickets are better um and if you're like me and you're really good with your ticket hygiene uh you follow you you monitor the queue and when something comes in it gets assigned to a ticket comes in and you basically acknowledge it and you close the ticket but not everyone else else is as good with their ticket hygiene as I am so here's a case of Giannis on my team and this is an alert that fires when somebody logs into a production machine that they haven't logged into in a long time this is meant to detect malware attacks and stuff like that so uh in this case the ticket was created Giannis doesn't acknowledge so I

I monitor the queue and I find that this ticket is outstanding and I ping Giannis and Juro I say hey Yannis can you acknowledge this ticket another day goes by no acknowledgement I say hey Hest ping are you there no acknowledgement then I say another day goes by and I say hey Yannis can you acknowledge this ticket I'm going to CC your manager and that ends up being kind of effective but I don't really want to do that like I've got better things to do than sit there monitoring the queue and pinging people in jira so what do we do to solve some of these problems yeah so here's the solution for fixing the non-action

ability so first of all just don't use emails for alerts just just forget about it you I guess you can still send them and send them to some Alias for like you know if you're bored one day and you want to historically go and look at what alerts have fired but don't expect any action to happen as a result of the of the emails that have been sent so for jira so jir has a pretty cool feature called service desk so we've enabled the Service Des feature on our uh security alerts queue or our Alert security alerts project and and what that enables us to do is set up slas and q's so we can Define that P0

tickets priority zero tickets need to be turned around or acknowledged within one hour we can say we can set different slas for different uh priorities and it's open-ended you can make the you can basically assign a PRI an SLA with any arbitrary jql jur query language query um so the jir q's are actually really cool too so uh you don't get this by default in J but if you have a service desk project you can create these different cues so that you can see at a at a quick glance what's the distribution of the active tickets um in your security alerts project so uh I mentioned slas so if you see up there there's some examples of

what what the slas give you so you'd be surprised at how effective it is when there's a little colored icon and a and a timer that ticks down it's actually really effective to get people to like want to take action and resolve that ticket before it turns yellow or red or or negative it's super effective psychology is awesome so but that's not the full answer so because again people might not be looking at those tickets and then I have to get involved and and refresh the JQ all the time and I don't want to do that so we built a service called the actionable alerting service now the actionable alerting service basically enforces action on tickets and it does

this by doing two things it finds tickets that are unassigned and it attempts to find an asse for that ticket and it does that using a variety of hers that I'll talk about in just a second and when it finds tickets that are assigned to people but past due or past the SLA has been breached it knows how to escalate those tickets in various different ways um and the way that it does this it's a it's a stateless service it doesn't have a data store or anything like that it just looks it inspects the ticket metadata and we give the actionable learning service just enough information in the Jura ticket so that it can carry out its

duties so one of the ways that the actionable alerting service assigns ownership to a ticket is we have something that's called self-service alerts so the whole purpose of self-service alerts is to get the security team out of being the critical path for getting acknowledgement on tickets so for self-service alert when someone does a common administrated task like adding someone to the admins group or make some kind of infrastructure change to to AWS like opening firewall ports or whatever in network security groups uh yeah um so the the tickets get automatically assigned to those people so that we don't have to go and track them down and this is this is meant to do a couple things it's a it's meant to

find sketchy uh internal actors doing weird things it's meant to find mistakes like someone accidentally opening all the ports to all the all the firewalls to all all of the world to a certain ec2 instance and it's meant to find potentially any like sketchy malware attacks that might be happening where someone's laptop is infected and it's doing things on that person's behalf so in that case uh the ticket would get assigned to the person and they would say hey I didn't I didn't do this I didn't create this new user and then we would you know we would then at that point the security team would get involved and we would begin our incident response

yeah uh so so it's kind of on the honor System so on the internal bad actor uh scenario you might think well hey like you know if I'm an internal band actor I'm just going to acknowledge that ticket well we have as part of our OnPoint process that we go back and audit all of those acknowledgements uh you know retroactively and we make sure that there's no funny business so here's an example of a self-service alert we have a Jura ticket and in the juror ticket there's a little piece of metadata that says where to extract the actor name from that event and as you can see here from the event this was a Duo integration change so uh

in this case you can see Duo data. username is alect T the actionable alerting service goes and looks up Alec T and active directory assigns the ticket to him and gives him a little ping and jira so sometimes the user the actor is not actually a real human sometimes there are service accounts that make these changes and this is common when uh when the cor Bene team has some shared scripts that or or web applications that make certain changes for them so when that happens it's hard to find an owner so you you can't just assign a ticket to the service account so the actionable alerting service in that case will look up the user it will

determine that it's not a real user and it's a service account and it knows how to track down the owning team that owns that service account and it will go and ping them in jira or sorry it'll ping them and jira and IRC and or slack or whatever chat mechanism that they have and say hey group that owns this service account uh one of you guys need to please acknowledge this because this thing happened so that that's self-service alerts and that's one way that the actionable alerting service assigns ownership another way that we can assign ownership is by uh referencing a pager Duty schedule name in the ticket and so this is super effective to

get you know get the person who's currently on point looking at looking at the ticket uh instead of having it go to email or something like that or or sit unassigned in the queue so here's an example of that the ticket has some metadata in there that says the pager Duty schedule name this is an alert called Google suspicious login that's something pretty sensitive we want the current on point person to look at it quickly so here uh the actionable alerting service looks up the malware on point uh schedule name and pager Duty finds that the current assign for that schedule is Megan and if assigns the ticket to her so if it's not a self-service alert

and if we think it's not as important enough to uh justify getting the current on point assigned to it then kind of a last ditch effort is to assign the ticket to the owner of the alert so the owner of the alert is going to have the most context on what that alert means and how to deal with it and that person can be held responsible for either following up with someone or initiating incident response or acknowledging and closing the ticket so now we figured out how to assign the tickets to people we have owners but what happens when those owners are lazy and don't uh respond to their tickets so in the past I had to go

and monitor the queue and say Hey you know acknowledge this ticket acknowledge this ticket I'm going to CC your manager so now we have a computer to do all that for us so it can ping the user in IRC it can ping the user in jira it can ping the user's manager in jira and you'd be surprised actually probably not surprised at how effective it is when you CC someone's manager in Communications it really it really gets action to happen so here's an example of that here's a jur ticket that's passed due and a few days goes by it doesn't get acknowledged so then you can see on the second comment there the actionable alerting service kicks in and says hey

uh Jose this ticket is passed SLA for resolution please you know please take a look at this and as you can see a few moments later the ticket gets acknowledged um so here's the example similar to when I had to go and Bug before and CC people's managers here the actionable learning service notices that a ticket is pass due it first tries pinging in jira if that doesn't work after another configurable amount of time it goes and escalates by adding the manager this the person's manager to the ticket and as you can see there it's super effective so another problem that we had is that there was no standardization with our alerts so you know how this is

there's no RFC for writing alerts so one person thinks that everything's really important to and everything is p zero and another person thinks like H well you know this is fine I can wait till tomorrow and when there's no like when there's no standards you leave this to people's opinions and then you have your alerts that are in various different states and it's really kind of hard to keep track of them and it's hard when an alert comes in if you should really like believe the priority or not because at that point it's going to depend on who the author was so what did we do that what did we do to fix that so we implemented A

playbook a runbook for writing new alerts it's a Manifesto that that basically defines what constitutes a priority zero issue when is it appropriate to page versus email versus jira actually it's never it's pretty much never appropriate to email um it talks about what kind of mandatory fields we require to set at a minimum you should set the owner on the alert so that the actionable alerting service can do its job and um it talks about the various feature sets we have like when it makes sense to make a an alert self-service versus not self-service and whatnot it also specifies the like the the proper granularity for an alert so for example maybe you shouldn't have a

pzero alert that like hey a security group changed you know maybe it makes more sense to have uh a a higher priority alert for when the scope of access has been increased on a security group and maybe when the scope of access has been decreased on a security group maybe that's less important and that can be you know treated as a lower priority uh thing to follow up on so finally finally it talks about testing and it Ma it basically mandates that any new alert you write needs to be tested So speaking of testing in we've we've found in the past that a lot of our alerts were sitting there with bugs with latent bugs just such that the alerts didn't

actually work maybe it was a typo in the the Splunk query where it was querying the wrong index name or maybe there was an assumption about the data you assumed some field was always going to be present but in the particular scenario that you're trying to catch that field is actually not present so uh that's a problem and and uh another problem is that uh sometimes the data sources have problems or something Upstream of your alerting framework so sometimes the data source just drops off the map for some reason either to a full Flatline or like a dramatic drop in volume like you know when you upgraded the Linux kernel on all your hosts and suddenly after that

day no more logs came in and you didn't find out for like two months until you were investigating something else that's a huge problem so what do you do to solve that problem well I kind of touched on upon this earlier but for all of the new alerts you write you need to test them end to end don't you can't call an alert done until you've actually seen it fire for the thing that you're trying to catch so if you're trying to catch somebody making a reverse shell or doing something sketchy or ex exfilling data once you deploy that alert go and xfill that data go create a reverse shell and wait five minutes and wait to get that

alert and if you don't get that alert there's a problem and you want to find it then instead of realizing 6 months down the line that the alert doesn't work um definitely create Flatline alerts so make sure you have alerts that that fire when the data disappears over some amount of some range of time and not only complete flat lines but volume drops as well and also test those Flatline alerts yeah so the last thing I'll talk about is false positive so obviously there's always going to be false positives there's it's really hard to prevent false positives um for example when your your uh antivirus on the Mac laptop finds a Windows executable that's probably not that big of a deal that's a

false positive um when you deploy a new production server service and it's listening on some ports that were never listening that were never being listened on before um an alert is probably going to fire for that and it's not really a big deal and it's a false positive so um how do we Rectify these false positives the way we Rectify that is through automation so automate your incident response steps make it such that when a fall when an alert comes in you can make the call of whether or not it's a false positive really quickly and so that you're not spending a whole bunch of time chasing your tail trying to figure out if this is a false alarm

or if it's something real you definitely don't want to spend two days on something that ends up being a false positive so write the scripts and if you're writing a script that's going to automate a whole bunch of steps and you have that for when the alert comes in consider pushing that Upstream into your alerting framework so why have a human being involved in hitting enter on some script when you can just plug that into your alerting framework itself so how do we measure success so this is really hard for me to do because in the past we didn't really have much visibility into what was going on we just had a bunch of emails that were not

being acknowledged and uh you know tons of events were kind of piling up all over the place and it was really hard to tell what was going on so but today I what I can tell you is that our active ticket Q is totally manageable and the slas are breached less than 50% of the time so that's good um we've gotten positive response from the people who are triggering these alerts most of the time which is our operations team and our cor Bene team and the security team is happy because we're no longer the critical path for getting traction when these alerts happen so we can work on bigger better things so here's the recap of the

problems and the solutions you can look at those on the slides later and the three things I really want you guys to get out of this talk one make your alerts actionable two make sure you have visibility into your alerting metrics and three make sure your tests your alerts actually work by testing them and that's all I have to say [Applause] so I had one more thing to say I guess um here are some social links Social Media stuff for help and thank [Applause] you thank you Daniel on behalf of bides and Link Fitbit thank you thank

[Music] you