← All talks

Dispatch: Crisis Management Automation

BSidesSF · 202025:43769 viewsPublished 2020-03Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Mentioned in this talk
Tools used
Platforms
Frameworks
Languages
About this talk
Marc Vilanova, Forest Monsen - Dispatch: Crisis Management Automation When Everything is On Fire We built Dispatch to automate our entire crisis management lifecycle, from initial report, to resource creation, participant assembly, task tracking and post-incident reviews. We want you to use it someday too, so we'll explain how it helps us, and why you should check it out.
Show transcript [en]

we're security incident responders and we face all the problems that security Incident Response Teams face in the very beginning stages of an incident we know that something is going wrong we have to open up an incident we have to act but like all the other teams like some of you we don't necessarily have all the information we've got incomplete information maybe we have one piece missing maybe there's a bunch of things we don't know we don't know the full scope of the incident we don't necessarily know what kind of incident it is in those very early stages we don't know which of many systems or micro services may be affected we don't know whom to involve of course

and if we don't know all of these things we certainly are not going to know whether attackers are actually present and if so what systems are they bothering so we know very little and in order to figure this stuff out we have to investigate so we get a incident commander involved and they have to start looking into the major questions of the incident so if we don't know whom to involve and we don't know which systems are affected or any of these things it's those questions that are key and they've got to start asking all of them so how do they do that well they communicate they're gonna open up a conference bridge or a private chat

channel and start asking which systems are affected and whom did involve etc they're probably going to create an email distribution list they need to in order to communicate with other people they're gonna create shared storage to store incident artifacts or logs they're going to need to invite the right participants probably based on the incident type and maybe severity they have to talk to all those participants and orient them as they come in make sure they know what's going on maybe with a conditions actions needs or can report that's the form we use and they're going to need to create access groups in order to manage access to all these resources that we've been talking about they're going to

create an incident investigation document that's where all the people are contributing can come in and type in their information on what they've discovered that's how we work they might create spreadsheets for affected systems etc as well so the context switching back and forth back and forth of setting all of this up manually incurs a lot of overhead and it's going to quickly mentally exhaust you and it's it's going to be very taxing as I'm sure you may have faced so out of curiosity are there is this already sounding familiar to anybody in the audience is this something that you've seen before okay yeah so there's a number of us who face this so I'm gonna tell you a big part of

the issue we face with this exhaustion was due to that multitasking that we had going on so multitasking may be fun at the beginning it may be exciting but quickly that exhaustion is going to happen and we know now that doesn't exist that you can't actually multitask you're actually switching back and forth between more than one thing your brain doesn't do more than one thing actually and so that taxes your brain a lot and you're gonna be less effective at both of those things and the current neuroscience bears this out so our incident commanders were busy trying to answer the questions of the incident and they were firing up all these documents managing all these channels and the

people and found that quite taxing so on top of all that how could we expect the incident commanders that's us we were doing this and our and our teammates how could we expect ourselves to perform well during the incident to be on point to be paying attention to actively answer those questions of the ensign well we noticed it and we felt that it was it was like paying a lot we wanted to be better at it so I'm Forrest Monson with me is Mark Villanova and we're just part of the team that came up with a better answer to this problem for Netflix detection and response a little bit about our approach to crisis management and incident response in

general we want our team and the incident commander to take care of all this confusing stuff in a consistent and familiar way we believe that as participants continue to participate in incidents that it builds their confidence in the process and it also makes them more effective we believe the second third fourth time they get involved we ask them to be involved with us in an incident and that's important because it reduces the amount of time spent flailing around we want to aggressively eliminate decisions that don't matter we want to reduce our mean time to assemble that's get the right people engaged and get them in the room we want to reduce our mean time to

stable for us time to stable is the period of time during the active incident until we believe that the major questions of the incident have been answered we call that stable if you're familiar with the National Institute of Standards and technology's NIST's stages of incident response we call this the containment phase but what that looks like in vary depending on the incident type so importantly we also want to learn from every incident and we absolutely need a system that helps us to do that we want to take all these lessons learn from each incident and drive them back into the organization we hope we're getting better at security with every incident so for our first

attempt at this to address these questions we bought a third-party proprietary security orchestration and response tool that would give us workflows we tried a workflow tool and we customized it to get ourselves those workflows we added a bunch of our own Python code as well and there was a lot we liked about it it was working for us and we learned from it but ultimately we found that experience unsatisfying because the tool wasn't actually built to do what we wanted it to do so we built dispatch and we released it into production just a few weeks ago and we're relying on it every day now and we're very excited about it I just wanted to come talk about it mark

is gonna take you through how dispatch answers these questions and automates away a lot of that pain we were feeling he's gonna show you what that looks like with some screenshots from a participant experience mark thanks for us the first thing that we wanted to do with this patch was to standardize internal reporting we wanted to get a move away from getting instant reports via email or like using a form that with like limited functionality because we wanted to provide a great experience to our customers internal customers so we end up building our own intake form it's very simple it only requires incident title and infinite description the view as a reporter happened to know also the

incident type or in the priority that we should give to these incident you can also do so but it's not required and we can always change it later once the incident form is submitted we present a reporter with this information on your left-hand side you can see the information that the reported submitted to us as a confirmation plus the insert commander that got assigned to this incident Inc they need a point of contact right away on the right-hand side as resources are getting created you can see them displayed on your screen and then at the end you get a list of all the resources that this batch created for you such as a JIRA ticket or like a slide private

conversation some storage for files an investigation document investigation sheet and also we usually provide a link to our security incident FAQ in case it's the first time that you participate on an incident and have questions about it this but will not only creates all these resources for you so you as an incident commander can focus on the things that matter but also helps us provide consistency in our process like Forrest was mentioning he would also manage the resources throughout the lifecycle of the incident for you so you can focus on what matters this but will also add the right people based on their preferences we provide an ear face for people to let us know when they should

be engaged some people need to be on every incident some people only for specific types of incidents or priorities or maybe because keyword was mentioned in the title so people can go and explicitly tell us what they should be engaged we also can recommend on-call people based on their preference the team preferences as well this budget will also page people based on the severity if it's high enough it will go and page the interim response team on call it also notifies an incident challenger more general incident channel and distribution list these are for like every incident and if that incident like changes its status changes that allows us to keep all the stakeholders informed these are a couple of examples of how we

the notifications look like on the left hand side you have a slug no decay on the right hand side a similar one that goes out via email anyone in the incidents channel the general channel while people can see like when there's like new incidents going on can get anyone interested in an incident get Getti bought by clicking a button so on every incident notification there's a button that people can click to get themselves added to the incident channel and later to the incident when people get added to the incident this fight will not only add them to everything that needs to be added but also provide them with context and access to the resources that they need this is how the

we call it the welcome message looks like when someone gets added to an incident on the left hand side you can see an ephemeral message that the participant gets right when they join the incident channel and then on the right hand side the email address that email that they get with the same information so these like accomplishes two things one people don't need to reach out to the incident commander to ask where is the information that I need to get myself up to speed and start contributing and then freeze the incident commander from doing that as well this part will also announce participants as days as they get added to the incident channel we as you can

see we had we provide the name the team the location and the role that they have as a participant on the incident and you some of you may have noticed that we didn't include the job title we did that on purpose and does anyone know why we did that we did it to avoid the power differential executive suit problem we want everyone to be at the same level when responding to an incident and we don't want an executive to like drop on the incident channel and then steer the intern to manner in a different direction that they were planning to go because the intern commander is supposed to have supposed to be the person with

the most knowledge context about the incident these cuts also allows you to hand off the incident and also assign roles to participants such a scribe liaison reporter or incident commander we use as locked dialogues to do that so on the top on the left hand side you have the slack dialogue where you can choose the participant that you want to assign the role to and then on the bottom like the role that you want to assign that person to once that a role gets a sign dispatch will announce that assignment in the channel so everyone sees who assigned which role to who if it's incident commander that change will propagate to all the systems you can

also engage on-call people track with with with dispatch and pitch them if you want we also use slack dialogues for that you can select the on-call service that you have defining that is by do I and you can choose whether you want to take them or not this part also makes it very easy for incident commanders and/or scribes supporting the interim commander to write and share status reports they can pop up these slack dialogue that uses they can condition conditions actions and needs for Matt to write a report and then once submitted it goes to a distribution list and also gets posted in the channel for everyone to see dispatch also notifies about incident tasks whether they are like new

tasks or pasa have been resolved it creates tasks of comments in any of the incident documents and keep track of them on a scheduled basis we have noticed that by and these are the screenshots of like how these notifications look like he also this patch also reminds the task owners to complete those tasks if they haven't been completed on time we notice that the crease in resolution time for this task as soon as we start like exposing them in the incident channel they could be bared in a document but by surfacing them in the incident channel one where everyone is like communicating allowed us to retina reduce the resolution time this patch also allows you to manage the lifecycle

of the incident if the risk has been contained you can mark the incident a stable if the incident has been resolved you can mark it as closed it also calculates the incident cost we design a cost model it's pretty simple at this point but basically looks at how much time people have spent trying to resolve the incident and then we combine that with an average cost per employee to come up with a number that then gets published in shared this patch is written in Python using the fast API framework reduces post casts to store and manage data and it provides a view Jas based front-end it's plugin based so you can easily extend it we have plugins

for JIRA T suite slack and pager duty so what's in the future well we have a few things we want to build we want to continue building our recommendation system we want to be able to recommend people and documentation perhaps previous incident documents or maybe run books based on the information that we know at this point during the incident we also want to be able to timeline an incident have all the facts and actions are we're taking throughout the lifecycle of it and also provide metrics so you can know like okay how many instance we had this one how much they cost us how many incidents are we gonna have next month we do forecasting as well

so to summarize our instant response process was very painful and so we built these fights to automate it we have open to dispatch today and we are hoping that you're going to check it out and find it useful feel free to send us feedback and contribute by sending PRS and we have I think time for questions [Applause] innovation answer no we don't have any innovation question was whether there was integration with ServiceNow yeah no no but like we said it's a plug-in based so you can write a plug-in for it and submit a PR we were happy to accept there's a question up there

so the question was around our treatment of PII or personal information how do we how do we treat that during an instance to make sure that it doesn't get broader exposure is that right okay cool so we have a we have a general set of protocols that we use for handling personal information and yeah we basically try to touch it as little as possible we we couple this and then we try to cross it with Netflix's freedom and responsibility culture which means that we do have that get involved button and anybody can step into an incident so at least we can see who else is involved in the incident and talk directly to them so I think that it's a really it's

mostly the responsibility that we place on people as well as the regular protocols that we face dispatch doesn't have specific controls around that yeah I think it's also worth mentioning that the way we give access to information is by when people join the incident channel we captured that event and then we add them to a Google Group that then determines like who has access to all the information so we have like the Google Group in the Google Drive and then all the documents there so we as people come and leave we add them or remove them from the Google Group that way the people that we use the slack channel as the suits of truth

of people that are involved in the incident

where the it's open source and the question is where is it hosted yes it's on github yes / Netflix feisty / dispatch sorry yeah the link is kind of like hidden in that text there's no question oh we're sorry

okay so the question was why didn't we implement voice communication like an automated setup of a conference bridge and is that something maybe is that something we're planning or have we thought about yeah okay no that's a that's a really cool that's a great feature we just happened around our culture to to centralize on slack and even though we will jump right into a situation room or in person if need be and slack we found had all these benefits around the chat ops capabilities as well as controlling the access so it's not like we wouldn't ever do that we've talked about it we do have plans to create google hangout in case people need to like jump on a bridge

because that's usually if if it's a high severity incident like the channel will be very chaotic and a way to like get rid of that chaos it's like getting on a bridge and so and the incident commander then moderate the conversation so we do have plans to like incorporate dyeing our clothes but we don't have it right now there's no [Music]

sure yeah the question was how we often have to work in documents unstructured documents and so how do we normalize that data how do we get if we're gonna create if we're gonna pull metrics out and you said often the problem is you have more than one data system well our solution to that is that we have more than one data system so we we use JIRA as a way to maintain details field little details about an incident and we do keep things in the document but we can draw conclusions about our incidents over time by pulling results from there did you want to say more on that yeah I just want to add that the reason

why we stick with Google Docs is because it's what everyone at the company is used to like we didn't want to go and create like something new that no one really needed knew like how to use and we wanted to train them by using all the tools that everyone uses that the company makes the process easier more familiar but yeah it's a channel challenge we I think we're planning to add some like tagging labeling in incidents for like incidents in this bunch later but we don't have it right now we have a question from slide oh might be absurd it's um how difficult was it to get the rest of the company to use this reporting flow

and not use the old methods a question from anonymous anonymous oh they're they're looking for us so the question was hi and you probably heard that already was how do we get the rest of the company to use it well I mean we kept telling them about it I don't do you have a better answer than No as if they were reporting us through different channels we would point them to the forum and then over time like people start using it and we don't have to tell them anymore and it's pretty like it's a pretty simple form it only like we said only requires title and description and because we don't want people to get stuck you know form that

has a lot of questions when what really matters is that we kick that incident off and then we can figure the rest of it later cool last question oh okay sure mister just curious if you ever thought about putting it on segregated system primary infrastructure you could still do so the question was have we thought about having it on a segregated system so if let's say instant response effects our hosting platform where we have dispatch will we still be able to handle an incident doing so it is sitting in a different AWS account Don protests and other infrastructure so yeah security has its own like product and trust accounts so it leaves there yeah but if

there is a large-scale AWS collapse we will be out of luck I think thank you all very much yeah thank you very much [Applause]