Practical Cyborgism: Getting Started with Machine Learning for Incident Detection

Name: Practical Cyborgism: Getting Started with Machine Learning for Incident Detection
Uploaded: 2016-11-13
Duration: 53 min 15 s
Description: Organizations today are collecting more information about what's going on in their environments than ever before, but manually sifting through all this data to find evil on your network is next to impossible. Reliable detection of security incidents remains elusive, and there is a distinct lack of o

BSides DC · 201653:151.9K viewsPublished 2016-11Watch on YouTube ↗

Speakers

David Bianco Chris McCubbin

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

Platforms

Frameworks

Languages

About this talk

Organizations today are collecting more information about what's going on in their environments than ever before, but manually sifting through all this data to find evil on your network is next to impossible. Reliable detection of security incidents remains elusive, and there is a distinct lack of open source innovation. It doesn't have to be this way! In this presentation, we’ll walk through the creation of a simple Python script that can learn to find malicious activity in your HTTP proxy logs. At the end of it all, you'll not only gain a useful tool to help you identify things that your IDS and SIEM might have missed, but you’ll also have the knowledge necessary to adapt that code to other uses as well. David Bianco (Lead Security Technologist at Sqrrl) David J. Bianco, Lead Security Technologist, Sqrrl Data, Inc. Before coming to work as a Security Technologist and DFIR subject matter expert at Sqrrl, David led the hunt team at Mandiant, helping to develop and prototype innovative approaches to detect and respond to network attacks. Prior to that, he spent five years helping to build an intel-driven detection & response program for General Electric (GE-CIRT). He set detection strategies for a network of nearly 500 NSM sensors in over 160 countries and led response efforts for some of the company’s the most critical incidents. David stays active in the community, speaking and writing on the subjects of Incident Detection & Response, Threat Intelligence and Security Analytics. He is also the person behind The ThreatHunting Project and a member of the MLSec Project. You can follow him on Twitter as @DavidJBianco or subscribe to his blog, "Enterprise Detection & Response". Chris McCubbin (Director of Data Science at Sqrrl) Chris McCubbin, Director of Data Science, Sqrrl Data, Inc. Chris is the Director of Data Science and a co-founder of Sqrrl Data, Inc. Chris' primary task is prototyping new designs and algorithms to extend the capabilities of the Sqrrl Enterprise cybersecurity solution. Prior to cofounding Sqrrl, Chris spent 2 years developing big-data analytics for the Department of Defense at TexelTek, Inc and 10 years as Senior Professional Staff at the Johns Hopkins Applied Physics Laboratory where he applied machine learning algorithms to swarming unmanned vehicle ensembles. Chris holds a Masters degree in Computer Science and Bachelor’s degrees in Mathematics and Computer Science from the University of Maryland. Thanks to our video sponsors Antietam Technologies http://antietamtechnologies.com ClearedJobs.Net http://www.clearedjobs.net CyberSecJobs.Com http://www.cybersecjobs.com

Show transcript [en]

the b-sides DC 2016 videos are brought to you by clear jobs net and cybersex jobs.com tools for your next career move and antietam technologies focusing on advanced cyber detection analysis and mitigation I was gonna say good afternoon but no no good afternoon besides thanks for having here my name is David Bianco I'm here with my friend Chris McCubbin we both work for a vendor called swirl but we're not really talking about squirrel today I want to talk instead about something that well Chris is our real expert here with with machine learning I as a security analyst have been getting more and more into being able to incorporate machine learning kind of tools in my daily work

and I felt like we're getting now to the point where you don't really have to be a machine learning expert to get some value out of machine learning to do common security tasks so I wanted to put together a talk and Chris and I collaborated on this and the the central thesis of our talk is it is now perfectly feasible and perhaps even time for most of the people in this room to be able to go back to their office tomorrow find some intractable big data problems that they've been having and have the possibility of pulling out a new tool out of your tool box some simple machine learning and so that's what we're going to talk about today we

have a demonstration program so I'm github I'll show you the URL for that later where you can actually see some of the get machine learning code examples in Python we're not going to talk a lot about the Python piece today instead we're going to talk a little bit more about the ideas behind machine learning so you can make some intelligent choices

so show of hands who in here has ever heard this piece of advice that it's a best practice to do log review every day or every week I'm disappointed there's only like five or six people maybe that means that everyone is has just banished this from their minds this is no longer like a feasible thing to do right if this was the Family Feud it would be like ant survey says no this is this is not really considered a best practice anymore although when I started in insecurity it definitely was a long time ago considered something that you could actually review all your sis logs like a system administrator could sit down there and read their sis logs maybe with

a little help from grep or something like that - to format things but that's no really no longer the case even a small or medium-sized business can potentially generate gigabytes of log files every day and when you get to an enterprise level you are probably at least at terabytes if you're not higher so the problem is that kind of these these things we're so good at logging now which is really a good thing right for us visibility is always good but it brings this problem with us that we are no longer operating at a purely human scale my take on this is that is okay because now we are at the point where we have the humans like us right we're

really good at understanding context be it business logic that goes along with some process or maybe it would be some some intelligence that we've read or even the idea of I know how this type of system is supposed to work like I know in general how an email server works this looks weird we're really good at that we're so good at that that sometimes we can find patterns where there are really no patterns like we find Elvis on toast and he's not really hanging out on your Wonder Bread unless you put some peanut butter and bananas maybe on it then he might be there but our brain is wired for patterns and we see that but that's a rare false

positive we can make use of our pattern ability by combining it with some of the things that the computers are good at that is repetition going through large volumes of data applying an algorithm kind of tirelessly doing the same things over and over again and when you put those two together you have what I think of as kind of like the the closest real word we're real world example of cyborg ism but I can come up with a cyborg being a part human part machine well that's the title of this talk right practical cyborg ISM it is now practical to combine the human and the computer with your big data security problems I just hit all the buzzwords sorry yeah

to get some practical useful analytics out of it so here's a problem statement that you can't read because our projector is not really good enough to read our slide but it's basically a bunch of small type bro HTTP logs just big wall of text you don't really have to read it just to know that your logs are a big wall of text and even if I show you just one screen full at a time there's no possible value that you could get out of that without staring at it for like five or ten minutes and if you think of at the scale of data that your organization's are collecting do you really have five or ten minutes to stare

at each screen full no not really so Chris and I have come up with a I don't want to call it a solution really to this product a project but I would like to say it's a prototype like proof of concept solution for this kind of problem cutting through all the logs to get to the things that we are interested in we call it clear-cut and that's available and github it's written in Python with some standard Python machine learning libraries as an example that you'll be able to go to after the talk and see how we did it we don't necessarily think that you're gonna immediately go out and use this into production use but you you could

definitely give it a try but the idea of it is it's a functional proof of concept that you might learn from or extend to your own kind of logs that you have in your environment we work with bro HTTP logs in this proof of concept but you could and we provide pointers for guidance on how to extend it to your own kind of logs that you have in your environment so at this point I'm going to turn it over here to Chris for a few minutes and he's gonna talk to you a little bit about the some of the concepts of machine learning and then after that I will come back up and talk to you a little bit more about clearcut

itself we'll do a little screenshot demo and I'll talk to you about how to interpret some of the outputs of the machine learning process and then we'll do a wrap-up try to turn this on without too much feedback here thanks David so yeah I've talked about here I'll talk for a few minutes about some of the important concepts all right so we'll start out here with a little analogy so I'm gonna claim that good theory leads to good programs you know it's kind of like a the whole idea of computer science the discipline is that you want to come up with good theories and those lead to good solutions to real-world problems so who who here has you know implemented

and optimized a non-deterministic finite state automata compiler I mean maybe I did it in school I don't really remember but you probably use one every day so here's use grep or Perl reg X right everybody what you're really doing there is you're making it you're running a compiler for your non-deterministic finite state automata right so what you're really making one of those but you don't care you don't care that it's NTM inside you might you might need to know a little bit about its quirks like for example you know actually grep and perl and everything that's based on perl regex are implemented quite differently right like so grep is based on you know it's based on a sort of a compiler

that's more efficient and it doesn't do backtracking right so in the grep doesn't have any bad cases you can't make grep blow up on on various reg X's Perl does do backtracking because of that you can write reg X's that make perl go exponential essentially and it's time in it's running time even on very very simple strings like a star a a star star or something like that so you know it's helpful to know this you don't really have to know why you don't have to know the theory why but you might want to know that yeah sometimes perl blows up on reg axis so that's kind of analogy here well you don't have to know you

know the incus intricacies of say like random for us to use random for us right or to use machine learning in general and but it might help to know that sometimes these things blow up or what they're good at so machine learning it has like a long storied history and he goes way back to the you know huge names like Alan Turing he invented things like the Turing test you know it's intertwined with artificial intelligence statistics you know you could have a whole talk on the history of machine learning or the history of AI machine learning statistics data science I'm not going to talk about that much here but basically there's two and even two is kind of

oversimplification but there's two big branches of machine learning that we're going to talk about today I want to supervise the others unsupervised and they do different things right so supervised learning is so you might have you need labeled training data that's kind of the big thing you need examples to show the machine to help it learn and given those examples it will try to figure out patterns in your examples that it can apply to unlabeled data to try to label that data so you have a bunch of pictures of cats and tugs and some of them are labeled cat and some are labeled dog and you pass it to the Machine and then you show it a picture

and it says I think it's a cat and maybe it's right if it's right that's a true positive if it's wrong that's a you know either false positive or false negative sometimes like you'll hear these terms but is a cat positive and I guess it like depends if you're a cat person but it's a you know usually you have a positive class and a negative class or a true class in the false class so a true positive will be something that's right about that's a positive and you know true a false positive will be something that it says is positive but it's not so the other big area that we're going to talk about is unsupervised learning

so this is I have no idea what's my date I have a bunch of you know keeping with the pictures the example I have you know million pictures I don't know what's come up it's cats or dogs aardvarks clouds whatever so I'm just gonna like give this to the Machine the machine is going to look at this data and come up with some of its own conclusions it's going to come up with some of its own groups that it thinks so these things are part of or it's going to come up with what it thinks it's normal and then when you pass it a new picture it's going to tell you oh I think this is group two well group two

might be cats but the computer doesn't know that all the computer knows is I learned about all these things in the pictures and I think this group looks that looks like one group of things and your new thing is in that group or it's gonna say this new picture looks totally normal that's fine or it looks weird but you know what's normal it's what's weird it depends on what you gave it in the beginning and to give it context you have to kind of like apply some knowledge afterwards or you can just say like I'm going to detect outliers I'm going to talk weird things that don't look like my original data so that's what unsupervised learning will

give to you and in clearcut we have an example of doing supervised learning with random forests and unsupervised learning with isolation for us so I think one of the reasons you're hearing a lot about machine learning nowadays is there's been a couple of I would say breakthroughs in machine learning that have really improved the performance of the algorithms by like we're gonna order of magnitude over what they were for a long time so that like in the case of random forests they started solving problems at a level like 10 to 20 percent better in terms of how often it got things right then they had been for a period of maybe 10 to 15 years so it

was a really exciting time maybe five ten years ago when these things came out deep learning came out that also had a huge for in their performance she you saw a big leap and the capabilities of these algorithms and along with that you know they kind of moved out of the academic realm and a whole bunch of people got together and wrote started writing libraries to do these things for sort of non practitioners and that's kind of what we're using so we'll talk a little bit about what these algorithms are what they're good at so we we've chosen to use it as an example random for us and that's a case of supervised learning alright so again supervised learning so

you have a population of things two types of things this projector is kind of bad I don't know if you can see it but like some of these things are blew up in here some of them are orange down in here and there's a whole bunch of gray dots we don't know what those are so we we have label data we know what the blue things are you really can't see unless you squint you know what the orange things are and you have a whole bunch of gray things you don't know there are so what we want to do is pass us to the machine and the machine is going to find a function that separates these things and

in this case I don't know can you even see this line here there's a line right here no you can't really see that so it's going to come up with a function that says okay there's a line here everything to the upper-right of this line is blue everything to the bottom left of this line is orange so in this very simple case you know it's easy to say okay that everything that's blue most of the things that are blue are up here most the things that are orange are down here we can easily sort of draw a line between them and that would be a very simple classification algorithm that sort of tries to make the maximum

separation between these two sets and there are algorithms that do things like this so in this case like you would want to optimize something right we usually call this a loss function but we might want to optimize like how many things do we get wrong or we can optimize like how we can optimize how our way we want to be as far from the line as possible when we know things about the labels so so none of these are going to be perfect almost never and you can't see it but like even with this line here maybe we have a blue one on the wrong side or an orange one on the wrong side so this

will be your true pot your false positives or false negatives but how close can we get to being perfect right it might not be a line so there are lots of different ways to draw these to create this function right like it can be you know circle it can be a squiggly line and actually like retweeted something the other day that shows the way different algorithms will separate these things separate different classes of things based on like where they are you know lines great if everything is like sort of on one side of the other but we were one groups in the middle and then any other group forms like a doughnut around it a line would really

be bad in that case right so you know and what you want to do is try to create a general algorithm that will handle all cases that's really hard right so a couple of algorithms like this one came out that do pretty well in most cases so this is random for us and one of the things is like so python has a really good implementation of random forests and a toolkit called scikit-learn I'll talk about that a little bit so kind of the the sequence here is you want to identify positive or negative sample data sets these are your labelled data sets you always have to clean and normalize the data every data scientist here knows

that right like so you have to partition the data into training and testing data sets so what you do is you take your label data you usually split it up into two pieces one piece is going to be the set that you trained on and you take an independent set that you want to hold back and to check to make sure that your algorithms working correctly you train the model using the using this software then you test the model on your hoback's set and see how well it does right you record how many times it gets things right how many times it gets things wrong and then you see how good that is and then maybe drink a beer after that

so how does the training work right how does it in particular how does a random forest do things so random forest is going to create a bunch of these things called decision trees so we drew a line before that's like maybe making a deciding what y mx + BR what a decision tree works by taking your Set picking you know one factor out of that set one column of data and asking like which column of data is the most predictive for your answer it knows the answers for this set so it can figure this out so it says okay this is lord of the titanic' data right and you're trying to determine if a passenger lived or died

it's a little bit morbid but so the first predictor is like is this person a man or a woman so if it's if the person is a not a man then odds are they survived in the titanic's case right so you can kind of split it along these lines so that's filled up taking one column and splitting it in one place you split it in a place that's sort of most predictive then you take the next you take another column that's the next most predictive so is that person older than nine or younger than nine so if they're older than nine you know probably died and you keep doing this until you reach a point at which this

whole tree is predictive enough now this is part of the things like how do you determine that you don't really care so the problem here is that it might look at things that are important in your test set but not important in the real world or overall write your tests that might be biased and you sort of go down a you go down a path by picking this column and saying I want to look at is it male or female first that sort of biases the rest of the data set and you can over fit and what does that mean that means it's it's creating a tree that works really well in this tiny set

of data well then when you try to apply it to another set of data it fails miserably so what is random forests do so what you do with a random forest is you take your training set you take a random sub sample of that set and you take a random sub sample of the columns in that set and that is your little mini training set and you make a single tree out of that mini mini training set and you do that like a hundred times you get a hundred trees and to do a classification you ask all hundred trees what the answer is and then you can take a vote or you take the average depending

on like you know if it's numeric or a yes-or-no question and it's a simple idea but it turns out this works really well and avoids overfitting and has a bunch of other nice properties right so great it won a lot of contests it did a lot of great things they put in scikit-learn and now you can like create one with one line in Python and I'll show that to you all right so you know say we have a bunch of bro logs right let's label them know that that sounds really terrible right so there's other things that we can do if we don't want to sit down and label like a thousand lines right like is this malware or not

want to do that one thing you can do is you can use unsupervised learning so in this case we're going to do outlier detection you can do clustering clustering it's like show me the groupings of this data and this data it's like really clear there's no colors in this data so you're not missing anything but you can say oh yeah there's two clusters here you can humans are really good at this in certain circumstances machines are not as good like for example you can tell there's two clusters here even that even just saying there's two here that's a pretty big deal for machine but another related thing is saying okay we have this data which which points are weird

which points don't look normal right in this case like maybe these are normal maybe these are normal but this guy over here he's weird he's like he doesn't look like the others right so this is called a liar detection it's a sort of subset of unsupervised learning and you can also pretend that a Lhari detection is a classifier you say okay everything that's normal you label as zero class zero everything is abnormal this class one or abnormal class so you have a classifier right boom so there's lots of ways you can do a liar detection on data we're going to talk about one of them and one I think is pretty exciting but other ones are like you can say you can

cute compute all the inter point distances so you take this point right here and you say how far is it to all my neighbors right and it's kind of far on average you take this point here how far is it to all my neighbors or just maybe the K nearest neighbors it's not as far so this guy is probably an in liar this guy is probably an outlier because its average distance to Ascanius neighbors is high there's a kind of a problem with that in big data right like first you have to compute the K nearest neighbors that's not easy and then you have to compute the distance to all those K guys that's not easy either

in a lot of cases so these are called distance based methods there's sort of angle based methods where you say like what's the angle spread amongst all my neighbors so if you're like on Mars or earth-like like what's the angle to all you all the planets it's going to be kind of high they're all over the guy but if you're on Pluto over here the angle to all the planets is narrow because you're an outlier again you have to find the angle to all your neighbors so these are n square operations right because you have to compute it for each pair it's bad you know want to do that so there's uh these new isolation based

methods that came out a couple years ago so what is this you pick it's sort of like I forests I mean sort of like random forests say both at forests in the name right so this is isolation for us but they're sort of qualitatively different what you do is you take you can't really see anything here right you take a blank screen well anyway so you take your data what you do is you pick one random dimension like you say okay X like this is two dimensional X Y you pick a random dimension x at this time maybe Y next time you pick a random point between the max and the men's so you have to know the maximum and the

minimum for each one of your dimensions that's not a big deal that's a one pass algorithm right oh if n you pick a random point in there and you split your data set into two days that's whether or not it's on the left or the right of that or if it's above or below it for Y right okay so it's split into two sets well this is forming a node with the tree like a binary tree these guys are on the left these guys are on the right so you do this recursively on every node every node you pick a random dimension you pick a random spot and then you split and then you say things that are split

off into single nodes and so if you have a single point in your node left there's nobody else left in that node if that's close to the root you're an outlier that's far away from the root you're an in liar and why is that it's because you know the closer you are to like your buddies here the more times it's going to take your splits to like split you off right and sort of there's bias in this - so you just and it's randomness right you could get unlucky so you do it like 50 times and then you combine the results just like random for us so what's so great about this right it's a

ho then right so you just have to do you have to pick a random spot and then sort things according that you know split things according to that split that's just ofn right and the tree is usually like well again high so it's like oh my god but you don't have to do any any comparisons between all pairs of things and it turns out that you know this it's kind of the same thing here but it turns out that this algorithm works really well it works better than any other any other outlier detection if you're in data is like totally numeric so again identify positive and negative sample data sets if you have it we said this

was unsupervised but if you know that certain things are outliers you can kind of hold back a set train your data again and then check it on the on the set that you held back again saying like class zero is the in liar class one is the outlier so this procedure right is pretty much the same and we have a second beer at the end right that's good so you'll notice like with a lot of these machine learning algorithms you're always going to follow the same procedure right the algorithm is different step five training the model but the procedure is the same so the scikit-learn guys notice this I mean a lot of people notice that so what

they did was they created a library where the procedure the gist's all that code is the same but just like the creation of the classifier is different but all the api's for that classify are the same so you know these are two completely different types of learning right random forests and isolation for us they're on different branches of machine learning quite supervised learning unsupervised but you create the you create the variable the same way a similar way you create it with a set of you know you create it with its hyper parameters essentially which is like the number of trees and stuff like that that's different random forests you have to have a set of

answers right you have to have your supervise how do you go back you have to have your supervised set but the way that you do fitting this is make making the model that's exactly the same code it's literally copied and the way that you test predictions on it is exactly the same that codes literally copied so all you have to do if you if you're trying to test different things out is just change the way it's created and maybe you know make a test set also so you can reuse this analysis script with nearly no change and you can like you know play with the parameters and find the algorithm but that's best fits your

problem with like not knowing how any of these things work really you don't have to and like these things are all optimized underneath they use best class algorithms etc etc so you kind of its like the grep of machine learning here right you don't you just like grep just type random for us you don't care and it'll do the things that you need it to do so David's gonna give a demo I think a paper demo huh all right I just say first of all who's who's convinced after seeing the three or four lines of Python code that our general thesis is correct that you could actually write three or four lines of Python code

anybody hands up okay we got some hands we got most lots of hands that's good I had to say one of the interesting things about working I'm not with squirrels data science team but I work with Chris a lot is as we get in these conversations where he's like yeah our data set has something like 34 dimensions and only 34 dimensions I am in a doctor who's story that's that is awesome so yeah this this part of the presentation I just want to walk through a little bit about how I prepared a demo and then we're gonna give a screenshot demo it's it's a screenshot Dumas's all terminal text so it's really easy but the training times the things take a

little bit too long to really comfortably fit into a presentation so fortunately I think you can mostly read this this is a little bit of a diagram about how I prepared the data for this demo so this demo is about random forests show of hands who remembers random forest supervisor or not supervise yes and Chris good job so that means we need labeled training data I need an ARP our problem is here is to just try to find things that are indicative of malware requests in our office sensors bro HTTP logs basically I want to know if we've got some malware calling home I'm not as concerned about what kind of malware it is or anything

like that although there are certainly techniques you can do to extend our binary classification yes or no good or bad malicious or benign into more complicated things like this is dry decks this is lucky this is whatever we're not trying to do that I just want to know do I think that this looks like it might be malware or not I have a hundred thousand or two hundred thousand log entries a date don't feel like spending all that time reading it so I want to reduce it down let the machine do something for me so and we're doing random forest supervised learning so I need two data sets right I need a set of confirmed malware HTTP logs and I need a

set of non malware benign HTTP logs I did not really feel like going through and manually labeling a bunch of things from our office sensors so what I did instead was I looked for a corpus of HTTP traffic that I knew was super likely maybe not 100% likely but super likely to be malicious so I went to a couple websites especially on the internet so malware traffic analysis net and contagion you dump both provide packet captures for malware samples that's pretty much all they mostly all they provide they don't provide benign samples unless it's by accident sometimes so I can take that run all the pcaps through bro and retrieve a corpus of largely malicious

network transactions there's a few things in there that I tried to do a best effort to clean out things like connectivity checking to Google or CNN or someplace like that right obviously associate it with malware but not itself a malicious kind of command and control I won't claim that I got everything out because I think I ended up with something like 25 or 35 thousand entries in that corpus I didn't review all of them so there may still be some in there but my hope is they're statistically insignificant I tried my best to get the the most obvious ones out so that's I basically said everything that's left in this file of combined stuff is the bad corpus and

by the way this bad corpus is checked in to github so you don't have to do this yourself necessarily just to try it out the second thing I need is my own benign corpus so be by virtue of the fact that I look at our network security monitoring data I feel like we have a fairly clean office network so again instead of looking at 200,000 things for the day and saying do I think this is malware or not I actually said the vast majority of our traffic is benign I'm just gonna assume everything in here is benign and if there's a few things that might not be again statistically insignificant I hope so we just labeled that benign traffic and

all we did was we combined that into one big data set where more or less every row had the benign or malicious tag on it and I took 80% of it for the training data every serve 20% for that test data that's a that's this part right here and you can see I got the training and test then we had to do a little bit of feature extraction this is built into the code the idea here is that computers are great at computing against numerix not so good at computing against text strings so we had to find a way to convert especially the text strings into numbers there's three things that we did mostly straight up some of the things

are numbers and they're legitimately numbers like this thing the number of bytes transferred or something like that right some things are numbers but they're not really numbers they're lying HTTP response codes you think they're num numeric but let me tell you their enumerated data types because there's no sensible logic that you can say a 404 is greater than a 200 or you can't say 300 plus 200 is a 500 it just doesn't work they're enumerated data types even though they look like numbers so we had to convert them into something that you could do computations on but more interesting are the strings in those fields so things like user agents the URLs the the list of if there were HTTP

key value parameters or anything like that that was a straight-up string we had to convert those we use two methods primarily I won't go into them too much detail because I want to save some time here but we have the bag-of-words method which basically said we count all the words you know delimited by whitespace or some kind of word delimiter and we say how many times you had any of these words that in your string so if you had a typical user agent mozilla occurs pretty often because most of them claim to be some kind of mozilla thingy right or WebKit or something like that and you might have a cow a call that says was Mozilla in here yes one

right didn't have WebKit no zero we also did this thing same kind of thing for what they called the bag of engrams so an in gram is just a subset of a string at least two right or sometimes three four or five could be as long as you want but we computed some that said show me these combinations of letters and how how many of these were in there so we had it's probably hard to see but the you know the text the quick brown fox if you have a like a four gram or a 5 gram I think is what they're using so let's say you have a four gram th e space did

you find th e space in your string yes so basically you compute all the possible engrams and you record how many of each one you saw in each string and it becomes a vector a bit vector that you can actually start doing computations on so if that sounds complex it is but fortunately I didn't have to do that and all those things are built in now to clear-cut so you don't have to do them youtr you can just use them oh my gosh you guys cannot see the demos at all that's fantastic it's gonna make the rest of this go really quick I'm gonna I'm gonna try and see if this helps it all I don't know

I'm just gonna see if I can maybe move this maybe you'll be able to see it better in here yeah let's see ah no

and I can't even read it yeah I don't know if this will actually work but we'll try it okay yeah it doesn't all fit on here though does it I can't see this is this view no up here can't read it this one right down here somewhere zoom whoops oh come one let's try this yeah

there we're getting closer okay that's good enough I think whoo all right do it all right let's see if we could do this so this is just an example first of all I'm running the training script if we'd come some two scripts train flows underscore RF four random forests and trains flows underscore like I F if you're gonna use the I forests and that's because even though these are mostly the same scripts just a little bit tiny difference we could have combined them but we're lazy and this is pretty much what you say you say well I'm going to run the train flows I'm going to give it the what - oh the malicious data and then I'm going to

give it the the benign data so just I'm naming two files right and it says okay yep I'm reading it I'm reading it I'm reading it and now I'm gonna make those the vectorizer x' the vectorizer czar the bag of words that bag of engrams they're converting the text into the vectors they're creating the features and changing them into numbers and basically and it sits here for a while doing that and then it sits here for a while doing training and then it outputs this skip all this stuff so here what we have is a matrix where we say so the we said it was malicious the computer predicted that it was malicious after the training

what is this 12:20 Kent I can't read it some large number of times right it got it right almost all the time we said it was malicious the computer said it was not malicious only like 15% or whatever this is so it made some errors but not very many again we said now we said it wasn't it was benign and the computer agreed it was benign sorry the computer said it was malicious a few times so got a few of those kind of errors as well but most of the times it could have been right but most of the times the when we said it was benign it was really benign so just looking at that matrix we think

we did a pretty job we got mostly the correct answers we've got a small amount of either false positives or false negatives now one of the things that you do here is is you go through this process a number of times until you are satisfied with the answers I didn't show you this when you're creating your own log files and adapting it to your own log files your you need to figure out if you're in the ballpark of something good or not it's nice to be able to see this to kind of get details but it's hard to compare it a whole matrix to something that you did before and see if you're improving or not so

what we also have the bottom is this f1 Scorch is kind of a standard score of a model that basically takes the ratios of all this stuff here and converts it into one number in general like a ballpark heuristic if your f1 score is above 0.9 you're good you're doing pretty good ours 0.998 I think that says suspiciously a little bit high but pretty good I'll show you why it's suspicious in a second now here's a bonus on the training script if you run it with the dash V it can tell you what it thought the most influential features were it doesn't say for this feature this is the most influential benign feature or the most influential

malicious feature just says these features had the biggest impact on the on the training and you can see the biggest ones up here are like Mac OS OS X yeah because I made a mistake which I elected to keep in here so that you won't make the same mistake I went out onto the internet and got primarily windows not where and I haven't primarily Mac Place so basically my model is more or less just telling me if you have a Mac you're good that's not entirely true because you can see there's some other things in here like the number of dots that it had and it found in the domain names are going to and things like that so like if

you're going to an IP address you have a lot more dots right but it's heavily biased for Mac's being safe because we didn't have a lot of malicious Mac stuff so when you're going out there if you are like a typical Windows systems all around your enterprise most of what you find in the Internet is going to be fine for you right if you're not you might have to do something a little bit better with your data than I did and then now that the model is trained and by default it's saved there's options you can say where to save it what to name it whatever but it's basically we took the defaults so it's saved in the current directory and

it's it's ready for you to go and then you can use that trained model on your production log files so here's here's again analyze flows no matter whether you're using random forests or I forests or whatever it uses the same analysis because we're just dropping in that model that we trained into the api's so they're the same and you just give it a file of log files bro HTTP log files and notice what it said it said I have something like 200 and something things for the person to review out of like I think it says 187 1000 so I ended up reviewing like 0.17 of the log files that's not too bad because now we're definitely getting

back into the part where we're saying the human can actually do that part this was the cyborg isn't part right we let the computer learn what to look for we let the computer look for it and if it found something that it thought we needed to look at there it is and it drastically reduced the number of log files and they've basically eliminated virtually all of them we didn't have to worry about it and the ones that are left here we can look at each one of these individually also a bonus on this one the same - V feature that you can use to show what it found for each one again it won't tell you if the most influential

feature said it was benign or not but it was the most influential features for that individual log file entry so you'll get one of these it takes a long time to do this by the way this extra calculations but sometimes it's kind of cool right at the beginning when you're getting used to it and you need to see if you have some trust in your model you can throw that up there and see what's what's going on really so now let's talk briefly about adapting this to the other long sources there's basically three things that you might want to do and since I can't read it I might ask Chris to help since he he

did this part can you read this there's three things and if you want to get the slides and find out what they are follow us on Twitter it basically it's just saying yeah it'll be up there in a minute and it'll basically it's basically just saying here's the three places in the code that you need to check out and it's what you need to change the log file input so if you're not reading bro HTTP logs you might have to change the format of the CSV reader the in the flow enhancer you might have to change the columns that your feature izing make you know converting in to numerix and then in the the generator

there's one of the the helper utilities that comes with it where you may have to change some of the things a little bit to make the compute the correct bag of words or bag of engrams on the right fields but in in other words most of the things you have to do are straight-up data oriented and not machine learning internals and so for the wrap-up so there's four key takeaways from today right the first is pandas and scikit-learn this is emotion especially the scikit-learn they're really active Python projects and they are so well documented on the Internet you can buy Riley books or whatever they have done a great job of bringing these kind of data science data

analysis tools into people like myself who are not really data scientists and being lettuce being able to use them just having us in my case just a superficial understanding of the strengths and weaknesses so I can maybe choose the best algorithm but I don't have to know exactly how it works and I don't have to know exactly how to program it because it's already done so security technologists we can start using these things in our daily work as kind of black or gray ish boxes it's to the point now where if you have a problem in front of you if the pile of data and you don't know how to find the pieces that you want you could

conceivably see is this something that we could use either structured runs a supervised or unsupervised machine learning to do and just write it thing implementing the algorithms is not something that you have to do the machine learning part is really the easiest part the hardest part is getting that that that data flow pipeline that Chris was talking about set up correctly so that you get the right data the right features and it cetera it's straight-up data programming it's not data science so it's easy to do it might take a little bit of practice but it is not magic and then we have on the next slide I'll show you the the URL that you probably can't read for clear cut on

github so our twitter handles so I'm just at David J Bianco this is an underscore secret stash underscore and I promise I'll put the relevant links on our Twitter feeds so that you don't have to worry about whether you can read them online or here or not but if you go to my github for David J Bianco you'll also see the clear-cut repoed that has the sample malicious data and all the Python code so with that do we have any questions oh there was a really quick question like a lot of data we deal with is really long tail distributions like if you look at histograms of website kill distribution there is little isolation he's kind of sensitive today

is it gonna work well for that so they've done experiments and it it seems like the distribution doesn't matter that much for isolation for us which is one of the one of the advantages of using that technique you know in some cases like if you do have a long tail it's kind of like what is an outlier in that case right like if you have if you have a zip distribution or something like that something is quote normal but it is unusual right it doesn't happen often so yeah I think in some cases you will get false positives when things are actually normal but are just outside on the longtail but you know I mean that

could be considered an outlier in some ways anyway but yeah so there are considerations of course with distribution of data but they've shown that I forests are good at lots of different distributions and you can look at their paper there's also a paper just last year and forget which conference I could post it but it was comparing I forests to other methods the other methods I was talking about and I for us were like the clear winner yep

interesting but maybe not bad all the time right so I see that we are kind of eating into the setup time for the next presentation so if you have any questions Chris and I will be hanging around a little bit after the talk and you can ask us in person but otherwise thank you very much for attending

Practical Cyborgism: Getting Started with Machine Learning for Incident Detection

Related talks