Boston BSides - Machine Learning for Incident Detection - Chris McCubbin & David Bianco

Name: Boston BSides - Machine Learning for Incident Detection - Chris McCubbin & David Bianco
Uploaded: 2016-08-17
Duration: 43 min 6 s
Description: Organizations today are collecting more information about what's going on in their environments than ever before, but manually sifting through all this data to find evil on your network is next to impossible. Increasingly, companies are turning to big data analytics and machine learning to detect se

BSides Boston43:06375 viewsPublished 2016-08Watch on YouTube ↗

Mentioned in this talk

Tools used

pandas Zeek

Frameworks

scikit-learn

About this talk

Organizations today are collecting more information about what's going on in their environments than ever before, but manually sifting through all this data to find evil on your network is next to impossible. Increasingly, companies are turning to big data analytics and machine learning to detect security incidents. Most of these solutions are black-box products that cannot be easily tailored to the environments in which they run. Therefore, reliable detection of security incidents remains elusive, and there is a distinct lack of open source innovation. It doesn't have to be this way! Many security pros think nothing of whipping up a script to extract downloaded files from a PCAP, yet recoil in horror at the idea of writing their own machine learning tools. The "analytics barrier" is perceived to be very high, but getting started is much easier than you think! In this presentation, we’ll walk through the creation of a simple Python script that can learn to find malicious activity in your HTTP proxy logs. At the end of it all, you'll not only gain a useful tool to help you identify things that your IDS and SIEM might have missed, but you’ll also have the knowledge necessary to adapt that code to other uses as well. David J. Bianco is a Security Technologist at Sqrrl Data, Inc. Before coming to work as a Security Technologist and DFIR subject matter expert at Sqrrl, he led the hunt team at Mandiant, helping to develop and prototype innovative approaches to detect and respond to network attacks. Prior to that, he spent five years helping to build an intel-driven detection & response program for General Electric (GE-CIRT). He set detection strategies for a network of nearly 500 NSM sensors in over 160 countries and led response efforts for some of the company’s the most critical incidents. He stays active in the community, speaking and writing on the subjects of Incident Detection & Response, Threat Intelligence and Security Analytics. He is also a member of the MLSec Project (http://www.mlsecproject.org). You can follow him on Twitter as @DavidJBianco or subscribe to his blog, "Enterprise Detection & Response" (http://detect-respond.blogspot.com). Chris McCubbin is the Director of Data Science and a co-founder of Sqrrl Data, Inc. His primary task is prototyping new designs and algorithms to extend the capabilities of the Sqrrl Enterprise cybersecurity solution. Prior to cofounding Sqrrl, he spent 2 years developing big-data analytics for the Department of Defense at TexelTek, Inc and 10 years as Senior Professional Staff at the Johns Hopkins Applied Physics Laboratory where he applied machine learning algorithms to swarming unmanned vehicle ensembles. He holds a Masters degree in Computer Science and Bachelor’s degrees in Mathematics and Computer Science from the University of Maryland.

Show transcript [en]

my name is David biano I'm security technologist at squirrel that is a secret code for I've done incident response and incident detection for a long time and so I kind of tell the company how to do it and how the product should do it uh Chris here is our director of data science you want to say something about yourself um yeah I have a background in math in computer science and I uh worked for a long time on you know killer drones and stuff and then they uh they put me on to you know doing uh social network analysis for the National Security Agency so I'm like super evil and then now I work for

squirrel but says it in this chillest way possible Super Evil um yeah so I want to say first of all as a security subject matter expert he like an incident detection guy right I never really before I came to squirrel got that much into the data science aspects of it it was like always like uh you know write new signatures or new Pol new grow policies or something like that do some Network forensics some host forensics to trace down things but when I came to uh squirrel we do big data security analytics and uh now I made our marketing people very happy um so I started learning more about the data science aspects of it

especially the machine learning piece and I think the machine learning is actually something that I was really excited to be able to come and talk to you guys today because I think it's a tool that probably most of us probably don't think that we are ready to use but in fact it is so super simple to get started and you can start making your computers actually give you some intelligent results I will say this is a something that Chris told me um machine learning might seem like magic to you but guess what somebody else has already cast that spell and all you have to do is use it there's machine learning for python or Ruby or whatever

your favorite languages are all that stuff is there and ready to go and that's what we're going to talk about today how you as a possible non probable non-machine learning expert can actually get started applying machine learning to the jobs that you do every day to help you become more effective so that's what we're going to talk about um it's a lot of this is It's a combination of we're going to talk about the process of machine learning for incident detection we're also releasing a tool uh a demo tool that you will be able to go and say either you know you run it as is to do some machine learning against your HTTP proxy logs to

find evil or possibly adapt it to some other logs or even other processes that you have where you want to do some uh what we call binary classification and we'll get into this a little bit later so when's the last time you heard this it's best practice to review your logs every day who's ever heard that raise your hand all right raise them high keep them up raise them high keep them up come on all right now take your hands down if you are still able to do that today you you review all your logs every day there's only there a lot fewer hands up now right and why is that well of course we have too many logs today too

many different types too many instances of the same log type is just not a feasible thing to do although it really is good advice if you were able to view all your logs put a human in front of them they could actually pay attention and concentrate enough to see the all the logs they had and find evil things in them you probably would get pretty good results but you can't that's where this idea of machine assisted analysis comes in or probably what I should have called this talk practical side orgism or security operations computers are really good at some things and humans are really good at others this is almost cliche right computers are good at um U you know

repetition tasks large scale tasks do the same thing a 100 thousand or a million times they do it quickly uh the algorithms I say algorithms work cheap you don't pay they don't take six days whatever um but they are terrible at the things that we're good at which is context and understanding humans are so good at finding patterns in data that we are even a little bit too sensitive and we find Elvis on tost like where that pattern does not actually exist we actually can find it um but we can take advantage of that we also have curiosity and intuition that says you know this doesn't look quite right let's let's follow this up and we have of course

business knowledge to know hey you know these are maybe two business units that don't ever talk they shouldn't talk they're legally required not to talk why are they talking right but if you put those together you can have that that cyborg piece the analyst and the human acting together more like one so you get good results from a massive amount of data and you can do it really quickly so here's a practical example of this uh bro HTTP log in this case uh this is uh HTTP log from an outgoing it's like H equivalent of hdtp proxy right um analysts in this room probably already found out that they think this is probably a suspicious or malicious

traffic because they probably looked just my guess they probably looked at post and then this domain name that nobody can remember or type I always train my new analysts that if you see a domain that you can't remember it and type it think that is suspicious with a few examples or a few exceptions of like uh in totally internal things like cont content distribution networks or something like that um these shorter domains though are supposed to be things people remember and type they put them in ads or they you know whatever clearly no one's going to remember or be able to type that correctly so it makes it a little bit suspicious it's e it took me

far longer to tell you about why I thought that then it took me to read that and decide it was bad on the other hand what happens if you're like this you get this many this is only one screen out of a whole day's worth of data you can't really do that it's hard you have too many logs so the solution is to get rid of some of the logs so that's why we call our our solution is called clearcut because we we got rid of a lot lot of the logs and we found the adad logs right here remaining um and I'm going to turn it over here to Chris talk about this the

background and the data science pieces inside haircut and then I'll be back in a little bit it's David so I'm going to talk a little bit about uh how we're doing some of the machine learning stuff um again like all these things are built into libraries and we're just using them I'm pulling the covers a little bit back on the algorithms uh mainly to show you how to use them as an analyst um because there's going to be some important stuff that you want to do if you're going to use this in real life so what we're using here is something called a binary classifier which is a subclass of supervised learning algorithm so uh

supervised learning algorithm what it what it's trying to do is you can't really see the colors too well in there but um so you're going to have a bunch of data those are the dots um and you're going to have some dots that are orange and some dots that are blue and uh you know those in advance those that's your labeled data set these are things that you know are orange and blue and you want to try to make an algorithm make a uh a model that can tell you if some gray dot that comes in is probably orange or probably blue and um that's why it's called a binary classifier because there's two classes orange and

blue or in our case maybe malicious and normal a lot of times people call these positives and negatives depending on like what the problem set is um so we have here um you know one way to do this in two Dimensions is to just draw a line between the two sets and that's one type of classifier it's a linear classifier you know some some very well-known classifiers are basically drawing a line between your two sets in some space you know there's some transformation blah blah blah um so you know the machine can learn the function and basically there's this is like a recipe for doing machine learning right like this is kind of uh you know this

isn't machine learning itself but it's like what you do when you want to do machine learning on a set you know first you want to identify those positive and negative sample data sets you want to get a set of stuff that's blue and orange right uh you want to clean and normalize the data you know data is always messy we learned this as a data scientist if you ever work with real data there's some data that you just want to throw away um and you might want to add some things to the data that are helpful for doing this classification we'll talk about that in a little bit um so we want to after that we'll take

those those dots that we know the color of we're going to partition those dots into two sets a training set and a uh and a testing set and um after we do that you know we're going to compute some features on this data set to do machine learning we're going to train the model so like you know Computing the features is a little bit of the of the stuff that you'd have to do yourself we've done it for you in this in this in this uh in this tool that we've written but if you want to do it on your own logs you might have to do some of that uh train a model it's one line of python

test the model against the test data set to see how good it is oh no that's also one line of python it'll it'll be off for a couple minutes and probably back

on This Is The Stuff the machine does not want you to know been happening all day by the way the address at the top of the screen sign yeah it's expecting that you will sign in with Google to be able to do that but you know we'll also take five questions so if you don't want to sign into Google to ask a question that's fine we just wanted to be like if people were uh shy about asking questions they didn't have to speak up so train the model one line of python test the model one line of python evaluate the results one line of python drink beer maybe you're one line of python because three or

[Laughter] fourer and then you're done and you can go use that train model on real data and start classifying uh so this is just a picture of one of those steps we're going to take our data so the data that we've taken in this example is from contagio and some other sources of malicious data we're going to get the pro HTP law version of that for the normal data we've taken some data in our Network for our examples and uh you know we've done the same thing running through row we get all the label data we're going to split it split it into a bigger training set and a smaller test set so feature extraction so this is

where it starts to get interesting for some uh for for you guys um we want to ex give the machine as much of a chance as it can to figure things out now we don't have to be perfect um I think David mentioned before that he looked at that top level domain or that domain and said this looks weird why does that look weird so we can use our our domain knowledge to help out the machine a little bit it's a little bit hard for the machine to just take things as is boil the ocean and be like yeah this is malicious so we're going to give it a little bit of help by Computing some

features on the data uh that can help it out so we're going to do things like take that domain name and compute How likely it is that it's English just we'll call that the entropy right um so the these uh these these machine learning algorithms especially the uh the random force that we're going to use it wants everything to be in numeric form a number right and you saw that that record was not numeric in general you know there's a there's a bunch of strings HTP strings uh there's things like HTTP codes which are numeric but they're not really numeric in the way that the tree algorithm wants it to be numeric and that like code 400 is not

twice as much as code 200 right that doesn't make sense to the machine learning algorithm so those are really enumerated types right so we want to uh for enumera types and strings we're going to use this method uh that's widely used in uh natural language processing called bags bag of words bag of engrams what we're going to do is take these string features and enumerated type features and form a bag of all of them so the bag is it has a column of data and the column is okay this record was code 400 or it wasn't so if it's code 400 it's a one code it's not code 400 it's a zero so that's the bag and

there's one for each code in HTP codes and we're going to convert that one column in the original data to how many codes there are s good uh same thing with string data we're going to use this engrams thing so we're going to pass a window over the string and for example here on the picture on the right there's five grams so every five characters is going to be a different bag so if the th space Q is present in our string we're going to have a one in that bag and if that happens again in the same string maybe we count it up to a two so this actually produces quite a few columns if you just do it for every

single bag right um we want to have a way to like maybe reduce it because the the running time of the algorithm is proportional to the number of columns you have so we're going to use this technique called tfidf to determine which of these columns we really do care about and what what that's uh that's another text processing technique that says okay so it's sort of the most rarest things that have the most informational content it can determine that from the training data and only keep those columns and later on when we uh pass through test data or production data it'll only look for those bags and all the other bags the throw so we kind of want to give the the

machine an idea of the things that could be important and we think that you know the different codes the engrams and the strings the entropy things like the number of dots in the domain name these are all important things we don't know for sure we have hints we want to give them the machine those hints and the Machine can then work it out I don't to do that um so we're using random forus I'm going to have two slides in random Forest try I hope I don't put you guys to sleep or anything this is the magic part right like you don't have to program this this is already done but just to show you what's happening um a

random Forest is a bunch of decision trees put together what's a decision tree a decision tree is you take a log and you ask a bunch of questions about it and at each point it's a yes or no question or maybe a greater than less than question and depending on they answer that question you either go left or right in the tree and the leaves are predictions so in this decision Tree on the left

here so that data that decision tree that you saw a second ago was based on Titanic Survivor data so the first question asked was are you a man are you a woman if you're a woman you have a high probability of survival so at that point uh it figures out that no other column has enough information to give you any better answer than survived um but if you're on the if you're on the the mail side some other questions could provide uh more information for that uh decision so like whether not you were a young male or an older male over age n and a half that could give a little bit more information and give you a better

answer uh so that's how they work once you have them um there's various techniques to create them uh pretty pretty much there's like a greedy method the greedy method is like the most widely used you take your Corpus of data you uh try to find the column that gives you the most informational content for the answer that you're looking for and you make that your first question so it's sort of a greedy way and uh then you split the you split the data set on that on that question whether or not you know sex is male or female and then you do the same thing over again and until you need some threshold of information

content this can to be very good at fitting fitting a model to a particular set of data but it has a problem that it could be it could fit it too well and that's what we call overfitting so if we Tred to apply this model to another set of data that was gray dots it might have asked questions that maybe aren't meaningful for like a real data set because you just didn't have enough data so that's what we call it overfitting it'll give you wrong answers in that case so one way we mitigate this is uh technique called random for us so basically we build a bunch of these trees and then ask them to vote and the

vote is a better answer than any individual tree what's random about them it's really pretty simple there's two random things you train each individual Tree on a random subset of the data that you select uh with replacement and for each individual tree that you train you select a random subset of the columns so you ignore maybe most of the columns in each individual tree and you do that end times and you form a classifier by averaging or you know uh you could also have voting uh it turns out this is a really good classifier this is called a bagging technique for uh for multiple ensembles of classifiers so we've written uh we some scripts to do this uh on HTP log data

and I'll let David describe how they're run actually I'd also like to just point out we chose random Forest because it's probably the best one if you don't know what algor you should use because there's a ton of other ones because it's really super good at just figuring out based on all the features that youw at it so um and it's also probably the most one of the most popular ones so you're likely to find it even if you're not a p person if you like something else there's probably a random for implementation that you we're going to talk about training testing the model and then uh running your own your own data through it um we're also going to

talk about that in terms of like how you do it in theory and in general but also at the same time how our tool how you use our tool to do those so here you see the first step is we have to train so we have our tool consuming bro format HTTP logs right nothing magic about that you could change that if you wanted to but that's what we do because we have a lot of broke data hanging the first step is to just run the training algorithm with a sample of malware HTTP logs and a sample of our production presumably but not entirely verified good logs and I think this you know we we run it with like- o so- o

means this is the bad stuff and this is the good stuff uh the good stuff I think we had about one week we just took one sample week of our logs um and then that came out to be some large number of uh logs the malware sample is roughly 37,000 samples I think that's actually in the GitHub repo I didn't provide our own logs you have to provide your own normal traffic the malare samp in the GitHub repo and uh that's roughly 37 so we kind of um red demand this way when it says building vectorizes it's actually creating all those features that Chris was just talking about the bags of words and things like that and

then it does the training as you just described creates that random Forest creates a bunch of random trees in it and then it actually spits out what you need for testing so skip ahead so read the Bros data into Panda's data frame by the way each one of those has a label on it we depending on which vile it came from we we say it's either benign or malicious uh we convert all the strings using bags of words you can see bag of word we do things like method and status code because those are um like the status code like Chris was saying is basically a enumerated type the the HTTP method is also another enumerated type

get post whatever and then ingrams uh for other things like demands and user agents and whatever we split it into about 80% training data and 20% test data and fed it through the random Force at this point it's noticed that we haven't really done anything with the test data set that that that 20% that we reserved but this is where it came CES in now we've made the trained model we want to see how good our training is so we took that test data which is still labeled each thing is still labeled Bine or malicious and that way we know what the answer should be for each one of those and we run them all through the

trained model to see how good we are and you can see here just a little table that says this is really class zero so good uh sorry bad and uh we predicted bad 12,428 times so yay us we predicted it was good only 15 times pretty good right we made uh we said it's not going to be perfect but it'll be in the ballpark and we did the opposite it was known bad we predicted it to be good only 19 times and we predicted it to be bad uh 9,563 times so great is this a good model though I mean you can kind of look at it with these numbers and say yeah this is a good model but when you're

trying to um run this in your own environment your numbers may not come out quite so clear also if you want to experiment with some additional features or whatever that you might want to add and you want to compare multiple runs it's hard to compare a whole table so what we've done is uh I think I have a slide on this yes we have also computed this thing called the F1 score this is a standard kind of statistical score you can see it in Wikipedia if you want but it's it's a mathematical combination of things like true positive rates and false positive rates and all these things to kind of give you a one number from zero to

one how good you think your your model is anything over like 0. n is considered to be good ours is actually suspiciously too good um as Chris was mentioning we might have some overfitting there we'll talk about that in a little bit but uh technically we're over .9 so we would we might consider this to be a pretty good model now bonus if you do the same thing and you add the dashb will'll take a little bit longer but it will actually tell you the features that the random Forest thinks are the most indicative of either way it doesn't label them like this is more indicative of malicious and this is more indicative of non-malicious

but it just says these are the ones that count more um and so in this may not surprise the malware analysts among you you user agents are often really influential because malw it screws up the user agent a lot or they might get a user agent totally legitimate that maybe not in your environment but we also have other things like the user agent entropy the sub the entropy of the subdomain um the the body length of their response and the request so the number of domain or dots in the domain and things like that and these are ranked I mentioned something there so these are named like the feature type and then if it's a bag of words or

engrams after the dot it's the actual engram and um also this is out this is the top 50 out of something like 3,000 features pretty right so the next thing is and this will write a a trained classifier out into by default your temp directory if you actually go in the code like give it Das H we look at repo it'll tell you how to um put that somewhere more stable for production use um but here we actually the next step is just now we think we got a good model we've trained it and test it so uh and evaluated it now let's run it with some of our real data and see what we get so that's what

analyze flows does you just give it the name of another bro HTTP log uh I I designed this so you could basically do this every day with the previous days with the logs or you could do something fans here and it basically you know loads it up calculates the features on the new log because it obviously needs those features to run the machine learning and then it analyzes it it'll go for a little while um and then it'll say here detected 298 anomalies out of 180,000 log entries so I'm only having to the machine's telling me I only have to look at 0.17% of all the logs that were in there which is really good now who

thinks that they can look at 180,000 no who thinks they can look at 298 yeah so now we're back hopefully into the realm where an analyst looking at the logs that they need to May again be possible like it was 25 years ago sorry is it t per hour per five minutes per day this this is a day log so it was 298 log entries for the previous day yeah it depends on the size of your network log but it says no my network log had 180,000 in it in a day we have a small office if you're a larger company you're going to have a lot more and probably you will also have a lot more things to

review um but hopefully it will still be in the in the ballp part of things that you can do and if it's not you can Tinker with the code add some more features or anything like that to help a little bit better uh also bonus this will take a lot longer if you run it but it's kind of instructive to try it once or twice if you run the same thing with dashb it will tell you um for each each of the outputs why it thought that like what were the most influential features so this is actually telling you these are the the features that we took out of the brog by the way if you want to go back to the

original brog it's line 431 in the original BR so it can help you find it over um but then we said this is something you should look at because it's different than most of your logs and it says okay user agent length was the most influential feature this is the user agent link that's the user agent right there um so yeah clearly that's a lot smaller than most real user um the response body link the domain name link uh the user agent these are all things like as you might imagine most of our office runs Mac OS and most the user agents have Mac OS in it and this didn't have any of it so it's

saying I expected to see these didn't see any but it's kind of interesting just run it once or twice it does take a substantially longer amount of time by the way so don't try to run it like that every day and I'm going to turn it over back to Chris here for a few more slides yeah so that that's actually one of the pretty cool things about random porus and uh you know decision trees is that they are one of the more explainable types of models that are out there some are just like you know deep learning it's like you know I have this like crazy function um so there's 's one one feature that we add I don't know how

much I want to go through this like in detail but um so if you don't have malware samples or you don't trust our malware samples um you can you can also run a a a binary classifier with just one class of data and there's various ways to do that um so we call this one class classification because you're trying to fit a model just to the normal data and trying to like find other abnormal data now one way you could do that is just like generate gibberish um that's generally not as good as doing this other way called noise contrastive estimation so what we're going to do is create fake data and call it malicious

and that fake data is going to be generated from the real data but have certain properties that make it look not realistic so like an example is this camir over here if I have a classifier that's trying to classify animals I'm going to like generate this thing that has a lion's head and and a snake's tail and uh nce is actually pretty much like this with logs I'm want to jam different pieces of logs real logs together and say this is malicious but it'll it'll end up being able to classify your normal data um better than just like Rand stuff so we have this if you don't pass in a malware file on the command

line that's an option it had that minus o option it'll do this automatically so you can see how that works you look at the source but it'll just produce a classifier using this automatically one nicely the uh the the normal data file is not uh optional so if doesn't have a flag so um obviously like you know this is something that we made for you guys to show you know how uh how sort of easy it is to to produce something like this it's not perfect uh there's some things that could definitely be improved about it um you know more diverse malware samples is the better data that you have to start out with the better orange and

blue data the better your classifier is going to be so that's almost always true um so you know better data better better classifier can I say something about yeah yeah go ahead when I said that we might have had a suspiciously High F1 score all of our Amare samples mostly were Windows malware samples in our office being mostly Macs um it was pretty easy for the classifier to tell the difference between those two um when I say more diverse malware samples I really mean we need to get some OS X malware in there which we didn't have because it wasn't there wasn't a lot of it in those archives yeah the uh there's a very old

story maybe apocryphal about um you know M about binary classifiers where they were trying to train a digital classifier to find pictures of tanks and it was very good on their test set but it turned out that like all the pictures of tanks were taken on a cloudy day and all the pictures of not tanks were on a sunny day so it was really like looking at the sky and saying oh that's a blue sky so you made a blue sky classifier and that's maybe kind of what we did we made a blue sky Mac OS classifier um so you have to be careful about that with but that's data and not even really the

code yeah exactly it's data um so uh there's some some malware does things that is pretty normal like uh checking Google checking maybe like a Time serve I things like that um that could throw off the classifier so you could pre-filter this is part of the uh data cleaning check you could pre-filter rows like that to make it not affect the classifier as much because it's going to look too normal things like that so you can do cleaning steps um there's actually a like in psyit learn the random Forest uh code allows for retraining the forest and it's called warm starting uh what what you could do is like you know your your train model

was on Old malware data old normal data that eventually will become stale like in our Dynamic environment so uh there's ways to like take New Normal data new malware data you could just like combine it with your old data and retrain or something from from scratch or you could do a new one from scratch um but it could be helpful to take the old train model and use that as a warm start to your new train model so there's ways to do that in pyit so that's another thing we could do uh and it' be nice to have uh you know plugins for different log types we're kind of like a little bit hardcoded to do HTP brogs right now uh

it' be nice to kind of have that in a plugin form where you can say like BR logs whatever or something else HP weird logs things like that um other thing is like these things can not just do binary classification but also K Glass classification which means like uh it would be able to maybe guess what kind of malware it is uh or normal so that's another thing that add um really like uh the main thing that would be nice to have is uh you know different log types because this kind of like like I said fixed to HTP progs um I kind of put a little recipe in here uh for anybody that wants to

Tinker with the code if you want to to do different log types these are the four steps that you'd have to take there's like three file three files that you'd have to change and four steps basically like you have to read it in differently and you have to generate features differently uh so if you do this it should work on like any other log types of data I'm not going to go through those in great detail but um know if you want to if you wanted to ask us any other questions about that uh here's our contact information and um anything else uh no just to say that uh there our Twitter handles are on here I'll be

tweeting out the link to the slides um shortly and also the GitHub URL is on here it is I'll be flipping it over to a public repo uh probably a few minutes after we we are done up here um so you know within say an hour all both of those things will be available and uh you can go to town also um we due to our AB difficulties we lost all the questions if anybody submitted something through Google so uh I guess we're ready to take all of our questions in person yes get a real simple question what P packages ml packages did you do so it's on the Remy uh we use Panda Sid

kit uh there's a couple of um random libraries for doing like top level domain extraction and stuff in the read me it has a a thing that tells you which packages you need to install one there uh so I have a couple of questions so the first question has to do with correlating the anomalies that your classifier produces so let's say you have one anomaly and another anomaly uh have you looked into how they can be correlated uh in terms of uh them belonging to to the same attack um yeah we didn't try to do that on this we were trying to keep it to a like a one job kind of thing uh in fact

you will notice that we actually when we display it back to you through our tool we don't actually tell you the the IP addresses or on either side or anything like that if you because for the most part and unless you're super owned U most of these still probably going to be false positive but uh the ones that do look like a true positive should be pretty obvious and then you could look those up if you want to um if you want to do the that kind of correlation I'm afraid you have to buy squirrel's product okay I expect there's going to be a lot of answers like that but uh uh the second question uh is about Advanced

or sry rather adversarial machine learning have you looked at the the uh uh modal drift aspect of of training training your classifier and the tagers for adapting to um to your your machine learning approach yeah it's a very interesting question um you know the this uh like retraining can help but yeah it's it's definitely like one of the if not the most challenging problems in machine learning today because of that reason and because of other reasons like you know the if you're trying to classify the type of an iris or whatever one of the classic problems the iris is not trying to look like a dine right like and that's not true here so this it's a

very challenging problem thank you clearly in your sample set you train great two AR effects you had there one is your your feature extraction is really really spse which I imagine again could lead you to faulty feat feature selection and the second is theed to a sample set for roughly the same size in your training set you roughly the same number of samples in both trffic true life of course has benign traffic in orders of magnitude more than malicious traffic any loss any loss you predicting no is the safest thing to say you look at question right so I mean you can always that that's why we're using like F1 score uh to do evaluation partially

because um you know if you just use something like accuracy accuracy is going to be very high if you just say no all the time which is bad right like so you really want to use something like F1 that takes the percentage of you know it's it's actually the mean of of sensitivity and specificity so that takes that into account and it's going to be low if you're doing things like saying yes all the time saying no all the time um but yeah like it there's various ways to mitigate like you want to collect as much malicious traffic as you can right of course uh you can use a combination of something like Moise contrastive

estimation with like your your labeled data you can you can do both so you can try to like you know make uh essentially fake malware traffic and try to make the uh the bound on your normal traffic as tight as possible so I I think it's you know malicious traffic drifts a lot there's a lot of variety in it but um I think generally normal traffic is not like that right like um you know you go to Facebook a lot whatever use OSX a lot so I mean those signals are in there and like um this thing is going to pick them out if they're there so I know in your example you focused a lot on the foxy traffic but

what other kind of logs have you focused on and have you tried to let's say cross correlate between different log sites to get let's say a higher tros rate yeah we Haven actually looked at in for this um the correlation of the logs uh we again like we're trying to be a demo crossed with a real tool so we kind of scoped it down to like you're going to look at a a log although it's it's perfectly possible that you might take like with bro data a lot of things in bro are explicitly linked by their IDs so if you actually looked at the Bro HTTP logs they actually have like file IDs that correspond to files that were

uploaded or downloaded you could think of pasting those two together and saying I'm going to augment the HTTP log with the information about the files that went on that um that connection and then see how that affected it so for example if I'm downloading an exe file maybe that's a little bit more malicious then or or or maybe you would find like the jpeg it says JPEG but the Mind type that grow assigned to it actually said like dll looks at the bites and that might be a useful feature that would help classify that as well uh we just did not do that because we were trying to keep it more straightforward and simple to make it easier to

understand because it's getting started with incident detection so this is the kind of thing that the product itself is is doing is that what you're saying you're just giving a sort of basic introduction I can neither confirm nor deny Senator yeah I think it's a good observation like these this is log at a time and we're trying to classify logs and really what I think you're asking is um we really want to try to classify behavior of various things like IP addresses or users and to do that you have to do correlation you have to do like modeling and things like that so that's yeah that's the kind of thing that we do in on in our real life

jobs I guess we have time for one more there's question explain your that's that's the GitHub logo yeah we're squirrel not GitHub we're squirrel we have the acorn GitHub is a cat octopus right cus cus yeah ours is like an acorn that's also a shield we PA those guys right well if that's if there's no more questions thanks very much [Applause] and

Boston BSides - Machine Learning for Incident Detection - Chris McCubbin & David Bianco

Related talks