Transfer Learning: Analyst-Sourcing Behavioral Classification

Name: Transfer Learning: Analyst-Sourcing Behavioral Classification
Uploaded: 2017-08-27
Duration: 36 min 39 s
Description: Ignacio Arnaldo and Tim Mather present a machine-learning approach to detecting cyber attacks across the kill chain by training models on behavioral patterns rather than hand-written correlation rules. The talk addresses how to curate externally sourced threat data and apply transfer learning techni

BSides Las Vegas · 201736:3995 viewsPublished 2017-08Watch on YouTube ↗

Speakers

Ignacio Arnaldo Tim Mather

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

Splunk

Standard

STIX

About this talk

Ignacio Arnaldo and Tim Mather present a machine-learning approach to detecting cyber attacks across the kill chain by training models on behavioral patterns rather than hand-written correlation rules. The talk addresses how to curate externally sourced threat data and apply transfer learning techniques to reduce false positives and analyst fatigue, with practical examples in domain analysis and C2 detection.

Show original YouTube description

GT - Transfer Learning: Analyst-Sourcing Behavioral Classification - Ignacio Arnaldo & Tim Mather Ground Truth BSidesLV 2017 - Tuscany Hotel - July 25, 2017

Show transcript [en]

this next talk is going to be transfer learning analyst sourcing for behavioral classification and it's going to be given by Kim Mathur and Ignacio Arnaldo right here a couple notes before we begin if you guys have phones please use them silently make sure that it's not going to disrupt everyone else and go off there in the talk microphone for questions if there's time for questions later there's microphone in the middle of the room and we'd like to ask that for the sake of a future of yours of this talk which will be recorded and then post it on YouTube later if you want to go back and listen to it again if you can just

use the microphone then your questions will be recorded also which is much appreciated quickly I'd like to thank all the sponsors especially very sprite / tippity tenable Amazon source of knowledge without them this event wouldn't be possible also big thank you to speakers other sponsors and all volunteers if you guys have feedback on the talk that you'd like to give you can just go online Sadd org and there should be a forum there that you can just fill out and that's really appreciated also so without further ado presenters can you hear me so a little bit about myself my background is in data science in the Iast in musain I started back in Spain in Madrid computer science and then I

did a PhD in AI and then I moved to MIT where I was a postdoc and we were basically building systems for machine learning at scale so out of that after that I jumped into pattern X this is that up that is tackling cybersecurity problems and we're actually using these techniques to solve some of these to detect with some of these attacks okay so just a set of baseline here at what we're trying to do and then he's gonna get into the details of how we're trying to do that you're all familiar with the kill chain and of course you want to get that interrupted as far left as possible are the right Google of course that's

trouble probably that is that today the products that are available to you the false positives are way too high of course the false negatives are way too high hence the number of reach that we have happening out there for a very high command for investigators sake and on us to go handle all of this work and if any of you have ever been a sake analyst you know it's strictly a pretty boring job most of the property last about a year and say that's enough I'd like to go do something that's a little bit more interesting so this is the problem are trying to topple here right now we're trying to do that across multiple phases of the kill chain again

of course as far left as early as possible for example delivery that's great let's see if we can intercede it from even getting in to organization but we all know that it is going to get in and so can we at least stop it either community control as an example of our X filtration before the intellectual property whatever it is that's important to your organization has walked out the door and you look at and say oh there goes ok too late at that point but you'd at least like to be able to do that detect it if you can and block it as early as possible the problem is is that you're trying to look at all of

these connections that are being law I think that may be internal I made the external ones that various apps it's at the network level whatever it is and those combinations of source IP address destination IP address ports etc that's billions and billions maybe even depending upon the size of your organization truly in some combinations per day let's be honest there's no way a human is going to be able to go through that okay I've been on the inside space myself I know your pain okay I would bet you that you are not even logging 40% of the data that's available to you and the amount of data that you're actually reviewing in near-real-time it's probably only about

10 percent of your enterprise ok with numbers like that there's no way that you're going to be able to find these types of times so it's no wonder that we're missing over 80% of the tops and if you look at Verizon's data breach report in spite of missing that number of attacks when they are discovered which is usually by a third party not by the own organization according to them 82 percent of the time the evidence is already right there it's already in your log files you've missed it because the volumes you to hide you have not enough trained analyst it is so what not just get into is that it's how do we train a model that I

don't have to write Sammy correlations for okay correlations are great but the fact of the matter is I'm a big believer spunk used to be a swamp okay it's a great product however trying to write correlations for all of the thousands hundreds of thousands of variables and she would have to count four is it gonna make it now you can do well party you can do various other tricks that swamp enables and that's great but you're never going to write it up correlation rules to be able to do that so I'm not too pleased some of the challenges of how our Jesse does absolutely so before they have been like a few presentations of people explaining different use cases

for security and how they were using machine learning models right so I want to say that I mean for me like there's a few challenges that are basically say block him to apply a I mean for sick and I wanted to very briefly so I wanted to really briefly go through those right and again I'm going to do the comparison with computer vision like a lot of the previous speakers haven't done that and the reason is that we'll all seen the success in computer vision and how cars are self driven and these tons of apps that you throw a picture and you get tons of information back right so why is that kind of stuff not happening in four

segments so for me the first reason is that data is not readily available so you can have access to malware samples you can have access to pad domains but how do you get any access to any lateral movement did I said to say something right that doesn't even exist the also organizations are siloed so they're not sharing their information their data so it becomes very difficult to truly understand those attacks as opposed to computer vision where you can just go to google to youtube to anything and get millions or billions of images another problem is that the data is not universal and what I mean by that is that images are images right so you can

take an image put it in the right format and it will be valid for your model but in InfoSec what were facing his tons of devices generating information there are different vendors for each type of device you have different vendors or firewalls and so on so each will generate a slightly different data and then their versions so if take that and get the company taric's you will find that there is hundreds of different data sources that you need to understand how to how to ingest again in computer vision face recognition is face recognition here and Spain when I'm prom and everywhere in the world but same for self-driving cars and now one of the biggest challenges that I face on my

day-to-day basis is that the data is not labeled so they tell me you need to go here take these data source and detect this attack and I say well show me some examples nobody can do that I mean nobody is telling me ok this is whether than an example looks like and the reason is that very few people actually you guys can understand what it what is an attack and what is not right where does poor computer vision like researchers have built like datasets with billions of of examples for because libyan is very easy right so anyone so anyone can go and say whether in an image there's a cat or a dog and that

would be a labelled example the machine learning model can ingest and learn to recognize another yet another problem is that the data is very dynamic so an attack that is valid today might change in six months right so the data you combine the fact that there are few labels few people that can give labels under the labels have an expiration date so the examples have an expiration date and it becomes very difficult to create datasets good enough to train accurate machine learning modes so I just wanted to provide an overview of what my life has been in this past two years so there's a lot of work into making this machine learning models work right so

people have talked about the benefits of deep learning and how that can leverage some of that - which I agree to some extent but in my opinion you need to either become a domain expert to understand those attacks and be able to detect those or get in touch with or work closely with domain experts such as thing that can give me their information that I need right and what I need is to understand what the attacks look like and there's a wide range of it of those not all of that all of them will be suitable for machine learning approaches you need to identify which ones you can actually tackle once you do that you

need to understand what are the data sources that you need to parse and model to detect those attacks after that you need to understand what are the right features so what is the right say numeric numbers I will describe the activity of the attack and finally you need to understand how to model those features and that feature there so that is a it's a lot of work and and I would say it's one of the biggest challenges because if you come if you compete against computer vision here there would only be just one data source right images or videos whereas we need to deal with hundreds so one of the things that we see that can

definitely improve this situation is to engage a human analyst at domain expert where the AI itself is engaging the expert so basically the idea is to create a loop and this is not new this is active learning this is again machinery mode machine learning research from the 90s so it's not a set of models it's rather a protocol where you have an AI system that wants to train a model right and the AI itself will query an analyst and would say well you domain expert I know you have a very limited bandwidth so let me show you like hundred examples give me labels for those so I don't think those tell me whether their attacks or not and I will

learn from it right and basically that's what happens here so you have some smart logic to understand what to query to the analyst so the idea is that normally you want to show attacks you want to show things that would make the model better so basically things that whenever the analyst labels the model gets better so then once the analyst has given the labels you train a new classifier and end up classifier then can be deployed right so I'm not going to go into the deployment phase in this talk it's all about training the mouse okay so you could argue that okay now we have a system that can take the knowledge from one analyst get his feedback and improve

models right where the ear itself is gonna be querying the analyst so then once we have that that framework in in place the idea would be why not communicate across different organizations right since people are analysts are already providing those labels that are needed for machine learning model but they can only provide things labels for things that they see a dead environment right so what if we can put together different organizations and share the label data so that machine Irmo is that machinery models that which I become more accurate now the goal for doing this is of course in machine learning it's well known that the more labeled examples you have the better your models

would become so if we if we share labeled input data across organizations what we would expect to see is better detection rates and it on the end of it you're learning from more examples the odds are you'll be detecting more things at the same time because we're going to get labels or label data at a higher pace we expect to learn faster so whenever a new attack happens you don't have to wait to have Android instances at your organization because if you get those distances from everybody then you can train up an accurate model faster at the same time if say you can expect to see to detect attacks that you have never seen before but that someone L had

someone else has why because if somebody detects an attack gives you the label information it can gives you that you can train them all that would expect that kind of behavior in the future so translate it into like detection performance plot however time what we would expect to see is that the the curve that uses transfer learning so basically that is getting data from everyone would end up having a better detection rate or it would also learn faster so we'll have a higher slope and it won't start at zero because eventually you can have examples of attacks happening at an at third I mean at another organization other than yours so there's two ways to carry out

transfer learning that have that are popular in the literature so one is you can directly share the model so those are executable black boxes that take data in generate predictions out in is it an attack or not or you can share the label data and we're going to go through the same two with a second and we're going to explain why so here going back to what Tim was saying where people have already been sharing information right and doing transfer running to some extent I would say that today what is happening is that when you when you subscribe to a ta feed or are contributing in in a community-oriented a threat intelligence feed what's happening is you're sharing those

exact say indicate in this case domains that are bad right so anyone that has is analyzing his own data can grab for these domains and whatever happens to be a match you say well maybe I need to investigate this thing right sorry what we're proposing here is that what we actually should do to extend that to be able to apply machine learning models is that not only you need to share the IOC's but you need to share the features that describe the activity of those models so I'm gonna go through a specific example of how we do that in later on so in order to implement this and to get this kind you need some

infrastructure and so basically you need every organization that is having this label acquisition loop where the analysis being engaged in providing label data you need to connect all of them and build a central repository right so just like everybody today sharing the iosys we want to we we want to create an equivalent thing where people are sharing not only the IOC's but also that features that describe the behavior of those IFC's so there are many options to do this basically if you're familiar with threat intelligence there's a format that is widely used at its peak so it's basically JSON you can put in there like any any fields you want and these are actually already exist right so entity

or actually I did I added an entity and you have the ID the created time the modified time the name this case will be a specific domain that we identified as as malicious then the description in this case we would have to add the label which is delivery and then the features right so if you just want to share that stuff the IFC together with the features just need to append the field and propagate that through the network and in order to to consolidate the distribution there's an open source framework that is taxi that it can be definitely extended to that that is used to propagate those thread Intel feeds in sticks forward so the the tools are

already there now there's one big problem though is that and this is the probably the most I would say technical part of the talk is that who's to say that an etiquette or an organization a is also an attack at an organization organization B because companies might have different policies may be what a company considers militias and other company says well it's okay for me I don't care about it or it it's perfectly normal that will depend on the organization's so we cannot just blindly take data from the outside we need to do some level of curation of the of the example cyber theory for that this is one suggestion this is open this is working progress is not

like anybody has fully understood how to do this but one idea is to say given that you will have some local examples of what an attack looks like right you can fetch say weird the local organization we can fetch the examples in this case will be flagging with red the attacks with green the benign examples and and with white it will build something for which we don't have a label and this is the representation that will see throughout in machine learning oriented problems where you have rows these are different entities this could be domain one domain to domain three time in for domain five and these are features that describe the domain so this could be the domain

length this could be the ratio of digits to consonants and so on so what you're gonna do what you're going to be doing is you have a set of attacks or bad domains that you have identified out your organization this particular piece you're going to get fetch bad domains that were that were identified by an analyst that not another organization and you're going to do some magic here to understand whether those extra the examples are helping you or not right so that's where would you need to apply some logic to curate basically the goal is to say do I use the attack examples from a third organization for learning or not right so that can be tested via say some kind

of a be test via cross-validation which means you try with the external examples or end without the external example whatever works best you do okay and at the end of it once you have understood whether you need to use the the attack examples from other organizations you build your training set that has said the benign and the attacks and there you can change your standard machine learning model that can be used to detect further future instances of the attack [Music]

[Music]

[Applause]

now that they are get points oh so now I want to use the different meaning of the talk to provide like an example of how to apply this technology right so this is an example in people have already talked about URL analysis my opinion it's it's a good fit for a talk because it's intuitive and it's a good fit for machine learning model as well and I'm trying to I'm going to try to motivate that right so this is a domain that we found in one of the organizations that we have access to and will by the federal reporter data as malicious and if you can see the time stamp that was only yesterday and we actually

discovered some time ago right and let me try to explain why we can actually detect these things right I mean it's quite simple when you think of it so first of all the domain looks fishy right so you have like somebody telling you to click here and update something probably not good but you know what like if you want to detect those with current gif it's you cannot be looking at the exact domain so what happened if the attacker actually all he changes the top-level domain see that the domain is exactly the same there's a top-level domain changes so that might not be stopped by those blacklist right now say that for some reason you managed to block those three

you understand that those sorry you understand that those feel bad and update your blacklist what is the bad guy going to do next well he's going to just modify it a bit and sure sure enough it's going to bypass the defenses right so I can keep going and and these are things that we saw in a very short period of time but as you can see it's very tricky to keep up with this with these small variations that's a blacklist won't do it on the other side everybody can see a pattern right so here you have like domains are very similar in length have similar words have weird top-level domains so this is a good fit for machine learning so I'm

going to do a quick experiment so we're going to be analyzing data from three different organizations over six months so data spanning from January to June 2000 2017 and we're going to be considering traffic to the top 10k Alex at domains which will consider benign and we and we're going to be using some phishing attempts that we identified out of which I mean but there were only 488 unique domains which means that some of the domains actually showed up twice so this is how we're gonna be doing it how we're going to be trying to detect these bad domains and this is very simple but it's for the purpose of the talk so basically we're going to list all the

domains that we see and for each of the domains as was explaining the previous talks exactly the same we're going to be exciting to extract in a set of features right so for Google dot s pain you will have the power ratio they did ratio the number of phishing names order the characters ratio the frequency of the top-level domain as a as a proxy to know how Trust helped the reputation of the top top-level domain that's something that we identified that people that registering domains in very cheap top-level domains I don't have all the checks and in the domain length and the consonant feature so very simple I mean let's not expect to detect everything

with this but just for the purpose of the of the experiment and at the end of it we append the labels whether it was a benign domain or a delivery domain right so in order to train a model we're going to be choosing random forests so other people have talked about neural networks in my opinion random forests are a very good fit for in for second use cases because you do you are in a situation where you have a lot of benign examples and very few attack examples right so in those kind of cases where you have unlabeled data those models are a very good fit and they provide some level of interpret ability as well so once they

generate a prediction we can dig into the model to understand why he predicted that when a domain to be good or bad then when it comes to choosing your machinery library you can you have a myriad of choices I use sohcahtoa which is whatever I use a lot has proven me that is very robust in production environments so I highly recommend it so this is the first organization not talking to anyone just improving over time right so what we're saying here is that the six months and here was show in the detection we the recall at hundred which means we'll show in the analyst the top hundred domains that we identify and we're seeing how many attacks the

percentage of attacks that were cashing that were other way detecting which means for example at the second month if you look at the top hundred attacks you're only catching 30% at the third month as you hear the from the previous months your what your detection gets better right so that's a human in the loop that is training the platform over time so what is happening actually is that the model is getting updated every month and as a result the detection rate is roughly going up now we do the same thing at the second organization and we see that Bradley there's an app word string we get a heap here I'm not exactly sure of why I

haven't haven't checked but most likely the domains that had that showed up in the six months were very different from the domains that were identified before therefore the classic R was not good at detecting those patterns and now what we're showing here is what happens if if those two organizations had shared their labeled attacks right if they could have created a common repository of attacks and learn together so what we see is that we see a higher slope for the blue one very small one so this is minor improvement for the organization too but for organization one we see that from the second month we're actually going from 0.3 roughly to 0.7 right so that's

a very significant improvement by sharing their attack examples they can actually improve those models much faster so now what happens is if a third organization joins this network say five months in right as these guys have already been sharing information a new guy comes and joins the network right so what what what actually happens in this case is that the guy that is the organization three that is joining the network at the fifth month it's not starting at zero at this guys did but it's actually starting at 0.5 so even if you not have not detected any any attack at your organization just by getting the data from the others you could actually be detecting 50% of of course attacks

and this is an a small comparison that we did as to when we were finding these paid domains and when they were being reported in blacklist and without too many details we saw that the median time of us detecting those was like 10 weeks before roughly because of because those those domains have like a very clear pattern and were easy to spot by these classifiers right whereas for a human it takes a human to actually examine the domain analyze it see whether it's good or bad and then upload it to and then update one of those lists now I wanted to provide some hints as so what each of you can do at your organization to try to with start

deploying these solutions right so the first thing is that I recommend the URL analysis it's a it's a it's a nice use case we have seen even the other talks it would use results and one step that would have to be taken to improve this is to expand the features that we're using right so when I said that this model provides some visibility or some interpretive interpretability we can see here which features are important right so we've seen that domain it seems to be important the number of phishing names is important so that's to be expected basically it's where the domain contains keywords as update add create download now free save that kind of stuff and

then the others will not not so important right but the idea is that you analysts will have ideas of what to look for right so you need to translate those into features and once you have those features you can actually train them all that will be more accurate another way to apply these technologies at your organization is to target all the use cases so that you need to follow a standard data science process the first step is to identify the use case you want to solve right there could be use cases unique to your organization nobody else have so those you need to identify what the problem is now here we provide some hints right so

the liver will be efficient domains will be a DJ those are use cases that can be addressed easily but then you need to understand what the attack will get logged right so if there's any particular data source that is useful to your the use case you need to fetch it you need to be able to understand it and process it then you need to get attack examples and and then you need to loop through a cycle which is where the data science work is of identifying appropriate features so what are the right American values that will describe those attacks and then train embody data model so you extract some features trainer model see how good the model is

if the model is not good enough you need to cycle back and get new feeling more better features right so that's the data center and then once you have a model that you trust you need to understand whether you want to deploy it online or but if it's bad it means that maybe your organization only cares out looking at the logs every month or so so you can do it like retrospectively and that's all good if you want to do it online then it's another story it requires some systems and some production guys I mean you need to productionize those models it's more involved but it can definitely be done and then finally you need to find

peers here in this conference and understand which ones will be willing to share those feature those labeled attacks with the features and try to build your network so how many of you participated I sacks are other sharing information or invasions day not a high percent mmm yes dr. mansion most a sharing them is probably using sticks taxi but there's no then really automated analysis that is aggregated but I've been compare that against your peers and that's what we're talking about here's a way to do that so with that said welcome any questions from the floor the microphone is volunteer

so one of the use cases you're mentioning is Seto bleep named the c2 use case where you know you can track seafood through proxy right you can do some machine learning on proxy locks and and detect zero beaconing correct so the question is if I have to share information on proxy logs so that I have more data to train the model on do I have to as an organization share all my proxy locks or only the ones where there is there was a so just to provide some context you will be sharing the feature which could be content things in that use case like number of connections a periodicity of the connections to that domain right or number of bytes sent in

each connection that kind of thing so so the shading has to be in the context of a machine learning model it's not that I am sharing raw logs now allowed to keep a machine learning model in mine and accordingly share level data that makes sense thank you my question would be you have to have a standardization type of machine learning algorithm you go to use it doesn't seem obvious to me you could argue that feature engineering is some somewhat coupled with the model that you deploy in this case you can standardize on random forests they will work just fine my opinion if you have the right features and enough data it will just but that being said once you receive all

the features from all the others in your own and you have your own features at the point you can also try different modes I had a question up so how sensitive are your algorithms to label data that is not right so you have medical so that's I don't know if in one of the slides we were showing a cross validation loop with say a B testing we're taking the data from the outside and defendant whether it was helping and then whether deciding to keep it or reject it right so I think that's the idea that we need to we need to apply here we need to have some level of curation of those of those examples so

before training the model you first understand whether that data is useful for you and if it's not you just rejected so you don't you don't apply a second layer of algorithm to sort of figure out whether you have good label data or bad label data or or anything like that we do we do so as part of that receiving processing that is all that kind of load all that logic that curates the data that is coming from the artery and that curation means your training models understanding how other behaving is and just a question about the the training data sample that you're using for the cross-validation how do you ensure that's a like representative sample to compare against yeah so what I

mean for me the way to do it if I have my own organization so B I get all the data from my organization so I get the data from my organization and use that as benign examples right and then you take the data of the attacks and use those as malicious and I'm going to make sure that the data is representative because you're using your own data so it's exactly what it's going to be looking like tomorrow right when you deploy no that's probably hardly feasible and so roughly in these cases I would account use for say weekdays and weekends so maybe a week couple weeks three weeks four weeks I was one of theirs any successful use

cases you could share I find like the URL is a great example but I find many of the problems and security are the attacks are far more subtle in terms of detecting you know it's the combination of legitimate activity in certain combinations often result in the actual malicious activity occurring so that is correct I think as I was saying that was me my last two years was exactly that right was identifying where the droid use cases for this technology and how we can address those so as part of that modeling decision you need to understand what those attacks look like right so you need to know what to look for you need to understand what is the right

entity which means do you want to have a classified based on domains or is it gonna be source IP are you gonna even be more granular and say I just want to model individually like every pair of source IP and domain for example so I will give you even though that source IP might be doing a lot of legitimate things if you assume that the domain is bad those connections will all be bad right so that's that's part of the modeling decision then once you understand that you need to understand what are the right features right and iterate exactly so I fully agree so for example in those cases that's a from a parent like a clear example of the

challenges so where can I find a dataset to play with that right like know where I work where can I find like what an attack looks like

so that's that's that's one of the cases where you see like all the all the talks about deep learning and so on and they're trying to go from the ninety percent to the 99 percent right the 95 to the ninety-nine point nine nine and that's a case where we need we need to go from zero to sixty right and that's that's that requires a set of different set of technology hi so someone who's new day iPad has spirits insecure let's say you sold me on this I want to bring it back to my boss say this is a great thing we should implement this will work with vault our partners or going to share what's the next step what do I

need to do to implement this okay so I think it will depend a lot on the use cases you want to target because some use cases will be very lightweight on the infrastructure side say you want to do some domain analysis you can get your logs and dump two domains or take those from or whatever it's simple that you can do very easily so that will require you like a few scripts that extract those features and 20 machine learning models have been tutorials in the previous presentations that won't be like more than 100 lines of code right now if you want to do that another use cases that would require you to look at long time windows of time you want say

they take command and control of it over behaviors over a month or over two months that means that you need to a whole big the infrastructure to process those laws to extract the right behaviors looking at a lot of data and that becomes a big data problem so that's when it gets difficult handy and then on top the data sources are you must be blocking the data elements that you eat it to whatever this is so it's a multi-step process every time very definitely but there's a security deposit I would I would look at it that in two different axis one is the size or the volume of the locks you need to pass and

the second is whether you need batch analysis or real-time analysis depending on that on those two factors the requirement will be completed when the the work that is involved to make this thing work

[Applause]

Transfer Learning: Analyst-Sourcing Behavioral Classification

Related talks