← All talks

GT - ClusterF*ck - Actionable Intelligence from Machine Learning - David Dorsey & Mike Sconzo

BSides Las Vegas41:0332 viewsPublished 2016-12Watch on YouTube ↗
Mentioned in this talk
About this talk
GT - ClusterF*ck - Actionable Intelligence from Machine Learning - David Dorsey & Mike Sconzo Ground Truth BSidesLV 2014 - Tuscany Hotel - August 05, 2014
Show transcript [en]

instead of calling a [ __ ] we should have called it in turn because we basically automated the job of an intern so sorry to any interns in the audience but not that sorry a little publicly sorry but inside gnarly sorry so at a high level what we're going to be looking at today is our attempt to use machine learning to group files together with to math and then from those comments programmatically extract your signatures so you could say hey find me [ __ ] similar to that in our environment or in virustotal or whatever you want to look at so who we are I'm Mike Sconzo currently RDA click security and David Dorsey the same so just two dudes we

like a little bit of math but not too much and a lot of computers so down at the bottom there's this web address that's beautifully formatted we have a bunch of ipython notebooks and code out there now that you can play with these slides are awful we're not going to post them we're gonna actually post the ipython notebook so you guys can have the code and play with us at home and we will be releasing kind of these two little utilities that you can just point a directory of files and it will magically do things for you hopefully we'll have those up today or tomorrow no no promises okay so Dorsey had to put this quote in there you have some random

coach throughout just because Mike said this the other day I like didn't so that's why it's here that's really all that to say about it I think it's the most brilliant thing he's ever said it tells you something about what you got in front of evokes all right so the basic question we asked is can we group files just by there static features we didn't want to run things dynamically because what we didn't and then could we go one step further and can we turn these into yarra signatures so you can easily deploy it and so we decided to try this on mocco files and your windows executables we tried three different clustering algorithms DB scan k-means

and mean shift but mean shift was not didn't work too well for us so we're not really going to talk about it other than to say that it didn't work so a little bit about DB scan it clusters things by just grouping things that are densely close to each other go figure the clusters can be any shape we don't have to worry about the shape go figure it scales very well with large amounts of samples and as long as your clusters don't get crazy big it works really well then for math guys it's not flat geometry and be uneven cluster size one thing to note about you scan will just walk back and forth we used psych it learns

implementation it has a really nice feature where you give it the number of samples you want per cluster and if it doesn't meet that meet that it will say all of these samples are unlabeled which kind of makes a lot of sense if you're looking at files right you would expect that there's probably a fair amount of files that are similar to one another which will kind of have this tale of kind of 1 in 20 files so for all of our work we said unless there's three files and a cluster we're not interested in in labeling it okay so k-means is another way that you can group data kind of the parameter you provide to k-means as you

say hey i think i should have this many clusters in my data this many groups so here's here's the number of groups you go forth and you tell me which things are close to one another one of the nice things with k-means that we found out is it tends to group things that are highly highly similar together and have the same values which actually works pretty well with the way we structured our yard generation and then mean shift like I said this one was was really sad we kind of have a motto failure is always an option so we totally exercise that option when we were playing around with me gift well we won't bore you with that

so a couple of things you can do to your data and all these have effects on how well for some definition of well depending on your use case these algorithms function so scaling it right it normalizes features so that way you're not dealing with you know one feature that's zero to one and then another one that's zero to five thousand or maybe tend a million and then the other thing that we used and made pretty liberal use of was this principal component analysis and basically what that does is it tells you what features in your data set contain the most information and it gives you a nice order so it's a way to say all

right instead of having 63 features so sixty three dimensional space you can say I want something a little bit smaller maybe 18 dimensions or three dimensions or two dimensions so that's for all the visualizations we actually used PCA to go from these stupid high dimensional spaces that nobody can visualize unless they're highly inebriated all the way down to two and three dimensions so now we'll jump into majko binaries kind of a running joke or any the Macho Man Savage all right couple chuckles I'll take it they've been wore the shirt dedication so when we talk about features what it really means is very static attributes of these files so we use mccollough bits in Python it's a really nice mago library

so we picked out a handful of features the number of load commands I don't know is anybody terribly familiar with the motto format it's actually really nice it's way better than PE it makes sense but so it's divided up into load commands and these little commands are basically sections and these sections do various things like hey load these libraries or hey this is the minimum version of OSX that I required Iran so we took the count and then also enumerated those we also threw into their the OS X version information about the symbol table information about the dynamic load er ma co also has these fat binaries anybody familiar with the universal binary format it runs on

powerpc and intel it does it because they literally ram to binaries into this one format and then the computer goes oh yeah we're on powerpc M&R in this version so those we actually broke apart in and treated as two separate files and we're in through a bunch of iterations with with feature choosing we didn't do a whole lot of analysis on did we choose the best features there's probably some better ones that we can choose there's probably some that we can get rid of but you know we'll present today worked well enough for our purposes to kind of proof out the idea and get some interesting results so the motto file source started out with 527 files from contagio and

virustotal and virus share and then we comb through the system directory on a couple Mac notebooks so we should in theory if you think about kind of compiler and tool chains should get some pretty natural groups right so if you get different virus families hopefully right they'll be grouped together and hopefully things compiled by Apple will get grouped together then on all those files took 63 features the reason we have more feature vectors and files once again as some of them were those kind of universal format binaries so we pulled them apart so now we're kind of an essence dealing with 639 files so once again we made a liberal use because anybody here sixty three dimensions now

alright I don't even know how to do that so this is what a really sweet high resolution picture of the raw data and 3d so when graphing the three most important features of each file this is what our file layout look like so these are all 639 blue dots this is what it looked like in two dimensions looks like a sideways e yeah so I don't know so as a person right you might see hey there's maybe this is a lot some things are similar and then I've got these word outliers so let's see what the math does so once again with DB scan there's not three things in a cluster not really interested we ran it against

that we got 21 clusters so end up with 140 labeled samples and 499 label samples unlabeled sample sorry which seems mostly awful until you realize that no clusters so using just these 63 features that we more or less arbitrarily selected right malicious files were never mixed with clean files so kind of kind of interesting and really telling results and the Olympus samples we were really angry at about first but then we said maybe it makes sense maybe there's a lot of one and two so this is kind of what the clusters look like the number on the left is just the cluster number the label sizes obviously how many files in that specific cluster how many of them were

malware and then how many of them were clean so you can kind of see you know a bunch of apple ones got grouped together and then potentially some malware families it turns out it actually do pretty well so we just grabbed this information from virustotal on the samples and despite some relative and consistent naming from various antivirus overall it felt like it did a really good job of grouping like malware to like malware right n even smart enough if you will to say hey you know here's a very and Ian it's pretty similar to the other variant so now on top of this we said all right well maybe we can get less like things to be slightly more

like one another if we do this PCA I mean as you kind of think about it if you go from really high dimensional space to a lower one you're in essence throwing information away so we reduced it from sixty three dimensions to 18 so we kind of have these 18 features that we're using and in numbers kind of inverted right so we got thirty four clusters 452 labeled samples only 187 non non labeled and then 50 to the 34 had this mix of both malicious and clean so I mean you can iterate on this and you can say well there's probably some some really nice middle ground if you have kind of that information on on

what's what's malicious and what's clean or what should be grouped together in which but if you don't probably still out about grouping so more clusters more files labeled bigger cluster size and this is what the colored results of clusters look like so you can kind of see groups of similarly colored things are by groups of similar color things in three dimensions projected from 62 63 so whatever that means kind of a slightly better picture so you can see more colored dots and in two dimensions because why not and zoomed in to mention so this one actually does a fairly decent job of visually showing okay I get it there's red dots by red dots and

orange dots by orange dots and yellow dots by yellow dots or if you're colorblind there's a lot of brown dots with some blue ones mixed in there so you know once again all right so you know maybe it did okay but we got some clean and some mixed but overall I mean there's still kind of you know several samples and the same family mix together and it was kind of at this level that we noticed Rosen's consistently first inconsistency at labeling samples from antivirus right you know when antivirus might label it one thing and then another one another one and then you switch and they'd be elsewhere so I'd like a really good example

not um kind of interesting yeah another Dorsey quote yeah so I was talking with Russ from attack research and he said I could quote him on this so you all bear witness that quoted him on this internet famous so kind of a higher level look at the k-means right you have to give it this number of clusters and unless you really know I have this many clusters there's several several ways to say all right data you tell me how many clusters you think I have and the simple formula for that is just the square root of the number of samples / 21 kind of interesting side note with all of this clustering and you'll kind of see it as

we go through more often than not and I'm sure hopefully when you guys try this at home you'll get one really giant cluster and a bunch of really smaller ones you can repeat this process on that big cluster so you can kind of get a sub clustered if it if it doesn't feel right so that's kind of where the art form of all this comes into play is maybe understanding what you hope to get out of the data or what you think the data represents and ways to tease it out so out of the 17 6 both malicious and clean fairly decent you know nothing totally earth-shattering but it's nice to at least see consistency among the various

algorithms and that's kind of why we chose to do a handful of them was to do a little bit of a survey and say alright you know is one algorithm going to be better at file clustering is one going to be worse is it makes sense to use a conjunction both of them the k-means actually in my opinion gives you the prettiest graphs so if you guys play this at home or at work give management the key means graphs it looks like you're super productive I mean you're really nice color bands science is happening [Music] so once again we wanted to try doing both you know PCA and scaled and compared it so pretty similar five

clusters right both malicious and not malicious and a brief note to the code that be will be releasing you can just give it a command-line flag and say do this with PCA and choose this many dimensions or take a guess at how many dimensions you think it should use and it'll do all that for you so you don't have to really do anything but what I like to call the Homer Simpson school of analysis where you just push push a button and then it does everything for you hey for some definition of everything so once again really really pretty colored bands and that kind of makes sense that you would get these really nice bands across you know

various axes since k-means really tends to gravitate towards grouping things with same values together so jump into the ARA is the cool part this is actually the part that probably took the vast majority of the work and what is right as we were wrapping up on the the yarra list they said hey we've got it some new stuff for y are in the works and it's going to support file modules does it go great just wasted two months not that I'm better a little better so it's kind of broken up into a couple different classes so maybe people can can use them our approach was a little bit different there's some really nice yarra signature generators online the

people have created and they're generally focused around shrink analysis we wanted to try and create one that was contextually aware of file structure which at least we're not crazy because the our guys announced very similar functionality so at least we've got a nice point of validation so there's kind of this base object and it's just a yaris signature and it actually does the rule generation and then we've got both amok oh and ap-1 on top of it that support various things so you can say all right for this file create a signature that contains the file header all the section names and then the specific information about these various sections or whatever combination you want so we

tried to make it pretty flexible so when we did the signature generation we could throw all 63 features at it and something would stick and make sense so the way we did the feature or the way we did the signature generation was pretty much a sledgehammer approach there's there's much better ways to do this and we can kind of get into why why some of our failures happen and where some of our successes were but for each cluster it basically says give me all the valuable or all the variables that are exactly the same and then for all of those that are exactly the same we're going to put them together in one giant ANDed condition and that's going to be

our yard signature and for that right we load commands and and all these other things one of the cool things that we built in towards the end was all these variations of don't care bit so the nice thing is you can make a stronger you're a signature by giving it bites and then some don't cares in between we got an example of that so I don't know if you guys can see that in the back very well but this is actually what one of the are signatures generated from the clustering and then through this your signature generator it looks like so it tells you right what samples are supposed to be in the cluster it'll give you a cluster

label the algorithm that generated with and then all the various parts that were fed into the generator so for this one it had a couple different symtab lobe commands and you can kind of see you know all the various question marks in the our signature language are don't care so that was a way to say hey let's take more attributes create a stronger signature with values that we don't care about our we know are flexible within the file format a lot of them have don't cares in as far as size so the load command will be present but the sides will be slightly different between binary so we chose not to put that in there and there's a condition so it

worked okay right you know there were parts that were completely mind-blowing and we thought we were geniuses and this I thought I sourcing on earth I know and then there were parts that were entirely humbling and we realized we already it's in a no idea what we're doing so here's a brief snapshot of some of the results so you can see there's awesome numbers there's actually one cluster where it didn't the yard signature didn't fire on any sample there's a bunch that it found you know exactly the files that should have been in the sample there's a couple more where the signatures probably weren't as restrictive as they could be and we got a couple bleed and that's

kind of one hundred twenty percent or one hundred twelve percent right two hundred thirty three percent of the files so basically what it's saying in the very last row is there were three samples in the cluster in the yard signature flag on seven and then we got to this one total failure right so in this one there were two files in the cluster and the signature generated triggered on 363 on the upside we found the to the downside we found 18 thousand percent more than we really should have so you know it's not perfect but at least we got some good results to go alright this is kind of cool so how can we get better results so we can we could

play you know more with the clustering again going to sub clustering look at our feature selection a little bit better we could support complex y RS signatures right this whole or business or we could do n this or that or one of those and one of these you can specify offsets in the generator itself we just don't do any in here to say look for the specific weight at this office it in the file so why a lot of room for improvement so if people are interested were tinkering this we're totally up for four people helping so I like Corsi go on with be so we did much similar things with the PE files so we started with

just a some of the simple features the file header optional header data directories and a few a little bit from the resource section we had a thousand files total 500 randomly selected you know the multiple windows operating systems under Program Files and windows directory and then 501 to be randomly selected from virus share as well just because it was a good source so what's like we did mock oh we're just going to view you know take quick look at the data just see how it looks I'm here it it isn't the 3d again it's the wonderful blue dots big kind of blob down there just add Val well so I said it kind of looks like a penis oh yeah and so here

it isn't too d it looks probably more like one bit more anatomically correct now but so this is what this is the graded get things you get when you visualize your data so Randy be scanned on the raw data minimum cluster size of three again seven clusters 36 relabeled yeah so that wasn't so good so we didn't I didn't didn't spend the time of delving into like well at least those 36 are grouped well we just jumped straight into doing PCA um much better i'm still unlabeled a half the samples but we still got 418 in the that were labeled and it's not terribly surprising least to me that we have so many unlabeled since there's so many different malware

families out there that i could easily see that we got one or two random together and then you know all the different windows files i could see them kind of being off by themselves seven of the clusters had both malicious and not malicious files so that wasn't too bad out of the total of sixty-three here's kind of a breakdown so you see we see we've got some decent size clustered clusters that were you know grouped least on the malware theme scale you know pretty well II of the 58 on all of them are malware so you we were kind of a enthusiastic about these results at least I was um and that and that makes me happy so then we

actually you know took a look at it and actually it did pretty well to our surprise looks like multiple different versions of the fan of the malware family and grouped it all together I was quite impressed now we look at it in 3d so you see that big blob there you can't really see anything so you zoom in on it you can see a slightly bigger blob I'll but you kind of see things group together I mean it's you know to the eye color test so.2d at you I think you can see it better so we do min again now you can see the kind of the grouping together looks I think it's a little bit

easier to see in 2d then we moved on the k-means you chose at k of 22 using that simple formula we showed earlier I'm some pretty decent sized clusters here it is of the penis again but now it's multicolored so I has to count for something yeah if you zoom in on it it looks does it look like it anymore but you can see nice I mean this is again the k-means pretty pretty graphs here I'm I'm so then back to the results this is actually using a pca again reducing the 20 dimensions didn't really change cluster sizes too much

we'll catch at the end again so PC I didn't change much I'm still the penis there and zooming in another pretty graph eight clusters had malicious in non-malicious I'm you can see like the cluster labeled one they're mostly malware but still a good amount of clean but that bottom cluster you know strong number of windows files so now we jump into the PE RA all we do is all this so far sports the file header and the optional header mainly because i extracted the RVA of the resource section which doesn't do much good when you're trying to get it from the file um so I'll talk about that a little bit um the same same thing as the maakoa

[Music] program it does builds on all common features supports don't cares everything gets handed in the end we have added one extra feature where if the it'll go look for the common values throughout on the one field and if it's the same it will put that there and ever when two different then you get the don't care I'm you can kind of see that here so we have the file header and optional header and then so here's how we did end up p stuff so the first one there my worst case i only got three hundred or a little bit over three hundred percent so weigh less than 18,000 s that's you you know so and actually have I think I

counted them we had like twenty percent where we were a hundred I mean 20 the yard samples or signatures that were hit one hundred percent um so this looks like yeah I'm useful but if you look down there at the end and so we see somewhere we miss a lot like cluster for there one out of 100 at least it's not zero and I think where those might have been like 15 where it misses everything which I don't really understand how it misses everything though I haven't figured that part out yet because it was built from the cluster but but the numbers don't lie yes totally a work in progress um but so you know so we had some success

in yeah you know maybe not but you know the idea has merit I hope um so definitely room for improvement doesn't like I said we don't support resources so I have to convert the RVA defiled offset and then I can hopefully get some goods clustering that way we actually don't support offsetting from the P header so P header can move around so that can definitely throw off our our signatures we can catch more of our files but then i would also not be surprised if this also makes us catch more files that we shouldn't catch so we'll see you know start start supporting [ __ ] in there not just ands and just more fancier yard stuff

especially as their new module stuff coming out so yeah so you can try this at home and happy hunting and do we have microphones I know yeah I mean we can say it and we can repeat it we have this one hey so how are you guys doing like testing and cross-validation like without that like aren't you just gonna like oh like over fit like a [ __ ] so that's that's a good point um so the only way we really know in this case that we are potentially over and under fitting is because we have pre labeled data I would imagine most p the anticipated use case I'll say you generally don't know what you're trying

to label which is why it's unsupervised so it's hard to kind of unfit or under fit non label data right that's so that's the output is the label right it's the cluster this group instead of apoe saying all of these things in this group are you know malware XYZ and this is clean and we're going to create a classification algorithm right using random forest or you know SPM or whatever else this is the opposite ways you're saying I don't really know what my my data looks like so this is a data exploratory tool to say tell me what my data looks like and more so the idea with this is can we take things from

this more mathy Python realm that we work in and export them so that way you know people can find something or they can you know if they've got a cluster that they can take something out of the Python realm and into where they have the ability to run er signature that kind of make sense plus but anyway so the way we did the cross-validation was after we got since we had the labels we were able to run the signature and say did it hit on everything in the sample right so it was our false positive and false negative and true positive inter- right by seeing what was in the sample and then what was in the same trait was

on a single carbon cluster so okay hi and what were the more informative the features and they do you think you would be able to predict them when you were defining the features themselves so we did quite literally zero analysis on what the more informative features were that's kind of one of the follow of things that we were hoping to do but never quite got around you there's actually a really good paper that I know helped with the PE stuff that they went through and found kind of the seven most useful features to cluster p files on so that was really nice for the p stuff but it was just kind of shocked in effect

for the tamaco you let the algorithm run and yep there were a couple features that we learned from classification thought that Dorsey did was where the the program jumps in kind of like the main offset it varies based on how old the binary was so that was kind of like a just a really shitty feature that we figured in the domain was was awful so that kind of stuff where we understood the file format enough to go this is entirely useless to use we were able to remove those but from a you know here's a full analysis and this is why this is the most useful feature now we just relied on the algorithm up to a little bit more so we

do have on our github repo you know have been that now some classification notebooks where we talk about the most important features for classification part at least and so we kind of can use some of that knowledge then for this so that's some of the reasons why we chose like the file header and optional header you know resources it says because we had a little bit of prior knowledge you guys were talking about you who's good enough until a certain size what are we talking about like how yeah sighs special it's a 4s cluster size you mean lower or your body oh of your your size of your data that you're analyzing Oh scale ah I got you um long day so some

of the algorithms are recursive and they just fall over after 10,000 or so or a hundred thousand right so in a lot of that's a function of how many features you pull out as well so unfortunately it kind of varies and since we didn't do any actual true feature a network drew future analysis I don't have a really good number for you but good question I know there was a notion headstone now that doesn't count I still love that's why hi two questions actually one is can you filter the the I mean the samples you you you were your algorithm can you filter by features I mean if I want to know which of those samples are calling some

libraries or doing these or doing that can you do that or you have to do that you're very beginning you'd have to do so that would be part of the feature selection part you could do it we we didn't do it we just kind of did this match and grab for relatively easy and hopefully informative features to extract vs you mean you could just as well do dynamic analysis on all these and then convert right all the registry keys and their various parsed out of libraries loaded and enumerate them and then few those into the algorithms features you probably get some really really cool results we didn't do any of that but like a really good idea okay

I'm second question is have you try combining this with indicator sort of compromise no but that's a good be curious because you know I've been trying some stuff with these kind of things and you you know you combine to our sources of information or you get an amazing results because you know there are some similarities in between and you can play with and fill your point we're going to get five minutes ago a couple more questions up all right we'll make sure you get your question with it are you you making the data sets available as well probably not but if you need help getting windows or mac up now where we can point you to where we grab the

vast majority of the samples online okay ah which results just the shitty text files yeah we need to show you text files like JSON loves oh it's interesting any of the JSON gloves hell I don't know I think we did for cosmic did we will see what we did for other ones and we'll do whatever we did before yeah great totally acceptable no it's this guy up front I'm an answer his question first

he's so mad now because everybody keeps ignoring them every cluster generates a signature yeah yeah we we have homework no that's actually really good and to be quite honest I was totally against even looking at labels and then Dorsey told me I was an idiot and then we should look at labels so I'll blame him for not thinking of that as well and I'll take a look mainly I just want to know how well we're doing have you know clustering stuff is great but how did it do we needed some sort of metric thanks Rick anybody else oh all right Pat so many questions that's fine all right Ted it's a pretty simple question right so what

so what algorithm for like what mathematical function do you use to determine distance between two features it's generally Euclidean except with think the default of DB scan uses Gaussian okay so I thought right because it's the easiest to do yeah and you do get awesome results from this like anything that you can think of like just you could enter manhattan awesome results great place to start for those like you want to take this and do something really cool with it there are a ton of other every night distances distance algorithms right so like 11 Stein's about to get mathematical on you so I symmetric graph isomorphism might actually refine that and allow you to get that 18,000 way down where a you

know that's right right yeah yeah and it might just be a case of like really shitty signature generation to yeah right exactly yeah might I thought I saw and we hit over here and then one over maybe disappeared I don't know anyways got more questions it's not that's okay did you do any analysis of sub clusters within because based off of your features you could have especially off the our signatures you have your signature that will always match they can be neat everything in this other ya know is kind of on our to do and then we ran out of time okay I mean we literally finished the slides no five hours ago this morning you know how goes get all the

data and you're like [ __ ] Nagano excites sweet if there's no more questions than this go over here no none none and that was another thing that we thought of two is you know can we leverage strings and use like Jaccard distance or something else and look at you set unions that say hey these strings are common among all these files as well totally just the pure numbers approach yeah sure

so more more from the problem at least from my point of view and I think Dorsey's well from Wickham our researcher you've got tons of files and you generally know which ones which but no matter how try carefully you tried to make sure everything's labeled you just wind up this pile of crap like I don't really know about it so here you know group it for me and likewise where you know the group's here group it for me and see how well the grouping does and then just generate the signatures for me so I can plug it into mindset and response process or virustotal zpi or you know fire I or something that supports er signatures to find similar

things with you just being able to push a button so some more exploratory and curiosity of anything I think we're at a time sweet ass Wow perfect perfect thank you such