← All talks

GT - Old Things Are New Again: Efficient Automatic Signature Generation for Malware Detection - Hyru

BSides Las Vegas31:1570 viewsPublished 2019-10Watch on YouTube ↗
Mentioned in this talk
Tools used
About this talk
GT - Old Things Are New Again: Efficient Automatic Signature Generation for Malware Detection - Hyrum Anderson Ground Truth BSidesLV 2019 - Tuscany Hotel - Aug 07, 2019
Show transcript [en]

all right welcome back everyone this is besides Las Vegas 2019 day two the name of this talk is old things are new again presented by Hiram Anderson and before we begin just have a few quick announcements first off we'd like to thank our sponsors especially our inner circle sponsors critical stack and Vale mail as well as our stellar sponsors Amazon Microsoft and paranoids it's support from these sponsors as long as long with our other sponsors donors and volunteers that make this event possible now this talk is being live-streamed so we ask that as a courtesy to our speaker and to the audience but you right now make sure to check that your phone is set on silent also if you have any

questions please use this audience mic so that our YouTube audience can hear you just raise your hands and I'll be sure to bring that mic over and with that we're ready to begin so please let's welcome Hiram Anderson thank you for the introduction I'm delighted to be back at besides ground truth so my name is Hiram Anderson and I'm the chief scientist set an endpoint security company called endgame and my secondary objective today is to be on somewhat of an apology tour for signatures I spent them first first five years of my career dissing them and my primary objective actually is to get to you to lunch so I'll be I'll try to be brief in my

remarks today and probably just hope to pique your interest and hopefully that you and I can chat during during lunch in more detail or you can find me at the endgame booth at b-sides so I think it's fair to say that one thing that machine learning is done really well in security is is malware so we're gonna talk about malware today and particularly it's good at detecting malware before it executes with a low false positive rate so the reason it's sort of taking fire I think the primary reason is because you can detect new samples not before seen at a modest false positive rate and that's the primary driver for the success of machine learning for static

malware detection I think today any serious endpoint security company has to say the words machine learning when they talk about their anti-malware solution because of this fact because it doesn't just memorize the past but it also projects a little bit into the future one important element I think that is not often stated about what machine learning has done to our industry particularly for malware is it actually has brought about a paradigm shift what I made me that is in the old days you had cubicles of malware experts writing signatures and today you have a data set and somebody just turning the crank who doesn't have to know hardly anything about malware in order to produce a new

model so the automation component that is actually a huge benefit to to modern anti-malware products and in fact this you know this this hand-cranking part is is so easy that you don't really have to know much about malware even a data scientist can do it right so now contrast that with the quote/unquote old school we have signatures on this side so signatures do not generalize to the future they're terrible at predicting new families and a new but but you know what they do well they do exactly what you told them to do they are excellent at at capturing the known malware at almost a zero percent false positive rate and in fact that is better in most

cases it in most cases than machine learning so what I want to just start off with his exercise is understanding the strengths and the weaknesses of these two things so but let me go back that was a mistake so with signatures I can get a hundred percent true positive rate and almost zero percent false positive rate and in fact that that does require usually domain expertise and some sort of manual intervention but at the end of the day when you're finished this signature is totally interpretive all in a way that machine learning isn't quite yet what I mean by that is so what do I mean by interpretable so here's a here's a rule there's a yara a RS

signature that detects lock or Goga and it was written by Florian Roth and if you ignore almost everything except this bottom condition which I've highlighted you know exactly why malware is being called Walker Goga it happens exactly for two conditions there's an aura right here either it has a PE header so this is this is MZ and it's file sizes less than 4 megabytes and it has one of these strings one of these five strings that begin with an X or it contains this this wide string this may lead to the impossibility of recovery of certain files so this is a signature it's it's imperfect because by changing just a few things I can break the signature but it

exactly captures all of the locker Goga samples already released in the wild and you know exactly what it's doing so where we want to live is actually in this sweet spot in this middle so we're gonna let machine-learning do its thing on the unseen stuff but we also want to get away from the old way of the scaling problem the human scaling problem we want to have the automation of machine learning where I just point algorithms at data and outcome signatures for all of the known bad samples these rules are are perfectly interminable you know it exactly what it's doing and it doesn't require you know I can turn that even a data scientist can do this you can turn

the Craik and make these these signatures without sort of understanding totally what what's happening of course we will never have good to positive rates on new new samples and new families that is the role for machine learning so the goal is to use these two things together instead of these two things together so I'll be talking about and for our YouTube audience that these things you can't see my laser pointer we want to use our automatic signature generation to bring the best of automation with low false positive rates so I just want to clearly point out that this desire to live in this sweet spot middle is not at all a new desire or a

new thing there's there's several excellent approaches Solutions out there already I'm gonna just name three the first is by Florian Roth called Jurgen and it's it was actually not intended to be totally automatic it's a semi automated thing so you point yards in at a batch of samples and it will create candidate yaar rules that you should review by hands and tweak until it works right for you so it's meant to be refined by a human the other two in the middle of the X Sigyn and base these are actually very they're based on the same work VX sig was recently announced released by Google hover flaked tweeted about it that it's essentially in baseball

they're both based on Christian bleakman's 2008 thesis essentially it takes Ida Pro disassembly and uses a least common subsequence algorithm to find signatures that describe opcode sequences for for Maillard families so for today I'm gonna be talking about more of the first thing we don't want to have any reliance on any this is simpler we want to just use the raw byte strings for this work so in my education I was taught that and I believe that a talk is an advertisement for a paper so here's a paper which was presented just on Monday at community workshop on for cybersecurity in Alaska and and now today besides so you'll note that on the author list I'm way down here I do want

to call out that my really smart colleagues from the laboratory physical sciences and and UMBC University of Maryland Baltimore County who have done far more work on this than I have but please if you can find this paper today on archive and it'll give a much more thorough explanation than you will today so let's talk about engrams real quick in Rams have been the bread and butter for a whole host of machine learning applications ranging from natural language processing by affirm attics and of course even though malware so in particular in Graham is as a way to splice up a long string into in little tokens and the number of tokens included is the number M so a unigram is for this

is a sentence would be each of the words individually a bigram or two grams would be this is is a a sentence and so on trigrams using three three tokens in that in that three gram so four for malware engrams have been used on tokens that represents raw bytes and that's what we're gonna talk most about today but all of the same techniques I'm discussing today could also be used for example in this assembly mnemonics or windows api function call sequences or window security event sequences and all these these same things apply but we're going to focus on the raw bytes for today so now know what makes a malware raw bytes really interesting to me and

unique when compared to most of these other machine learning natural language processing snow means are really two things the first is the sequence length so if you think about a document for natural language processing if you have ten thousand the words in your document that's a giant document for natural language processing but for malware to have one thousand million bytes is actually kind of average so the just that length of documents and tokens is different that it makes you wonder if the engrams that we developed for document parsing are maybe not totally well-suited for malware parsing or document like this so much different a second thing is that I find interesting is is just the the size of the engrams

so for most of machine learning in natural language processing and compression speech it's most common to use you know grams or by grams or trigrams but almost never do you see in equal to four or more an amount where there's been a few efforts to have n equal to six but almost nothing has exceeded that and that's that's interesting for a number of reasons um and I'll bring up that so the first is that the sheer vocabulary size so we're talking about byte sequences the number of byte sequences of total possible byte sequences for six grams is something like two hundred and eighty trillion that's a total number of possible combinations you could get right in in

order for me to extract six grams on five terabytes of data we did this and it took a twelve machine cluster one month to find the six strands so 2:56 to the N is painful and we can barely do six and at the same time if you think about malware you almost think like six is six really enough if you want a gram to mean anything at all so with with six bytes I can capture three wide characters right I can get the m.i.c of Microsoft and that's it with N equals six I can even capture the full x86 instruction set which can go up to 15 bytes so it makes you wonder like are we are we doing it wrong like if I

want a gram to mean something am i doing it wrong so the question we asked ourselves was kind of a crazy idea what if what if there was a world in which we can extract a 256 gram or a 512 gram or a 1024 gram spoiler 1,024 gram we're gonna call it kilo gram do you get it okay so if that were possible would that even be useful intuitively the data scientist said you and me are thinking that's a terrible idea you're overfitting to maybe a single sample it will never be useful but spoiler alert it turns out that signatures work in malware and people reuse very long strings of bytes across um our family and we'll see that later

okay so we're at possible we would like to explore whether or not we should even consider engrams bigger than 6 which has never been done before and we're gonna we're gonna approach them the following way so there is a hope and if they're based on sort of two observations the first is that engrams follow this power law the power law means that there's a few 6 grams that are used everywhere but the proportionality drops off according to this linear power right right so most in grams are used in only a few samples and there's a few dominant ones the second observation is that for writing signatures for malware we actually only care about the dominant ones if I want

to write a signature for Locker Goga I want my Engram to cover all of the Locker Goga samples right if I'm so so I can throw away all the other engrams so think about that 256 to the in space we only care about just a tiny a tiny pinpoint and that space that pinpoint representing the very top of this power law distribution right okay so given those two observations let's sketch solution the first is that we're just gonna focus on the top K let's call K a million right a million out of 280 trillion it's a pin drop but a million is still a lot of a lot of grams to worry about and we're gonna we're gonna

do a algorithm that does two passes in a very efficient way the first pass we're just gonna find what are candidates for K we'll use a hash table for that hash tables have collisions we'll deal with it and in the second pass we'll use that hash table as a filter to only extract the candidate K values and actually retain the actual engrams let me actually do this like three times so it'll be very clear how simple this is and I'm not going to show this today in the talk but if you if you care about such things and wanna read the paper because of the power law behavior and because we only care about the top K we

can actually bound the error rate in recovering the true top K tokens in the presence of these hash table collisions right so we can say that with certainty we are actually giving you the real top K of the engrams okay so let me let me just walk through it with a very fancy simulation what this algorithm looks like so that the blue the blue thing is at my hash table and think of this hash table as having like it's like 16 gigabytes right it can fit on your laptop and a laptop memory and then I have a bunch of files on disk and for the sake of this demonstration we're gonna say there are three grams in my

set of files and I have a hash function and the hash function is this computer location for this gram so remember this is a sentence this would get a location to look up in my hash table so gram 1 says I'm gonna increment I'm going to go to that location the hash table and increment gram 2 says okay you belong here and let's say gram 3 is a collision and I go back to that at that location hashed I've just showed you the first pass of the kilograms algorithm I just count things in a hash table and I throw away all the grams so think about 234 billion things I can't possibly keep around 234

billion six grams in memory or disk but I can keep around the 16 gigabyte a hash table with counts in it so the second pass I'm now just gonna use this hash table as a filter and now every time I come around I pass through one of these grams again gram one I see that he belonged to one of the dominant buckets and so I'll keep that gram I'll keep the actual six characters and store him to some data structure or database gram two did not meet the threshold and I discard him and that will be the case for the majority of grams I encounter and and then gram three this collision I'll keep him - and guess what that collision I've

recorded that and I'd recorded their counts in this new data structure and I can tell them apart now even though their hash with hashes were the same I have the original byte sequences back ok so that was that was a the second descriptive algorithm the last time I'm gonna show you 34 lines of Python code that will allow you to implement this on your own ok so it you can't read this because it's too small but if you zoom in on your cameraphone maybe you can do it but there's four parts I create a hash table it's a numpy array alright then I do a MapReduce to accumulate things into this hash table and to make this fast you can

use clever hashing tricks like a rolling hash algorithm like the the ribbon carp hash then there's a third part where I do a MP partition a numpy partition sort to find out which which are the dominant K counts in my hash table and then then the last part is passed to where I use that hash table as a filter and I store my true engrams and guess what a Python dict okay that's my fancy that's my fancy space city saving structure super easy okay so let's let's look at this now let's look at some some experiments that I hope you'll find interesting on now that we can compute in grams of ludicrous eyes like 1024 are they at all

useful so first I want to tell you about some data sets we have four of them we tried against one was an industry so provided by an industry partner that contained to two million and change portable ex-people files from 2014 and 15 the year is important as I'll mention later then where to use the Ember Pease from 2017 we have a malicious sorry I a PDF data set and also we have some buyer search so the virus share set is just malware families the top 20 Windows malware families where the family is given by a V class and we run the fellow following experience across them the first is like how large is in how large

should it be I mean our data science intuition says that your your silly to choose a large end because you'll be overfitting so what we did is we we took the top hundred thousand engrams across each of these data set core PI and we we built an elastic net model and we tried all these values of n ranging from N equals 8 to N equals 1024 and so the first thing is your intuition of mine was correct is that as you increase in your performance drops so the graph I'm showing here is it's a balanced accuracy each of these data sets have different data set sizes so I'm waiting them and aggregating them in a balanced way into

this accuracy number and the reason the accuracy instead of another metric is because one of these has no benign in it right so accuracy is a way to wrap them all together so what you should notice is that as I increase n that my performance drops from close to 100% to down near you know 75 percent or something balanced accuracy that's intuitive one thing I found really surprising is that N equals 32 which is a ludicrously large in gram size for most things is really no worse than N equals 8 that was kind of interesting totally unexpected and surprising is that when we went to any in equals 1024 this worked at all so in fact if you'll

notice the accuracy numbers for n equal any 1,024 on the bottom we're getting something like 92% accuracy on many of the data sets the other very unexpected thing is you know you and I have been taught since our childhood that signatures are brittle and they'll only last so long but in fact this last column here is this says end to ember is when we extracted these signatures on the 2014-2015 data and then we apply them to the 2017 data and that's the accuracy number we're reporting so if you look the accuracies are in the high 90s for the for the you know that not as ludicrous chrisley large in grams and actually 72 percent for the 1,024 grams

so that's kind of interesting what we did with this was the following so the intuition should be if I use in equals 1024 I'm gonna have lo recall but extremely high precision right there's gonna be only a few samples that match this very specific byte string right so we're gonna form the following algorithm this way the algorithm is we're gonna create a Jana rule by using these ludicrously large engrams these kilograms and we're gonna start with the extremely high precision local recall side and until we have enough coverage will just decrease in until we have covered our family enough that makes sense so in words we start with in equals 1024 and we have a little while

loop that while it's bigger while in is bigger than 64 then then extract signatures make sure they don't exist in another family or a benign set and and then do a very fancy or statement so I'm just gonna or all these if it exists if this string exists or this string or this string that's my yard rule okay and until I've met my target coverage so this really simple thing this really simple algorithm which is a few lines of code if I compare this to a yard gin and I will caveat two to fluorine rock that this was never meant to be fully automated and this is exact wouldn't we fully automated it but compared to fully automated yardage in

kilograms outperforms it across the board on this this set of of 2020 malware families there are a few Mueller families which they all failed spectacularly yet but for the most part we have a very high f1 scores okay so why else would you want really large immigrants well if they're really large you know exactly what they're doing right and I want to just take you through a tour of some of the engrams that were discovered in these in these malware this malware corpus that I think is really interesting so first um so what engrams are more the more common in malicious binaries the benign so here's one this is a Graham that was discovered it is a registry key and you'll

recognize it as a common registry used for run key persistent guess what a lot of malware tries to persist using this run key and I knowing nothing about the malware behavior only that this existed 12% of all malicious files in ember this this signature was discovered totally automatically required no no brains no security brains right here's another interesting one this one did require brains not mine to understand what it did so we found I think this was a 64 gram maybe 128 gram we found a byte sequence that was very common in malware and I hand it over to my friend Bill and I said bill what is this and he he he had disassembled it and he he told me he

said look look at these bytes that's vert qu l outlook right so this is this is assembling a string in machine code for virtual look that is used later to go get get proc adder so that I can load dynamically dll's in order to evade static machine learning detection so this is really cool because this was an 8% of malicious binaries and kilograms just found it I wasn't looking for it I just found it right that's pretty cool okay this is from the PDF data set this is from a 512 gram it picked up turns out that malware authors are lazy just like you and I and they reuse with code and here is an

obvious skated JavaScript string used to exploit a particular reason of PF Reader and I honestly have no idea what it does but it's used all over and this exact string you can find in hundreds and hundreds of malicious samples of PDF and kilograms just found it because it was popular that's it this is a cool signature ok last last deep dive into signatures this one was also kind of fun so there was a there was another signature used all over and thousands of malicious binders in the Ember dataset and they gave to Bill and he he tried to disassemble it it wouldn't and finally he said well where did this signature happen and it turned out it lit up

everywhere in all the resource sections so way to the resource sections and it turns out this icon this icon is used everywhere if you see this icon of your desk do not click ok so this this is a totally brittle signature I could change a pixel here and this signature would fail but guess what malware authors are totally lazy and they're reusing this icon all over the place right ok I'm done I already I promised you that my first objective was to get you to lunch on time and if you will if you'd like to chat more later please do I have lots more slides that show caveats and and more interesting findings but I would

just provide a summary so I talk to you about kilo grams which is a clever plan words that I didn't think of and I wish I did but it's as simple and it's insanely efficient remember how it took me a month of a 12 cluster machine to extract 6 grams with this kilogram so the secrets to kilograms I use this streaming hash algorithm and a hash table that's it and and forty few lines of code right and I can in I can extract 1 million samples on my laptop in in less than a day right so it's it's pretty efficient and it's actually surprisingly good and interpretable we saw that you know it's not gonna beat

machine learning ever on never-before-seen samples about for some malware malware families it has 100% detection and your percent false positive rates so this is something you'd want to use in addition to your machine learning for the known stuff the known bad right and that's that's my final point again I guess the the overall picture here in my apology tour is that signatures are terrible for detecting new things but they are awesome for detecting the things that you know where there are already out there and maybe you should use them thank you [Applause] time for about two three questions the more questions the less lunch just so you know do you see the antivirus machine learning industries start

focusing on classifying the non kilogram classifiable malware than like specifying those or well I will I would say that there I've often thought about like so think about neurons in your neural network or think about boosting stages in your gradient boosting classifier why am i spending three boosting rounds trying to catch all Apple when one signature can do it all right so I've thought about that and I think I think a lot of a lot of mature companies that's why they have both signatures and a machine learning as a heuristic sometimes you just want to get the easy ones right away and let them miss you let me focus on the hard ones so I like that suggestion I would agree

with that John so talk about the brittleness of the signature still sticking around using the and engrams so in NLP skip engrams can be used to solve some of that brittleness for that sort of problem have you had a look at that sort of things yeah it didn't look in-depth so I I feel like there multiple ways to use this we're so we actually in the paper we explored two approaches one is a purely signature based and one is as an in Graham in a model right so what I presented a was all the signature side but the model cited definitely definitely they're much more robust right yeah but at the fundamentally at the in Graham level instead of say the

quick brown fox jumped over the fox over so a one skip oh thanks Graham oh yeah we didn't so what one thing about malware is it's a positionally insensitive so my dot text section could could be at 400 x offset or somewhere else so one thing we did we did not we never did any temporal down sampling intercept but what we did do is let me show you this here's a here are the most common 32 grams and you know something wrong about them right it's it's it's massively redundant and so this speaks to that instead of instead of like the naive thing to do here is like I'm gonna skip every third bite and reduce my set but you are in

danger because of the positional insensitivity of missing out on something good so instead we have this thing called a 1/2 stripe so if the hash modulo some hash stripe number say 16 is equal to 0 I'm gonna ignore ignore the number so that that's a way to get around the the sort of over representing a single string I guess I'm not thought about how that helps the the brittleness of the in gram just real quick first what hash function did you use in all of our experiments we use a ribbon carp streaming hash all that means is that say I've got eight bytes coming in I can put it through a shift register so I you

multiply every position by a number and and when one comes out of the circle circle buffer I know what to subtract from it what to add sure if it's very fast every hash cost one multiply addition yeah I'm not super familiar with that but is there any reason you didn't use any kind of locality-sensitive hashing functions or yeah well exactly we we don't want look we we want a strict hash function right we want to minimize the number of collisions so things that are close to each other we we don't want in the same bin yeah sure yeah

[Applause]