
hi I'm Anderson it's going to be talking to you about your model thank you for you so yeah today we're going to talk about your model your model is not that special but this is more of a journey and a statement about models than about you you are special you are special and in fact your model might be special but it's probably not your fancy model architecture that makes it so and that literally is the punchline of my talk and if you don't have time for anything else this would be a good time to leave but if you'd like to stay like the state the other points of my talk I'm going to take you on a tour through deep learning
starting simple and going to complex and showing how taking off-the-shelf concepts from deep learning deep learning and the image domain how that might be applied and it will judge how successfully that can be applied for building malware machine learning models for Windows PE malware and lastly I will close up a few slides of my editorial comments about where I think you should be spending your real your real effort and time in building a machine learning model so before I move on I'd like to note that if you go right now to get github.com in game a key or special then you will find code with a with jupiter notebooks where all of this is available
to you what you would need to provide for this are data a buckets full of malware and buckets full of benign data and then you two can create from scratch your own machine learning model so first for context a little bit about me I have a very fancy PhD from University and the rest of these slide the points here are supposed to convince you that I had a lot of experience in machine learning that apparently give me license to have my photograph taken with a wrinkled shirt and tassel hair as you've seen this picture and also to note that since my PhD work a lot of smarter people have come along and mostly made it irrelevant and
have made machine learning accessible to the barbaric masses and otherwise reduce my confidence and my ego and having gone out for a PhD there so with that I'll want to tell you a story about how is she learning used to be in the old days so first there was dataset curation where there we go where where people would have to bring together data sets that were carefully curated this would take an expert and lots of time and and step two there was domain domain experts would have important meetings about what labels should be applied to to which data samples and curate the data set domain experts would furthermore decide on how to describe those those data elements so
if these were images lots of papers about features like sift features that were supposedly useful for characterizing objects with an image images so the models could be built around them and then a model specification so a data scientist who spoke a lot of math would be able to talk about it and propose various model types and he'd throw words around like VC dimension and no free lunch theorem and other things would make him sound smart that he would then apply to the the you know to his to his department and they would they would believe him because these were big words and I would come this model that was pretty good that was that was pretty decent and it
was good for job security but then deep learning came around came along with his uh with his rich daddy and his millennial skinny jeans and deep learning changed a little bits the perception about what is achievable with machine learning so in the new paradigm and this is a little bit of a hoax but in the new paradigm one just simply downloads some data from the world wide web and then unleashes these images Mechanical Turk and lets the masses label them and immediately has a curated data set a model is trained you'll notice that steps three four four three through five are replaced by deep learning and then then comes Mel model validation so and all this no more does
the data scientist necessarily need to speak you know math but instead imports tensorflow or Karis and this has become extremely accessible to a lot of people and that's a good thing but what is done is it made me slightly less self-confident it worked itself confer about my PhD and given the Millennials more time to tweet about how great their models are so with that introduction I would like to introduce to you from ground zero what we're going to do to develop a malware detection PE PE malware detection machine learning model so if you were here earlier there was a great talk by Allah to raise your hand who introduced this machine learning and it's it's building blocks I'm in a
review in just three minutes some of those that are important for today's today's talk so number one and by the way that the notation will be like this I'm going to draw a picture and I give it a name and a little description and one line of code in a platform called Kerris that would allow you to instantiate that structure so this is logistic regression what it does is it takes some inputs and from those inputs creates one number and if you choose an activation function that squashes that one number between zero and one this is called logistic regression and that's just a building block that one can apply usually to the last layer of a neural
net so that one can create a model that says one being the malicious and zero means benign and some number in between means all right right so that is what logistic regression is for second are fully connected layers and this is simply a transformation from a input that could be the first input or some intermediate input to a different reference tation using often a smaller number of numbers to represent the input and that's achieved with this again dense it's the same as logistic regression except perhaps I might use a different activation function a popular one is called rectified linear unit or Lu here which which is on for things that are positive and just passes the input and
is off passes a zero for for inputs that are negative lastly for this talk is a convolutional layer and a convolutional layer is actually a very much like a dense layer except that it restricts it has some local interest so each output like f right here only takes as input his three most adjacent neighbors and this size is called the kernel size the kernel is actually the thing that's being learned and the kernel is swept across the input and is applied consecutively to say those three those two in the final three to produce outputs and this comes from a long history of image processing where convolutions are the key sort of atom that are used to extract edges and other
things image processing and a convolution in convolutional neural networks usually there's not just one filter but many many many filters so like there's a hundred and twenty eight different little kernels that I'm learning that are three long that are represented in this output here so that in Kerris is a con kong one d where i specify a kernel size these other I will not refer to today but I guess for the sake of completeness in because I wanted to prove to you that my PhD was worthwhile I'm showing that recurrent neural network said long short term memory networks that are useful for keeping state much like a computer would when learning from its input so let's
build our first model and this model is called a multi-layer perceptron and the building blocks that we're gonna use our logistic regression for the output the the dense layers and dense layers and everywhere else so in code that turns out to be really just about five lines of code maybe sixteen with comments or something right where we define a model in Kerris we we have a skip this for a moment but we have added a dense layer and then we've added an activation layer these other elements of drop out and batch normalization are sometimes helpful so that as a lotta image in the previous talk we don't over fit to the data so for a you know a fairly
sophisticated model and one two three four five six seven eight lines of code I created a multi-layer perceptron now the thing about this multi-layer perceptron is that it's going to take an input and learn intermediate representations of the input that are useful for the task at hand and the task we're giving it is decide whether the input belongs to the benign set or the malicious set of data that we provided it and this this might seem like magic but by by just applying simple calculus rules one can find define a loss function on the output and sort of just a sign blame for high losses all the way back at the input and where the blame is
high those are the things we're gonna we're gonna decide to twiddle and tinker with until the loss becomes low and the blame is a relatively low for the input so deep learning is all about minimizing the loss and which in turn propagates all the way back through something called back propagation to the input so that all of these edges which define how to multiply and add things together minimize the overall objective so this is nice but what's the input so for our very first model and there's code provided in the github repository I've provided we're going to define features and I didn't spend a lot of time on this but I'm gonna show you it's actually
it's actually pretty good it's pretty good for a first cut model so features are just a way to take the raw bytes of the PE file and condense it into let's say 2500 2500 different numbers that represent malicious or benign file and most of these have been published in previous works and I've implemented them here in the code but they they are things like general file information how big is the file when is the compile time right header file information like does does the checksum check out for the PE file section info what are the section names and section sizes so in a PD file there are different sections where data and code and resources and import address
table tables are contains I mean we extract information about that in a part of a feature vector imports information so we literally ask we query the PE file and ask what libraries are imported and which function ap is are used and we we include that in the feature vector and if the if the PE file the ex-people has any exports like it like a dll would have and then there's a second sort of class of features that don't require parsing the PE file which I just loop on the bytes and I'll take a histogram of the bytes so things have you know lots of zeros in it then there will be a spike there and might be indicative of
some unused space or or something so I buy into prehistory Graham that's courtesy of our friends at now so folks axe in custody Berlin who kind of make this association between byte values and how how they appear in complexity within your PD file so if ASCII characters appear in in sections are really relatively low entropy or if if ASCII characters occur in and sections are relatively high entropy and lastly I'm going to extract some strings and this is really simple like let me count the number of times I see a prefix that looks like a registry key or let me count the number of times I see C colon backslash right release string counting and imma throw all these
together and do a feature vector now you'll notice the astute the astute students will have noticed that some of these things like section name there can be very many of those a section name can be almost anything that's 8 characters long and because I want to have a small feature vector or using a common trick in machine learning called the hashing trick which takes an input and maps it through a hash function which tells you which bin in some you know this notional 4 dimensional array which bin I'm going to increment my counter so this is like a histogram I'm noisy histogram these input sections so we've employed that trick all over and that's a common
insight learning other things so that was complicated the point of this slide is that we're gonna represent raw bytes as features so that we can build a multi-layer perceptron that inputs as its first layer this encapsulation of what what does a PE file me and outputs a malicious or benign and if we train this in a I think this was done on a hundred thousand samples not a lot but after you wait for about a half hour then we get performance like this what this means is that on a holdout set of data if I take a new sample that this model has never seen before and I extract its features and then pass it
through my multi-layer perceptron then threshold it at a threshold of 0.88 which was chosen to give me a one percent false positive rate then I can detect 92% of all malware that's never been seen before so it's pretty cool it's pretty cool and so M is basic and few semi tricks now this is where we have a level with ourselves this is not a special model this is not a special model this model took just a few lines of code multi-layer perceptron this was kind of this is kind of 1994 technology that we're applying here with except for some of the activate functions so we want to make our mount model really special and so let's begin
our journey now and throwing away this nut special architecture and diving into deep learning what's been wildly successful for images and object detection and speech recognition and machine translation and a see if we can apply it to malware detection okay so the point being in all of those applications I described those are end to end deep learning there's no magical feature extraction you put an image and out comes cat you put in the image and out comes dog but here we had to go through this laborious step of extracting features feels like extra work this is the old way of doing things so let's see what happens when we try to load first we're going to be more
special now the more special model we're going to choose now is a fully convolutional neural network and all this means in the image domain is that I'm gonna take remember this kernel I'm gonna learn some filters to apply to my image that will be represented by this set of feature masks and in turn I learned some more filters on that feature mask to learn a different representation and so forth until I've you know this this neural network has learned on its own choosing to minimize the overall objective features that are useful to tell apart dogs and cats and votes and birds and at the end of this plug-in guess what my multi-layer perceptron so all of this now becomes my
feature extraction and all of this is the code I showed you that slide lines long so this is really cool for images because no longer does people have to write papers and papers about sift features but they plug it in and they wait for a week and the model is trained and then when they apply to new image they that they can just give it the image and out comes a label so the NL @v the analog for this and Malware so in images these are two dimensional structures and the up color there's three fields red blue and green malware is different no color one dimensional so what we're going to do is our very very first layer is do what's
called an embedding layer and essentially for every byte in the mouth and the malware for every byte value I'm gonna let the deep learning architecture choose what color to make up for that byte so instead of being RG and B I can choose it to be a two-dimensional or two-dimensional and that's a parameter that can tune and our experiments I chose to so we'll call it red and blue and then from this sort of one dimensional color map that's a representation so every bite in the PE file now has a color representation in this one dimensional image and then I'll apply the same magic here but one dimensional convolutions that just go you know in one dimension and then tack
on our multi-layer perceptron now that seems really cool so now if I train this model then it works no more PE file parsing I give it bytes and it gives me Alyssa's her benign that sounds really cool it sounds a really special you can ask a question oh yeah so that came from from supervised learning yeah so I gave it piles of data with images of dogs and cats and votes and birds and the same is true here I give it piles of data of raw bytes that were labeled malicious or benign yeah and I would refer you to the code and the Jupiter notebooks that will go through how that's done yeah it's as
simple as creating directory called benign and putting a lot of windows files in there or whatever and creating a directory called malicious and then dumping virus share in there right that's what you had to do but as John Seymour would tell you don't only put Windows files in your benign directory put a lots of diverse style open so we have a special model now the COR special model works our special model so the picture is nice and code it's only about this long and I've written the code in a way so that those with mere mortal GPUs can also run it so instead of cramming the whole P file into linear GPU memory I've actually broken it up in
little pieces and you operate on the pieces independently and that just helps with memory and so that's what this time distributed this about there's nothing temporal time about it it's really just the pieces so do do my embedding layer over all the pieces of the malware do you might you know do my convolution with a convolution filter over all the pieces and at the end of the day I again add on my multi-layer perceptron its faction called MLP for that and then have it get a 0 to 1 and this would be really cool if it only worked so you know a ROC curve of 0.95 I've set a threshold of 0.8 1/2 to lock
in less than 1% false positive rate and we get about 50% of new malware samples well clearly this is not a very special model and this whereas the previous was maybe 1994 this is probably 2011 let's get really special really special let's go to 2014 2015 and apply some some really cool technology so the reason you can see this is because you and I would not really you know it's hard to appreciate this scope of these models this is called the inception model that's done by Google and each of these if you see each of these like structures here I've blown up so you kind of see it but if you were to give a slogan to the
inception model it would be let's let the network decide what kernel size to use and essentially for every input we give it choices one at 3 by 3 or a 5 by 5 or 7 by 7 and let's do all those things and let then that decide which of those to use or in facts maybe blend them all together and that's the theme of this inception model and that's really cool but as I was thinking about it this is not really cool for I'm not sure that's how applicable this is to malware maybe it is there's another model that's called ResNet and if it's actually slightly older than the inception model but it's you know it's sticker would be
let's let the network decide how deep to be so every knows deep learning deeper is better and and ResNet the trick here is that you stack together these kind of structures or I have these these different weight layers but then this shortcut so if if the network wants to it can totally bypass this deep layer or if it wants to can use it so the problem with deep learning a challenge is that the deeper you go the harder it is to Train and so that what the resonant authors did is they stack together a bunch of these residual building blocks so that the deep learning can decide how many of them to include and how
difficult to make its own learning task which is kind of clever and plus we like ResNet because with a little craftiness we call it malware ResNet so for the first time today I'm introducing malware as net to the world which is a very special model a very special model and looks like this in code first we have defined a very fancy one-dimensional residual block in Kerris where we take all this means here's we're adding together the the pass through the input remember that's the short-circuit and in this basic block and I hope you'll ignore the fact that there's a comment out here because this breaks a current of elimination of Karis developer fix later I promise but ignoring that this
is a simple pass through block and if I stack these together and a bunch of code that's even harder to read because sometimes special models take more code then then this is what you get so I'll just learn about this code is that you may if you try to run this on your Mac it probably won't run so I've added a function to remove some of those late on purpose so that maybe I will run in yours too too if you have a mere mortal GPU so here is a malware is net a very special model and we'll see how it goes and if you'll notice here with our very special model if I've chosen the
threshold of point eight three that it gets lesson in order to achieve less than 1% false positive rate and now I can detect a whopping 9 percent of new malware samples so I promise you this with my talk was not to be about deep learning but I did toy with these ideas for a very long time in frustration and I'll tell you a couple things I think there are ways to maybe get these to work with a lot more effort but the the point I'd like to make about this talk is that number one deep learning actually does require some significant work to get it to work right we can't just take concepts from images and
expect them to work well for malware so images have this thing called image smoothness so if you if you're if you're a little salt minister pixel marching along an image and you come to a sharp gradients you've probably hit an object right if the brightness changes that's indicative of something but if you're a byte marching through a PE file that looks a lot different you know it asked a printable character in the header or the import asked address table is much different than a printable character in the text section where the code might reside so for PE files also there's so much parsing domain knowledge that's left on the table if we don't use sort
of those handcrafted features so I guess the point here and I love to chat with those who are interested I have actually toyed and trying to use real deep learning and create a few gadgets that try to encode some of this domain knowledge for example if you believe in trippy to be an important an important feature then you can build an entropy gadget that's really good at calculating entropy or something like it and it can use that automatically so I have 30 seconds to wrap up in my commentary about what you could do better with your time to make your model special and the hint is that it probably has nothing to do with your model architecture that you
can take that 1994 multi-layer perceptron with these features that I provided for you and you should spend your time in the data and by that I mean the biggest challenge in information security is the fact that the malware that the customer sees is different than the malware that you trained on and the benign software that the customer uses different than the benign stuff for the you train dodge so paying careful attention to the labels of your benign wear or your malicious samples and I know a whole PhD dissertation about this very topic can be really beneficial furthermore just the prior the prior distribution so how much malware is on the network this is the world the
virustotal sees for example and this is the world that your customer network sees right they're very very different and those need to be taken into account I'm at in-game there have been times where our model release has been almost nothing except changing the data labels and changing the data that we've trained on ok so this model was not very special but it is pretty good and in closing I like to reiterate that you are special even if your model is not and I've named a repo for you called in-game ink you are special your fancy deep learning malware may malware classifier may work but let's talk about how to do that later and I hope you'll come and see me at the
endgame all right thank you I think I've overstayed my welcome and we'll take questions after is that right okay