
all right the last talk of the day that means I get to go till no 8 9 10 but whatever that uh anyway say a quick overview of what I'm going to be talking about so it's said my name's Rob Brandon been working in the field for 20 plus years largely in a Incident Response of malware analysis but I did my PhD with a University of Maryland in a machine learning and deep learning applied to uh security problems currently work for the threat hunting team with Booz Allen Hamilton dark labs and semantic representations of machine code are really my GM so kind of why even mess with this whole embedding thing a lot of malware analysis and reverse engineering
problems are basically similarity questions you know as a instant responder a network defender when somebody gives me a new binary and says like hey we think this was used in an attack it'll usually take me about 20 seconds to say ok yep this is definitely malware so the really interesting question to me as always not so much is this malware it's going to be ok how does this work you know is it ransomware is this some kind of remote access trojan and similarly you've got for the vulnerability discovery problem you know I've got this new binary are there any vulnerable code paths in here so one of the things I'm going to really care about there is does it
import libraries from other things you know does it have code that has vulnerabilities that we already know how to exploit so once again it's a similarity problem a lot of them we have out there for measuring code similarity things like bin diff graph graph comparisons those really don't scale too and a number of binaries you know doing an N by n graph comparison is really not computationally feasible and similarly things like a SD hash and SS deep work for some things but you know they don't really encode any semantic similarity data they want for doing these kind of tasks so this is where we move to embed ins what embeddings are these are inspired by the concept of word
embeddings in natural language processing where you take your information you know whether it's word paragraphs or whatever and you move it into some kind of high dimensional space where similar things are located similar to each other so in this case we're do the goal here is to take a function out of X and X cubed of an executable binary and somehow learn a high dimensional representation of it so that functions that do similar things are located right next to each other in the space and that allows us to do things like say we've got 5 million functions and we want to figure out which of those that which the functions in our new binary are similar somewhere
to those we can use some kind of locality based search rather than having to do a linear search through the entire down that way and another interesting with this particular type of technique is a lot of problems in the machine learning space aren't really so much machine learning as an tation learning you know the real challenge is to figure out how can you come but come up presentation of your data that you can then throw a linear regression on top of to get the answer so and we look at function in pettings there are kind of a couple of broad ways that we can look at classifying them so one of the ways I've started absurd to
look at a function embeddings is to look at how much feature engineering is required to build the function embedding so we have some some methods like such as the one that I created when I was in grad school which require no feature of our feature engineering you know they just take the raw data to generate your embedding there are a few others have come out over the last year that other people have come up with that use some more sophisticated domain knowledge in pre-processing the embeddings and this is really kind of a really new area of research like I said I finished my PhD in 26th or in 2017 in this area and since since then you know just really
over the past other people that have come up with kind of independently come up with the idea and come up with their own method to you uh build of embeddings so really there hasn't been any kind of comparison as to you know what kind of things are these embeddings learning you know what what method should we be using to create code embeddings so this is just a few of the methodologies have come out over the past year to the generative rnns is a method that uh I came up with and so I presented this at a DEFCON AI village last year this particular method works off of just raw bytes you know you don't need to disassemble the function you
don't need to do any kind of analysis of the function when I was doing my initial work I did do some normalization to take say API calls and normalize those out from being a sequence of bytes into some kind of hash but the initial experiments I was working with show that there's no noticeable improvement so I just left that particular technique out of here so basically it's kind of embeddings are you take something called a character RNN which attempts to generate new information based off of previous information seen so kind of like what's chips shown in the diagram over there you take your function you show it to the network one byte at a time and after
each byte the network tries to predict what the next byte is going to see is you give it feedback at the end of each training cycle you know based off what it what it got right and got wrong and at the at the end you kind of have a network that has created its own internal representation of the particular machine code that it's been trained on so kind of what you do with these embeddings are you take your train network you show it a sequence of bytes from the function and at the end of that sequence you just take the state of the network and that's your embedding so you know kind of the embedding is basically
what does the or what what information did the network find useful in order to predict each byte in that function so a worm Tyvek is a slightly more complicated a function or a method that was presented by a ban Bruce at camless last year and what he's doing here is he's extending the word to algorithm which is largely used for natural language processing and he's extending it to executable code and now my my interpretation of this is largely based off of a couple conversations with him and his slides so there may be some subtle differences in the way I implemented versus the way he did but I think I've been largely uh faithful to the methodology so basically in any any
issues with it are definitely my fault not his but uh so as you can see this is an example of some assembly code that's been normalized into the modeling format that he's doing you know registers are replaced from EDX EDX sorry X whatever into just a token indicating that this is a register you know same way same way with memory accesses and a D references those are just replaced with you know this is a memory address this is a dereference one of the differences or challenges and with this type of model is that this requires disassembling the code many people working in the disassembly space will tell you this is disassembly is still a far from solved problem so as
I'll show later they're having to do that does lead to potential issues when you do it with doing this kind of work and this falls into what I would call like a moderate level future engineering you know you need to do some disassembly maybe replace a couple of tokens but you aren't doing any significant code analysis you know you're not trying to compute the call graph or anything like that so the next method going to discuss is called safe self attendant function embeddings created by luca boscorelli and another group i believe they were based out of italy this is a little more complicated model than the previous two but in a lot of ways it's kind of more
sophisticated version of worm Tyvek and you can get the modeling code for this particular methodology at the at their github site but basically the way this model works is they uh so the previous model worm to that you basically do word Tyvek on each of these each of the instructions and they need to you take all the all the vectors for an interferon ssin that were computed but for each of the instructions then you just add them together and basically the captain a ssin of all of them gives you your vector the difference between that and safe safe also takes each of the disassembled instructions they have a slightly more complicated tokenization scheme where they're looking at what
memory ranges was each act memory access in and doing some other slightly more complicated stuff but largely a very similar thing and there then you there there is then as well using word to vet to compute a embedding for each instruction but then they're feeding these heaps of instructions into a self attentive recurrent in order to compute the final embedding so very similar to a worm Tyvek but a little more sophisticated this falls into what you know what I would consider it slightly more significant feature engineering and finally there's a SM - a SM Tyvek which was presented last year at the jailbreak security summit by Sophia dent wan and this is probably the most complex Tim
Betty methodology that I've seen anybody attempt to use to date she included a lot of compositional feature or a lot of features besides one foot bytes of the function such as its neighbors in the control flow graph and a lot of other very computationally intensive and very domain-specific things into the embedding method that she was doing unfortunately the code for this was elbow and I wasn't able to figure out how to recreate it just from the data that's out there so I wasn't able to include that in this particular experiment so finally kind of to the that said I was busy I really wanted to find a data set where you had some very easy to define classes that were going
to be present in the data set but I also wanted some code that was going to have some relation over time so kind of a natural fit for this for me was the code base of the various BSD unix's so if you look kind of at the history of things in 1992 you had 86 BSD you know you're so later that got forked into net BSD and FreeBSD then you know back in 95 an FPS ve got forked into open BSD so we have this luck this code base that had a similar start you know a similar origin you know 20 20 years ago or so but that has evolved significantly in the years since
so do it trying to come up with some kind of you know code similarity analysis this looked like a really interesting corpus of a code to take a look at so what I ended up getting out of this data set was 222 binaries each one of the binaries was present in either bin or user bin across all three operating systems so you know your LS your sh whatever other binary has had a common name the FreeBSD binary is I compiled with - o 0 and - OH - nat PSD and OpenBSD i unfortunately was not able to get the make comp files to work for changing the optimization flags from the default and after a couple days of
banging my head on that I just kind of gave up I additionally I as functions that were greater than 16 bytes because I really didn't want to have a just a bunch of thunks in there that were you know of course those are all going to be the exact same code and that's going to throw off a lot of similarity metrics so basically the criteria was only functions that are greater than 16 bytes in less than 2048 bytes we were used mainly cuz anything over 2048 bytes is really just an outlier size wise and to keep it easy I wanted to uh have some boundary there the initial parsing of all the functions was done with binary ninja so I used by
enter - its to determine where the function boundaries were and all of these were compiled with debug symbols so it was you know didn't have to do any significant work and there wasn't the problem of where you did how do you determine where the function begins an end ends so binary ninja was used to extract the binary code and the disassembly for all of the code except for the safe model so one of the things really wanted to try out here was the the safe group they trained the model that David Lee on a 64 by Therese from Linux so several user space library binaries from Linux so the anything here it can you relate can you
just release a model trained on one set of code that generates embeddings and still have that those embeddings be comparable to embed inside our trained using other code and i got some really really interesting results with that as i'll show you in a few slides but the other problem was safe uses verdere to you to embed or to disassemble the code that it feeds into their model unfortunately verdere - is not always the most reliable binary analysis system so it broke on the trying to pretty other functions of so set out of those xlix and ended up with about 19,000 functions that went into the model so some of the parameters used to train the
model worm Tyvek I trained the that model for 40 epochs the RNN model trained for 20 pots at that point it it looked like it had largely uh stopped changing in perplexity safe model I didn't train the model simply become how well that generalizing yes I know that this is definitely a in experiment where I'm throwing a lot of different things out there and if I had you know over the next few years there's definitely a lot of interesting things to follow up and maybe get a bigger but this is still like I said I still got some overly interesting results out of this and all the models used a dimensionality of one hundred and whatever you do we comparing
any kind of embedding type of problem you really want to make sure you use a constant dimensionality because different dimensionalities are going to have different capacities for encoding data so you really don't want to uh you know say okay we're going to trust this one with ten dimension and this one with a thousand because obviously the thousand dimensions can hold a lot more data than ten everything each each data set used a different binary so FreeBSD compiled with clang 7 by 0.1 OpenBSD was built with clang 6.0.1 and epic bsd was compiled with GCC 5.5 rato so as far as the class breakdown FreeBSD - O's hero had significantly more functions or was a significantly larger part of the data
set I think it's still not large enough to further relate for any data set imbalances to really cause issues especially since most of the issues here are more with a model generation rather than with classification another interesting thing kind of as a side note is that you know FreeBSD owe to the both FreeBSD s came from the same source code so there's obvious a lot obviously a lot of function inlining and other optimizations going on between - oh and - o - which is something to keep in mind when you're looking at trying to take to block binaries and determine did they come from a similar code base so one of the things I like to
do whenever I'm doing any kind of unsupervised or cluster analysis is throw up it's kind of a stir plot of the data that I'm looking at so in this case these are just basic scatter plots the data set for the models and just kind of looking at it off the bat you can see there's some really interesting things in just in the data from right here you know you notice the safe plot the classes are a lot more intermixed when protected down to two dimensions via sneak tease me than the other two you know you get kind of the cleanest separation within the RNN model and you know the warmth of ech is kind of
similar to the safe model okay in law cases you really don't get much more than some intuitive ideas of what kind of space class separation you might have in the data and some pretty pictures for a presentation but I've still found this to be a fairly useful technique so kind of one method I've come up with over the past couple years for determining whether to embedding methods work as well is heavily inspired by language processing so one of the main embedding or main ways of testing natural language embeddings are to take your embedding and see how well it worked for determining what words are synonyms so if you look at English and other languages we've got
what plenty of lists out there that says you know loves likes appreciates those are all similar words you know they're all synonyms for each other so they should show up in similar parts of the vector space so with fun with binary code we really don't have that common knowledge that common ground truth consistency of well we know that we should say it when we see these two functions we should say that they're similar with the concept of consistency we all that consistency means is that hard consistency is that if I take function f from set a and I take that same funked function with embedding method B those two out of the entire data set those two should be the nearest
neighbor for each other so it's basically that you know embeddings you know if I embed a function in safe hard consistency with worm Tyvek would be that same function has the same nearest neighbor in Word or in worm to effect as it does in safe so soft consistency just means that the nearest neighbor from say function f in safe is within the ten years Davers or you know the N nearest neighbors of the same function in worm Tyvek for some value of N and actually I got some really good numbers on that so you'll notice the so the consistency between the RNN methodology in worm Tyvek was basically non-existent you know it's basically was a show that there is no consistent but
it's a very remarkable amount of consistency between the worm Tyvek method and the safe method you know to the order of a 43% of the functions had had hard consistency and then even with 10 which ously kind of a small number for and for the soft consistency that still brought it up to about 68% consistent functions from one data set to another
the kind of the takeaway from here is that at least for this for this experiment models that had similar feature engineering ended up having similar levels of consistency with each other and I thought that was very interesting because you know simply adding a few vectors together is very different than creating a self attentive network over the same set of vectors so you know kind of the the simpler MOT the very simpler model with the same with the same feature ization ended up learning the same stuff as the more complicated model and one of the things I'm interested in trying later is a upping the value of n in this experiment to see just how far that increases a
consistency you know maybe try with an N of 100 or you know 20 and see how that changes things so like said the data sets that trained on the same stuff are can tend to be consistent data sets trained on different models tend to be consistently inconsistent the additional test I did which was so common in natural languages given a supervised learning task how well does the embedding work in this case I most of the times that I've tried this on code embeddings it's been kind of a waste of time because all of the supervised tablet we have ends up being trivially easy to model and watch every vector so in this case trying to
determine you know the supervised learning task was which data set did it come from is it a FreeBSD is it in open BSD is it an FBS D every single embedding methodology got 99 point something percent on that so in order to really exercise code embeddings we need to try and come up with some more challenging tasks and datasets to maybe get some more useful comparisons because every saying everything works great isn't you know necessarily the most useful thing to say so kind of some of the conclusions and takeaways from this is a two-lane can definitely make things complicated the real life definitely makes that particular methodology hard to just take off the shelf and put into use without
doing some significant engineering say maybe put a different disassembler on the front so another thing was you know even if consistency between models is poor that doesn't mean that the embedding it is not going to work for your particular use case you know as I saw even though the two different embedding future sets learn different things you know they still work the diffuse them for India at the end of the day and so getting kind of where things were going with a ASM Tyvek features besides the compositional features of code are definitely somewhere that's right for research you know there's a lot of really interesting things that we could be doing say with a trying to
incorporate the control flow graph and other features from the binary into the model so one of the things I think would be interesting to do from this in the future most of the work so far has you know people have naturally looked at functions as being the best way to or the best you know kind of semantic block of code to major things in we you know humans tend to look at things in terms of functions but looking at things on the from the computer side maybe looking at basic blocks would be easier you know and the basic block is just a section of code where you don't have any control flow changes so that side steps a lot of
the problems you have when you're doing program analysis you know as far as to determining you know have we ended another have we ended the function or did we just jump somewhere else due to something else you know or is this just a piece of code that just has nothing but jumps in order to get around function analysis so basic flocking beddings is definitely a really interesting area that I plan on taking this in the future other than that there's my twitter and email anybody have any questions thanks awesome talk this whole area is very much my jam so I really enjoyed that thank you so um having said that I'm now gonna hit you with kind of a sneaky question one
of the things that's come up in looking at embeddings for natural language models is this question of bias so I think most of us are familiar at this point with the notion that if you translate from Hungarian that doesn't have gendered pronouns you get different gendered pronouns in English for nurse versus doctor so to what extent do you think based on sort of your experience with this that that might be an issue going forward because obviously a lot of the code bases that we have available to us via from you know malware aggregation sites or large code bases or stuff like this these are all skewed from sort of normal behavior so if you get into actually
using this for like reverse engineering at that point the quality of those embeddings and whether or not it's actually mapping to the right functionality might be biased by that data and might might lead you sort of down the wrong path so do you have any thoughts on that or how it might be as well so I think that's definitely a very heavy potential area but I think one of the key differences that we can leverage in terms of code as opposed to natural language is that ultimately we're not really looking at what's the semantics intent we're looking at what did the compiler generate so that's where I that's one of the reasons that I think
moving towards a basic block approach where we're looking at how weak how do we combine basic blocks blocks and then derive meaning from those can potentially help a lot of the issues with the data set bias because a compiler is only going to generate so many basic blocks humans are only going to code so many different things before they put in if they in there or some other thing that's going to cause a control flow change so I think that's definitely one possible mitigation but yeah it's definitely a huge area of potential concern and one that I don't have a good answer for cool um so I have a question as well kind of similar on the similar track so
have you looked at function embeddings when you say injected like a single line or something into the function and seeing whether the function embeddings would also be similar there - no I haven't interesting to do I'm just reminded of a news post from last week for silence and malware classification so yeah no I that would be interesting I think these should largely be robust against that particular type of technique just because the code that you inject would have to in somehow in some fashion over over over way the code that you are that's already there so I mean if you look at the like say the worm safe model you know your whatever line of code you're going to put in first off
you have to know what vector is going to be generated by the instruction vectorizer and then that that vector would have to be of high enough magnitude to throw off everything else that it's already seen so that's I don't know have AI think that that's definitely interesting to do
all right well then let's give another round of applause [Applause]