
hello again b-sides now from our breaking ground track henry reed and emily rexer will present static detection of novel malware using transfer learning with deep neural networks stay tuned hello uh we're very excited to speak at b-sides las vegas um we're going to be talking about the static detection of novel malware using transfer learning and deep neural networks my name is emily rexer and i am responsible for a lot of the machine learning on this project my name is henry reed i'm responsible for the data collection and infrastructure as well as the project direction of this of this uh project others involved in the research include nick cohen our senior cyber security project leader who has helped us with some
of his expertise as well as doug woodward who is our senior data scientist and a lot of the code that we still use today has been written by angela tchaikovsky our cyber security intern so the problem advanced adversaries develop custom malware that bypasses antivirus when i say advanced adversary this doesn't have to be a nation state this could be a cyber criminal group or it could just be a group of people who have the capability who can write custom malware it is important for us as a community and open source community to research and develop new anti-malware techniques to improve cyber capabilities against that malware whether we even if that even if we lack the funding and the capability of a
large enthalpies corporation developing those techniques helps the general community gain new skill sets and gain new capabilities that can then maybe be used in commercial more sophisticated tooling so it's important to keep researching this topic our approach was to base our work on the intel microsoft machine learning approach where they converted executables into images then they used those images with deep learning models to categorize malware as malicious or benign or categorize executables with motions or none this method allows for predictive classification meaning that malware that hasn't been seen before could still be detected it's a form of static detection where no code execution occurs and it's much easier to develop than something like a heuristic uh
detection method where code will be run for some preliminary statistics to get you guys interested for x8632 now we're on windows 96.542 accuracy in detecting malware that has not been seen before the intent of these charts is to act as an introduction to the research as well as a tutorial to recreate the research should you decide to do so we'll begin with our background then we'll move on to our methodology including data set collection data set pre-processing uh our models and experiments then we'll look at the evaluation discussion and finish up with future work and the q a so ordinary malware detection is done mostly in two different ways signature detection heuristic signature detection is pattern matching of raw bytes with
known bytes that come from malicious samples if you ever use yara rules that's exactly how it works it just matches things that it sees within the binary itself heuristic detection is a form of um essentially you you run the malware and if the amount and or you run a binary and if the binary does malicious things you categorize it as malicious the third method which is new and up and coming is using commercial tooling we have two examples of it but it's not very common for example the firewire malware guard uses raw bytes in both deep learning and non-deep learning approaches to identify malware they explicitly state and i quote machine learning allows for detecting
and stopping new malware faster than conventional signature based approaches cl1 is also another example current industry standard machine learning methods generally take text data off of a binary similarly to ordinary signature detection and then they use it in combination with a machine learning model this requires a step called feature engineering which is time consuming is a time consuming process and in the end you get a model that can tell you whether something is malicious or not book percentage confidence we are not doing that instead we're using images and an image recognition model i'm going to pass it off to m we'll talk about this in more detail so i'm first going to talk you guys
through a introduction to the broad machine learning workflow so this is a flowchart we're running from left to right here so we first start with data acquisition for us that's collecting malware and benignware then we move on to data preprocessing which again for us would be converting to images and then we move into this sort of cyclic uh model experiments phase where we try and evaluate a bunch of different models to try to get like the most accurate model once we're done with that we take the most accurate model and we push it into model deployment um but currently we're still in the model experiments phase um so now i'm going to talk a little
more on the specifics of the model so we're using neural nets in this project which is just a system of connected nodes where the connections are modeled as weights um in this image you can see that the leftmost layer is the input layer and that's where we're going to feed in our images in the middle you have this hidden layer which is sort of this like black box ml magic where everything goes on and then you have an output layer on the right and that is where our model will say either malicious or benign and we're using neural nets because they can model very complex functions and they can take in images but some problems with neural nets is that they
require very large labeled data sets and they're very computationally expensive to create partially because of these large data sets another thing to note is that model creation is non-deterministic which means that even if you feed in the exact same set of inputs you're not always going to get the exact same model so in order to try to counter some of these uh potential problems we're going to introduce transfer learning and transfer learning is just taking an existing neural net a pretend neural net and retraining and adding some layers on top of it to fit it to a new data set um and so this image on the top here you can see that that's the original
neural net that we're transferring with and then on the bottom you can see that we've added a couple layers on top and then we've retrained it in order to perform a new task and so advantages to using transfer learning is that it requires a lot less labeled data than just like training a neural net from scratch it's also a lot faster to train partially because it requires less data and you can also take advantage of neural nets that perform similar tasks so in our case that would be image recognition disadvantages to using transfer learning though is that the model you eventually create will not be as well fitted to the data as just training a neural net from
scratch because you aren't retraining the entire model you're just retaining some added layers on top so here's an example of how we feed an image into a neural net again we're moving from left to right here so we've got our image three by three so nine pixels we flatten it out from top to bottom left to right and then uh with each pixel we convert that into a number from zero to one based on uh uh its range of color so for us it'd be from white to black uh zero to one um and then you take each of these numbers and you turn them into nodes and then feed that through the neural net
so now that we have our model we need a way to evaluate which model is is doing the best and so these are called metrics and they're based off of these uh four rectangles in this chart so we have true positive false positive false negative and true negative and the metric we're going to look mostly at here is uh accuracy and accuracy is just the ratio of all correct predictions to all predictions so that'd be true positive and true negative over everything and here we're aiming for 100 um i'm not going to pass it back to henry so we can talk about inspiration um and data collection and pre-processing so we based our work off of a paper
called stamina scalable deep learning approach for malware classification written by intel and microsoft there are two goals in this paper one is the malware classification problem where knowing that something is malicious what family of malware does it belong to and the second problem is detecting malware that you haven't seen before meaning i don't know what this is is it malicious or benign that's the same problem that we're trying to solve the approach is that which we describe you convert a binary to an image then you transfer learn a deep learning model on that binary image we found a few issues first the suggested image conversion process negatively impacts all the accuracy or it has impacted the accuracy negatively
in most of our tests we have an alternative system that we will talk about later additionally the data set is proprietary this is a major problem because if you can't have that ascendant you cannot repeat the research therefore you cannot validate the research and the secondary problem with the data set is the fact that the way that they um they balance the data set it's uneven as you can tell 30 2009 executables you know give or take and 59 000 malicious the way they balance it is they take the non-executables and cut them in half doubling the amount of phonetic schedules that they have it's not a very good method we'll talk about in more detail a little later
but we we can't use this in an operational sense definitely work would work in a lab but not in in the real world so issues with similar works um whenever a cyber security engineer hears the phrase machine learning uh their eyes roll out of their sockets and they immediately go blind the reason for this is because the datasets are often not published the code is not published nor are any techniques for recreating the two published and that makes the research unrepeatable and unverifiable and basically useless in a field because you can't there's nothing really to run we aim to do the opposite we're currently on track to release the software in around two months give or take and we are working on
releasing the data set as well more details on that later so considering that the intel microsoft data set is proprietary how do we recreate the a similar data set we mainly relied on virus share for a training data set fireshare is a repository made by porvus forensics a new york dfir firm as of the time of writing which is around a month ago there are 38 million samples i'm sure there's a whole lot more now and we in our case we specifically downloaded torrent files 380 inclusive to 389 we pulled all the windows x8632 malware less than five megabytes in size we identified the file type of the newest file command so we just use the
magicbytes to do that and then we randomly selected ten thousand and fifteen thousand samples using the news shelf command for a testing data set an additional testing data set we used apt malware which is a github repo anonymously uploaded um we we do trust the so the the whole point of that repos it contains malware written allegedly by five different nation states we do trust that repo because each hash in each malware sample has a hash associated with a threat intelligence report either written by a corporate intelligence firm or written by an antivirus company so because there is that validation from third parties of those specific samples we assume that the samples are true and
accurate four are benign wear we collected it from three main sources the first source is windows 10 we simply installed it in a vm about six months back or so or maybe nine and then we mounted that virtual disk in a linux vm and we pulled the exe and dll files out we did the same thing windows server 2019 around the same time except we selected all the services removing any conflicting ones and we mounted that virtual disk and pulled the executables out then we installed sigmund 32 and we selected every single program besides source code packages and then we mounted those uh in a link's vm and pull the files out uh there's two small caveats with this
process if you're mixing and matching file systems as you're copying files i do understand that windows is case insensitive while linux usually is case sensitive so be careful when copying files from one directory to another and avoid overwriting files disabling clobber will help you with that uh do use f to explore equivalent to identify duplicates because you will have a lot of duplicates between windows and windows windows 10 and windows server so it's important to identify those and remove them and not have them in your in your training dataset so as a summary we have two major data sets we have 15k which is the x8632 data set for windows and we have 10k which
is also an x8632 windows data set both of these contain two classes called malware and bananaware so when i say 10k what i mean is 10k malware 10k benignwear for a combination of 20 000 samples compared to intel you know the the largest data that we have is 15 000 for both classes while they have around 60 000 you know give or take um we you know for us the issue is the the banana where is the bottleneck you know we can't go higher because at most we had around 15 000 benign samples and creating a benign data set is very difficult we will talk a little more about what we think the open source community should do to
to or or yeah we'll talk a little more about that in the end but that's really the big bottleneck there's a there there are many data sets for malware binaries but there's very few for benign ones the imbalance in the intel's of microsoft's data set the way they fixed it is again they cut binaries in half and then converted them to images not a very good approach because a binary could have for example a very large icon so when you cut it in half the second half will just be the icon file and then the code the variables the logic is going to be in a different file entirely or you could have a situation where the
code variables and logic and everything else is now cut literally in half so ignoring the pe headers and other parts of the pe uh file this means that the machine learning model could simply assume you know not not that it's a person but it could think to itself things that are malicious are the full binary and things haven't cut in half is benign and and if i see nothing that's been cut in half nothing on this machine is benign so that's not a good approach to use out in the field it's very lab limited i'm sure the results work in a lab just fine but again we want we want an approach that you can repeat
and recreate such that you can apply it to the real world so now that we acquired the banana where how do we convert it into an image the intel approach is as such you take the binary converted to a one dimensional pixel stream the null byte is black the you know 255 is white everything else is a is a shade of gray in between banks on the pixel file size which we assume that's not the paper isn't very specific but we assume means kilobytes we you take it to and convert it to a 2d image using the table on the right so for example if the file size is between the pixel file size quantum quote is between 10 and 30
kilobytes then the width of the image is 64 pixels then the rest of the file will determine the height so in the end you you get a rectangle you convert the binary into a one-dimensional stream you make that into a rectangle and the last step is you make it into a square of size 299 by 299 or 224 by 224 depending on what your model wants and you either use the nearest neighbor algorithm or bilinear interpolation then specify which algorithm worked best so to emulate their work we simply stuck with nearest neighbor because visually speaking just looking at the images nearest neighbor makes less images and we assumed that that would make a that would improve the results rather
than having a slightly blurrier image our process is slightly different instead of using that table we completely ignore the table and we just take the square root of the number of pixels that we have and we make that the width and height so in the end we have a or in the middle of the second step we have basically a square right and then we take that square and we scale it down using the nearest neighbor algorithm to 299 by 299 this method on multiple tests provided better results than the intel microsoft approach so we recommend this step to be repeated should you decide to recreate the research so on the left you see two images though
that with no black border around them those are images that were larger than 299 by 299 and had to be scaled down the two images on the right are um images that were smaller than 299 by 299 in those cases we simply filled the background with black okay so now that we've talked about collection and pre-processing our data let's get into the model experiments section um so this is that looping process in the uh ml overview flowchart so uh machine learning is kind of an optimization problem you have a bunch of different variables you're trying to fine-tune them to try to figure out what works best so that's what we're going to walk you through
um note 2 that training neural net is non-deterministic like i said before so for to counter this for each model experiment we're running we're actually training and evaluating 10 models and then taking the average of the metrics to try to reduce randomness and our data set split is 60 20 20 train validation test um so here's a list of all the variables we're gonna try to mess with for for a model we're transferring with we're going to use two resnets and three dense nets and the primary difference between resonance and dense nets is that dense sets are more interconnected between the different layers for in terms of data set henry already mentioned this we have a 10k
and a 15k data set just to see if adding extra data helps and then we have two variables unfreeze bottom 10 and fine tune all layers and both of these have to do with if these are if these are turned on that means we're training more of the original model um and so if we have these on that means that theoretically it should be more accurate because um we're training more of the original model to our data set um however this may lead to overfitting which is when you fit too precisely to your specific data and it also definitely takes longer to train the last variable we're working with is the image conversion method um so either intel's square or intel's
table method or our square method so here are the uh 10k data set results note that these aren't the conclusions this is just the raw data from our experiment and optimization process and the number you should care about is in this outlined row so i'm going to point out a couple variables that matter we found that um so this outlined row is the highest accuracy of all the rows for 10k uh we found that the square image conversion method worked best and uh transferring from dense set to a one worked best we got an accuracy of about um 96.4 here these are the 15k dataset results so again this outlined row is the highest accuracy
um note again too that uh um square image conversion method worked the best and densenet 201 worked the best and we got about uh 96.5 accuracy here um i'd also like to point out that resnets in general do worse and um like i said before this is probably because dense nets are uh more interconnected between the layers so now i'll talk about our our test data set apt malware so let me kind of walk you through the chart here uh there's a lot of information taken so each column um describes each apt so this is these are the results for apt 1 apt 10 19 and so on the row describes the exact experiment so this chart only shows the intel
image conversion model as you can tell and we have all the the other variables here underneath each column you have an average and a standard deviation so you could really see how different the variable uh the accuracy is between the different models uh we will click put glass here where it says average hd accuracy that's per row meaning for you know intel image conversion with data set of 10k not unfreezing the bottom 10 layers and not fine tuning all layers using redstone 50 we have an accuracy of 84.98 for all of these combined um so i'd like to direct your attention to this to this orange orange column called energetic bear this this specific apt in this data set contains
malware called havex which is a remote access trojan that targets ics systems more specifically their opc protocol this is what rob lee would call stage 2 malware that actually attacks the ics scada environment rather than the it infrastructure around it we hypothesize that it does so horribly in our tests because we don't have that kind of data set in our training data set it's a very rare type of malware so because we we haven't trained on it doesn't do so well and because it's so much different than ordinary malware however if we switch to the square image conversion method you can see that you know all around the board um the the accuracy has increased
there's a whole lot more green and a whole lot less white so and and you can also tell that the densenet 201 on average does generally better than the other methods here so now that we kind of went over our results let's see if we can analyze them one you know we are the best accuracy we got training on the on the initial data set the 10k 15k data set um really that the best we could get is 96.5 so why isn't it any better so let's take a look at the file size distribution the file size distribution describes the distribution of file sizes of the original binaries benign we're on the left here and on the
right the malware before they were converted to images so the nine-ware executables tend to you know start off anywhere between zero to 100 kilobytes there's over 5600 samples of that and they you know very quickly but smoothly taper off two single digits anywhere beyond um uh 3 400 kilobytes for malware it's slightly different uh we still start off high you know exactly 5 000 samples um this is the 10k dataset by the way but exactly 5 000 samples of malware between 0 and 100 kilobytes but it kind of drops down and there's a bit of a hump here drops down again a larger hump and over here we have double digits in terms of the amount of
samples we have for larger malware the same exact story repeats itself in the 15k data set so we we hypothesized that this imbalance this imbalance in the data set could cause the accuracy to be lower simply because machine learner models do best when the data sets are exactly well balanced meaning that they you know there's an equivalent amount for one class as there is for all the other classes in the data set and in this case there's that imbalance in the file sizes there's just not enough samples to really go on especially if there's one or two benign samples of a given size versus you know 20 of malware so the next thing we wanted to take a
look at is you know what's what's what's in our data set um what are we even training with so to do so what we did is for the 10k data set we simply sent the hash of that data set to the virus share api the virus share api saves the virus total results when the malware was first introduced into the repository so this isn't you know the actual buyer's total results where you can repeat their the scans and get the new scans and so on and the reason why we're not doing that is because that costs money and fireshare does not cost money most malware is categorized as generic malware so anything like potentially i want the
program generic trojan or just the word malware some are given numerica or otherwise um you know non-descriptive identifiers we categorize all of those as generic some more specific categories exist also so if we want to graph this information we have to come up with a few rules and there are such um if malware is categorized using only generic categories meaning non-generic categories don't appear for a given malware sample we count the most common generic category if there's a tie we use the following precedent of pop taking precedence over trojans taking precedence over generic so meaning that if there's for example a malware sample with three hits for pup and three hits for generic pup will only count uh if a malware if a
malware sample has any hit for a non-generic category it would count the most common non-generic category so if there's you know 30 hits for the for potentially i want the program but one hit for spyware then we count it as spyware we don't count the other hits if there's a tie in non-generics we can all ties so if there's a tie for spyware and ransomware we count that as a single hit for ransomware a single hit for spyware so given all those rules this is the chart that we have you understand that these numbers don't all add up to 10 000 because of that double counting possibility so um let's let's walk through each call we have a
lot of generic malware around 19 000 or so that's unfortunately the reality we can't expect anti-virus companies to meticulously identify which each and every malware sample does we have quite a bit in the spyware category around 2400 then moving on to back door this will be your reverse shells your bind shells things like that around 15 000 of that droppers are the two stage malware so these are things that don't have the malicious payload on them and instead they're waiting or they have the capability to download that payload and then load it later the next category we have is riskware riskware is defined slightly differently depending on the vendor but it's anything that can pose a risk to the user that isn't
strictly malicious so this could be drm circumvention which i would argue provides you know zero risk to the user um to software that can provide a backdoor which creates significant risk uh we also have adware this is things like that inject adds into your operating system mainly in the browser we have viruses and worms which are pretty standard coin miners exploit code or compiled exploits generally then we have root kits only eight samples of that hack tool could be anything from uh drm circuit convention to a metasploit exploit to a meta-split shell those could be categorized as hack tools depending on the antivirus company and we also have crack tool this is explicitly drm circumvention
from there we also have trojans and pups which is common knowledge and ransomware around 900 samples of that so so doing this allows us to see what we're actually training with so in the future if you have a different distribution of different types of malware you'll be able to see that and see what you're training with and kind of hypothesize on why your model acts the way that it does one of the other things that you might want to do when you're working with image recognition models is take a look at your false positives and false negatives and true positives and true negatives and see if there's a visual difference between if there is a visual difference
so for example for an image recognition algorithm if you see that a lot of your false positives are because people are wearing hats or glasses or masks or covering your face somehow then you can tell that hey maybe it's a problem because they're covering your face right you can visually as a human being you can tell why the model does not do well with certain images you cannot do this with the data that we have so i'm gonna use my cursor here to give a few examples um take a look at this false positive here and the true positive here they're nearly identical there's almost nothing visually that will discern them sure this one's slightly brighter but there are examples
of true positives they're also slightly brighter like the one here right really there's no visual significant visual difference between these for some reason this this this little block that looks like a tv screen turned to a dead channel looks like that but it's still it's still a true positive while other ones are not right even though they all basically look the same if you look at the true negatives and uh false negatives same story repeats itself um so like this this image looks basically identical to this one uh these images over here in this little little tetris shape are look very similar to this image and this image these images look very similar it's not
really that much of a visual difference you can't tell that hey you know this thing has an enormous icon that's showing up and all the benign samples that have icons in them do badly that's not the case there's not much of a visual difference you can tell you know just by looking at images so valuation and discussion uh file size distribution charts why is the accuracy uh not higher our hypothesis is as such uh you know it because there's a mismatch between the file sizes from the benign wearer and the malware because they don't exactly match the model isn't as well trained if we had more samples that better fitted to that we could potentially have
better accuracy the intel and microsoft paper also claimed that the larger the images got and the bigger they were and the more you had to scale them down the accuracy would would measurably decrease we did not find a pattern for that we our cutoff was was five megabytes the reason why we picked that is we found that larger samples than five megabytes pretty rare we didn't want to introduce things like that we wanted to limit our our training data set so given this cutoff we did not find any consistent pattern showing that larger images scale down somehow do worse or otherwise yep so we also found generally that the best performing model is independent of
the data set and so we found that dense set 201 performed best for both the 10k and the 15k data sets um in terms of image conversion method um we found when we hold all other variables constant that our our square image conversion method performs um better than intel's uh conversion method about seventy percent uh in seventy percent of our tests and we're working on running more tests for this um and then one of the uh um one of the variables we changed to to try to fine-tune more of the layers of the original model we found that whenever we like flip the switch whenever we train more of the original model we always got higher accuracy and so we
think this is because it's retraining more of the original model so it's actually better fitted to our data set um okay so future work um there's uh so we sort of just picked um a an image uh compression algorithm in order to fix a variable um so one of the things we could potentially try is to uh use uh bi-linear interpolation instead of nearest neighbor for image compression another thing we're looking at is transfer learning with other models and there's a paper mentions something called resnet rs we're looking into a big project would be to create our own neural network and we're thinking about doing this in one of two ways we can either base it solely on malware
images and then transfer learn to malware and benign wear or we could create a single class neural network so a single class being either malware or benign wear and then use an anomaly detection method to try to detect the other either malware or benignor so what do we think the open source community should should do where should they direct their efforts into in this research um one is that benign word data boarding really that's the main issue is the bottleneck is always the nine where it's really hard to acquire a data set of that and the big obvious reason is you can't just legally data hoard proprietary proprietary software that's been compiled the reason for that is there's copyright
and license issues and restrictions that don't allow you to conduct unauthorized copying and distribution of those binaries though free and open source software that's been compiled and installed you know the the full final executable that hoarding that should be fine provided you have license agreements there are millions of malware samples for windows but very small amounts of the nine samples so a balanced data set is required for this methodology um software installers are also not the same as find the final executable can't really be used here so whenever there are large data sets of software installers we can't use those because well because that doesn't represent uh an already installed executable that just does regular
operating system or application functions it's an installer as a nice gui it has menus those things aren't rel generally present in in a lot of the the other software that that you would normally run in your system so we can't just use that to to cover the benign where data we don't um we don't have the the other argument that that might come up is something like well i don't live in say i don't live in the united states i don't live in a country that has trade agreements or similar copyright laws and policies can i hoard it uh you might be within legal right in your area but people from other nation states would
now not legally be able to use your data set in their work so even if there is uh even if there are sources of data for that's not really going to be available for everybody else so really it's what we should expand on are techniques to acquire it yourself as well as other methods to to re to create more benign datases without falling falling ill of the copyright policies one-shot learning is a new concept it could potentially alleviate problems data set size and should be explored though it is relatively new it would also be kind of cool to see if we could train on more than just two classes of the non-wear malware instead train on
malware specific types uh that way we could you know analyze the behavior of the software in a static manner meaning that if we have classes of ransomware classes of backdoors classes of spyware and we have the benign work class we could potentially say like hey we are 70 confident this is ransomware and we're 90 confident that it's a backdoor that'll be a lot more helpful than a um a binary um a binary result so in conclusion uh using density201 for both data sets the highest accuracy we were able to get was about 96.5 percent and unlike normal malware detection methods we're using machine learning in images which is pretty weird and this allows us to skip
feature engineering so it's a lot easier to develop and to deploy and again the end goal for this is to provide open source repeatable research so the best way to reach us is using that twitter handle my dms will be open so feel free to message me um we are working on releasing the source code and data set using an open source license um contingent of the other licenses that we must comply with um so that's gonna hopefully come in around a month or so give or take a few weeks we have done this before so it's not our first rodeo so it should be fine in terms of release we're just we're really we're just kind of fixing
the code and putting the final touches on it we are also going to be sharing the data set so feel free to do so and when the code will be available i'll want to announce it on twitter but two if you don't want to you know look at that we will make the code available at that link below again using an open source license and the goal of that is to also have it such that um when you you you don't just have the code you also have the models already made so you can download the models and actually use them and that concludes our presentation please let us know if you have any questions and thank you for listening
all right hey welcome everyone this is the q a session for our static detection of novel malware talk uh we've got the authors here with us and we're gonna go through and just discuss a little bit about uh more about what they they brought and personally i'm really excited about this uh talk i what i love about this most of all is your attention to operationalization and like real world results here right the base approach supervised learning sort of a basic image classification model is a fairly you know vanilla established machine learning uh you know but uh all the magic here is and how diligent you folks have been with your data sets using full binaries using publicly
accessible training testing data you know or at least data that you're trying to get that way um and and trying to get a better balance of maligned to benign samples you know your willingness to work towards producing that publicly in a way that will provide reproducibility so i mean like that uh what uh you know what would you say is like your your biggest wish for all this work that you've been doing like what would you like to see it accomplish so best case scenario um organizations start specifically i want you know the attention of closed source entities that make end time fire software to to do two things one is to consider this method right because they do have better
data sets than we do then we're uh more researchers than we do probably better researchers than us right so i want them to potentially comment um likely not gonna happen but uh i i'd want at least a white paper kind of discussing the topic i understand that until microsoft kind of went over it i want to kind of look at our specific process and see if there's any issues with that because obviously having a kind of a well-funded team of people like a corporation discuss that with the open source community will help us overall right help us help you essentially sure that's one um two is really providing an a sorry let me pull back lots of closed
source antivirus companies do use machine learning but the binaries are not publicly available i understand that for benign wear right because you can't exactly violate copyright and licensing um to then publish stuff but at least the methodology for acquiring that data will definitely be helpful um yeah you can't give us tell us what you did to gather it so we could potentially go try and make those same arrangements ourselves with those same entities yeah all right fair enough correct um from the open source side would love to say exactly um you know there's data hoarding communities there's you know the defcon equivalent village that that dark tangent runs i forget the the name now but it's the data
duplication village right so there is a community in the open source arena that does data hoarding so while technically speaking can't really share benign samples but at least sharing the methodology or speaking with kind of that community and opening up the conversation there would help us acquire more samples to train with um and lastly really the i'm hoping that you know we we kind of called out some names here and there i'm hoping we'll get the attention to try and make research this exact way right to try and make it such that it's repeatable to try and make it uh to try and make it very operational or at least focus on that side um you yeah there's many papers by
universities corporations white papers um you know similar to us ffrdc's doing research on that kind of thing but i'm hoping that instead of it being a vague sort of um overview of what was done instead it's very detailed such that we can use it um in an operational sense yeah i it very much uh you know is refreshing when we got this proposal i mean just the focus on these sorts of aspects is really what made it stand out from a lot of i see a lot of sort of snake oil machine learning you know stuff come into all the conferences that i i'm on the committees for and uh you know i i really like the ones like
this where it's like no no let's let's try and concretely solve a problem and then like show what we're doing rather than just uh you know sort of say trust us it's all you know working behind the scenes um so what you know if you i understand you know you're working on getting your results in your data sets to the point where you can release them for folks who've seen this talk and you'll be able to play with this themselves um you know what sort of license will you be releasing this stuff under and and how can the community you know best give back and help help you if they wanted so this is so we haven't decided on that
yet but this is what i'm going to push for and it's likely going to be display uh hopefully a license such as mit public you know the mit license or the bsd2 clause essentially all i want is in case somebody does the research they link back uh i don't mind it being commercialized i don't mind it being made closed source you know and i you know a gpl has a purpose right and that's for linux and operating systems that's great but for our research you know i don't believe the open source community can match corporate antivirus so there needs to be a way for corporations to make closed source versions so so that's the intent uh in terms of the
data set uh malware we can just share it right i'm hoping to go for a similar license you know there's different lines for data sets so i'm hoping to go for something similar where just attribute us if you use the license um for the nine wire we'll have to see cygwin binaries could be gpl right uh which could prevent us from publishing the model file themselves right so you might have to recreate the models which will be which would be problematic but we'd have to speak to our lord see what the situations of that um but the hope is we you'd have several files one uh tools and little scripts i wrote to help you pull data and and verify
whether something is an executable or not you know as you install stuff um the tools convert images both the intel method and our method so you can compare and contrast and then hopefully a docker file where you can just run it create an image and then that that image would then be used for you to just give it a directory and given that directory you'll recursively search through all of it find binaries and give you like results um those are the really the three core things that were however many of us the core things that we're going to be looking at for the public movies great um so emily like what uh sort of same question if you could wave
your wand you know and and make you know some result come out of this like or or get some alternately you know get some access to something you don't have already what would that look like are you any any wish list or on either of those topics for you it's a big question um i have to say first of all um i think it'd be really helpful for a lot of like um senior machine learning developers to take a look at this um because i'm i'm very like very early career so i feel like i'm probably probably missing a couple um maybe obvious things um so i think having you know a more experienced eye take
just like take a glance at this entire project would be useful and um we're working on getting a couple people within the company to do that but i do think like researchers with a broad range of backgrounds um so like from other companies would be super useful okay um i want to add to that um there there's a machine learning engineers who've never dealt with this problem there's cyber security engineers like me who have no machine learning background can't code this uh really that there's that niche that exists of both of those interconnected and again those generally exist at unicorns operations right right the unicorns exactly uh trying to hire one for a reasonable price let me know how that goes
but um having corporations comment on this would definitely help excellent um so then uh basically we're looking for anybody out there who's got a line into one of these groups to step up with a contact kick kick uh kick your link to them and and see what they think when they kick the tires awesome okay um so then i guess one of the interesting things you you talked a little bit about some of the improvements that you made in like image processing and so forth on this uh is there like do you feel pretty good about that i mean it's just it was it's kind of amazing to me that compared to companies like intel
and uh and microsoft you know for instance you you feel like you were able to to get an improved uh process on that portion of this uh you know this mechanism um what do you think is you know the the reason behind that or you know what do you think other other folks who are looking to do this kind of work could learn from what you're able to achieve there i think um first of all it'd be very interesting to actually talk with the people who wrote that paper because i am i'm curious like how they come up with that because it seems um to me less obvious than just you know making it into a square
um but i've henry and i've talked about this and we have no idea how they came up with that right yeah generally people have good reasons for the things they do you know and yeah we it uh it is always anytime you it seems like it's glaringly um they miss something glaringly obvious you've got to wonder what data they have that you don't or what they're keying off of that you aren't what what are their how are their goals different from yours yeah totally so the something to add to that is i believe they cited one paper that may have mentioned that that's the method that they use but the actual reason why isn't exactly
disclosed i don't think they no i don't want to speculate but um it could also be that their data set just works better with that methodology again we we have no idea to verify that um yeah so yeah so yeah i mean i guess there's really it's kind of an insurmountable problem other than you know uh absent you know voluntary collaboration and uh and publishing from you know these kinds of closed source companies uh operating in this space i mean there really isn't much way you can check a closed source you know malware product or anti you know virus or something like that to find out how they're getting the results they're getting or what data they're
using or how they got that data and so it's just it's kind of endemic to the space hopefully you guys can move the needle on that a little bit i really appreciate everything you're doing here so um yep if you want to pitch what was i think you listed in the in the talk the ways that people can get a hold of you um you know anything else you'd like to say in closing before you know we sign off here for today yeah so uh at medichenry on twitter uh you should be able to reach me my dms are again open um and then on github the aerospace corporation you should be able to see that and we'll
have it listed there as a project so i'd say in terms of verifying yeah you mentioned the topic of verifying anti-malware and that's really difficult because if malware is known most antiviruses will have the hash listed in the database so that's how they're going to detect it right it's the easiest method i wish there was a way to convince those companies to do demos of their anti-malware by not using those pre-known hashes so take your machine learning data set maybe remove some of the apt stuff removed stuxnet from it remove other you know apt samples and maybe some other you know mirai or mirai however you pronounce it other malware samples take them out of your training that's a
change of other stuff and let us evaluate it in an open source manner right yeah because if you don't do that then you are basically trusting the corporation and their heritage rather than trusting the technical capability of the tool so i doubt i can you know convince the corporation to do that without it that is a hope and and to be clear there's a i mean the trust is a is a funny thing there's certainly reputational trust is a valid trust model right yeah but um but yeah it's nice when you can trust but verify that's yeah the saying runs correct awesome very cold worth saying all right well i really i really appreciate uh your time both of you
and look forward to hearing more about this when you get the rest of the data published awesome thank you so much for having us great conference thank you