An Open Source Malware Classifier and Dataset

Name: An Open Source Malware Classifier and Dataset
Uploaded: 2018-04-25
Duration: 28 min 13 s
Description: Phil Roth - An Open Source Malware Classifier and Dataset Research in machine learning for static malware detection has been stymied because of stale, biased, and otherwise limited public datasets. In this talk, I will introduce an open source dataset of labels for a diverse and representative set

BSidesSF · 201828:13835 viewsPublished 2018-04Watch on YouTube ↗

Speakers

Phil Roth

Tags

CategoryResearch

ResearchEmpirical Research Technical Deep-dives

StyleTalk

Mentioned in this talk

Tools used

Ember Jupyter Notebook LightGBM YARA

Service

VirusTotal

About this talk

Phil Roth - An Open Source Malware Classifier and Dataset Research in machine learning for static malware detection has been stymied because of stale, biased, and otherwise limited public datasets. In this talk, I will introduce an open source dataset of labels for a diverse and representative set of Windows PE files. The dataset also includes feature vectors for machine learning model building, a high-performing pre-trained model for research, and source code to reproducibly generate the features and model. I’ll also detail the reasoning behind the features and labels and demonstrate how the machine learning model performs on samples in the wild.

Show transcript [en]

All right, thanks. So yeah, my name is Phil Roth. I'm a data scientist at Endgame. And along with my colleague Hyrum, we've released Ember, which is an open-source malware classifier and data set. So first, I want to talk a little bit about why we would want to do this and what motivated us. So open data sets push machine learning research forward. So there's been a lot of advances in machine learning in general in a lot of different areas over the last 10 years. Uh areas like optical character recognition and you know, machine translation and also identifying objects in images. And there's a lot of reasons for that. You know, hardware there's been a lot of

advances in hardware. Data sets are getting much larger, but an important reason I believe is the presence of open benchmark data sets for researchers to use. So what you're seeing here is a This is a plot of like 20 different open benchmark data sets in the machine learning research community. And this is just the number of times that they're cited at large machine learning conference called NIPS. And you can see that benchmark data sets have been around for a long, long time, but it's really been in the last 10 years that their use has taken off. One example from that previous slide was MNIST. This is a open data set of like 70,000 images of handwritten digits. You can

see some examples there. So what researchers can do then is just train their model on 60,000 of those images and you know, then test how well their model identifies the correct digit in the leftover 10,000 images. So it's become very important to the community. One of the leading researchers has said MNIST is the new unit test. And kind of what he meant there is that the algorithms have gone even beyond the point where MNIST can measure how well they're doing. But even in that case, MNIST is still important as kind of like a sanity check on to make sure that the algorithms are doing what you expect them to do. So security lacks these data sets and

this is something I've been saying ever since I kind of joined the security industry four or five years ago. Yeah, you can see some examples of me talking about this there over the years. But there's a lot of good reasons for why why security lacks these data sets. There's personally identifiable information in them. Companies might not want to release, you know, network logs or network infrastructure because they're afraid that attackers would get too much information from that. And also you know, companies don't want to release their own intellectual property. That last reason is a good reason why open data sets don't exist in the field that I'm in, which is kind of static classification of

malware. This is just a one application of machine learning in the security industry, but it's the one I'm going to focus on on this talk. So what are we trying to do? Pretty much it's just the antivirus problem. It's you have a new Windows PE file that you haven't seen before and you want to know is it benign or malicious? So there's a lot of ways to solve this problem. We're we're coming at the problem with machine learning. So we're extracting all kinds of features from these files, you know, as many as 2,000 or so, but in this simple example, let's imagine if we're just using two features. Maybe like file size and number of

imports. If you take those features from all these different files, you can then plot it in a two-dimensional space. And the colors don't show up very well, but you have like a red bunch that you can call malicious and a blue bunch that you can call benign. And then what you want to do is classify a new dot as it comes in or kind of divide that space up into what you think is the malicious space and what you think is the benign space. Simple rules can allow you to do this, but it's just not very effectively. You can see, you know, the red space here now doesn't define those blobs and there's some outliers that get

misclassified. Machine learning can really help by defining better boundaries and giving you better performance. But there's so many options for machine learning algorithms. How do we know which one's best? And that's where Ember can come in. And Ember stands for Endgame Malware Benchmark for research. We want it to be known as kind of MNIST for malware. So the name. The letters match up. I really like it. It's a great name and it also gives me the opportunity to to say this joke as often as possible. I'm not going to get tired of it and so you better not either. All right. So what are the details? What is in Ember? It's an open-source collection of 1.1

million PE file hashes that were scanned by VirusTotal sometime in 2017. The data set includes metadata, derived features, a model trained on those features and also there's a GitHub repository with code that allows you to work with the data really easily. But importantly, it does not include the files themselves. And that's just because, you know, the benign files are companies' intellectual property. We don't own those files and we we can't release that inside information to the whole world. And so we're only releasing like derived information. So the data set is divided into 900,000 training set, a smaller testing set there. The training set is divided evenly between benign, malicious and unlabeled samples. And the training set appears

chronologically prior to the test data. You can see this is kind of a date histogram of the month that each each sample first appeared. We're releasing that month data with each sample. You know, it's not totally accurate. It's the month, but it should allow you to do a lot with that. And it's important to make the training set appear before the test data because it kind of reflects the nature of the problem of antivirus. You want to train your model at one point and you certainly want it to get all the samples that you trained it on correct, but you also want it to get all new samples correct. You want it to correctly predict whether a sample is benign or

malicious that you haven't seen before. That malware authors are, you know, currently writing or will write in the future. So that's why yeah, the test data shows up in November and December of 2017 and the training data is all prior to that. Oh, and then also by releasing the month, we're also allowing you to do kind of cross chronological cross-validation. And also you can quantify how quickly your model becomes out of date. You can train a model say through all the data through May and then say, well, how well is it doing on data in August and September and October. And you can see how much worse it gets over time. So this is what the data actually

actually looks like. You can go grab it now. It's a 1.6 GB tarball there on disk. Once you extract it, there's seven JSON files in there. When you look into each file, each line is a JSON blob. And so here we're just looking at the first three keys of that first line. And so that each JSON blob has these first three keys, which is kind of metadata about the sample. It's the hash, the month it appeared and then also the label. Zero for benign, one for malicious and negative one for unlabeled. There's a lot more keys than after that, which kind of show you the extracted features from each sample. And I'll get into more

detail about what those features are later. But yeah, right now. So there's two kinds of features. There's those that can be calculated directly from the raw bytes of the file. That's like byte entropy, byte histogram and strings. But then there's also features that we're extracting from the files that need that require us to kind of parse the PE header. And in order to do that and to understand the PE header format, the PE file format, we're using a library called the library to instrument executable formats or LIEF. And we got to give a big shout out to Quarkslab for open-sourcing this very useful library. So calculating the features is kind of a two-step process. This code we can't really read the

details, but it's just This is just one one feature and the important point is that there's two different functions to calculate features. The first accepts the raw binary the PE file itself and it generates that JSON blob. And that JSON blob is kind of what we're calling the raw features. The second function takes that JSON blob and then vectorizes it into just a list of like float 32s. And that's a feature vector that you can then feed to a machine learning model. Uh So the data that we're distributing, that those JSON blobs, they need to be vectorized before you can train a machine learning model. It's an important step. And the code we're releasing allows you

to to vectorize all all those features without really, you know, without doing any more work. We've defined that for you. So yeah, let's go through the categories of features that we kind of chose. Byte histogram is a simple count of how many times each byte occurs in the file. The byte entropy histogram is kind of a sliding entropy calculation that then kind of benchmarks each entropy calculation back to each byte that occurs. There's more details in that paper that I linked to there and we've also written a paper there's there's more details about it in there. Section information. So this this is an example of a feature that requires you to read the PE PE file header.

So we list all the sections and all you know, the entropy, the virtual size, kind of other information about each section and then also which section is the entry section. Import information, export information, we just have a list of all the imports from each file and a list of all the exports and for the imports we have what library the those functions were from. Strings, we can't just extract the strings and hand them over to you. We were a little worried about personally identifiable information or divulging intellectual property, but we were able to find all the strings, tell you how many there are, what their average length is, a histogram of the characters that appear and some other things. We've

also done some pattern matching. So the strings that we think look like URLs or registry keys, we count those up and we give that information to you. There's also more general information just about the file, its size, its virtual size and so on and also header information directly from the PE header that talks about where the file was compiled and kind of the compiler information and and that sort of thing. Yeah, so I I kind of mentioned this, but after downloading the data set, you need to do feature vectorization before you do model training. The Ember code base defines that. Feature vectorization after I downloaded the data set onto this computer right here, it took about 20 hours on this

machine. We have a server that does better parallelization and it can take as quickly as like 10 or 30 minutes. So after you've done that, you can train a model. We've trained a model for you. We didn't make any special decisions on this model. This is a very generic model, but we wanted to train it and distribute it for example purposes so that it can serve as a benchmark. It's a gradient boosted decision tree model that's trained with light GBM and that's an open source implementation of that gradient boosted decision tree algorithm. It took about 3 hours again, 3 hours on this machine here. So it that was you know, that's pretty easy. And then once you have that model, you

can make predictions on every sample in the test set. And it's going to that model is going to spit out a number between 0 and 1. The if it spits out a number closer to 1, it thinks that file is more malicious. If it if it spits out a lower number, it thinks it's more benign. And then what I'm showing here is a histogram of those predictions on the test set. Again, red for malicious, blue for benign. So you can see it's doing a pretty good job of separating the two classes, but there is some overlap. Once you've made all those predictions, you can kind of make some statements about how well you think the model is

doing. We do that with a receiver operator characteristic curve and the area under that curve. It's so we get this score. And then like I said, you you you get a number a score between 0 and 1 and then you want to pick a threshold and say, okay, anything that scores above this, we're going to define that as malicious, anything below that, we're going to define that as benign. And if we do that, how many do we get wrong? How many false positives do we have? Well, I chose that threshold in order to get a 0.1% false positive rate and at that false positive rate, we get a 93% detection rate. So huge disclaimer, this model is not

malware score. At Endgame, we train malware score and distribute it as its production model. It's protecting customer machines now. It's great in my totally biased opinion, but this Ember model, it just doesn't perform as well. It's malware score is better optimized, has better features, it's constantly updated with new data. The purpose of the Ember model is not like to protect your machine. It's definitely to just serve as a benchmark to say, okay, the researchers out there can choose a different machine learning model, choose a different technique for classifying benign and malicious samples and how does it compare to this, you know, common benchmark that everybody has access to. So yeah, I would not suggest using this model to protect your

own machines. Along with the data, we're releasing code base. It makes it very easy to vectorize the features, train the models and then importantly make predictions on new PE files given the model that we're distributing and I'll I'll show you that in a in just a second. I what something I'm really proud of is the code repository has a Jupyter notebook in the resources directory and I've kind of defined the environment that I ran that I trained the model in with what Python packages I used, what versions and so you yourself can run this notebook, you can train the model yourself given that environment and this notebook will reproduce all the graphics that are in this talk, that are in the

paper and it's a very good explanation of how to do that in code. So yeah, I we I hope researchers in this area, you know, pick up this model, they run with it, pick up this data and run with it. There's a bunch of different things we're hoping people do with it. This first category is like we hope people beat the benchmark. You know, this was a pretty easy model. We didn't make many customizations to it and there's a lot of things we researchers can do to improve the performance like immediately. You can you can throw out the features that we're distributing that aren't very interesting. You can do feature engineering, come up with better

features. You can do you can optimize the light GBM model parameters with grid search. That's going to immediately get you better performance. Or the last one is kind of like semi-supervised learning. You can bring in information from the unlabeled samples to hopefully help you, you know, learn some more about the structure of PE files if the if the labeled benign and labeled malicious data isn't enough. And we're cuz there isn't much about semi-supervised learning in the academic literature and we're hoping this kind of spurs that. The second category of things you could do is kind of going beyond just gradient boosted decision trees and well, I've already mentioned you know, how quickly do these models go

out of date? We can definitely define that with this data set, but then also you can go beyond gradient boosted decision trees and look at featureless neural network based models. And for that, you would need independent access to the samples themselves. We can't provide that, but I'm hoping there's research institutions that have access to VirusTotal. Any of these files can be gained with a you can get them with a subscription to VirusTotal or some agreement with them. So hopefully researchers with that kind of access can release neural network type approaches to this problem and you know, kind of show how they compare. Also, you can kind of take the offensive side and say, how can we use machine learning to from

an attacker's standpoint? And so you can use the Ember model as kind of like we want to beat the Ember model. Have a malicious sample that Ember the Ember model classifies correctly. How can I change that make it bypass the detection? And I I just want to make a note, you know, offensive research is very important to defenders as well because we want to learn more about that and it's it's going to help defenders in the end. All right, demo time. This is where talks are won and lost. So I wanted to bring a little hat from a winning team that I know of, the 76ers and so hopefully this will help me a lot. Yeah, trust the process.

All right. Let's see here. Can we see that? Yes. All right, so I want to download some of the most recent benign files that are on VirusTotal and the most recent malicious files and I want to just want to see if the Ember model classifies them correctly. So keep in mind this is data that has just been seen by people, you know, right now and so that means the Ember model which was trained on data through October, you know, is going to be very out of date. But when I've done this demo myself, the Ember model has been it's a pretty good winning percentage just like the 76ers. So we should be good. All right, let's see. I'm running this

through my phone, so I'm going to skip the 5 megabyte and take the smaller one. I'm going to download the malware directly to my computer. I'm a trained professional, don't worry, I can do this. Let's see. This is a smaller benign file. Download. All right, sweet. Got those. So now let's go over. So I've already downloaded the Ember model and data and you can see that here. Yeah, that's the tarball right there. I've already extracted it. These are the JSON features. The cool thing about light GBM is the model is like kind of it's inspectable. It's it's just in text. And so you know, you have decision trees, you have a bunch of decision trees and you

can inspect what decisions the light GBM model is making. So that's pretty cool. So we downloads, we got our two downloads there. So this is the repo right here. I've already installed it, but you can just run it again. There's some instructions in the repository itself about what you know, what requirements you need to install, what kind of Python packages you need to install. I've already done all of that. So scripts. So we give you a very nice uh classified binaries script that'll just help you, you know, make predictions on new PE files once you have a model. So, let's specify the model data Ember Ember model and downloads. All right, we're making predictions. Uh all right, we did all right, maybe. I

hope. Um so, it the model loads very quickly uh and you see this one got a very high score. That's FA2. Let's hope that's the malicious one. Uh yes, it was. And the benign one scored pretty high. That's actually in all my tests that's kind of the highest that a benign file has uh scored. So, 149. Uh yep, that was the one we down downloaded. So, victory, just like the Sixers. Yeah, nice. All right. Uh All right. So, this is all available. Uh the the links the presentation is in Sched, so these links you know, you should be able to get them. Uh but we we've released a paper. This is the highlight sentence for me. Evidently, uh

well, so in the paper uh Hiram was gracious enough to uh train a neural network based model himself uh based on some recent literature and we kind of compared that performance to uh gradient this simple gradient boosted decision tree model. The gradient boosted decision trees did, you know, slightly slightly better. Uh so, we're saying, you know, you know, at the current state of the art these featureless models are not, you know, up to snuff from the from these older uh type models. So, you know, we think there's potential in the field uh and we hope this uh kind of data set kind of spurs that kind of research uh in onto featureless neural network models. And

so, I just want to say, bring it. All right. Download the data, download the code. Thanks a lot.

How did I do? Do I have time for questions or Okay.

Do you want to bring the microphone or you guys can just yell it out if you want.

Um secondly, have you Yep. Secondly, have you looked at um packed samples and the entropy features of obfuscated um code? Yeah, as in if I give you a packed win PE sample, um how is your model going to detect it? Yeah, packed samples are are definitely hard. I I haven't looked at them in the Ember data set specifically, but we deal with them all the time when uh training Malware Score. Uh and yeah, the we can definitely detect them, like you said, with entropy based uh features and everything. So, the model knows that they it's just hard at to um to make a good decision cuz you can't learn very much about what's actually in like a packed

file. But so far we we've done pretty well good performance and you know, maybe people can take this and and find even better ways to to get features from obfuscated files. And um how do you what's the score variance between uh two malware I mean, a malware belonging to family one, but it's variant A and variant B. So, YARA catches those pretty well because apart from the SHA, you have different variants that you can code up in a YARA rule. Um how do we do that? How will Ember detect variants which are very similar but are the are different sample I have different SHAs? So, so you're saying if there's a common family and there's just a little a

little difference between something that might exist. Yeah, oh I mean, you would hope that the machine learning model would would find features that exist in all the different in all the different samples of the same family and would be able to, you know, make decisions based on the ones that it has seen and make an accurate decision on something from a family that it hasn't seen before. Okay. Did that answer your question? Okay, good. great initiative. Thank you so much. Okay thanks.

Uh yeah, I liked your talk. Uh I had so many questions while you're giving the talk, but you actually answered quite a few of them uh toward the end. Uh I guess I'm I'm I'm a little bit um I'm new enough to ML and deep learning to be dangerous. It's not my original uh technical expertise. But I was a little confused about SHA-1 using it because I mean, SHA-1 unless you have an exact match, it's just going to be different for everything. It's and it's in theory a uniform distribution, too. Yeah, definitely. So, it buys you nothing as far as I should have been maybe more explicit that we're um distributing the SHAs of all the files,

the SHA-256, but um it's not meant to be used as a feature uh that you would train on. It's definitely meant to be like if you have access to these files, then you need this SHA and you can go and get the file yourself. Uh so, it's more metadata and not data that's meant to be trained on. Got you. Got you. Now, I think you had information in there that I could distinguish by operating system somehow. I mean, I think you had some Uh metadata in there that made cuz like in my environment, I'm almost all Linux and I I could care less about Windows, for example, you know. Like malware for Linux Windows. What are you

talking about? The can run this. No, the features, the features. Oh, okay. Yeah, I I mean, we're just grabbing whatever is in the PE header there. Okay, but I can separate them out if I need to when I'm training. Oh, okay, like in different kind of like, yeah, you could do that. I mean, um combining different models, like training separate models for different subsets. That's that's definitely something you could do. Okay. And then now I've lost track of my third question. I'll be around, definitely. We can we can talk after. Yeah. Sounds good. Thank you very much. I really appreciate it.

I think You still have like 5 minutes left for questions if you if anyone in the audience still wants to ask questions.

Now I remember my third question. Uh one of the things is I'm getting other data like if malware comes in, I'm going to get have I'm going to have logs, Splunk logs, which will tell me the behavior of something coming in. And I was wondering in the features I you said you just had really three labels, no, you know, malware, not malware, or unknown. Yeah. And in the future will you have more distinction like, you know, like uh I know there's a lot of different distinctions between states. Well, yeah, it sounds like you're getting into like behavior and like what it when this was downloaded, when this ran, what did it do beyond this? Right.

And that's definitely something you can do that's kind of like out of the scope of just static malware detection where you just want to use the data without running it and make a decision uh just with only that information. Well, I'd like to combine it with the dynamic data. Yeah, and there's more opportunities for that. Um yeah, we're not including information about what these files would do if they were run. Uh it's kind of out of the scope, but but yeah, that's definitely you would want to move beyond that and and uh make more complicated models with all the information that you have. These are in the public domain, so they probably have descriptions associated

with them, right? How they operate? What they do? What's in public I mean, the all the data's out there now. No, but I mean, the the malware that you've done the SHA-1s on and everything, they're well-known. They have behaviors associated with them or Yeah yeah uh it probably a lot of it it probably is commodity malware. We we didn't pick it to be, you know, to be from certain families or anything. It's kind of uniform, just random random. So, yeah, there probably are very well-known things in the data set. so I can dig them out. Yeah, okay, great. Thank you.

If there are no more questions, uh let's take the opportunity to thank our speaker again. All right.

An Open Source Malware Classifier and Dataset

Related talks