← All talks

Dynamic Analysis of Malware Using Runtime Opcodes

BSides Belfast · 201630:30141 viewsPublished 2017-09Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
Tools used
Service
About this talk
This talk presents a machine-learning approach to malware detection based on dynamic opcode analysis. The speaker describes building a large dataset of malware traces using virtualization and dynamic instrumentation, then demonstrates how clustering malware by runtime instruction sequences achieves better classification accuracy than traditional signature-based or AV-label methods.
Show original YouTube description
BSides Belfast 2016
Show transcript [en]

[Music] thank you and thank you for your for coming to this talk and my name is Donnell Colin I'm a second year PhD student in Queens and my journey towards kind of cyber security started fairly recently and I spent 20 years managing and working in some of the most famous bars in Belfast and having been made redundant one Friday afternoon and then finding out that I was gonna be attached the second time 24 hours later I decided to get my act together and get a proper job so always interested in computers from my first spectrum 48k when I was 3 program and stuff in basic but new qualifications computers I decided to join Queens and do the MSC in software

development which converts a primary degree into a software development and for the dissertation then I asked someone it's ISA to set me a project which was then a McCallister's of malware and then from there I joined in said via PhD so I'm relatively new to the cyber security industry and my talk should be in three parts so it should be a live demo in the backgrounds this is my first time presenting this outside Queens represented anything I would say Queens so whenever I said if people are going to your live demo we kind of go and hopefully at work they'll talk about the data set that I've built and I'm currently building and then the future

work that we'll use for that okay so and the the essentially to build the data set I used a lot of different kind of tools that are then cobbled together in a bit of a symbolic fashion to make an and encode our own system so essentially we take a file and assess it against we virtualize it use dynamic analysis and then assess it against a machine learning model that we've built on a data set okay and called arc just as a random acronym but also my boys called Noah so I said I'm doing the first thing after Noah Noah's Ark and so I will try and get out of this not only is it a live demo or my

first oh it's a Queen's presentation but after our DP and queens to get it running and then from Queens or DP at our malware lab to think get that running so hopefully this works yeah okay so essentially I'm just gonna set this right and then go on with the presentation and we'll look at the results at the end so just simple shell script executes the Python script and then that's that's the malware virtualize in there okay so

so the current context of my PhD then is just new ways to detect malware so signature detection is the most widely used approach within commercial malware detection so as everybody knows New Balances must be captured analyzed for signature stored and then deployed so by definition you're behind the curve you have to have a new sample of malware take it analyze it normally manually for signature then deploy that to all your users obfuscation techniques components issues so standard obfuscation techniques polymorphism etc and can cope on this issue and with recent advances more sophistication some research urges some researchers have shown that signature based detection methods are on the brink of failure then with the advent of cloud computing as well as

closed storage then it's just another attack vector so the motivation of the research is to develop a strategy for the detection of malware that's immune to modern office cased methods and applicable a hypervisor level so what we do is opcode analysis so upwards if you don't know it's a portmanteau of operation code so it's the portion of assembly language that specifies the operation they performed in our brand so essentially the human readable version of machine language so here we have some so the body of current research op cool analysis can discriminate between malware and brian software that's clear from research this bypasses some issues inherent in signature signature detection models because you're actually examining the code however statically

analyzed files can't generally investigate part or partly encrypted malware sometimes they can and they're approved code obfuscation so junk insertion etc dynamic analysis on the other hand allows obviously as a malware to reveal itself a long time you get what should be the true code however the datasets use the literature are small sometimes the small 70 samples it kind of doesn't quite stack up that in their nice nicely worded introduction to research paper they say there are 40 million samples in the malware you know new samples every day and then they squander their methodology and say so we took 90 samples and run machine learning on it doesn't quite stack up virtualization for dynamic analysis tends to be

virtualized and but it can be detected by modern malware so in terms of my investigation structure and I'm at the point of care of the data set so we're kind of generally critiqued the data sets that there are my calluses investigations have used for being too small to body samples so we want to try and improve improve classifications there was discrimination between benign software and malicious software and then by that we want to invest get malware types and look at uninvestigated malware types and how that feeds back into each each segment so the e/m is the create a data set of processed instances of Maur and good ware which is sufficiently deep in terms of quantities of samples

sufficiently broad is in types of malware and the the purpose of that is to increase classification or it's discrimination between benign software malicious to enable more complex machine and deep learning so algorithms that need a lot more samples and increase understanding of malware types so that's malware versus malware so building the data set then we have a few sources of all that to your virus share and Malaysia Malaysia is from a project that ended up doing a very very good paper and give 12000 binaries that were and take basically the 300 VMs vulnerable VMs faced on the internet pointed in the couple URLs and watched as all the malware for you so we have we

had all these samples and with the md5 lists as well and so we put those through a virus to Google scanners so virustotal very graciously donated a private key and for a while to allow us to do more samples per minute and so we checked the nb5 of each file and we then pulled on the information that we wanted and built a database with it we then so for me I look at executable files and the problem being that we had I think it's 2.3 million files we didn't know all we had was the md5 and no extension no idea what this like she was rollin picking randomly try and execute it we needed some attributes

so essentially this built an attribute database and for my research it copied executables into separate folders from a practical sense but we didn't discard anything else so that would be its kind of looks like this see the md5 you have a response code which is just an internal check for me to make sure that it exists so the Python doesn't fall over and number of Possible's is in the number of malware scanners that detected as a malware which I'll explain in a minute the file type the type of file which are two different things and then the results of 52 AV scanners so search we build it like a large descriptive attribute to database and one of the

problems is Freddy obviously this kind of pervasive is how do you categorize malware into families and my future investigations will be Dylan aided by malware type so how do we correctly if there is a correctly correct way of doing it assign these samples into categories to impose our own categorization or clustering to use existing broad measurement tools and the problems you have to start from somewhere so for example this is one file and you can read that so at the top there is the ND five and these are the results of some of the AV scanners some of them are reasonably helpful you see a lot of were true j'en there a lot some are little bits self oak there's one

says suspicious so it's designated the file as suspicious with no real more information than that so what I did was set the threshold for detection 50% maybe scanner so if 27 and 54 detected as malware I took it that is a talking point about the research and which I can discuss later but essentially I wanted to kind of set it at a kind of a good immediate intermediate level the results weren't in uniform format so we had to try and find some way of kind of saying well what do they all judge them by there's the Carroll 1991 format that you should adhere to and it was updated later on but the January don't the only thing you

to really know is that all components so the family the type the variant or delineate by punctuation so essentially what we did was we tokenized it and parsed it on majority rules classification so when you scan through that the word that appears most is chosen so we designated as trojan but it's a tentative label which kind of underpins the problem so one of the questions I was asked when I presented this internally in Queens is why are you spending so much time as in the first year of your PhD and processing the malware into the scripted Arguelles that's not actually the data that you're capturing that's just a description of the data that you hope to

run the problem being as I described the malware samples when the end is md5 no extensions there's no idea what they were so it's necessary divide the data set into the files of interest for me they're executables but all the people for any big Research Center so all the people are may need a data set up all the file types so sharing the data this was easily done with all information was stored so if somebody wants some poison PDFs we got a collection equation PDFs there's JavaScript HTML what everyone there's a big Android research cluster as well so there's a lot of Android so malware available the offices for the new data set was on structure and

volumes so they require that information so it's a general lifecycle of the malware sample so first of all it's died notice and then we attribute it so the again these are tentative labels but it's it's a starting point so then we get to the fun bit where we execute it so we used VirtualBox for virtualization and the API is particularly nice we used some hardening strategies and but not all hardening strategies which I'll can talk about in a minute so the guest OS was Windu's tan and we staged it to provide similar environment to normal user so can like if you're doing a honey pot or honey net we have to try and staged it to look like a an actual

system rather than a victim so some kind of hints that I picked up from various talks so poisoning history Microsoft Office we've got some documents in the document history we got flash Java stuff in the recycle bin as well some instances of malware it's check and recycle bin to see whether it be in virtualized or not and then other kind of basic sort of things set the harder hard drive size to be over 60 gig any less than 60 cake and it reckons it's being virtualized but there are some parts of anti virtualization techniques that we could really get past for example VirtualBox you have to use guest additions to try and control and manipulate files and

annoyed we couldn't get by past that really so we tried as much as possible but for us we didn't go phil hog and full anty anty virtualization because the fact that malware would check for virtualization isn't indicator malware we wanted to use that as a feature so we've got a host operating system and essentially just launched as a Python script so the thing that I clicked on there and that launches my word slice guest operating system so it alone a clean snapshot that's been previously saved it loads a debugger in this case I used a ladybug and and it as a pass the file that I want to investigate as a parameter launches the file I run it for

nine minutes you could run it longer because somehow well we'll wit or sleep for a while we'd sleep for 24 hours somehow our sleeps for 30 days before then encrypts your hard drive and wrote for nine minutes and and in the background the debugger traces so essentially we get this Sami language instruction by instruction one of the good things about dynamic analysis is that you get instructions in sequence as they occur as they're passed to the processor which is important for other for some machine learning algorithms for example hidden Markov models whose sequence B is learning so that we once we get the wrong trace we tear down the vm launch the next file and that was

distributed in over 14 computers and we just let it run so we have a dedicated malware lab for the MSC in cyber security we just literally just let it run let it go and so at the end of the execution that we have the tracing apart so in the background of the execution we've got a debugger role and just running tracer so literally just stores all the our poses they come I parse through those and with a parser that essentially just points the all the instructions this is the sort of trace that we get my typical execution will be about 9 million lines long and so it's a small snapshot so what we look at these

I discard everything else these are the opcodes I discard the operands because I'm not particularly interested in them it's the sequence and the quantity of our crews that we look up so this is then incorporated into large data set so that database essentially a CSV file looks something like this so the top line there there would be one instance of malware so the opcode XOR occur occurs 10600 22 times call eleven thousand etc so essentially we're currently opcodes as they occur we do some manipulations that for different data sets as well for example we can switch the frequency so the percentage of time that is percentage of our codes in the entire one of that instance that

our XOR and also we have a parser that puts it in the just the opcodes in sequence translates them in that number and that could past the head Markov models depending on the toolbox that you use so results did this is actually slide eight I did it because the system just runs eternally I literally have a folder drop malware into it it just keeps going and going to go on posture the only problem is storage for the terabytes upon terabytes of ROM traces that we have and so the same awareness but at the minute forty eight thousands executed processed and labeled dynamic consciousness malware and contacts the literature the next largest that I find I mean let review which is slightly old

the 6,700 dynamic are 22,000 systolic the data set and hopefully gonna be published so we're just gonna make it available at a wider research community so people can do whatever they want tell the corner point to the malware data set then is the benign it's actually the hardest part to do because it's harder to get a benign software malware it's kind of a sign of the times it's harder to get legitimate software that it is to get malware malware is also an automated a cracker whereas benign files it's a bit more complicated to actually execute them and get a decent trace of them we got 1,200 benign files traced we use then a synthetic minority class we're sampling technique

called smoke and essentially we used that to triple the data set we verified that by comparing classifications essentially so we increase the classifications by the might that we added that examples that we added so we're sitting on 3,600 benign samples so working progress so sadly the data set that was finished a few weeks ago and and the working progress that we're doing at the minute so we're trying to redefine the malware labels based on supervised machine learning so essentially clustering blind clustering of the dynamic wrong phases we're clustering a subset of the data as in the militia data set which I'll talk about a little second which is the unusual benefit of sublevels so one data set a minute has

two sets of levels and they were doing clustering with our core analysis so this will potentially overlap both we will be able to see how they map up so for the Lamesa militia data set was twelve thousand binaries so in machine learning homes it's very unusual to have data that can be labeled with two different sets of labels so the Malaysia authors labeled our data using a wide variety of techniques including the use the icon analysis can let you know what Ketut came from used kind of server analysis I P landing addresses no previous addresses but also we were able to put the Malaysia binaries 352 av scanner is using virustotal so with two different sets and labels for the same

data but we wanted to add a third so we took all these binaries and put them through the system executed them cut their own chases so we have a third set of labels that we can run clustering on so we've got the novel dynamic montrezl clustering levels so essentially for the same data we have three sets of labels we're working on seeing how these overlap the our clustering of the wrong dress tables might kind of lean towards navy scanner a might lean towards the Malaysia authors data themselves it may be someone that'll hopefully it's somewhere in the middle because then you're not measuring the exact same information you've got something new we're working on proving classification so it's benign

versus malicious so it's the ultimate thing that you want to do is designate something as malware or something as benign so using basic classifiers linear support vector machine and trees K nearest neighbors we've got over 99% accuracy ninety-nine point eight percent accuracy so can't discriminate against benign software versus malicious software ninety nine point ninety eight percent of the time we used some in Samba classifiers to see if we could tweak that a bit some boosted trees some subspace K&N properly the CM of mind with 10-fold cross-validation then with the size of the data set we will use likes deep learning neural networks so we use the skilled conjugate gradient back propagation tree and your network

and again 99% accuracy these solve different problems and we use int data types and example the usual neural network needs to treat once ticks Oh fair we would have time to Train it putting out a sample a new sample within milliseconds so and also different problems the one bigger kind of issue is that sometimes these performed quite badly on the benign set so a lot of false positives comparatively so kind of work that we're kind of working towards is using Mike's accrual of es darling and where it kind of counts one algorithm is the Ripper and which uses roughly thirty rules on this data set quite the opposite say if it's over a certain number of codes but under

another one it's possibly this so we got a ninety two point four percent F score so that's a kind of an aggregated way of looking your rough machine learning models so ninety two point four percent accurate on the benign set which is better than some of the others ninety eight point seven percent overall look at reduced in the long run length my supervisor got Couric in and one of his papers pretty well he was doing a PhD was on one length and we're able to detect malware within a thousand dogs quite a short space of time in terms of execution looking towards the future and few PhDs after me and there's possibly gonna be a hardware implementation of

this we're trying to run look at this at wrong time so on a for example a hypervisor and context look at the feature selection and extraction so one of the big issues for machine learning is reducing the number of features that you have I used 610 up codes from Intel instruction set 304 of those didn't occur and the data set so we just have done to 300 instantly and but there are other ways standard techniques for reducing the number of features so you get more accurate and quicker detection we're looking at sequence based sequence based learning so hmm so sense you can look at the sequence of the sequence occurs so if you were looking at from a hardware

implementation of this you would see the sequence of codes occurring and you could detect that as malware class we sampling then to provide even training so we do have a class imbalance with benign versus malicious we're going to be sampling not to see and I'll take an aggregate to see whether that provides better learning the one thing I'm currently looking at is costing the proven interim malware classification so malware versus malware get rid of the benign and look at can we tell the difference between malware versus other types of malware so although the vinayan software can be accurately distinguished to malicious files of 99.9% accurate enter class detection is per that's based on a V

scanner intervals so it's a B scanner level of the 15 that we can assign the detection between those was actually quite per so deep learning classifier was 27 percent accurate you'd be better off taking its answer flipping it over and then you would get 73 percent accuracy random forest gives 74 percent accuracy and a squirt vac machine gives 63 percent so it couldn't detect very well the difference between these types of malware so we want to look offices at levels that boots officially described the our boot representation of the data so there are fairly high level descriptive terms they don't accurately represent or map down to what actually happens on an instruction per instruction level um so where do we want

to know this and exactly its threat analysis through different types of malware cause different threats if you've got a nasty bit of ransomware setting on your system you're going to take a different method of action and then if you've got a level kind of pop up or a browser modifier and so for our clustering so we have 15 labels to 15 different types of malware that we've seen these labels to honor database more we did was put them through the round random forest classifier I've mentioned this so we got 74 percent accuracy so seven and a half times out of 10 it could tell difference between different types of malware sorted it was throw away all the labels

do you blind and unsupervised learning we played various clustering algorithms and give them new labels so the new label will become cluster one cluster to accept it's up to us then they kind of decide what does that actually mean put these than through the random forest classifier just kind of compare with the AV labels and when we look at the results so this is one the e/m clustering algorithm so we look here treat muscles so what it was and the class that are predicted I used X being clustering which find four clusters unable to be classified well so the the last one find eight clusters with varying kind of degree success so you see cluster to there three

positives 56 play at the time not create XM use clustering find four clusters and healed we classified well however two of the clusters were very small for example you would have a cluster with xx thousand samples you have a cluster with 29 samples not a very good model X means is just k-means clustering out of them by the cost functions essentially penalize for a wrong decision and that can be used to determine optimal number of clusters so you can that's you can use that just as a lot of these algorithms will ask you how many clusters do you want it's kind of counterintuitive when you're asking it for the unsupervised you wanted to tell you how many customers customers there

are you use X means to find the optimal number of clusters so it was run X means and got these results so you see cluster 1 there had 29 samples 70% accurate it's not a great model even though the bottom-line accuracy is fairly good so we took the X means suppose four clusters and we said that k-means clustering give us four clusters box see what you can do and this is what we got here so it's separated into four clusters for us a lot better spread a lot better sample size as well so you see that which true positive so can tell the difference between classes correctly 98% of the time it's got decent other

metrics so one of the when me and think Savino got as are we see receiver operating characteristic curve which was a hundred percent accurate so essentially new map or graph three positives against false negatives it's a hundred percent so Apollinaris kind of essential a message is that when malware's relabeled using dynamic opcodes the clusters can be distinguished more accurately so when you throw away the labels and have to say look at the information say separate yourself it-- when you use opcode analysis it's a lot more accurate to be able to distinguish these different types of malware so that these these are the originals in the bars are the original labels that data sets so this

would be nice if you look at the cross-section so the different colors this is the new cluster so you see for example issues that the new clusters have been spread across the entire the entire range so that's their kind of work that we're doing at the minute and I'm still fairly early on and I hopefully I'm only halfway through the past year for and so if I go back over to my live demo and

so and hopefully you see at the bottom so this is just kind of the the tryst that we went through so it virtualized malware it then parsed it for all the outputs compared it against our machine learning therapist and you'll see at the bottom here Sabo 1 is malware so the tak that this is malware that is it's a it's a trojan and so that's just that's just one model we can take up model in annoyed but it's essentially the end and version of the system we used to build the database so any questions and so the question that was how do you know the malware started to work and I really look at the malware itself in the

execution but there are all the researchers who look at the traffic analysis so we get a good run trysts essentially this is the malware actually working and but we also look at the network traffic so that's not my area research problem people do we filled one night of executing 3,000 samples we filled all the space that we could possibly store of pickups so that's how we knew it was kind of working something was phoning home some of these again it's kind of some of these malware samples can be a bit old in a few years so the command control servers might be dying so may be trying to phone home it may show itself down but again this is we're

trying to model assume what a user would experience particularly in a virtualized sense so if somebody's actually gonna execute this program this is what they would experience and so we're trying to model it from a higher level just like with the kind of virtualization and the hardening techniques it's what somebody would actually experience so if the malware is checking to see if it's being virtualized that's then they get its power yeah

yes yes and so the question is can have a come across malware that can detect the debugger and we use a masking tool called strong rudy essentially mascot and we're not gonna be able to be a hundred percent all the time for every instance of malware and but we do use a mask until because it is known that you can check whether did that you be in virtualized and that you're being debugged essentially so I was able to test it on some legitimate software and somehow work if we get decent long traces it means that the malware is running maybe checking but again that's that's a feature for us to detect

and mr. the question you're just saying so that LexA mcafee there is so I am is now I fear them so I work with cease it so a lot of the member companies I look for example McAfee our Intel security right and they offered that sort of thing and I've seen a few of them and what I would love to do is have essentially just a machine that I can not work slice I just run malware on and then scrap it and do it again with dynamic analysis it takes a long time if you push us in 48,000 samples it takes a long time that's why in the literature the sample sizes tend to be

very small the first paper in this body that it sure was 70 samples you can't really do malware analysis with so many samples