Wes Widner - Lessons Learned from Analyzing Terabytes of Malware

Name: Wes Widner - Lessons Learned from Analyzing Terabytes of Malware
Uploaded: 2015-09-13
Duration: 38 min 28 s
Description: Video from BSidesAugusta 2015.

BSides Augusta · 201538:2859 viewsPublished 2015-09Watch on YouTube ↗

Speakers

Wes Widner

Tags

StyleTalk

Mentioned in this talk

Tools used

Cassandra Cuckoo Sandbox IDA Pro PAFish Thug

Platforms

HBase

Service

VirusTotal

Standard

OpenIOC

About this talk

Video from BSidesAugusta 2015.

Show transcript [en]

We'll record the mic also. You want to turn you on? Test. Test.

We're going. OK, we have reached the last talk in this particular track. And I think we may have saved us last. So we actually do have a 5 o'clock slot in red and in blue 1. But this will be the last talk in this particular track. So thanks, y'all, for sticking it out this afternoon. Our final speaker, I'll just get out of the way and introduce Wes Parker. All right.

Thanks, Bill. This song was actually written for me several months ago. So just a little bit of background, because Phil asked for some background. And I guess I do need to provide a little bit of explanation to Al. this talking about how I'm here and all that. I'm actually from the U.S. area. We moved up to Atlanta and I joined the global threat intelligence team for Mackey up there. That's where my security industry experience really started. That was the tendency of joining big data analytics to threat intelligence. So fast forward, not fast, but a little bit, but about five years after that, at Roman Norris to create their malware pipeline. The initial requirements in malware pipeline were really fuzzy. Just get a whole bunch of malware

and do something with it. Just, I mean, that really was it. Get some things together, maybe do some stuff. Yeah, just have to do some stuff. So that really was pretty frustrating. Several months later, we actually have, I work with the proof of concept, we're on version 3 of the pipeline now. We have a pretty good idea of what a pipeline is supposed to do. So, I wish it was nice and polished and we started with it, but I came up with two major goals for the pipeline. The first one is automated malware analysis. Automated as opposed to manual. So we have a team of analysts in the office, they go to incident response and they pull apart a piece of malware and

they write up all kinds of incident reports. My job is to provide them as much automated intelligence as I can. So looking over their shoulders, seeing what they do every time they open up iPro or something like that, and writing an automated way to generate that intelligence using the usual iPhone. So that's the first goal, automated analysis. The second goal is taking all the automated analysis that we have through all the samples that we run through and come up with some machine intelligence. Now, if we'll get into specifics of what machine intelligence is because it's used a lot of times today, it's just a coverall machine intelligence, when really it's just a single word. We'll get into

that. The matter of pipeline can be divided up in three broad categories. There's the sources, the features that we pull out of the sources, and then the machine intelligence, the intelligence that we generate from the features, from the sources. So each one of these areas has its own peculiarities and its own gases. So for the sources, When we first started looking at the malware pipeline, we have some internal sources of Norse. Just to give you some background on Norse, what we do, the bread and butter of the company is Passive Sensors. You've probably seen the Threat Map on Norse's website. Granted, you've used a lot of these things. But that's from Passive Sensors hot spots. This pipeline

or this project, is taking the data that comes from those passive sensors, or initially was taking the data that comes from the passive sensors, the URLs illustrated, trying to find malware off those URLs. And instead of being a torrent like I had hoped it would be, initially it was pretty handy. So we had to go hunting for malware sources, samples. So there's two major types of sources. One is organic, and that's where you have live, malicious URLs that you're trying to pull data off of. What we found out is we'll ingest something like 500,000 URLs. Out of that, only 20% or so will result in a binary. Malware fights being analyzed at every step of the way.

The first step is the server that you put it off of. The server doesn't want to give you the binary unless it knows that you're going to be infected. So there's all kinds of work with our crawlers that we have to effectively come up with a honey crawler. There's a Python utility that looks really promising, it's not one we use, but it's called, it's Thug, it's written in Python. Pretty much all of our tools are written in Python, just because that's where we found most of the frameworks to do this.

You have to present the right user agent. We actually spray our URLs with random user agents, and then that actually provides good intelligence layers to see which user agents are accepted more by malicious sources. It gives us an idea, and we haven't started publishing this information, but we should, of which user agents are more targeted by malicious servers. I can go ahead and take the result of that. The primary result is i7 on Xp. Perfect. i7 on Windows 7. Yeah. So not the very latest, not really old. So I'm going to write that. What I've found is most malware targets two versions of Windows back. So anyway, organic URL sources, they find you, they send you every wrong error message. they

will see, well, they don't want to send the file, so they'll send trash data, they'll send, just, they'll redirect you indefinitely, they'll keep the ports open, so first thing we had to do for the Malad pipeline is basically taking account failure every time we're going to scrape from the organic source. Second thing that we learned with this, with organic sources,

When we go to different vendors to ask them for a sample of their organic sources, we have to be able to vet them for how good they are, how good the sources are. So what we found out is that volume and how many valid binaries are downloaded, the division of those two gives us a yield. A good yield would be about 20%. Average is around 15%.

The reason for that is most of these URLs that get published, they get taken down very quickly because of automatic systems. The site operator, whatever. So organic sources, you can get a ton of URLs coming in, but you still only get relatively triple of binaries. So the second type of source is an artificial source, a synthetic source. These are vendors that have a stream of malware that comes from someplace else. And what we found out later is that those other places are usually memory dumps or samples that are submitted from the website or something like that. So these are a lot safer because they're given to you in a stream. But they're also unreliable in the fact that if I'm given a memory dump, then that

memory dump is usually as an invalid dump in the checksum in the header. So that means that the sample won't just run on the V on which we'll get to in a lower level. So we have to gauge that source by how many valid runnable binaries it gives us. The good thing about a synthetic source is that it's a really high yield. And there's several places that you can get a synthetic binary for it. Synthetic sources from lots of stuff to play with, even historical maps. So once we get a torrent of a whole bunch of binaries coming in, well, we need to be a little bit regardless of which source they come through. Another thing that we found out is that there's a

lot of overlap in the different vendors. So what we do in our pipeline is we prioritize the free sources look at Mozilla, their safe browsing stuff. It's a great source for free feeds. There's like that. And then we layer on the other sources after that and kind of look at the delta of the difference between the different sources. What we found out is that at a certain point, some vendors just overlap entirely, and it's just a matter of price. Which one's got the price? So, after we got the malware in, or I should back up and say we're indiscriminate in how we pull sources down from these URLs. So, we pull down HTML, JavaScript, images, whatever the server gives us. So

we don't know that it's malware. Unless the feed, the synthetic feed, tells us. We'll get to that in a second. So the next thing we have to do is features. Pull out features from whatever it is we got. The features are the mining part of data mining. That's where you find, this is the real, this is where all the work goes in. I don't know if any of you have ever worked with the machine learning system, but this is where the hard part is. Actual implementation of the algorithms is relatively easy. Relatively, but pulling out the features is the hard part. So there's three main sets of features that we pull out of our binaries. The first is a simple surface scan.

I say surface scan because we're not, this ideally should be one pass through the table. I've run into, every now and now I want to add more into this initial scan, like rest queries and going up to virus total and doing a hash query there and all this other stuff. that always bogs down the filework. First step, figuring out the line type, file size, just basic information about the file. This is not really making in calculations. This is really another initial step here, is splitting out the strings in the file. Now, many of you have worked with pulling apart malware. I know that most malware is packed, so you're probably not going to get that bunch. useful stuff out of

the valid sample, but you will get some sense of where it came from. The second feature that we pull out, now we're getting into the real head and lifting of actually static data and analyzing malware. Based on what my type of pull out, you'll want to send it to a specific static analysis. For example, we have a separate string for PDF analysis, a separate string for Android analysis, SettlStream for PE analysis. So I'll talk mostly about the PE analysis. The portal executables, DLLs, EXEs, and all that. Those are for Windows files. That's generally where all the action is. So the static analysis of a PE file. First, we unpack the file. We call out the several rec services with a hashed file and see if they've seen

it before. If they have, there's really no reason to keep on doing more static analysis or dynamic analysis. So we're kind of short-circuiting the process. We want to see what DLLs that the file has, that the file reaches out to. We want to fingerprint the file. There's a really great framework for this. It's called PE Frame. It's Python Utility, really great.

It's actually freeware. Most of the stuff we use is freeware because this is, well, I've been telling you something else with the department one because it was a proof of concept. So go through the file. Everything is short of running the file. So some of the more interesting things that are specifically part of machine learning, going through the file, generating a syntax tree of the execution flow of the file. more advanced stuff based on the abstract syntax tree is to most of that one has short circuits in it where if it detects certain things about the system it will just shut off time not able to reach certain servers or if it detects the VM in the background

especially if it detects the other little short circuit generating the abstract syntax tree in the sapping analysis phase lets you come up with tools to basically patch the malware to run all the way through completion. The ultimate goal here is to find C2 servers, command control servers. So the other feature from static analysis that's really interesting is opcodes. There's a wealth of information with opcodes. So the opcodes from an assembly language perspective, the

Each one of those instructions, an opcode, but not just one, combining it into,

in analysis it's called an n-gram or a natural language processing, n-grams for words, you use the same basic principle for malware or executables. The ideal one, the ideal size that we found is five, five opcodes at a time. and then using those as fingerprints through the rest of our analysis. That's pretty bad.

So the last bit of features that we generate are dynamic. We actually have a Cuckoo system. Anybody here ever heard of Cuckoo? Yeah. Cuckoo's wonderful for dynamic analysis and malware. The problem is it's not really built for scale. So we had to write some code around it to move the files for the A and D. The biggest thing is keeping Cuthu from falling over the dock. So dynamic analysis provides a lot of really good data. You're actually watching the malware run. And if it doesn't short-circuit itself, then it will go all the way through to reaching out to the C2 server, which is ideally what you want to add.

Usually that's not the case. One thing that we found out real fast with dynamic analysis in our YES machine, the one that was injected with the Google agent, we had to do a lot of patching for that machine to make it look like every other box. So to keep, there's a great utility out there called PA Fish to analyze your guess machine and keep it from being detected as much as possible by the malware sample. So this is really a cat and mouse game. I'm trying not to be detected as I'm running a malware sample in my sandbox and it's trying to detect me and refine it back and forth. That's why I mentioned back in the SAP analysis part, it would be great

if we were able to patch the malware back a piece of software and get it to work all the way through it. Other interesting parts about this, now I have a CooCoo cluster that I've administered, and each box runs eight samples at a time. And I didn't isolate every VM instance. So one thing that I've found is that you can actually watch viruses attack other viruses across the VM network space. So that's kind of fascinating. So the last part, once you get all these features generated and stored in some system that can handle it, now's the real fun part of this intelligence, generating machine learning. Or actually, coming up with a machine learning algorithm, it's able to generate data or models that

fit your data. So machine learning is thrown around and is used as a buzzword in a lot of sales. In particular, there's two things that we want to do with machine learning. We want to do categorization, and we want to do prediction. So we get a lot of unknown binaries from the organic sources. We need to figure out whether they're bad. We see those coming across the network, we see file hashes. We need to know, or we need to come up with a way to say, this is probably a good problem for all the internet sources. This is the This is the problem space that every AV vendor is in. The other thing is we need to categorize malware because we may see

hundreds of thousands of samples a day, but only a small fraction of us renew samples. Everything else is really just repackaged, rehashed, There was a stuffer to a guy that launched today about the Caspersky thing that came out. Where y'all read about the Caspersky case that... or news article? No. Two years ago, Microsoft accused Caspersky of generating their own malware standards. Not infecting their clients, but but generating new malware samples and uploading that to VirusTotal to skew results from VirusTotal about which AV vendor was more accurate. Effectively, what they were doing was back in the sati-analysis phase, I've shown them earlier, they would take the file and they would change one bit, which would change the entire hash of the file, and they'd

go right back to VirusTotal. And they would say, we want it as bad. Which, of course they did. They knew it. Machine learning, two main roles, categorization and prediction. The categorization, we generally don't know what we're looking for. We just want to, we don't know how many categories the malware are. So what we're doing is we're fingerprinting files based on entropy is what we use. We use Shannon, just a basic Shannon entropy calculation. On the files, all of them in different sections. If it's a file like a P file, then we can break up the sections in there, and we can figure out each one of those. That gives us a good vector of how similar these files are together. So

categorization is unsupervised learning. So we let the algorithm just generate how many clusters it finds. And that's what we use to determine whether we have what we used to call an actually an emerging hash or something that's coming out that we haven't seen. It doesn't fit between the clusters that we have so far. It's something new. So that's how we generate that. The second type, well, cluster. The second type, last type machine might have been found so far, or that we've used so far, some prediction. Now prediction, like I said, is good or bad. Prediction is really aided if you have some authoritative source to tell you that this sample of this data is bad. So

going back to the synthetic speeds, most of those vendors will give you the files that say new mark was bad. Or if you throw up the virus total, but there's a few other places that you can look for progressful result. So you've got bad files, and then one thing that we do in our dynamic analysis is we inject it with known, benign files. We actually take the OX to the top, the files and run it through Internet Explorer. So that generates a lot of noise, a lot of traffic. All of that traffic is marked as benign. So you've got, out of that you've got good, known good, and known bad. And those are Those are the supervised, that's the supervisory set of data that will run through our

classifier. And that gives us a good model to run over the rest of our, and I know the files would say this was probably bad, probably good. Now, the cool thing about the pipeline that we've set up, and if you're, the whole point of this talk too is that If you were analyzing or seeing files across your system, you could set up the same sort of pipeline and come up with a way to prioritize some files or others in order to analyze them. This is where the predictive analytics also comes down. If it's probably good, then we increase the score and we prioritize those to go through the dynamic analysis and static analysis. Low score, low priority. The

reason is every step of the way, the feature generation surface static and dynamic, they get progressively slower, like orders of magnitude. So we actually use our own pipeline and feed back into the cell in prediction sets. So theoretically we're getting smarter, so we still have to work with some of that. papers around that. Anyway, that's what I've been learning. It's wonderful. There's actually a ton of technologies involved with it. I'll mention a few of the frameworks and some of the systems behind it. So if you're interested in the nuts and bolts of the technology, feel free to ask. I'll go ahead and tell you that there's three clusters that we use for the pipeline. There's the control cluster. the

HUCU cluster for running dynamic analysis and the Enbar cluster which is our Hadoop system for machine learning. So all the tools that you have to use for cloud maintenance or cloud automation, all that stuff, we have to use that here to get everything together. The only, one of the big differences is especially with our HUCU cluster, when we pull samples off, we assume that those servers are tainted somewhere. So we have to come up with other automation, wipe them all out, install them automatically, and bring them all back up.

So with that, any questions? How often do you have to recalibrate when you run through? like does it go through smoothly one time and then... It's pretty lumpy. The question is how often do we have to recalibrate over dynamic analysis? Right. And the answer there is

the yield from dynamic analysis varies greatly. What we do is every four hours I'll wipe out all the boxes. So I'll recycle them every four hours. I've played with that every day. A week was way too long. An hour was way too short to get a good yield out of it. Part of it is a sample will run and we give it a good five minutes for it to run. One thing we also noticed for samples is that many of them were time-based. So we have to effectively time work the system. As time goes on, it goes, it skips forward in time and it actually takes your weight. So by the end of the run, it

should be like five weeks in the future. Hopefully, it'll have decided to run that. But the time-based execution is also back with the static analysis patching malware, because that's another challenge.

So it's still trial and error. Yeah. Yeah, it is. Because, like I said, every step of the way it's fighting. The only step where it's not fighting is after you've generated the future, which...that's the part I'll throw. Because it's finally going there.

using some kind of taxonomy like STIGs or something like that for once you pull out IPs or extra things, different things from the example? Yeah, great question. Whether or not we use taxonomy like STIGs, we actually use STIGs and OpenIOC, generating those from both static analysis and dynamic analysis. So we don't have a reconciliation process just yet for the two different types of files that we generate. And manual analysis. If one of our analysts comes along and generates their own sticks file, that actually helps them out because it gives them a head start on sticks generation. Do you all store those in the database or do you just keep raising your flat files per sample? It's kind of a dorky subject. But yeah, we're storing these flat

files right now. Right. And logging to turn that into a graph storage.

which is something that I would highly recommend. Also, this, if you're a grass-door system, it's very good. Probably the O4J or TAP-TV, something like that, depending on how large it is.

So when you find repetitive, Just for instance, malware that's out there that you know is repetitive, now you've got that signature for that particular one, and then it's... That's what you push out customer-wise? Yeah, we sell a feed of malware intelligence right now by itself. I know it's a problem with other products, but we actually forgot to mention this on the organic sources. Once the server is infected, that server is probably going to have every URL you've tried is going to be malicious. That's one thing I've seen. So the crawler itself has to be smart enough to know, I'm getting logged out of the server, I need to buy. It's returning the same file 50,000 times.

So in essence, you've got to create a normalization table for it too. On normalization, that's where the graph database is coming in very handy. Yeah. Just a quick question. You said you were running eight templates at five minutes in peace. How many ? Eight, right now. I think we're scaling up right now, but there's talk about it. Personally, even though dynamic analysis is really sexy, I like static analysis because I think it scales a lot better. It's something that you can actually, you can run on hardware systems like GPUs to actually boost your processing power. Dynamic analysis looks nice,

So when you look at samples via entropy per section of the PE file, you're saying that that's a good way for you to cluster? Just because if all the, you know, sample A and B have the same entropy per section head, is that how you do it? If they have a similar entropy, yeah. And it's actually more, so the machine learning part is not just, not just one-to-one on entropy, it's entropy in similar groupings of sections. So one of the first routes I took with this was trying to set up a neural network for that, and a neural network, real fast, is just you have your inputs and then you have your outputs. In the middle you have hidden layers. Those hidden layers are the way that

they explain it is synapses in the brain. these two inputs when seen together usually give a strong correlation to this output. This is really simple. But that's where the groupings of entropy come from. So that's why to give you an idea of the features, I spend a lot of time on that because an average or a decent set of features would be somewhere around 50 to 100,000. And if you If you look at that from a regular to a, it would be like 50 to 100,000 columns in the database which is obviously not going to work well for something like MySQL or any other relational database. That's why specialized key value storages exist like HBase. I don't

think Cassandra's really geared towards it but you see those really wide things.

But that's why we generate so many features so that we can come up with those hidden factors. But going back to your question, Intrigate, I've found, is the best one for the cluster. There's also things like language. That's why you're pulling strings out and then figuring out what's the dominant language or mixture of languages. So I think it was the Sony hack. It came through Korea. So malware is actually an enumeration of several different pieces of code. So they'll get a packer from over here, and they'll get a domain generator from over here and a few other pieces. That's why I say that it's helpful to go through static analysis to go ahead and just say, hey,

this is an off-the-shelf DGA, so don't worry about going into that. We've already seen that. That's what saves our animals' time. It's really good as well. I love the way you know, mountains and mountains. It's also maddening too. The whole title of the talk came from, well first of all, the requirements were really vague in the beginning. Second is, every step away is just a whole area by itself. Static analysis is a whole field.

Setting up the dynamic analysis is just a challenge in and of itself, scripting the Windows installations and all that stuff and trying to hide yourself from the malware that's running. Then just on top of all that, a cloud setup where you've got hundreds of thousands of samples coming in and moving that through the process and then just getting out and just belated

Come up with Quik to help us give away some goodies. Does anybody know the name of the product that the Norse map is attached to? For a lock pick kit? For a lock pick kit. What was the question? The name of the product that the map is attached to. Norse is... I don't know who pays attention to the product. I heard it's Norse mythology.

Well, we do belong to Norse mythology internally. The name of the pipeline for the name is Gatchen. Gap, Gap. Norse god of G is this one. That would be Loki. Not Loki, I'll have to go with it though. Gatchen, G, G. They definitely would not have named it Loki, I'm pretty sure.

Got another one? Oh, um... I'll put you on the spot. So, okay, how about for a lock pit kit first hand, which year of B-Sides Augusta is this year? Second. Okay, right here in the red. Third. Third, correct. How about for a lock pit kit? Okay, you go from Southward or Southern, I'm not really sure how you pronounce it. So for Blue Team Handbook, Incident Response Edition, I just have to know, what are you knitting? A shawl. A shawl? Yes. Okay. While you're in here learning Blue Team Sew, I think you deserve a Blue Team Handbook. Just for knitting, and that's all. That's the first time I've ever closed second. That's what you're doing with any cloth.

Good. Right. So, three of all. Thank you. Sorry to touch my own spot.

Okay. And our final thing, just another one.

This is for a disk duplicator. Does anybody have a need for that?

Like a real disk duplicator. I don't know how it counts. Yeah, like why do you need that anymore?

It's like if you're working on antiquated technology for backwards. It's a trap! That's it. So we're going to have to go, I guess we'll just go easy. Standard TCP port for LBAPS. 636.

Go ahead, Trent. C36. You guys can argue with it. OK, thanks, guys, for coming out. Thanks, Wes.

Wes Widner - Lessons Learned from Analyzing Terabytes of Malware

Related talks