Automating Bulk Intelligence Collection

Name: Automating Bulk Intelligence Collection
Uploaded: 2021-05-17
Duration: 26 min 33 s
Description: SOCs face massive volumes of logs, alerts, and suspect files that make manual analysis impossible. This talk covers techniques to automate data mining and malware analysis to extract key indicators of active threats, including strategies to tune automation for accuracy, manage false positives, and d

BSides Charm · 201726:339 viewsPublished 2021-05Watch on YouTube ↗

Speakers

Gita Ziabari

Tags

CategoryTechnical

TopicDetection Engineering Malware Analysis Threat Intel

ResearchTechnical Deep-dives

StyleTalk

Mentioned in this talk

Tools used

Cuckoo Sandbox Foremost strings YALDA

About this talk

SOCs face massive volumes of logs, alerts, and suspect files that make manual analysis impossible. This talk covers techniques to automate data mining and malware analysis to extract key indicators of active threats, including strategies to tune automation for accuracy, manage false positives, and derive organization-specific intelligence from attack campaigns.

Show original YouTube description

Automating Bulk Intelligence Collection Every SOC is deluged by massive amounts of logs, suspect files, alerts and data that make it impossible to respond to everything. It is essential to find the signal in the noise to be able to best protect an organization. This talk will cover techniques to automate the processing of data mining malware to derive key indicators to find active threats against an enterprise. Techniques will be discussed covering how to tune the automation to avoid false positives and the many struggles we have had in creating appropriate whitelists. We’ll also discuss techniques for organizations to find and process intelligence for attacks targeting them specifically that no vendor can sell or provide them. Presenter: Gita Ziabari Gita Ziabari is working at Fidelis Cybersecurity as a Senior Threat Research Engineer. She has more than 13 years of experience in threat research, networking, testing and building automated frameworks.

Show transcript [en]

Hi everyone. Hi. Thanks for coming to my presentation. My name is Geeta Ziabari. I work at Fidelis Cybersecurity. I'm going to talk about automating bulk intelligence collection. So I'm going to talk about a concept of automation. I can have it on my phone. Okay, I'm going to talk about concept of automation and discuss when automation is needed with you guys. Techniques to automate processing of data mining. Then I have a tool. I'm going to introduce it to you. It's called YALDA. It's being used for data mining. And overview of the algorithm. I pass you the GitHub. And you can go ahead and download it from the GitHub that I pass you for free. And I am going to tell you how to use it, which is

pretty much sample. All you need to do is give the files to it, and it's going to analyze it for you. So what is concept of automation and when do we need to automate something? That is the main question that we need to ask ourselves when we want to write and automate something.

If you are repeating a procedure over and over, then you would need to consider that automation is needed. How many of you guys here do automation on a daily basis? And how many of you use tools for analyzing your data? Of course. So if you need... something to be automated, you would know that, okay, I would need to automate this process because I am just manually repeating myself. Although the results that for each manual testing or manual analysis you are doing is going to be different, but the concept of automation for the procedure that you are following is going to be sort of like the same for what you do. So at least part of it

could be automated for extracting the data, and then you could start analyzing. If there is a large-scale data, then you would need to consider that, okay, automation is needed for that one because it's impossible to analyze every single file, especially if you have a very large data. And of course, human error is another factor that could be eliminated. Now, you may consider that, okay, I'm going to do automation, but human errors is going to be eliminated, but another factor is having a lot of false positive. How am I going to control that one? And how it is going to be controlled? I'm going to talk about it shortly. And of course, it makes our life easier when we have something automated that actually works and gives valuable

results, then it just makes our life easier.

We need to consider that anything could be automated. Even bad ideas, but bad ideas usually give you bad results. So it's going to be lots of false positives in the case if you just start automating something. You spend the time and then you're getting all the things that you're going to get is going to be bad results, meaning that false positives. So you just want to avoid it. To avoid it, you need to apply some intelligence in your automation.

you want to do data mining, then you would have a large scale data, a lot of information. And then there are too much malware outside that it's impossible to analyze every single one of them. And we know it ourselves. It's just impossible. Every day there is going to be new malwares, a lot of malwares, new campaigns are being built. And it's just not possible to go and just analyze every single one of them. Even the best team of analysts, they can't do it. So automation is actually needed. If you want to plan for automation and do it, then you would need to critically examine the information. What information you're going to extract? You would need to define it and see what is needed.

And draw meaningful, actionable conclusions based on observations and information. Define the expected results. This is the key factor. You would need to know what you are looking for. You are looking for malicious data. what exactly, what information are you interested to extract. And just define it, have a checklist for yourself and start defining it. And then develop the automated framework. Data mining is like finding needle in a hay shack. You have to take baby steps and you have to be prepared for what you want to do. So start with analyzing given samples manually and carefully analyze them just one by one and see The keys that are interesting. Make a platform for extracting the keys. Just write,

okay, I am interested in extracting these type of keys. For example, I can find domains. I can find URLs. I can find executables. What am I supposed to get, and what am I interesting to extract from the data that I'm analyzing? And...

Analyze the data based on the obtained results. So whatever results that you find, start writing down that, okay, I'm interested in these keys. And start just like having something for yourself before again starting the scripting. And then taking these baby steps, start writing an automated tool for extracted data based on the obtained results.

Again, here is just an example of the checklist based on your finding and based on whatever research you want to do. You would need to have a checklist that I want to write a tool with these keys and I'm going to extract them. Of course, some of them is not possible for all of the malware and all of the data that you are analyzing. Some of them are just like a main format. Some of them are like JSON format. Some of them are like embedded objects in them. You may have different information. But the main thing is just you need to define the keys that you're interested in the dictionary format. Like, for example, for this particular case, I would say I'm interested to see the file

type because the decoder that I'm going to write is based on the file type. And then the file size, SHA-256, embedded objects in the file, executables in the file, and then if there is any URL or domains. So make something like this a dictionary. I am a fan of Python for scripting, and usually when I want to just write it down, I say, OK, I'm looking for this key. Define the keys. And then the values are what your automated tool is going to extract from the file. keys, it may have just one value. For some of them, you may just get a list of information of what you're extracting. And you can just construct the

information. And this is going to be your expected results that you're going to get. Consider inserting data in database. That's another important factor. Because when you insert data in database, you would be able to extract it whenever you want. You have a safe place to store all the data. And then based on the information and based on the query that you want to do, you can apply a particular search on a particular file type or particular, let's say you're interested on hashes with a particular file size or let's say hashes that has executables in them or it has domains. It just depends on whatever information you want. You can just automatically start extracting them, which is quite cool. As I said, start with manual analysis. But when you are

doing manual analysis, it's very important to see the right path and right direction in it. So

You would need to consider that every malware family, they have their own structure, so they are not like same structure. And to define the pattern, you would need to have sort of like more understanding of what is specific about this malware family. Is it being as a phishing in the emails or is it, what is the file type that is being sent? And what are the information that I could extract? Some of the valid, valid, variables that you see and the configuration item that you would see is just like they are not necessarily valuable and they may just use default values and you may want to just do not consider them when you want to write your pattern for your automated tool.

This is again from a campaign that was being used and it's for analyzing body of email. A third party actually provided this in JSON format for me, and it was like MIME style. And when I started looking at the files, the body of the email, it was just like something that it is coming from USPS ground. And it's coming from Russia. Like the source is from Russia, the from address, if you take a look at it. And then it's USPS ground. And it's just saying that, OK, go ahead and download this attachment, and you would see the shipment later. Extracting the information of embedded objects, I saw that it's coming in a zip format, and it's a WSF file. Analyzing the WSF file, it

gave me a lot of malicious domains. Here it is. So see?

So for my particular extracting of data, because I have a lot of information, a lot of JSON formats are being passed to me. So for just extracting it, this is a good one to just start with. And I can just take a look at this one, analyze it, and then having a platform for domains is going to be my key. And then the list of domain is going to be my values for this case. So all I need to do is just write a parser in this case. Let's just review so far what we could do for just analyzing a set of malware and it's just like a set of JSON files that has been submitted to us and we are analyzing it. We see that

it's just repeating the same pattern for this particular campaign and I'm having a JSON mime, I'm getting analyzing keys Of course, I can get from address, and I can build up a blacklist for the domain countries that's suspicious. I would say like Russia could be suspicious, China, North Korea, all of them, they could be suspicious. And then you could decode it. You could actually get the analyzed body. Of course, it's basic for encoded. You can decode it, and then you can download the attachments. Analyzing the files, if it is WSF, you can extract the domains. So, baby steps is taken for extracting the information. Look at the chain. The first chain is the most important part when you have

a file that is suspicious. or malicious. Usually the campaigns, what they do, especially now they're getting pretty much smart, they give you a file. The file is going to extract, it has a link. It's going to download something from a link. That file that is downloaded is going to go and just have another link, execute another link, and it's going to get another executable. But first place, the most important thing that we need to consider for extracting is the file, the first, first file that is going to lead to the chain. So just detect the chain and the first, first chain and try to extract it. This is something, an email campaign that's is featuring a PDF attachment. And

right now, as of now, they are actually pretty much active and they are sending emails. This is what it does. It's a zip file. It has a PDF in it. The PDF when you open it, it has a URL link. It's going to download a JavaScript file. The JavaScript file itself, it has a link. And it's going to download an executable file. So the first thing that we need to extract is the PDF and we need to detect it. Now guess what? The PDF is fully undetectable by AV engines. No AV engine is able to detect it. And here is what I'm going to share with you. So if you apply a strings A on the PDF file, you would see that, okay, it has a domain.

This is just like, it has a type action referring to a URL, and this is a URL. And then it ends with end objects. You can just write a regex for it. And you can start parsing the PDF, decode the PDF, and extract the URL. And say, in your analyzes, you can say, OK, it has a domain, malicious, and it's malicious, basically. So right when you get the PDF, you can say that, OK, I'm going to stop it right there. There are a lot of campaigns going on and there are a lot of information out there that you can go and just check it out on a daily basis. Very nice information are coming out on a daily basis. The analysis is being done by someone else. And then

what you can do is just go ahead. All the detailed information is available. All you need to do is just read it, find the pattern, and then write it down. for scripting. So analyze known malicious samples for fingerprint. This is, how many of you have heard about CVE 2017-0199?

Okay, it's pretty much new. And right now the patch is actually applied in Microsoft, but it's an RTF document that the OLE, it executes this particular URL. not just this particular, but it just has a link that goes and downloads a document out of it. So again, when I look at it, I see the pattern is it has a link and the end of it is doc. So HTTP and doc in the link in RTF document. So again, I'm going to write an analyzer for RTF and sort of like a decoder. And then I'm going to extract this. HTTP doc. These are the main thing that I'm going to grab. This is just the regex that I'm going to

use for my analysis. And then this is the comment section in all the files that come with this campaign. So when you write it down, you would be able to detect this particular family, which the analysis has been done by some other analysts. Now, there are a lot of tools that you could also use for your analyzers to commit a pattern for automation. Use them. Foremost is one of them that is really valuable, at least for me, because it's just like it's possible to extract the embedded objects in an automated way. So go ahead and use it. This is just like an example. This hash is quite malicious. And then I applied Foremost. found a lot of

files in it. The files are in different formats for this particular hash. Right away, I would say, OK, DLL and EXC, there is something wrong with this file. And there is actually something wrong with these files. They are also malicious. Ran it in Cuckoo sandbox and found out that it copies itself to APP data, local temp, It begins using find first file and opening files. Read the file before overwriting it. Opens the file to infect for writing and writes the virus. The common thing that it has is using visua. Writes the original file and it appears to attempt to infect all the files and not just the executable files. it's time to connect the pieces.

You categorize the obtaining formation of what you have collected and now how can you connect all of them to build your framework. For this path that I chose it's just like the file type is something that is the main thing that and the main main the first step that I would look for and then based on the file type I start analyzing it and analyzing it in more detail, and then extracting the data. Let's say if it is domains, if it is embedded objects in it, or anything that I extract from it would be interesting. Of course, you can get the SHA-256. You can get the file type, file size, and some extra information to just build up a nice dictionary format or for inserting in your

database. You have a lot of keys that you could add. and you would be able to use it like, okay, this file is malicious or this file is suspicious or with this information. Then if you are running another research, you can say, I'm interested only in PDF files. And you can just like do a query in database for PDF file format and get all of the information that has been extracted.

So, baby step. is very important. Now, I'm introducing YALDA. It's an automated tool that is based on what I just described for you. It's a data mining tool for extracting malicious data, such as URLs, domains, and embedded objects. It could be used as a file scanner for detecting if a file is malicious or not. It's a tool to obtain categorized data based on the file format as I just described it for you. And it could be as a base tool for any research. If you're doing a research and you want to get some more information about the data that you are analyzing, this could be used. Of course, it's sort of like in a way written that you could add more functions

in it and just use it based on your own needs. You can just go ahead. It's very easy to read and very easy to follow and you can just use it based on whatever needs you have and just start expanding it. It could also be used as a testing tool to analyze detection ratio. And it's not an AV engine. It just has a long, long way to go for becoming an AV engine. It's just for analyzing files. But I'm going to add more features in the near future. So I'm going to pass you the GitHub and go ahead and download it and use it. You can just add more functions to it. Or on a weekly basis, please check out the link or follow me at my Twitter.

And I'm going to let you know that when a new feature is being added to it. The main thing that it does is it gets a file in the directory that you pass it, sort of like decode it, analyze it based on the file format, and extract the objects based on the pattern that I found out based on my research. And it starts mining. So you give the files in a given directory. If it's a mail, format or if I detected that, okay, it's a mail format information, going to decode it and get the mail attachments. Or if it has many subdirectories or if it has a compressed file, it's going to just like get the files, the actual files. And then extract the embedded objects

in the file, decode it somehow. That's if I have a decoder for that file type. And then based on the information that I'm getting, it's going to get URL, domain, file size, file type, SHA-256, some attributes, and then a flag, if it is malicious or not, based on the analysis. Again, data storage is really important because you would be able to query based on the information that you are grabbing for, like file type, file size, SHA-256, the list of domains, list of URLs that has been extracted, the embedded objects, if there is anything, and if there is a malicious, like if it is executable object, then it's sort of like malicious, I would say. And if I would

see that it's sort of like suspicious, then the flag is going to raise as suspicious, et cetera. Now we have all of this information. If you want to, for example, use the domains, not necessarily it is malicious. It's based on the pattern that we follow, but you would need to apply some quality control on it. There are lots of open sources for domains that you could use, for example, out there. Or based on the category that you want to use, you could go ahead and just either buy the filters that are available and apply those files and those feeds that you are getting as a filter to your tool and start using it.

Conclusion, if you want to do extracting and data extraction, what matters is that you would need to know exactly what are the steps that you want to do and you would need to know the keys that are interesting for you to extract. And as I mentioned again, it was just like you would need to take baby steps and find the right pattern. And everything that you would see not necessarily is going to be usable, but just like some of the variables and some of the keys are very interesting. So try to extract them, make the pattern, and start automating it just for that particular pattern, then connect all the dots. This is the GitHub to YALDA if you

want to start using it. It's in Fidelity Cybersecurity

GitHub, and you can go ahead and use it.

For using the tool, you would need Python 2.76 or later. I tested it on Linux operating system. And the Python modules to download are the ones that are listed here. And of course, you would need MongoDB for using it, because it inserts all the information in MongoDB. Using GALDA is pretty much simple. All you need to do is write some basic information in it. you would need to specify the data directory, the place that you want to place your samples, and then a place that even if this directory doesn't exist, it doesn't matter, just specify a directory that you want the mail attachments get downloaded there. And if you want to make it clean up so that all of the files that has been already

executed and entered in your database, you want to just clean up this directory for a new set. You can just do it by one, or if you want to keep the data, you could just leave it as zero. And then information of your MongoDB. And it's going to create a new collection called phishing domain, and it's going to do the job for you, analyzing the files and inserting them in database. Follow me in... my Twitter, I'm going to, if there is any update or anything, I'm going to update you. And there is going to be more updates on this one. So thank you so much, everyone. Questions?

Not with YALDA actually, but if you go to Fidelis GitHub, I'm pretty sure that we do have a tool that is in advance actually. It's called BarnCat and it does do the job and you can get the information. But this one is just for analyzing the samples that you provide. Anyone? Okay. Thank you.

Automating Bulk Intelligence Collection

Related talks