Applying Data Science to Identify Malicious Actors in Enterprise Logs

Name: Applying Data Science to Identify Malicious Actors in Enterprise Logs
Uploaded: 2016-10-30
Duration: 48 min 43 s
Description: Balaji Balakrishnan demonstrates how machine learning and data science techniques can isolate malicious actors within enterprise log data. The talk covers graph analytics for threat hunting, supervised and unsupervised learning methods, and practical tools like Neo4j and Azure ML Studio for feature

BSides DC · 201648:43232 viewsPublished 2016-10Watch on YouTube ↗

Speakers

Balaji Balakrishnan

Tags

CategoryTechnical

TopicDetection Engineering Malware Analysis Threat Intel

TeamBlue

StyleTalk

Mentioned in this talk

Tools used

BloodHound Elasticsearch Nmap Splunk Wireshark Zeek

Platforms

Apache Spark Databricks Hadoop Microsoft Azure ML Studio Neo4j

About this talk

Balaji Balakrishnan demonstrates how machine learning and data science techniques can isolate malicious actors within enterprise log data. The talk covers graph analytics for threat hunting, supervised and unsupervised learning methods, and practical tools like Neo4j and Azure ML Studio for feature engineering, anomaly detection, and risk scoring.

Show original YouTube description

Applying data science to identify malicious actors in enterprise logs The presentation will provide guidelines on information security data science insights with repeatable process and examples on visualizing and applying machine leaning to information security data for identifying malicious actors. One of the key strengths of security teams is access to enterprise log data, meta-data, network traffic data, and netflow data. The challenge is finding and isolating the bad actors from legitimate traffic. Security professionals can benefit by applying machine learning and data science on enterprise data to find anomalies and identify patterns which will be helpful in isolating events which might indicate compromise. Steps involved in applying machine learning algorithms are to visualize and combine data cleansing with clever feature engineering, choose right metric/method for estimating model performance and then spend a lot of time tuning the parameters. Balaji Balakrishnan (Senior Information Security Officer at World Bank) Balaji Balakrishnan has more than 16 years’ experience in IT and Information security domain specializing in security operations management and incident response. He has worked in major financial services organizations and has lead 24/7 SOCs/incident response teams. Thanks to our video sponsors Antietam Technologies http://antietamtechnologies.com ClearedJobs.Net http://www.clearedjobs.net CyberSecJobs.Com http://www.cybersecjobs.com

Show transcript [en]

The B-Sides DC 2016 videos are brought to you by clearjobs.net and cybersecjobs.com, tools for your next career move, and Antietam Technologies, focusing on advanced cyber detection, analysis, and mitigation. Thank you for coming to this talk. I'm very excited to give this talk. Just to get started,

again, this is, I want this to be more interactive session and I hope it's a learning experience.

This is a standard disclaimer. The opinions are my own and this doesn't represent anybody or any other entity. Again, a quick review of the agenda. Excuse me, I'm still recovering from cold. In case the voice is not clear, please let me know, I can repeat it. So I want to give a quick introduction. I have more than 16 years experience in information security operation, mostly working on blue team, taking care of incident response, vulnerability management and the whole And primarily I have been working on the financial services industry. So the key takeaways from this session is from my view, there are two important aspects. One is the ability to use the graph analytical techniques and the ability to use machine learning techniques to solve many of our problems

And I hope this session would give you some more insight on how to approach those two aspects. And as we move along, I'll try to give some pointers around some tools that would enable you to move ahead in those aspects. Again, some basic introduction in terms of threat hunting, again, all of us are very familiar with this topic and we have been discussing this for a while. And again, proactively searching for threats is the threat hunting aspect. And again, if you see the maturity model there, again, this is from a SANS paper, analyst paper, They highlighted some of the key techniques as the maturity improves, how the visualization and the machine learning aspects can help.

So again, as we understood the threat hunting, what's threat hunting and why we need threat hunting, it's also important to understand the threat hunting mindset. The funny part, yesterday I was trying to prepare for this and my son just came in and he was like, I don't understand what's this. So I tried to tell him, in case if somebody calls 911, that's more like an alert driven and if you have the police patrol where they consciously go and look for any suspicious activity, so that's considered exploration driven hunting. And I thought that spontaneous example was very good. But I don't want to go in depth in terms of each and every aspect. This is a blog from Anton, basically the

Gartner, I think the slides will be available after the talk. The key idea is all of us are familiar with the known bad stuff. the RR rules, the IDS signatures, and now we are getting into a phase where we want to identify stuff that's not necessarily signature based, but more activity based. There are many reports where some of the threat actors don't use malware, and a very small percentage uses malware, everything else uses standard built-in Windows tools. So if your administrators use Windows tools and the bad guys use Windows tools, again, PowerShell, there are many examples of the inbuilt Windows tools. So we need to have the capability to identify what's good and what's bad based

on the context. And I think that's the key. So we discussed a little bit on what Trent Hunting is and what is the mindset that we need to be in. So one of the aspects that for us to be successful, we need to have a platform that would be able to provide a way by which the security team can effectively and optimally hunt. And at a high level there are two capabilities. One is having the capability to pivot. And so having the graph visualization and understanding the context around all the different indicators, it's very, very important. I think as we move along, I'll show you some of the key advantages of having graph. And also the other most important aspect is to have the capability to integrate with

advanced scripts, basically Python, SciPy or Python on R scripts that can be integrated into whatever platform that's used to store data. And again, in a logical way, this is how we can think about this. I don't know if this is visible, but you have your data sources, which is pretty much, we will quickly discuss what are the different data sources we need to focus on. But the idea is you capture all your data sources into this big data analytics platform. And the key aspect for us to remember is to have the structured data and structured data and semi-structured data and not only focus on machine data because some of our contextual information lies on unstructured data. So as much as possible if we can collect the data and

so if you move up the framework you can see the, again this is based on Apache Spark but in reality it can be Hadoop and some other machine learning libraries on top of it. But the principle is the same. The idea is you have a big data platform and on top of it you have a machine learning library, Apache Spark uses MLlib. And then you have the graphics, Apache Spark uses graphics for the property graphs. Again the idea is to have your data sources map to any standard or map to your own data model. It's important to convert all the data, for you to have the context, you need to make sure all your data is in the same

normalized pattern and the way you do it either you take industry standard and here we have two examples like CBE or you for instance where this is a very good framework you have your data map it back to some other frameworks and make sure you you have a consistent data throughout and then you can once you have that context then you can use it to run any of your scripts. This is again, Apache Metron is just one example. There are many other examples. I wanted to highlight this because this pretty much shows you all the different aspects that needs to be present in a system. The idea is if you need to have a good threat

hunting platform, what are the functions and principles that it should operate on? And I think One of the key fundamental aspects is to have the capability to bring all sorts of data. Here if you see again, you have all your network and antivirus and ideas and all the network and security related data, network security host. And you also have enrichment which is again the context from both external and internal perspective. And here again, it's just a few examples to highlight where you have the different roles of the employees. And then you add it into data store. And the one on the right is where you do all your analytics. The idea here is this, again, it uses Lucene, you can search and you can search

free text, free form search, which would enable the analyst to go and understand all the different context. And also, by inserting different machine learning scripts, you can enhance the data and provide additional alerts and additional context. So that's just a sample architecture, just for reference. There are many more,

ways by which we can implement this, this is just one example. So data collection, again, I don't want to take too much time in this, all of us are very aware. We need to get the data from the assets we need to protect and here this is based on endpoint. I mean, from endpoint perspective if you get all these critical process oriented data, that would help you analyze if there is any malicious or any weird activity going on. And again, there are many tools there.

It's up to the environment, how you want to collect it. But the idea is all the hosts, all the assets, all the endpoints, it's important to collect the data, both from a forensic perspective, so that even if the bad actors modifies the data, you have a log copy to know exactly what was going on. And it's also important that you spend some time to understand the different log sources and as an example, one example is in Windows you have a specific way by which you can actually capture the command line events. And that's very, very useful when you go into understanding what was the different activities that were performed by the threat actor. having that kind of granularity

in the logs would help. So spending time to understand your environment and capturing all the required logs would definitely help a lot. And the data collection from a network perspective, again, we have a lot of intrusion detection systems, and all those aspects, again, from a medium perspective that would give us exactly what happened and the context. So in my mind at least now the biggest difference is we are able to collect a lot of data from other contextual aspects, primarily the HR database, the passive total, even for the external. We didn't have this rich set of contextual information earlier. We were, or at least in my mind, the focus was on a very limited set of data, maybe bro ideas or

maybe antivirus or any other aspects. But combining the authentication and access log with the intrusion-deduction logs is very, very powerful and I think we can, as we move along, we can see some of the examples. Again this summarizes what we discussed so far. Again this is a very logical representation. The idea is you capture all your data and put it in one single place where you would be able to secure it and

safely process the data. And there are three, if you see the left there are three ways by which we can enhance our detection mechanism or hunting. So the rule base is always, see we are core by design, we are core engineers in terms of understanding how the threat vector operates and designing YARA rules or designing IDS rules. And I think that will always help us to, detect known stuff and there is a place always there. And same with the time series analysis where you always have, like for example, denial of service is a very good example where it's volumetric and you have to have kind of understanding on what's your baseline and then you go,

you have your thresholds based on that. So we won't discuss that in the future aspects but I just wanted to highlight that it is still a very very important way by which we detect threats. So the property graph analytics and the machine learning we will see in depth as we move along but these two techniques applied the idea is to get the security events and then apply the contextual data and in a way this automates your contextual analysis. All your SOC analysts, all right now they do is understand the security event and then add the context to see exactly what happened. And both machine learning and property graph analytics helps us to automate this contextual analytics. And

so I think we pretty much discuss what kind of a threat hunting platform we need. we need a threat hunting platform where we would be able to store structure, unstructured and all types of data and also we would be able to run advanced machine learning techniques. So from a process perspective, we, again, the high level processes, we create use cases and then apply some of these techniques and then move it

to further team members and then get feedback and keep refining. And I think as we move along with some examples, this would be more clear. And so what's different now or what's the enablers? So the biggest enablers in my mind are the cloud technologies are enabling us to do a quick implementation of these. So you don't need to do the plumbing and you don't need to do a lot of, in many organizations it's very difficult to implement a big log platform, but nowadays because of the cloud technology, I mean, Splunk has, the Splunk cloud Elasticsearch has their own Elasticsearch cloud. Apache Spark has Databricks, Apache Spark. There's a lot of options now and Amazon has Amazon's Elasticsearch platform as a service. So the

time it requires to build any of these has become really short and I think that helps us. So we can focus on what we do best rather than focusing on the plumbing. And the big data analytics technologies are evolving not for cyber security, they are basically evolving for internet of things and all the other aspects that we are basically taking advantage of and Apache Spark and the streaming, all the other technologies are rapidly evolving and we can definitely take advantage of these technologies. And just to summarize the key takeaways, again, build a threat hunting platform that integrates all your data sources and depending upon your environment, you have to choose the best option and create a process within

your team incorporating all your teams and that's very vital because if you have the technology and not everybody uses it, then you may not get the feedback to improve your platform. So I think we discussed what are the different requirements that we need to take care when we design a threat hunting platform. Now The second part is basically discussing about some of the data science techniques. And again, as I mentioned earlier, there are two techniques which are very powerful. One is the graph analytics and the machine learning techniques. And because some of the other

technology, I mean, there are options that make us easily deploy,

all of us, I believe that this would help us a lot as we move forward. And again, the skills is, what skills is important as we move along this journey is, again, the substantive expertise is basically our core information security knowledge that enables us to do what we do. And here, hacking skills is mainly data manipulation and data mongling, so to ensure that all your data is in a single format, you don't have any missing rows, and math and statistics knowledge, all of us definitely at some point went through, so at this point it's just a matter of combining these, and there is a lot of, I can give some examples, there's a lot of techniques that help us, so one example I

always think about is you can always see a PCAP with the hex. If you can read hex you can read PCAP but you have Wireshark which automatically understands the protocols and gives you exactly what's going on there. Similarly,

the machine learning and other technologies, now we have the encapsulation framework which provides us an easy interface for us to implement many of those, again, as an analogy very similar to the wirefrog. So this is again a very much used code. So defenders think in lists, attackers think in graphs, as long as this is true, attackers win. So the idea is, so again, The important message here, or the important piece here is to form relationships between your data. Again, it's basically a data model. You can have a property graph or you can have, again, it's very similar to entity relationships where you define different entities and then identify the relationship between the entities. And once you do that, and that's applied to

all your data, it becomes very easy and second nature to identify hidden relationships and it's extremely powerful. So again, this is just the basics of what a property graph is. Neo4j is a very popular graph database and it's open source and all of us can try it. And similarly, Neo4j is a very popular graph database and it's open source and all of us can try it. And similarly, Neo4j is a very popular graph database and it's open source and all of us can try it. This is a threat crowd where I just wanted to show an example of how this in reality helps you. So in this case, we just had a hash. So we gave the hash

to the threat crowd. So basically this hash was related to the IP address and there's a specific signature set. So just by going into the IP address, you see that There is a lot of other domains that you may or may not know that is related to this. And having this contextual information will help you to see if there are any further systems that have been compromised in your environment. And

this is a very simple example of how the graph technology would be extremely useful for us. Similarly, Bloodhound, this was released in Black Hat this year. This is extremely powerful. The way this works is it creates a graph from your Active Directory metadata and it basically queries Active Directory and then it builds the graph for us to query. And the configuration of this is very simple and this is also open source. You can try it and you don't need any privileges, you just need a user privilege and that derives all the Active Directory permissions including what groups, what users derive credentials and including the domain admin. So you can go through and actually query the relationship, the path by which

the attacker can easily obtain domain admin. This is a very powerful tool that explains the power of graph technology. So again, as we move along, another powerful example is the way, in this example, the idea is we have the Bro and Nmap and the Remedy data, which we feed it into a taxonomy or ontology which is basically the graph again it can be a property graph or it can be a resource descriptor framework where basically it's an entity relationship model where you specify the different asset how the asset is related to an IP and this is an example ontology for reference you can

quickly have all the different entities related together. By having this and applying this, as we move along I'll let you know how we can actually implement it. But having this will quickly give you aggregate data. So how many vulnerable systems are having this malware?

these kind of correlations would never been possible before. And you can say how many attacker use this vulnerability and you can go from, you can pivot from one point to any path in the relationship and you can aggregate based on these relationships. And that's the beauty and power of this technology. So again here what we are trying to do is to see how we can identify a most vulnerable asset. just by mapping all the data, different data, into the same taxonomy. And so I think this pretty much covers the graph technology and The implementation of graph technology, as I mentioned, there are many ways this can be implemented. One is the Apache Spark has a

graphics module on top of it. Elasticsearch has a graph module. Neo4j is open source that can plug into any data source you have, Hadoop or Splunk or whatever source you have. And you have many other technologies by which you can and again, Bloodhound is one aspect where you can actually try it and see the power and then you can expand it to your other data set. So the biggest benefit, or at least is when you start expanding, you can understand your data better. You don't need a big data set, even if you have a small Neo4j, just put two or three CSV files and you will understand your data better. And the biggest benefit is even if you go and

look for a commercial solution, you know what to ask and you can actually have intelligent questions to make sure that the solution that's being provided will meet your needs and that's the biggest advantage. And also it gives you very good insights and it's very intuitive and by nature we understand, we work with graphs and I think By nature, humans are visual, and I strongly believe the graph technologies would help us. And I think I touched some of the, all the technologies I mentioned, the Apache Spark, Elasticsearch Graph, Neo4j, all of them are open source. I think even Bloodhound is open source. And this ontology framework is based on the RDF, which is the RDF triples.

This is another way of entity relationship model. where you have the object,

this explains the subject predicate object where for example here, you can say the system has an IP address 1.1.1. So the system and the IP address would be related. And similarly you can have various relations for each and every entity.

I hope this was very interesting. And similarly, parallel coordinates is another way by which you can look at, like this is a multi-dimensional search where the advantage here is you can view different columns and understand the relationships between each other in a way that if there are particular increases, then it can quickly highlight the different,

just between the different columns, which one is more prominent. And similarly,

this is a way to visualize some of your data. So, I think the next aspect we are going to look at is the statistical analysis, which is basically risk scoring. There's many framework which use risk scoring, and this again goes back to your environment on, which is, if you have a strong people, in terms of the roles, who is responsible for what and who is the administrators and having a very good understanding of your environment can enable you to design a very good risk score based approach. And again the idea here is, so some of this may be the threat actor TTP based where you can say that if there is an external facing PowerShell, then I

think that's malicious, so increase the score. Similarly, the asset is outside the network, then you increase the score. Maybe you have different controls. So similarly, you aggregate and then you prioritize what to look at. And again, we can look at some examples. I don't know if that's visible. But the idea is we look at, so we cannot there's a lot of noise in login failures, so you cannot basically, you have to bubble up the most riskiest login failures based on some kind of, here we used standard deviation and I think we also have a few other parameters. But the idea is, if you have a continuous,

login failure for this for one hour, then you increase the risk score by 20. And similarly, you again count the, if the standard deviation is greater than two, then you again increase. So then you aggregate, you have it as a, in this case, it's a summary index, this is done in Splunk. You basically aggregate, since the, aggregate is stored in the summary index, you just sort it, and then you, based on the top, topmost

scores, you take action. This, again, is a mechanism whereby which you can prioritize your, reduce the noise level. And this can be done, so the base laning, can be done very effectively using this risk scoring methodology. The approach is, this works really well for denial of service and

similar aspects where you know what's your average and then you can calculate your daily, monthly, and then based on that you can trigger alerts and you'll be quickly notified if you, if you're above your threshold. So the time series analysis, one of the aspects in time series analysis is the simple moving average. And again, the idea here is you calculate the amount of, you calculate the amount of, in this case, if it's data, you calculate the amount of data, depending upon what you're measuring,

you have a graph and I think you can overlay that to make it visible to others. And this is a DDoS example where

basically we just overlaid the data, current traffic with the baseline, and I think that gives a good idea. So simple moving, again, it's the same thing. You kind of calculate your average, and then you measure it with your previous five buckets, and then you calculate. So similarly, you can identify beacon using the fast-freer transforms. This is just, again, an example. Again, those are some of the techniques that can be used to, as you start hunting, you can deploy any of those techniques to quickly identify any unknown stuff. So the machine learning is where I believe, again, the automated contextual analysis.

How many have you tried the Microsoft Azure ML Studio? And that's very intuitive and it helps a lot in terms of getting, and some of the examples I will go through is based on that. Also, there is Databricks Spark,

which again allows you the capability to run MLlib based scripts that again you can use as long as you upload your whatever contextual data you would be able to do the analysis. And there are,

One of the biggest advantages is, again, the different, previously we never had this kind of rich contextual data, and now that we have a lot of data points to feed in, you can experiment and try, and one of the most prominent,

aspects is finding a bad domain. And also similarly, we can also use the same technique to identify

malicious files or randomly created files. So, but the idea there again is very, again, let's go through the process and then let's go through a few examples so that so that we tied both of them together. And again, so the key principle in the machine learning is to have the program automatically learn

what is the output and then give you a give you the results. So I mentioned the risk scoring. So you have to manually do it if you kind of know your environment. But think about having somebody do that for you. That's machine learning. So you can have your risk scoring automatically created by feeding known

data points. And that's very powerful. Again, there are two types. One is the supervised learning and the other one is the unsupervised learning. In supervised learning, the target values are known and we label the test data so that we also we also know what the outcome we want. Again, just a, I don't know if it's a good analogy, but if you think about it, it's like a boutique ice cream where, I don't know if you've been to Tutti Frutti or Sweet Frog, where you go have your ice cream and then you have all your other flavors on top of it, and you kind of have I mean, you have a way to kind of have that,

the result you already know. But the unsupervised learning is like a chef creating a new dish that he never knew, and he's trying to improvise and see based on what he has. So there are some examples like k-means clustering, mainly the unsupervised is used for learning about your data. And Supervice is mainly used when you know the outcome and you have a standard set of data sets. That's just a very high level overview of Supervice Learning and Unsupervised Learning. This, at a very high level picture, it gives a different classifications of different types of machine learning. On the top you have the, the regression and the two class classification are all supervised where you, regression is numeric

and the two class and multi classes for basically labels. And the clustering is where you try to group, peer groups and then, anomaly detection is finding the outliers. So both of them come under the unsupervised. And this is just a quick way to look at

how we can understand the different methods. So again, I mentioned earlier that the Microsoft Azure ML Studio has a very simple visual way of implementing the machine learning algorithm. So again, in this specific example, we get a lot of Intel feeds and other feeds where you have a lot of text documents that you have to convert and you have to kind of use it. And this is one example where you can use the text classification where you feed your unstructured data and then the Microsoft Microsoft feature hashing module will automatically extract the data and then

will make it available for the different relationship mapping. The same aspect can be done with also mark logic where if you have a bunch of unstructured data you can put it in and then you can use it to extract entities and then map the relationship and use it for further analysis. So in this case, so if you look at the steps that's required, you feed in the data and then you do the, by preparation we make sure everything is in the same format and then we, by pre-processing, we make it ready for the different modules to extract the features and then we train and evaluate the model. So the way,

this works as you basically give a set of test data and apply this algorithm and then you get the model and then you apply

your training data and then you apply your test data to make sure your performance, you calculate all your performance, how the model is performing and then you can tweak it and then slowly move it and So I'm not sure if some of these are visible, but so here the key point I would like to highlight is it's very visually intuitive and at every step you understand what are the different aspects that's taking place and you can actually go through each and every phase. So one of the real applications of this is, this came in an article a while back. This is from IBM Watson, one of the, biggest task that's like, it's trying to ingest all the data

from many different sources basically and then map it to the different entities. Again, the entities here are the campaign indicators, target, the one on the right. And by extracting the different associated text,

it can quickly identify which articles are related to which threat actor and the whole understanding of the different threat actors can be easily possible. So there are many other uses. So if we go back to the machine learning itself, so how we can apply there are many ways we can apply machine learning in our environment. So there are many ways to explore. So one way is to identify malicious domain or malicious files, randomly created domains or randomly created files. And the way we look at it is like we take the, all of us have access to our,

or proxy data or firewall data and feed that to the different,

the external context, contextual aspects and also entropy. You can calculate, see if, what's the ratio between consonants and vowels and you can, look at the number of transitions and you can look at the passive total score and you can look at the Alexa virus total. So all these together will give a very high fidelity. So if you start adding all these external features, that gives you good fidelity. So you can actually have a two class classification. You can say the particular domain is benign or malicious. How it will decide if it's benign or malicious, it is based on all these different features. So the different features, many of them are open source and many of

them, it's available in your environment. And the idea is to train the model and then use it. And also, not only that, you can use it for, as I mentioned, the risk scoring is a very powerful example where,

having the system automatically calculate the risk code based on the context of where you logged in from, that's very powerful. Because the type of login, the where you logged in, and then the device, asset from which you logged in, and all these put together would give you a really good picture, and also the resource you try to access. All these put together would give a high ranking on what is what is the most high risk activity. Those are just a few examples that I could think of. So, as I mentioned earlier, we have a lot of use cases that we can play with and there's a lot of tools available, the Azure Machine Learning Studio, the

R, and I mentioned Apache Spark. There's a lot of tools that we can, and Python has a lot of tutorials and we have a lot of material that we can start using. And this is beneficial again, as I mentioned, we have the data and previously we did not have the way to put things together.

Now that we have everything in one place, we can actually put things together and connect the dots. So, and again, that's the final slide where

the possibilities are, like there's a lot of possibilities in how we can use this technology. Because you can use this technology not only for threat hunting, but also for vulnerability remediation. So at any point in time, you have to provide a report to your management saying that these are the high risk assets. And by

having this framework, it'll enable you to map the relationships and quickly aggregate and provide those results. Similarly, you can do fraud detection, you can automate many CIS. The CIS security controls are all very, very difficult to implement, and by using automation and by using this, automating the contextual analysis, You can actually, like for example, asset. Is it your asset or not? You already have a few features that you have from your remedy, your bro logs, and you can just integrate them together. You can do cloud access security monitoring. And there's a whole lot of avenues by having this platform and applying both machine learning and graph analytics. And I hope I have convinced that that This is a powerful tool and I'm

still learning and this is very exciting. And I'd like to hear more on best practices and it's a long way ahead and at least because it's very new and the technologies are evolving so rapidly, it's a very exciting part and also very challenging

and I really hope each one of us use these technologies to make ourselves better. And again, thank you for your time. This is a learning experience, at least for me. Thanks a lot.

Applying Data Science to Identify Malicious Actors in Enterprise Logs

Related talks