
our next speaker will be ahmed mirzay ahmed is a senior security data scientist on protection teams at elastic and today he is going to talk us about practical threat hunting with machine learning take it away thank you uh hello everyone uh i'm very glad we're back to in person events gradually uh i usually love to have eye contact with the audience but i can't see you very well but that's fine yeah so uh i thought now that we are in an amc theater uh having this thing maybe a good idea uh so this talk was uh initially planned to be given by my colleague craig but unfortunately he couldn't make it for personal reasons that's why i'm here
very glad craig has been active in cyber security for several years especially in threat hunting he's been the chief security architect and a principal at several security uh startups uh myself i've been active in cyber security for more than a decade most recently in android security mobile and desktop malware analysis reverse engineering and applied machine learning in security so today i'm going to be talking about practical threat hunting with machine learning hope you enjoy so right off the bat why should we use machine learning for threat hunting right this is the question that we may ask ourselves in many application contexts that we want to apply machine learning to these days you may say i'm not going to apply
machine learning to to hand threads and i don't have any plans to do it in the near future in that case uh i want to correct myself why should i use machine learning for threat hunting right so uh i personally i'm opposed to applying machine learning to any application context blindly however it's a a really good tool it's an additional tool that we have in our toolbox for solving problems it's a valuable tool an additional tool for you know finding threats where conventional methods uh cannot do well it's also very good at automating and scaling existing threat hunting techniques so the two questions that are often being asked uh in threat hunting are one how are we
gonna make sure uh we don't you know tr uh trade alert fatigue for anomaly fatigue because these two are kind of interrelated and the other one is how is this new approach based on machine learning different from things like stack ranking or outlier detection running really large queries right so uh these are the two things that uh they are often being asked we do have machine learning jobs at elastic that are running continuously in what we call a bucket intervals which are literally literally time intervals sometimes you know 15 minutes uh or 60 minutes depending on which one gives us the best results uh and then uh with respect to or in terms of uh
what do we do to make sure we don't get too much output right uh so excuse me uh we've done a lot of work uh at elastic on that uh so we have something called uh data feeds which are literally uh you know the other feed queries we call them elastic search queries uh uh at in there they're basically input the models these elastic search queries we we measure uh you know signal to noise ratio and then we put exceptions on those alerts and then we try to exclude those uh before you know inputting those queries to the models so kind of you know pre-filtering uh so this is the way we do it uh
the other thing we do is some sort of multivariate correlation so uh we are joining machine learning alerts at elastic with other types of alerts uh we get uh for example uh malware detection alerts or you know uh conventional uh alerts search or query based alerts behavioral alerts we're merging all that together and one thing we have found is that when two or more of these detection pipelines agree that something is malicious or suspicious then confidence tends to go way up and uh another thing that we have done recently is clustering alerts based on risk score host and user risk score and this is something that we are still working on and then there is a blog post uh
that is really great explains how we're doing this stuff at elastic and it basically walks you through some of the functions i really recommend looking at this you know uh blog post so this is a summary of uh what we do in terms of you know machine learning based anomaly detection and then uh after this summary uh let's dive right into some uh examples uh some uh case studies uh that can be divided into three main categories those uh uh that are using endpoint data uh those that are using network data and then those that are using uh cloud api audit logs so let's start with this uh cases studies that are using endpoint data
these are some of the jobs that we have at elastic that are using endpoint data so we have something around 64 different machine learning jobs uh that are you know using a endpoint data running process uh from the endpoints those properties for example we are looking for rare processes uh rare path process creation like you know parent-child relationships uh and then uh we have another job which is pretty good at finding uh uh you know obfuscated powershell script block content so the point is that rare processes are not always uh anomalies right uh so we not only should be looking for processes i'm really hot uh i don't know uh it's okay uh we not only should be
looking for processes in a single host but across the entire host population right so that's why we have a separate uh job looking for rare processes in a single host and then another job looking for rare processes across the entire host fleet or host population uh right so uh you may say uh we have some you know application types out there that manifest rare processes occasionally uh for example finance applications they run processes that manifest once per month or once per quarter so if you find a rare process in something like a finance server that could be sort of uninteresting however if you find a rare process in something like a domain controller or ada dns server or exchange server
you know at least the plain vanilla versions of those tend to have less behavioral variance or spread so finding a rare process in those things could be more interesting and of a higher priority uh so the first case of study is this one uh we were working on our detection methods for a campaign called nobelium back in may 2021 and this is a campaign that has been active for a while so we kicked off a high priority project to make sure we're actually able to you know detect this campaign and what we found was uh we were able to detect this campaign using some of the unsupervised machine learning jobs and alerts uh even without any prior knowledge about
what the iocs uh look like or what the behaviors look like at that time so uh this is the first case study and then we got a couple of detection on this uh to the techno belly and we got a couple of detection it was not only one one was we had an activity from an unusual you know working directory and in you know well regulated environment or a windows environment for example with heavy security policies it's sometimes unusual for users to be you know running things uh from their profile directories so uh so we found uh one detection was related to this another job uh found a network activity from an unusual process name
right which was calling over the net talking over the network so basically it was connected to the stock was talking uh on the network uh so that job this was the second detection that we got uh as far as other possibilities we found uh other things that were interesting for example we found a very you know strange uh dll name in the one of the samples when we detonated it uh and you know anomalous dll activity or dls name sometimes could be interesting and fruitful however uh you know it's gonna be a little bit more difficult because as you may know we have just a huge number of dll loads right uh so dll names uh
they're gonna tend to have more cardinality more uniqueness you're going to see many you know dll names both in a single host and across the entire host fleet so you know dls activity is good but you got to find the anomalous dealer activities or rare dll activities so another thing that was interesting is beacon activity if you can find beaconing then you can possibly detect and block attacks at the very earliest stages and this is something that we're working on recently uh so this was the most notable example or cases study that we actually use endpoint data uh running process we use basically the properties of the running processes on the endpoints uh and then
uh the second uh series of cases studies that i want to discuss are those that are using network data right so uh in the network realm as you can see you know one of the uh very worthy and interesting things that you can do is to plot it geographically right so you can do this you can digest the results in a relatively short time and it's a simple technique so i really recommend this plotted geographically uh and you know uh these jobs as you can see the jobs that we have machine learning-based jobs that are using network uh data they're looking for things like unusual destination country unusual spikes in you know network traffic and network denies
and uh especially the first one is looking at unusual spikes in network traffic to a particular country so these are the things that part of the things that we are relying on that are using network data to detect threats uh let's have a look at the cases study here this is actually what you can see here is a couple of detections that that we got uh that were based on uh you know unusual spikes in traffic to a particular country destination country if you look at the typical column this is a column that actually shows what the typical you know uh network traffic should be the size in terms of so and what the actual should be and the
description shows how far that actual was from the typical so what's great about our machine learning jobs overall is we're not giving any numbers to our jobs right so we're not for example saying uh show me or alert when for example the network volumes uh volume goes above 10 000 flows per hour we're not giving any static number and everything is kind of dynamic so our machine learning jobs can find peaks and troughs that are not only peaks and troughs but those that are unusual for the network that the incident is you know actually taking place uh so these are uh the part of the things that we do uh and uh the this cases study
uh we found these uh basically where we were better testing one of our network anomaly detection jobs and what we found was actually a malware instance that was getting stopped at the border firewall and it was trying to get out and connect to its c2 and as you can see the c2 was in a country that we didn't have a relationship with yeah and so uh it was a good catcher fine uh but you may criticize that you know a geographic detection relying on destination country is not a good way to detect threats and i agree uh because basically threat actors can evade you know very easily by just not being a geographic anomaly in other words uh you may get hit by an
attack from a place that you cannot imagine right maybe an attack from even your own country so uh uh this is a case of study where one of our machine learning jobs could detect the sunburst c2 right and the way we did it uh was not through geographic anomaly right uh so because this was not a geographic uh anomaly in this case uh so what we did was we leveraged a job that was looking for dns tunneling uh and this is something that we actually adopted from one of our customers or users uh that they you know they were using this and they like to use this looking for really really high apparent relationship you know between
parent and child domain names in dns events so we adopted that we we adopted that into a machine learning job and we could detect the sunburst uh c2 uh basically in this case there was a domain uh a dga domain generation algorithm in place uh and there was just you know a lot of uh child processes for one parent sorry child domain names for one parent domain uh and that's why we could detect this right so in this case we didn't rely on uh geo geography we will leverage a different technique which was looking at dns tunneling and so it was a notable example for uh the detections that we got using network data now
the most uh probably the most interesting part is uh detections that are using cloud api uh audit logs because you know we are moving towards using cloud more and more uh right uh so let's have a look at the case studies uh or let's have a look at the detections that per se that we got using uh cloud api audit logs so machine learning anomaly detection could also be used as a go-to technology in the cloud space but you know the cloud domain is a little bit uh different we have virtual machines we have virtual servers uh we have containers that run operating systems and they you know they kind of like look like virtual servers uh but
you know cloud world is different uh we have a kind of services that basically don't traverse virtual networks right they are not accessible from virtual networks that your virtual servers on are on right so there are a number of services that don't traverse the virtual networks you cannot observe them using you know for example virtual you know uh firewalls intrusion detection system looking at the flow logs uh you cannot observe them by any of these means so uh uh this is uh the way we do it uh the machine learning jobs uh and you know cloud incidents uh another aspect is that they often lack clear evidence or uh you know indicators of misuse uh so it's very hard sometimes
to find out based on this cloud api auto logs and the other thing the other important thing is the difference between normal user activity and you know a hijacked user context involved in a malicious or suspicious activity is often a matter of nuance uh i will explain about this later more uh i hope you understood that part of the you know virtual servers because that was uh very important like there are some services that you cannot basically observe or monitor right because they don't traverse the virtual networks uh so there are a lot of services in there uh and there is actually this attack matrix for cloud techniques uh and the overwhelming majority of these techniques uh you know are related
to credential access uh you know in the cloud incident world uh there are cases i mean there are most of the times there are many cases where we're talking about credential access right so we're looking at cases where somebody has obtained a set of valid credentials and they're basically trying to impersonate a user they're trying to persisting in the control plane they're trying to use services in the control plane right to achieve their goals so uh yeah so let's have a look at the uh at some some of the cases studies that uh we had using cloud api logs uh you know this is uh one of the incidents that we could detect uh it is the one one of
them from 2016. uh there are several similar incidents out there in public records but this is uh one example and there were actually many dimensions to this but there was a very significant and interesting you know cloud aspect or dimension to this uh incident that i decided to talk about and that aspect was related to the exfiltration method all right uh so you know the ex the exfiltration method was kind of novel at the time and it was something that i'm not sure that many people had thought about doing uh once uh you know a user got access to a virtual account uh it can actually use supported functionality to share snapshots of the virtual servers uh with you know other
cloud accounts so so this was the novelty of that once uh they accessed a cloud account they they were using that supported functionality to share the snapshots with with an account which was under the attacker's control right kind of forklift data back uh to the uh cloud account that was under attackers control and uh this is uh how it looks like when you know someone shares a snapshot of a virtual server with an account and this is uh actually one of the cases where you will see somebody else's account right because this is the account number that this was shared with uh so here uh is what you can see related to that yeah the font is not that much good but
uh yeah this is one example the other one was this case study where somebody was trying to bounce through a service called a web application firewall or wav service right so they could bounce or pivot through this web service in order to get another service called the metadata service and the metadata service is a is an information service in cloud accounts where virtual servers or you know virtual accounts can ask questions about you know themselves and get answers uh sometimes virtual servers can you know interrogate this metadata service and ask temporary credentials to you know authenticate a different service depending on the roles that is assigned to them so this is another thing that they can do now
in this case the study somebody was you know able to bounce through this web service interrogate the metadata service and then you know grab the credential out of that service and then that credential was actually actually accessible from outside of the cloud so they were able to log in right uh and then they were able to access that cloud account and they were uh able uh to access the data that was stored uh in a data store called s3 i'll talk about that method list buckets if you uh if you're not familiar this is basically a method that you can enumerate a data store uh like s3 for example so uh so we have something around uh five
different machine learning jobs right uh so these jobs are looking for uh you know things like rare method or rare call to methods rare requests for for a particular cloud user they're looking for you know a rare city or rare country for an api method they're looking for rare spikes in error messages or outliers in error messages because what we've found is if somebody has a shell on a virtual server right and if they're running a post-exploitation framework like bark to pivot through the control plane or you know to possibly move laterally among virtual servers then uh it will you know just create a burst of error messages and access authentication denied messages so looking at this is also pretty useful
the spikes in error messages the other thing is a rare error so rare error has kind of observability implication use cases uh and what we have found is rare error sometimes will predict eminence cloud service failure meaning that a cloud service uh is likely going to degrade or you know break down in the next few minutes so looking at this is also helpful uh uh but you know detecting anomalous commands or or methods uh because we're looking for anomalous commands to basically basically kind of uh detect especially credential access uh in the cloud domain uh because like i said the credential access it was was one of the most uh you know popular things that you see
uh so the challenge in detecting anomalous commands is sometimes these commands are using are being used by you know uh other operations or even sre users or developers that are actually you know doing some breakfast or they're you know maybe troubleshooting something so uh it's not that much easy to detect uh anomalous commands let's take the least bucket that you saw earlier as an example so if you remember the threat actor in that case study persisting uh with the compromise credentials uh was using this command the list buckets to enumerate that data store which was s3 in that case so this method right in a even in a very small cloud account is often being used
too much for us to try to alert on it on this so it's not easy for us to detect that even in a medium-sized cloud account you know this we can do maybe tens of thousands calls to this method on a monthly basis so it's very hard to you know uh detect uh anomalous uh commands uh now the other thing is that we not only should detect these things but we should kind of be able to identify suspicious or malicious user activity meaning uh we should be able to identify when the cases where these methods are being called from a hijacked or illegitimate user context so this is more important than the first one i would say
uh let's have a look at some cases studies now that are specifically uh you know related to to the topics that i discuss uh you can see here uh a detection that we got in this case uh it was a user uh calling this method it was craig in this case list buckets to enumerate a data store this is not a method that you know is is used by this user a lot so it's kind of very anomalous and rare so we could detect uh that right uh and we could alert on that and then we can simply go and ask the user why you know are you calling this method so this is uh one thing that we can do
uh now there are some methods that are especially important right not all methods are important but there are some methods that are more important privilege methods identity and access management methods or you know sensitive methods these methods are not being called by normal users uh a lot right uh i mean developers can call these methods or sre users may call these methods to troubleshoot something maybe or or doing some breakfast but whenever you see a call to a sensitive method or a call to a privileged method from you know an unusual user context then that could be you know an indicator of you know something fishy is going on uh and and then these are all other methods or examples
uh here you can see the first option the first method is shared snapshot volume created this is a method that has never called by this user before it's very rare right so we could detect that using one of our machine learning jobs and then the other one is ron instances now this is very uh very interesting because you know many cloud accounts they are really you know automated or at least controlled by automation so it's very strange for you know users to log into a let's say virtual uh server and run a virtual machine and run things manually so uh when you see this thing it's a very strong indicator so more privileged methods like assume role
for example have also you know featured in a number of threat models and scenarios as well and you know these privileged and sensitive methods uh should not be coming from unusual geo locations or countries either so we should take that into account as well uh and then same thing for this method console login what you can see here is just a a login a console login by a user called uh craig again uh right uh and we could we could detect this uh and you know we can do the same thing for authentication events or login events and we have had pretty good results uh plotting you know uh source anomalies or geographic anomalies uh like
let's return back to the metadata service that i discussed earlier uh this is a very important service and it was uh an ingredient of the second case study right if you remember i want to review again in that case the study the threat actor was trying to p-vote or bounce through a web service and get to the service in order to you know grab or extract some credential out of the service in order to be able to log in back access that account and then you know enumerate the data that was stored on that account so uh so metadata service is very important it's uh it's used programmatically and it's used a lot right uh so
cloud access management services can call this service uh automation that is developed and instrumented by users can call to this service and many other services may call this service you know all day long now how can we uh you know detect uh anomalies or threats uh that are related thank you uh to this service we can rely on two things a rare process name uh that are calling this service so rare processes that are calling this service or rare user context or rare username that is calling this service right these are the two ways that we can kind of detect those threats now i want to show you examples of each of these scenarios so
what you see here is actually a user uh uh sorry a process right uh let me go back just we're gonna look at the at an example for the first thing a rare process calling this metadata service so in this case it's curl right a user is trying to uh interrogate the method as a service using this uh curl by hand and this is uh really unusual uh you may ask why because you know it's a fully automated and regulated environment where where these users are working on so it's really unusual uh you know for them to do this by hand uh and the second one is an unusually unusual user context or unusual user calling to
this service uh in this uh case uh you know it's a user called craig uh he he doesn't normally do this and have no reason to the to do this so yeah this is the a case study for the second the second item uh now in terms of uh future directions and road ahead uh there are several things that we're doing right now uh i think craig has more than 700 unsupervised machine learning jobs uh in development at the moment uh and he has done some work uh on detecting uh you know threats and low beam activities especially living off the land activity and those hard to find things using on supervised machine learning models
uh and then uh he's done some work and plan to do more uh i may involve in this uh indeed uh on you know risk clustering and then alert clustering multivariate correlation you know combining or joining uh these machine learning alerts with conventional alerts behavior alerts and other types of alerts that we have uh so yeah these are the uh things that we plan to do or we are doing right now uh and this is kind of the summary of uh how we we do it at elastic uh to the tech threats if you have any questions uh i can maybe answer uh them generally if you have more specific questions maybe you can shoot craig uh an email
directly
no problem yeah any other questions or i don't know
okay thank you [Applause]