Clustering Of Web Attacks: A Walk Outside The Lab

Name: Clustering Of Web Attacks: A Walk Outside The Lab
Uploaded: 2018-01-30
Duration: 56 min 59 s
Description: Abstract: A lot of research was done about clustering attacks of different types using many Machine Learning algorithms, with high rates of success. These were mainly done from the comfort of a research lab, with specific datasets and no performance limitations. In this session I will share my exper

BSides Leeds · 201856:59301 viewsPublished 2018-01Watch on YouTube ↗

Speakers

Gilad Yehudai

Tags

CategoryResearch

TopicWeb AppSec

ResearchEmpirical Research Technical Deep-dives

StyleTalk

Mentioned in this talk

Vendors

Imperva

About this talk

Abstract: A lot of research was done about clustering attacks of different types using many Machine Learning algorithms, with high rates of success. These were mainly done from the comfort of a research lab, with specific datasets and no performance limitations. In this session I will share my experience with dealing with clustering of attacks in near real-time scenarios where performance is a key factor, and where the reality punches lab statistics in the face. I will discuss some of the challenges we experienced during the research like: 1) Applying a clustering algorithm to a stream of data. 2) Extracting meaningful features from limited data. 3) Translating different features into something we can calculate distance from. Speaker Bio: ilad Yehudai is an algorithm developer and security researcher at Imperva’s web application research group. Gilad develops algorithms and solutions using state-of-the-art machine learning algorithms, and also researches new security threats and vulnerabilities. Gilad holds a B.Sc. and a M.Sc. in Mathematics from Tel Aviv University. He has a very analytical and technical background with experience in both statistics and machine learning. A math geek by day and an avid Snooker player by night (And vice versa).

Show transcript [en]

so hi I'm glad this talk is called clustering of web attacks but it's actually about telling stories what stories about attacks on web applications and what happens is that today we have many security solutions that protect web applications and what they do besides protecting besides protecting them is they also output many alerts actually tons of alerts a lot of all else and there is a very big need to find the stories or the essence behind all this huge amount of alerts so this talk is not about how to tell the stories but it's about how to find the stories that I was telling and it's also walk outside a lie because I'm gonna share with you some of the dilemmas and

some of the decisions we had to make on the way in order to get what we wanted to be in the in the clustering of the application attacks so first let me introduce myself my name is Gilad and I have a master degree in mathematics and my background is mostly related to statistics machine learning mathematics I woke it Imperva Imperva is a security company additive wide range one spot spectrum affair security products that protect where plication but also databases file shares etc and and what I do that is I'm a researcher and I work on the web application part of the company and my work is mostly related to developing new products that are related

to web application and so that's it about empower I'm not gonna mention it again in this talk so I don't know what gather and let's say what I'm gonna say today is what I'm gonna talk about today's first I'm gonna discuss why is it good why we should cluster were application attacks and find the stories behind them then I'm gonna share with you our solution how we did it and some of the decisions we had to make and finally I'm gonna share with you some of the results some of the stories that we found behind the attacks so let's start with the background about how we protect my applications and what are their lives that we found find out

so in order to protect your application one possible way to do it is to use something called a web application firewall or woth for short now a wife is a network component that basically filters requests that goes to the web application its firewall that separates the application from the outside world and every request that is sent from the outside world to do application is filtered by the wife the wife decides whether it is an attack and then stops it or it is a legitimate request and let it go through and there are many many kind of laughs I'm not gonna go into how the world works but the one thing I'm gonna say is that besides the fact that

the bus stops the attacks it also reports them and what the user of the wife gets is a log a log of all the alerts that he found so here is a log for example and that's the truncated log we could see for example the ID the type of the attack the targeted URL and the country from which the attack came from usually the law contains much more attributes but let's look at this for example and if we look closely we might be able to find some kind of pattern behind the attacks but doing it this way it's kind of hard so let's rearrange them a little bit and yeah we we actually can't find the stories behind

them this does bandar-log for example there's a story for about a scanner from China doing many kinds of attacks but as you can see all the attacks are coming from the same subnet the same class scene there's another story about the comments panel from Russia which can which is coming from many different IPS but targeting the same pattern pattern URL right news backslash some topic backslash comments and there's another story about a remote code remote code execution attack that is always targeting the same URL index dot PHP but coming from many different countries from distributed origin so yeah if we look closely we might be able to arrange they are left in some kind of way to

find the stories behind them but in real life it won't really walk because in real life we don't have just about fifteen or a couple of dozens of alerts let's say that customer usually uses one a couple of servers a couple of applications and it gets around a couple of hundreds of thousands up to a couple of millions of alerts a day so it just can be practically done manually doing doesn't need to find an automatic way to find the stories behind them and that is what rank to do we try to take the alert each alert captures a single suspicious respond request and we'll try to nail him down to a couple of clusters and

maybe a couple of dozens of clusters will each cluster tell us this tells the story and the crowd the cluster captures some kind of wider phenomenon about the attack like the origin of the attack what was the intent and so the intent of the attack and maybe what was the target so that's basically what we're trying to do is and we want to give the customer some kind of let's say a report about what happened what happened to you yesterday what kind of attack did you find for example if you see the SQL injection attacks on 18 of your service coming from the same subnet and this attack contained about 11 care less it's much

easier sing it this way rather than saying 11 case separate alerts and trying to figure out what's the common thing about them Oh a vulnerable server from China is targeting your website with WordPress servers using specific WordPress phoner abilities so that's some kind of story that we can tell but in order we need to construct the clusters right to find the pattern or tell the stories behind their lives so I said about clustering a couple of times let me just elaborate a little bit so clustering is some kind of it's an area of research in machine learning mostly machine learning machine learning is an area of research in artificial intelligence and the goal of clustering is to take data and group

it in such a way that two items in the same group are most similar to one another than two items in different groups and for example if you want to cluster a scatter plot in the plane we can do it for example in this way where each color will present a different cluster and clustering is not a specific algorithm but it's caused some kind of family of algorithms there are many different kind of clustering algorithms but every clustering algorithm that you will find have basically three basic ingredients and the basic ingredients are first the data right first you need to decide on what the data we want to input the algorithm and it's not just to

decide what data but it's also how you want to structure the data and now we want to enhance the raw data that you get the second the second ingredient is to find the distance measure between two data points I mean when you want to cluster like scatter plot is very easy to calculate the distances but we're trying to cluster attacks on replications and we're trying to find a way to know when to attack so similar to one another that's not such an editor is it asked to do and we need to find a way to to do it and I'll talk about it later and the third ingredient is to choose the algorithm so this algorithm should

take the data it should take the distance measure and it should somehow digest it and output the clusters or the stories and we need to find the right algorithm for us and we'll see that actually for us it was quite quite a challenge to choose the algorithm that will work before so I'm going to show you our solution have waited and discuss all the three basic ingredients the data the distance measure and and the clustering algorithms that kind of worked for us in the in in relation to web application attacks so let's start with the data everywhere application attack is basically an HTTP request and that's the raw data that's the most the roast data

that we have and actually an HTTP request have quite a lot of information inside of it for example we know the method we know the URL that the attack was targeting we know all the headers some headers are more important than others we need to decide which ones we know the parameters but finding the looking at the whole data is not enough we need some kind of weather work some kind of method to structure it for example we might not need all the headers I mean there are many headers somewhat somehow rather than others and maybe they will just be noise in our data we need to decide on the right headers for us maybe we want to separate

the parameters that are located in the post body from the parameters in the query string maybe they have some different mechanics that we need to use so there are a lot of decisions we have to make about how we want to structure the whole data and besides the HTTP request with some additional data that might help us one one one important attribute that we have is the IP the source IP it may indicate on the origin of the attack who was the attacker another important attribute that you have is the type of the attack right the woth that stops the attack tell us whether it was for example an SQL injection or a directory traversal or cross-site scripting it may

also help us to correlate between different attacks we also know the time of the attack we maybe were able to correlate between different attacks using their time lines when did the attack happen and sometimes we also have information about the attacked application I mean was the attacked application and online store or was it some kind of financial or banking application it might make a difference for us later on the clustering alright so we have the whole data but actually we can extract much more out of it and this is the phase where we take the data this phase is called feature extraction and that's what we take the raw data and we extract more out of it we take let's say well a

feature basically is every attribute in the data for example the IPA the feature do URL is a feature if you want to take the user agent for example that's also feature so everything we have in the data is the feature and we want to extract more out of it and for example let's look at a couple of features that we can extract more features out of them and one of them is the IP so the IP may indicate on the origin of the attack right who was the attacker but we can just looking at the IP we can take we can find out much more and you have any suggestions suggestions what can we find

on the IP yeah yeah do IP location that's great yeah ASN yeah that's great so yeah we can find a lot of things from the IP for example the geolocation right we know the country the region the city sometimes the country is not enough I mean note the two attacks came from the US is not a lot of information the US is huge and maybe knowing which state in the u.s. might help us more well also know the exact coordinates of the attack the provider they Sam another important attribute that we know is that sometimes attacker is some kind of anonymity framework in order to hide the origin so knowing whether an attacker uses this kind of anonymity frame-up

might help us later on for example if the IP is coming forward torn out from some kind of anonymous Maxime it will also help us so another important feature that we have is the URL the URL may indicate on the target of the attack what page in what we source the data is targeting so what what extra features do you think we can extract on the URL not everyone at once then think we actually separate the quality and use it on a separate day to boot in the parameters so a couple of thing a couple of features we can extract from the URL first the file extension I mean is the attacker time to target a JPEG of an HTML or PHP page it

makes a difference a second thing is is that some kind of pattern in there in the director visitor tackle it's time to target like like we saw before that that attacks on a page called news back slash a topic back slash comments so sometimes there is a pattern and another important feature we can extract from the URL is actually information about the attacked application itself for example if an attacker is trying to target a page called WP config WP config is a page a special page for WordPress application it may indicate that is trying to target web lead a WordPress application so it's also very important information that will so now we have the data we have all

the features in our algorithm we have quite a lot of features I can't there's not enough time to say everything it's about forty to fifty different features and wait the next thing we need to do is to find a distance meter that's the second ingredient now if you want to find the distance between two points in a plane there's an exact foe not to do it it's very easy but we're trying to do something much harder taking two attacks attacks on web applications that have many complex features and find the distance between them and I'm gonna give a couple of methods to calculate the distance so before before calculating before showing specific methods a couple

of considerations of considerations that we have to make so the first the first consideration is that we have a lot of features each feature the features are quite different from one another one we need to find a distance measure for each feature on its own I mean find a single distance measure that work for both IPS URLs a file extensions etc we need to look at every feature by itself and find its own distance measure the second consideration is about normalization now by normalization I mean that all the distances will finally be numbers but they should be normalized or in other ways should be on the same scale for example let's say we have two

features feature a which can have distance between 0 to 1 and future B which can have distance between 0 to 100 so that's that's not on the same scale right and if two attacks are very far away from one another compared in feature a so they'll have a distance one in which alike but comparing it to feature B they are very close to one another and that's a huge bias in in the results so we need to find a distance mid the distance measures that will be easy to normalize and to normalize them all into being the same scale a good scale usually is between 0 to 1 that's like a general method but it's not the only way

and the last consideration is that well we have a distance measure for each feature but that's not what we're trying to do we're trying to find the distances between attacks so we need to find a way to take all the distance measures that we found for each feature and sum it up together in order to get some kind of total distance between attacks that the attacks themselves attacks on do applications so let's say couple of distance measures that might work the first distance measure won't talk about these distances between strengths right we have a lot of strings in the data do you are L is a string the file extension is doing the user-agent many many

features and the levenshtein distance between between two strings is just the minimal amount of single character edits you need to do in order to get from the first string to the second so for example if you wanted to move to go form pictures backslash cat the jpg to pictures backslash Oh JPEG you need to do exactly three a single character edits sorry by single character edits I mean you can insert a string you can delete its link or you can sub to the string these are the two single character edits that are allowed in the levenshtein distance so actually this distance worked quite well for URLs right it measures some kind of similarity between these links but will

it work for all of us link features and the answer is surprisingly or maybe not so surprisingly no for example let's take the file extension right is CSS close to CSV I mean CSS is a stylesheet page related to HTML a CSV is a comma separated value it's like an Excel they have completely no relation between them but if you take the levenshtein distance between those two strings or any other string similarity distance there are many more they will be pretty close to one another just because the c and s so this distance measure won't work for all of all for features and we need to find another another distance measure for example for this kind of feature and

actually we have many different features that are strings and the the fact that the strings are similar to one doesn't have any meaning at all for example the country right if two countries sound the same what does it mean so another method to do to measure distances it's called the discreet distance and that's a very easy distance measure it will work for any kind of data that you have and it's just taking two objects and saying if they are the same the distance is 0 if they are not the same the distance is 1 so this distance measure is already normalized right between 0 to 1 it works quite well for certain strings it will

work pretty good for example for file extensions if you want to separate them for one another and you can also do some kind of weighted discreet distance if you have granular data for example we have the country the region at the city is using the geolocation that we extracted before and we can say that if two attacks are coming from the same country give them some distance X if two attacks are coming from the same country and same rage distance Y which is smaller and if two attacks are coming from the same country region and City give them distance Zed which is even smaller so we can do some kind of gas with it just a way to

describe a start and that's not such a bad distance measure to work with the next feature I want to talk about how to calculate distance from is the IP now the IP is a very important feature it indicates on the origin and there are many ways to calculate distances between IPs and let's talk about a couple of that so the first one is use the geolocation I mean we know the exact coordinates of each IP right so we can calculate the distances between every two I piece using the real distances between these coordinates I mean if we have an IP from five foot and an IP from Paris let's calculate the distances between flat-footed planets and that

will be the distance so that's that distance function actually doesn't work so well it's a pretty bad distance function and a couple of reasons for it so first of all usually the the real location of the IP doesn't tell us much for example two IPS maybe close to borders but come from completely different countries so the distance between them will be pretty small another another bias that we have is there are some huge countries in the world like China for example and two attacks that are coming from China can be extremely far away from one another well they do have some correlation between them when for example two attacks that are coming from different countries in Europe can be pretty close

to one another and another another disadvantage is that it's not so easy to normalize this distance function I mean there is no natural way to normalize this distance function to be between let's say 0 to 1 of 0 to something so we need to look at another distance function and and in order to do so we need to look at the structure the IP

so let's talk about first I'll only discuss ipv4 and the structure of ipv4 is that every IP is made of four numbers each number between 0 to 255 right so we can look at the IP as a four-dimensional data and we know how to calculate distances between four dimensional points in the plane there's an exact formula to do it it's very easy usually and we can do it this way but in order to improve the distance function we need to give different weights to the different parts of the IP to of the IP right because if you go to the left of the left part of the AP the numbers are most significant than the right part of

the a pin so for example he will give weights which are 110 100 and 1000 and if you want if you want this distance medal to look more like let's say Canadian measure or the regular measure that you want did you know you can take the square root of this formula so this distance function might also look pretty good at first glance but the couple of disadvantages to this to this measure so the first one is that the weights are pretty much arbitrary I mean if you try to calculate distances you could take otherwise you will maybe you will get better results maybe not but that kind of arbitrary and there is no natural way

to choose them and the second disadvantage is that you can normalize this distance function but it's also not that natural so it so because the distance is in this this distance function could be could go from zero to something extremely large very large so we need to look at another way at the IP and there is another way to look at a pace and that is looking at the night period as a thirty two dimensional data and it may sound like a lot of dimensions but trust me this distal function is actually pretty good and every IP right is made of phone worth between 0 to 255 that's exactly 8 bits so 8 times 4 is 32 each IP is

represented in 32 bits and what we can do is look at their mutual prefix of twice and look at the mutual prefix to the left and the distance will be the size of this mutual prefix times 1 over 32 this 1 over 32 is to normalize it to be between 0 to 1 and do 1 minus the result that's to get a real distance function and this is actually a pretty good distance function to walk with it does look at the structure of the IP it's already normalized between 0 to 1 and it's very easy to inter interrupt what what was the reason that we got the distance so this works much better than the previous methods to calculate

distances between IPs so let's say we found a way to calculate distance between all the features that were so as I said with something between 40 to 50 features but what we really want to do is to calculate the distances between the attacks themselves right we know how to calculate the centers between ipace between your lives between the ferals etc but how do you calculate distances between the attacks so a common way to do it or a popular way to do it is to do what's called the weighted sum of the distances so the distance between the first attack and the second will be some weight w1 times the distance between the IPS plus some other way to W 2 times the

distances between the URLs etc when we have let's say D WD times the distances between the user-agent that is the amount of features that we have in the data but I mean what is this WS came from and they're what's called in machine learning hidden weights these are the ways that will actually determine our distance function and we need to find them somehow so there are a couple of a couple of methods to find this weight I'm going to show you two of them though there are more methods so first the first way to find them in stewards of domain expertise and trial and Mo I just and just do manual setting right we know the features we extracted them

and we should know which features are more important than others then we can do some kind of trial and error and manually set the features and it works pretty good if you have a lot of data there are couple of disadvantages first it's hard to generalize this result and the reason is that when we do this kind of manual setting we do it on a couple of data sets right a couple of a couple of trials and arrows but when we do when we want the algorithm to work in real life it will work on data sets that it never seen before so it's pretty hard to generalize something that we do manually and the second thing is that it's a lot

of work and trust me that's a lot of folks are doing it doing it manually it can be done but it takes a lot of time so another method to do it is the most structured method and that is using supervised learning so supervised learning in machine learning it's the more let's say common practice in machine learning where you get labelled data and you want to predict something related to the data and that's what we're trying to do so we need to get labeled data about our features about about our data about our clusters and it's not so easy the way to get label data is to take a data set and cluster it manually

I mean manually finding the cluster of themselves and we need to decide on which labels to give and what actually will go into the supervised learning algorithm and the data that we have is we take every two pairs of alleles right every two pairs of a left and each two pairs of alerts will get a label and the label will be either if they are in the same cluster or if they are not in the same cluster so what do we have now is many pairs of alert each pair of alert we know whether it is in the same class still not in the same class there and we use some kind of classification algorithm to to find the weights and it

might sound I don't know a bit fuzzy but actually it's very common practice in machine learning it's like the most basic let's say task that when you when people learn machine learning they do it's like classification a binary classification when you have a deal to all fours I'm not gonna go into how it can be done but if we if you read a little bit about machine learning you see that it's not that hard they're also a couple of disadvantages to this method and the first one is that it's very hard to get the unlabeled data I mean to get yes so the label data because to get the label data you need to take a data set

and to cluster it manually and clustering manually is a very hard task to do I mean you need to take let's say a couple of hundred thousandth of alleles and find the patterns manually between them and it's not so easy to to be done it can be done but it cannot be done to many many data sets because it just takes too much time to do it and the second disadvantage is that this method is one to overfitting and this is mostly because we don't have much labeled data overfitting is where you get an Al you use some kind of a machine learning algorithm and it's just fixed state on the error that is present on

the data itself so there is a very large problem in related to overfitting basically in all machine learning algorithm and specifically in this one so that's also a disadvantage and the last one is that this method and that's important one misses the structure of the cluster because let's say we have a cluster with three alerts and alert a is close to a LESBIAN dollar-based close to alert c but alert a and c doesn't have much of a relation between them like it some kind of string of alleles and this method misses this this this kind of structure because we only look at the pills so the way we did it is we are some kind of composite of the two

methods we did a kind of manual setting of the of the a11 and then we used a class so a classification algorithm to improve them at which the the weights to be better so there are many methods to find a weight and actually it's not such an easy test to do all right so we have the data we have the distance function we know how to calculate distances between attacks we need the last ingredient right the algorithm itself and in the algorithm itself used some kind of less common clustering algorithm which is called clustering it's streaming mode and the reason is that when we do the clustering we need somehow to store their alerts in in

memory right and we just have too many LS I mean each customer has between a couple of hundreds of thousands to sometimes even a couple of millions alert a day and if we want to do it for all of our customers it's just too much data to store in memory so we need to find a way to do it in streaming and in streaming what we do is we take an alert we cluster it somehow we do something with it and then we throw it away we don't store the this alert in memory so this is different from the usual a batch clustering in batch clustering we have all the data in the beginning we ingest

it into some kind of algorithm and the algorithm outputs the cluster and that's when people speak about costing that's what usually happens the clustering but we had to do clustering in streaming mode and in slow mode what happens is that we have some kind of current state of clustering of class tells the clusters that are happening right now and we have a new event and we need to update the cluster state somehow like take the event cluster it and then output an updated a clustering state and then go from this thing right from this state right to the beginning to get another event and that's going on all the time online in slimming mode so

we'll focus a little bit about clustering in streaming mode and a cup of consideration considerations we have to make the first one is that we have a very limited amount of memory and as I said we cannot store all their all their loads in mem in the memory but we can store something and we need to decide what we want to store in the man in the memory and we need to do it in some kind of smart way the second consideration is that the decisions has to be done in real time right some event goes into the system we need to cluster it right away we need to do something with it and then throw it away we can

just wait for a couple of events to come what to do it online streaming mode so we need to make the decisions right where when they are left going to the system and the third consideration is that we have to we must have an ability to undo our decisions and when I say undo decisions for example if you look at this clustering state all the points in blue are in the same cluster so the algorithm decided some time that this point should be in the same cluster but after a new event went into the system it split the cluster so it undone the decisions that it may be phone we need to find a way to make the algorithm undo

best decision I'm gonna talk a little bit about how we did it but it's called it's kind of a technical algorithm so I'm not gonna go into this s just describe in general points what are the methods we used in order to achieve this kind of crafting algorithm so the first the first thing we did is that we start a grid aggregated data statistics about the data instead of the data itself for example if a URL was attacked 100 times instead of storing this URL 100 times in memo we can stow it one time and said say that it was attacked 100 times so it saves a lot of space in memory and we can store this kind of data in in the

memory so that that's the first thing the second thing is that we are able to undo decisions but we undo them based on aggregations meaning that if you want to take let's say an alert and put it out of the cluster we can't just take we cannot take a single alert we need to take some kind of aggregations of alert and take them out so it does make the algorithm a bit weaker but for our for our data it did worked with it good and these are some of the constraints that we had to deal with when we do algorithms that work online and in real days like on real data and the third the

third thing that we did is we actually used to distance functions not one one distance function which is a light distance function that doesn't use all the features it uses just a very small amount of features that we have and it's very easy and takes a very minimal time to compute it and when every time an event goes into memory we use the light distance function and the distance function does actually two things the first thing the first things think that it does it tells us when to events should be in the same cluster it does it very good but the second thing is that it doesn't tell us when two events shouldn't be on the same cluster

so when the light distance function tells us that two events should be in the same cluster we said all right which was this distance function and we move forward to the next event but when it tells us that two events shouldn't be in the same cluster we don't trust it and we use a heavier distance function and this heavier distance function considers all the features that we have in the data all the 4050 something features and it takes much more time to to compute it but using this granularity in computing distances we are able to do the decisions in real time make decisions in real in your time so that's basically our solution the algorithm

itself well it's kind of technical to go into details but the the field of clustering data in streaming mode is actually pretty research there's much research in this field and will not the only one one one's doing it so now we have all the ingredients right we have the data we have the distance function and we have the algorithm itself let's look at some of the results now we just that the algorithm on a couple of data set many data sets and each data set contained alerts on a single customer a single customer doesn't mean a single replication usually a customer defends a couple of applications but a single customer and from a timeframe of around

2 days so which data set in a single customer on two days and let's look at a couple of stories that we found the first story is about attacks on Apache strategies floods is a pretty popular framework for developing applications it's part of the Apache Software Foundation there there were many though many vulnerabilities released related to Apache starts and still are so that that's the story that I for example that we can tell about but she starts a attacks and what's interesting to see is first all the attacks came from a single country China but from very different IPS right so in from different regions and these regions are quite fun far away from one another in

China but all the attacks were using the same attack tool auto Spider 1.0 and another important attribute of the attack is that all of the attacks were targeting vulnerabilities of Apache struts but different kind of vulnerabilities and there are actually vulnerabilities from 2017 2016 and even 2013 so this attacker uses some kind of distributed network to attack a single customer on a very short time period and trying all kinds of vulnerabilities of a budget slots that you may know in order to find the one that works because maybe he thinks this customer uses Apache struts and you can somehow maneuver it or find the vulnerability that will walk for him so that's the first story the

second story is about a scanner called open vas and open bus is an open one of ability scanner pretty popular one and the way it usually works is that it sends a couple of requests to open a session in front of the in front of the attacked application and then it launches the real attack well the real attack contained many many many alerts in this case a couple of target almost 2000 and the the kind of attacks are very distributed very varying well it actually tries practically everything the directory traversal SQL injection cross-site scripting and remote code execution almost everything and on a very distributed target but that's not the attack itself that's more of the attack that happened the first

three rows are the zeros from the last slide so actually that's not the whole attack that's not even the whole attack itself the real attack contain about 45,000 alleles that's just a fraction of it the board is not big enough to show everything so basically what this story tells is not about the origin it's not about a specific type of attack but it's about the method that the attacker is using because every kind of attack here if it's targeted on the same customer and it works in the same way it sends a couple of requests and to open a session and then it launches the real attack which contains of something around 2000 requests every request every time so

this is a very spear targeted attack on a single customer for many different places around the world you can see a Singapore South Africa us Italy and there are many other countries which did the origin went from to attack and all targeting the same customer so someone is trying to find something that will go through this customer and we actually find many more stories sometimes related to the origin sometimes related to the URL the data the attacker is targeting and using the clustering we can find the real stories that are interesting for [Music] for the users of them did this - Application Firewall so I'm going to share with you a couple of next steps of

a couple of things that we think about it should be should happen to this of algorithms the first is let's say we do the algorithm and we find the clusters we'll find the stories maybe you can use this stories in order to automatically create or suggest new blocking rules right for example if we see an attack coming from a single user agent using some kind of tool targeting a specific URL maybe we can use this kind of attributes of the attack in order to block the attacker before we even pass the message before we even pass the request itself does providing extra and extra protection and extra security for the customers so the second is that we can take if we're working

basically on datasets of one day two days but not much more maybe we can look at clusters that work across time for example let's say we have some kind of application that is attacked every day for a month in the same method maybe we can do some kind of clustering to the clusters that we already saw in order to find some wider phenomena to the attack and find some kind of pattern that occurs time to attack this to application but over a larger period of time and tell many other things that can be done in this area a couple of key takeaways that I think are important from this talk so the first thing is that we have many alerts

in the data I mean that's what security products do they protect and they alert but what we actually want is not many alert it's just a few stories to understand what really happens in to application and that's what that's what this talk is all about and that's what we're trying to do to take all the huge amount of alerts and to nail them down to a couple of dozens of stories that are really worth telling and that's really the essence of what's opening when the replication is attacked the second the second point of that clustering won't walk out of the box I mean it's just it's not like the kind of algorithm that you can take the

data put it in an algorithm and hopefully get clusters you need to do a lot of work on the process you need to do a lot of listening to the data you need to decide what distance measure to take you need to decide on the right algorithm even if you don't walk in slimming mode still there are many different algorithms out there and you need to choose the right one for you and that's not the kind of thing that's not the kind of algorithm that you just walk out of the box and the third point is that this field of research of clustering attacks application attacks so cyber attacks in general it's not a new field of research there are a couple

of academic papers about it and there's been much research in this area but I think there's a lot more research to be done a lot more exciting attacks to see especially using different kinds of data different kinds of attacks or different kind of systems and I think there's much more research that can be done in this field so thank you very much for being here thank you besides for having me and I hope you have a lovely conference if you have any questions I'd be happy to answer not alone it was yeah

the question is what kind of to use to get the data so what my company do is it has a web application firewall and it has many customers so we get the data from seeing this kind of logs this data is not publicly available unfortunately because I have to process the data ok so there are many machine learning libraries that open mostly in Python and we was mostly Python to do it the scikit-learn this popular one for deep learning deep learning there are many other stencil flow over pi touch dinette if you know so there are many different results and we was mostly a mostly Python to deposit data

yeah okay it's a good question the question was with supervised and unsupervised learning and would we see a and if any way searching related to neural networks and I think there's there's much to be related to a neural network one feature that is especially required in newer neural networks is that you need a lot of data for the neural network algorithm to work and the problem here is that we don't have data and also usually neural networks works in the structured way in the supervised learning methods and we don't have much labeled data here and that's actually a big problem in many other fields of research related to neural networks that we don't have enough data so if someone

comes up with a lot of data somehow they might do something really fascinating and really brilliant related to neural networks in this field yeah yeah yeah actually we find a lot of stories related to the attack of spoofing the user agent we saw a cluster the attackers try to create many different user agents but they are they all always had the same prefix and use some kind of randomized their suffix so if we were some kind of levenshtein distance it will catch it if the attackers are completely randomizing the user agent I don't know I think gibberish if ordered and the Levitan distance won't work but we still have a lot of other versions usually

attackers either have some kind of pattern in the URL we also have had other methods to find a tool not only the user agent for example we can look at which other exists and which doesn't the order of the headers the attack methods so basically the algorithm doesn't rely on a single feature that's let's say the one of the biggest advantages of it of message yeah

that's a great question how do Mitchell success of the algorithm right because that's basically that's a basic problem in all unsupervised learning algorithms well you don't have some kind of benchmark that you can compare it to so one way to do it is to create a benchmark like create yourself let's say you take a data set you cluster it somehow and then you use the our algorithm on on the data set and you can compare it to them another method is to let's say get feedback from users of the algorithm right give them the algorithm and listen to what they say another method to actually we use quite a lot is we literally search test the algorithm on

many data set and see whether they work good or not according to what he said and if someone has a lot of domain knowledge then it might work that way but there's no one great way to find them whether it worked yeah the methods that you mentioned right now and based on that you know how accurate is it depends on how you define false positives because false positives can be two events that should be in the same class there but they're not or two events that let's just let's just define it has a story that is based on the logs gathered which is not accurate to what be more so essentially is telling us something which didn't really happen

okay so so basically we did a couple of benchmarks and the algorithm worked pretty well on this benchmarks but we switching on this match mark and we also tested it on them so we expect them to work well I can't give you a specific number like X percent false positives [Music] yeah yeah okay so 20 first question yeah we observed a couple of related to this but if you choose this algorithm correctly then let's say you can have like an alert a which is close to alert be in a couple of features and a lot be close to alert see in a couple of other features and they will all be in the same cluster and you get some kind of

chain of affiliates in the same cluster so if you do if you choose the algorithm quietly it will solve this kind of problem it depends also if you want it you know sometimes it is something that you don't want but we was dead and my name is just the second question so no we didn't we didn't consider and any classic metals that are not distance based I don't think I know laughter in metals that are not distance based I mean in any clustering you need some kind of distance measure you are so kind of proximity but you usually need to consider distances between data points I think if not I'll be happy to to know otherwise

so thank you very much if you have any questions I'm still here but I need to finish so thanks [Applause]

Clustering Of Web Attacks: A Walk Outside The Lab

Related talks