
hello everyone good morning good afternoon good evening wherever you are in the world my name is vivek malik and together with my colleague kumar vikramjeet we'll have a short discussion about a project that we recently open sourced next who are we a couple of words about us we are adobe's security intelligence team which is part of adobe's security coordination center the purpose of the team is to do data science research and security field we mostly focus on reactive security and that is basically identifying threats that cannot be detected by a conventional ways in other words we use the logs and security data that we collect from adobe's assets and we try to find anomalies and the bad
stuff uh the part of the members of this team are andre kotai iberi boros kumar vikram jeet myself vivek malik next
osas or one stop anomaly shock this is a machine learning framework aimed to discover anomalies in a given data set the open source project represents an implementation of several adobe security intelligence teams patents white papers and other projects the goal of osas is to make it as easy as possible to experiment with different data sets algorithms and feature combinations and find the most balanced combination for your own use case as well as data but most importantly osas will try to answer why an anomaly is considered actually an anomaly to better understand this uh let's really uh talk fast about anomaly detection next the anomaly detection principle uh this principle is as simple as pi anomalies represent extremely rare
events the more complex question is how there are some well-known algorithms for outlier detection such as local outlier factor or the isolation forest k means and distance from the center of a cluster error reconstruction and so on sometimes these algorithms generate anomalies and our stock or security experts don't understand why these are actually anomalies and the simple answer anomalies represent an extremely rare event um that is not enough for them so coming back to anomaly detection principle let's look at the picture uh in the slide and try to identify an anomaly everyone's first reaction would be the egg that is of a different color that is the brown egg is an anomaly and maybe you're right but if you look close
enough at the picture all the eggs are wearing facial masks except maybe two or three so maybe that's an anomaly that i'm actually looking for the bottom line is the way you prepare your data is going to influence what kind of anomalies you generate so the data preparation is fundamentally very important next
now let us try to bring more context to our egg situation and briefly briefly describe each one of them so we see a couple of eggs appear to be in love one is sneezing one definitely has eye problems some are mad some are scared and so on now this is important because the fact that we have a brown egg is not the most exotic feature anymore the sneezing egg becomes important the egg that is crank is also unique and so on in other words what we propose is falling we'll describe each observation eggs here in this case with a set of labels and then let anomaly detection algorithms decide which ones are actually the anomalies and the way
we generate labels becomes the entire magic so you see in this picture uh the egg white has mask open eyes and then needs eye surgery that becomes the outcome of an algorithm similarly egg white has mask open eyes and love or has maybe chicken pox looks surprise becomes the uh outcome of our algorithm that we run another egg that is white has mask open eyes but is an angry person so probably that's the anomaly that we are looking for so these are the labels that uh we are trying to implement uh with one stop and normally shop next so let's talk about the tags or the label generators we have three types of tags i will start by
saying that when we designed osas we had in mind semi-structured log data this type of data is usually found in access logs or error logs from apache tomcat in hubble and so on hubble by the way is another open source project from adobe but you can check it out this information is semi-structured because you can infer attributes from end for instance while apache log is x based you can extract timestamp ip addresses url response code etc etc one important aspect is that this data contains many attributes with unbound values for instance the ip addresses and urls usually these attributes generate large feature space by comparison the number of examples that can be contained in a manageable
data set and that can be processed by machine learning algorithms is relatively small these two factors combined generate uh something known as data sparsity an unwanted effect that makes machine learning learning algorithms overfit the training data and generalize poorly on previously unseen examples to cope this or to cope with this osas reduces the large features based by employing a tagging strategy on the raw data set this is done before feeding the data to anomaly detection or classification algorithm the tagging strategy uses predefined recipes for specific field types now this is based on three type of tagging strategies that we use the standard tags the text tags and the expert knowledge next uh there are five types of label generators
under each tagging strategy uh combined and we're going to go through each of these uh separately so let's take a look at the multinomial field uh the tags provided by these fields is uh check for the data that counts like in detection algorithms uh it counts for unique attribute values and usually they're less than 10 unique attribute values the kind of models that you can run on the multinomial fields is either start statistical distribution values or label based value frequency and special tax for any unseen data under numerical fields usually you can count for unique attribute values and more than 10 percent of unique values are available in uh numerical fields all values should be numerical or there
should be none uh the type of models that can run are like standard deviation mean medial uh label based gaussian probability finally field combine field combiners um the detection mechanism uses all multinomial fields that can be combined together uh the models that are executed on field combiners is statistical distribution of values in the label based uh frequency on special tags for the unseen data next a text field the type of detection is non-numerical you can only count unique attribute values and more than 10 percent of unique values are available you can compute the algorithms such as engram language model compute perplexity of each example mean or standard deviation and then label based on gaussian probability
finally expert knowledge the detection is manual in this case and the models that you run are keyword or reject reject space the labels are generated for matched instances next the anomaly detection algorithms by default osas has four anomaly detection algorithms the isolation forest local outlier factor svd based uh anomaly detector or the statistical engram the first three are specific to scikit learn and if you want to know more about how they work we suggest you consult official documentation and the papers associated with it the statistical engram method is an algorithm designed by us it uses statistics to compute the probability of observing combination of tags and compute the anomaly score using sum of sum over what is known as the negative
log likelihood next in a nutshell the pipeline contains three main modules on the left side we have the data acquisition mode which uses a security uh incident event monitoring uh for runtime and statistically uh compiled data for training the middle section is called the labeling or data grooming module and it applies labels for each interesting attribute type finally the right section is the scoring module which uses one of the three strategies to assign scores to examples the three scoring strategies are get tagged label data compute uh supervised and unsupervised scoring models and generate a supervised risk-based anomaly search for your scene during training we use statistic data sets to compute static data sets to compute statistics
and language models for the attributes labels label the data set and compute the model for three scoring strategies at runtime we just apply all three elements of the pipeline using precomputed models and generate a score for each individual example previously unseen in the dark side next now let kumar take over thank you vivek so i'll i'm kumar here i'll go through the demo so in this demo we'll cover three important aspects of oss first we'll look into how to get it up and running next we'll run oss with default configuration and after that we'll customize oss with expert knowledge and we'll compare how the behavior changes and how the improvements impact oss results
so let's get oss oscs up and running so in the git page we can get all the links to get started with so the git page has a quick start guide there are basically two ways to install oss you can either download a pre-compiled image which is the easier way to get it up and running and alternatively you can also build the image locally and another way can be to just work on the git repo and make changes as per your need so next we'll go to the folder where our data set is stored so in this folder we see there is a csv file which is our default data set we are going to use for this demo
so this data set has around 5 5 000 events and we have also inserted some of the malicious events which we want to detect as anomalies which is our end goal so we can start with pulling the docker image and then we'll run docker run command so in this command we are specifying two ports so the first port is for exposing oss web service the second port is for exposing elastic search kibana front end the last argument you see where we are exposing the local folder to osos so that it can access the default dataset that we are using
so once we run it it will uh it will initiate the two web service that uh is basically the front end and back end of oss so we'll go to the console link or we are opening both kibana and the console link so once we go to the console uh we can see whether the the our test data set is present or not so we also provide three other oss endpoints which are not mentioned in the readme file but it can be used so the console basically provides command line interaction with oss using the automated pipeline you can run the entire workflow and using the generate config endpoint you can generate the config so in the run full process you can run the
entire process without any intervention and generate config you can generate the config by just using the web interface let's go to the dashboard so this is a default installation of kibana dashboard so you can go to the dashboard link here you'll see there are like five dashboard links you can customize as per your need here you'll give you'll get all the stats related to your anomalies next so how what what are the steps involved in building the test pipeline so basically there are three steps first you generate a config file by executing auto config script so here we will execute auto config script and it will have input as the data set and the config file that you want to generate
so this config file is basically a set of configurations for label generator and the anomaly detection algorithm that you want to use it it contains all the uh label generators that the model will be using and once you execute it it will also contain the parameters that it needs to execute to contain to generate the model so after this you have to train the pipeline and train pipeline use the last config file that you generated and the last step is to just run the pipeline which will take as input the model file and the input file that that is our default data set
so we'll look into it in depth once we start working with the data set so these endpoints are not mentioned in the readme but you can access it so these are still in work in progress but this will be fully built once we are done with it
so now next let's let's use a default data set and go through the oss execution so first we will use the console link to go through the entire process so let's check whether the data set is there or not so let's see our folder is apps folder inside the default dataset is present and next we'll go we will proceed with getting the config file so we'll execute auto config
and will pass the dataset as the input dataset and we'll give it a default config argument
so once it's done if you want if you want to edit we can just go and edit it so we can see that's present there so we can open it in a text eraser and edit it as per our requirement
at the bottom we can see that we are using statistical and gram anomaly detection as a scholarly scoring algorithm
next we can go forward with uh training the pipeline by using this config file that we already generated
so this will generate the model file that we want to uh run the pipeline on so once it finishes we will get the model file
there is a small typo i'll just rerun it again
after this we have to execute to run pipeline that will use the default model file that we just generated from the default configuration of oss so we'll pass the same input file and model file will just need to specify the output file that is the results file that we want to use
so once we are done with that we can see the results in the kibana dashboard
now we'll go to the dashboard link and we'll see the results and stats related to the execution so it's a default keyboard installation so the username and password is just admin admin and you can configure this front-end as per your requirement
so in the dashboard we can see that there are like 5 000 observations and we have used statistical and gram anomaly scoring method and next we can see that there we can see the max score and min score that we got for the data set and there are around like 43 unique labels and next in the histogram panel we can see the score distribution of all the observations so most of the observations are centered around zero to one thousand and then there are small humps where we can see like lesser number of events
and there are other panels that we can explore we can see the rarest 10 labels that got generated and also top 10 labels at the bottom uh there is a panel for the top 10 anomalies it shows the tag sets and the scores that it generated so labels or tags are specifically important if you want to see why that event is anomalous so next we go to the discover link where we can see individual events which were characterized as anomalous here we can add a filter so say we want to see all the events that generated score greater than 1000 so we can just create a filter for that and we want to select to score
and command
fields and next we'll sort the score by descending order so that we get the max score at the top so these are the most anomalous events or commands that are seen in the data set and we can go through and analyze them so this was the default configuration and we can see that there are not many anomalies which might not be malicious in nature specifically so we can go investigate it next we'll look into how to enhance the process and how we can improve the detection process so in this anomaly set we'll see a netcat command so the netcat command that we are seeing here so this network command is the command that we have inserted in the data set
and it looks like it is pretty it has a it is anomalous but it lies pretty low in that score so we'll see how we can improve the detection process in the next step so next we will add some expert knowledge which is security based and we'll improve our labeling by using that to knowledge base
so we can run the entire process using the console or we can just use the web interface let's go through the web interface here so here first of all we can just generate default configuration just like we did last time well submitted once we submitted we will be presented by the default configuration we can then go again you can confirm the configuration here we'll get an option to change it once we submit so here we are getting the option to change the config as per requirement so we have used statical and gram anomaly and we have other four that we can choose from we will choose least outlier factor here sorry local outlier factor and
then we'll remove some of the label generators as we don't feel that those fields are adding any of value to the analog detection process so yeah we are just removing some of the label generators and the fields that can just be random and it's not useful for score generation so next we will look into how we can add the knowledge based labeling so we'll use generator type keyword based and we want to add our knowledge based keyword for commands so we'll change the field name to command and we need to change the keyword list that that we have selected so i'll just copy paste here and this this is the list of uh command keywords that we come up with
by using our see by by doing our security research so we can use we can just add keywords or we can add regis based keyword that you see in the next example so this is example where we are just using regis based labeling it can be ip address or it can be paths so yeah so once we're done with it we'll just save it and we can see the file name is named as a tailored underscore the previous file name next we'll go to console and then we'll repeat the whole process first we'll check the config file that we just generated that is tailored underscore on the previous file name next we'll train the pipeline using the new config
file we will supply the same input file which we have been using and
we'll provide the new config file and we'll just let it execute
yeah we also need to specify the model file that we uh gonna generate from the string pipeline process
so it will generate the model file which we will use to run the pipeline just like the previous process so once we have the model file generated so it's a good idea to have a model file which is named differently than the previous step so that you can repeat the whole process as many times as you want next we'll run the pipeline using the same input file you can change the input file to completely new file to test it with respect to the model file that we have generated in the step or you can just use the same input file depends on your requirement so once this is done we'll again go to the dashboard and
check out the results
so again i'll copy the dashboard link and in the banner front end we'll go and observe does it cause any significant changes in the anomalies
so this time we have used local outlier factor anomaly scoring method and we'll see the score range has changed a lot and we have generated around 125 unique labels that is three times last time and the score range is also very huge
and if you see the histogram score distribution we see that most of the scores are located near zero and after that there are very few events that have very high score so it looks like it did very good clustering of the anomalies and we are not getting anomalies that have like very high score next we can look at look at the real state labels so these labels got generated because of the keyword based labels that keyword based label generated we have used so you can see they are named command underscore keyword uh and there is a command which we have put as a as a label generated keyword you can also see top 10 labels here
and same as last time [Music] you can look at the top score anomalies labels and scores generated by them we'll again go to the discover link and check out the this course again so we'll use the same filter and then we'll choose score and command and labels
if we sort the score to descending order we'll see that the netcat command that we have observed in the last iteration it is appearing to have the highest score and then there are other uh commands that we have inserted that also have a higher score so they appear to have the so these are the inserted malicious commands that uh that were injected in the data set and it's very very very relevant to the security investigation that someone might be doing so we can use oss in ids logs or service logs it's it has a very open implementation
so as we have seen that oss is very easy to deploy given its integration with kibana and docker it's very easy to use and you can repeat the whole experimentation process many times by changing the configuration file and adding expert knowledge as per your requirement so in general as as with any ml approach you have to test the features which will add real value to your ml process and you have to test out all the ml all the anomaly algorithms as per your requirements and see which will give you better results next we'll go forward with the questions so this is the github link where you can go and fork the project please feel free to try it
and if you like the project please start the project also you can get more details about the project in this medium link there are other adobe related resources where you can go and check out adobe projects thank you hope this was informative to you and please feel free to reach if you have any questions thank you