SOS*: from pirates of the high sea to those of high finance

Name: SOS*: from pirates of the high sea to those of high finance
Uploaded: 2017-11-13
Duration: 47 min 57 s
Description: Francois Dion explores the Stochastic Outlier Selection algorithm and its application to anomaly detection in financial and network security contexts. Drawing from maritime safety research, the talk demonstrates how unsupervised machine learning can identify suspicious patterns in data while avoidin

BSides Charleston · 201747:5743 viewsPublished 2017-11Watch on YouTube ↗

Speakers

Francois Dion

Tags

StyleDemo Talk

Mentioned in this talk

Tools used

Anaconda Jupyter Notebook pandas

Frameworks

scikit-learn

About this talk

Francois Dion explores the Stochastic Outlier Selection algorithm and its application to anomaly detection in financial and network security contexts. Drawing from maritime safety research, the talk demonstrates how unsupervised machine learning can identify suspicious patterns in data while avoiding false positives, with live demos using scikit-learn and practical guidance on data cleaning and feature engineering.

Show original YouTube description

Security BSides 2017 College of Charleston, SC November 11, 2017 @BSidesCHS Title: "SOS*: from pirates of the high sea to those of high finance" Speaker: Francois Dion (@f_dion)

Show transcript [en]

[Music]

[Music] so welcome if you were wondering if you were in the right room it is not about the SOS signal used in for chips book about the stochastic outlier selection algorithm however the base of the research was actually done in regards to maritime safety so some hints from pirates of the high sea to those of high finance I'll talk a little bit about an application within a financial context but it can also be used in all kinds of anomaly detection situation scenarios normally I don't start with a little frame about me but this is the first time actually I'm speaking in Charles and I don't see any real faces that are very familiar so I figured I'd introduce

myself briefly so I am working out of winston-salem North Carolina I do data science research I've been doing that for a few years and for some reason I tend to gravitate toward anomaly detection outliers things of that nature and hence why it's quite applicable to this conference this weekend if you need to get a hold of me Twitter is at F underscore the IO n or my email FD on at D on research and a little bit more background so I've been doing data science related projects for about 11 years or so and really initially I didn't think there was a lot of of future in that regard it felt more like oh well it's it's it's yet another

data mining approach right and then you started digging deeper and deeper and you know the rabbit hole has no bottom so this a little bit about me and this stochastic outlier selection algorithm came about because there was a problem in the problem the original problem was that you might not be familiar with that but as far as for example in the case of Rotterdam the Port of Rotterdam they have consoles where they're watching all the boats coming in going out and this repeats itself across the globe right I mean you have the same thing you would have four airplanes you need to have it for also boats and although it might seem a lot simpler

because we're pretty much on two D versus a plane you know you you have the third dimension there are other factors such as boats are going a lot slower than plane so it's harder to detect my new changes and things like that right the other thing is these are human operators monitoring the maritime safety and security they watch these large displays so every so often it's like oh nothing has changed okay no nothing has changed oh no nothing has changed and so on and so far you can imagine how numbing the effect is right so turned out that their specific problem was kind of a universal problem when it comes to dashboarding views of of security in general right I mean the

moment you start looking at a screen that looks the same day in day out you just zone out and it says despite the visual aid anomalies often remain undetected so this is not teaching you how to become a data scientist I'm just briefly just explaining a little bit these terms so regression a few for example think of oh I want to know how much I should pay for a house so you'll do your research right and figure out back to square footage and things like that what you're doing without knowing it as you're doing regression basically you're trying to figure out a given a set of data what is the predicted price for a house

with different characteristics right classification is all different but it's equally something we do fairly often for example if I bring up let's say a blu-ray and a cassette right I mean obviously there's not a lot in common the only thing is it start Star Wars and Star Trek so it starts with Star on the name of whatever it is on there right but beyond that how do you know do i classify these things as similar this is this is text this is an episode being narrated this is a movie are they similar so this sounds like a pretty simple thing to do but when you start having different things oh this is a VHS tape so now it says Star Wars

also so these are probably should be classified as the same thing but they're on different media so when you start to adding all of these things together it gets so confusing as to what is considered a class right just briefly so clustering is similar to classification but basically you're saying here's all the stuff I got and algorithm figure it out what are things that go together well based on one type of algorithm in particular so for example it might be the nearest neighbor calculation or stuff like that you can specify also in certain algorithms how many clusters you're expecting to see and it'll try to figure out cluster to data and that's pretty useful on its own but clustering when the issue

is that it goes quickly when you add features and you're trying to represent that in a form a visual form that you and I can understand you need to see that in a two dimensional maybe three-dimensional space but beyond that it would be fairly difficult to really visualize all these clusters in n dimensions where n is maybe a thousand or something like that so the next one helps us with that it's a dimensionality reduction and so when you have too many features you can actually try to figure out how you can eliminate some or you can also actually go into dimensionality reduction where you will map these features back to maybe 2d and we'll show

an example of that and anomaly detection and that's basically what we'll be mostly focusing on today as far as I normally detection we we do that naturally we know kind of this does not belong or this doesn't feel right or I mean might be laying down in the bed and we hear a noise it's like oh hey when was that or so I mean it's something that naturally we are able to do but of course if we can automate these things and be really good at it then it alleviates the pressure on the human to perform day in and day out watching and display trying to figure out what's going on right there's no perfect algorithm in any of these things

right I mean these are just models or estimation of reality and sometimes it they are very wrong but as long as they are not completely wrong as long as they are you know right enough to serve our purpose there okay so and that's basically what we'll see and if you are really into and learning more about anomaly detection I have a podcast on archive is so art a RT C HIV da TS archives yes alright not that easy to remember but we will not cover any of these other things these are all bread-and-butter data science things but such as feature engineering or deep learning or decision trees we just don't have the time to cover any of

these things but they can be incorporated in anomaly detection - or alongside anomaly detection I will use the jupiter's notebook environment and if you've never seen that it's actually quite useful also for doing various information security related things I mean basically documents all the steps you there's a version of Jupiter notebook that allow you for example to run bash commands so if you're an expert in bash then you can actually also document that integrate your graphs charts and text within the notebook environment and at the end you have a report that can be used also behind the all of that is Python the programming language so I could learn will use it only just for something called TC T

distributed stochastic net neighbor embedding and that's a related to SOS also of course Pam does is how we can load data and have it in a nice tabular format in memory so we can do new manipulations so I mentioned SOS but what is SOS so it's an affinity base outlier selection with a probability score Wow that sounds like too much right so affinity based so first of all what is affinity again going back to my different things anybody knows the deads no eight-track okay what's on it is actually somebody did a home recording of a Star Wars Episode which I thought was kind of weird but anyway I ran into that so again this is Star Wars also on

magnetic tape so they have an affinity because they are related to same subject and are on the same media as far as features right but this is audio and this is video so this is where they start diverging so if I can bring back this would you say that these two have more affinity or these two have more affinity so this possibly so again it depends how many features you have huh yes exactly so so that's a thing to define affinity we actually have to define a step before that which is a dissimilarity metric how not related to things are which is kind of like what why do we need to do that well because

that allows us to make it a nice symmetrical matrix which we can then can transform into an isometric affinity matrix and I'm not gonna go into the detail of that I mean there's a 209 page PhD thesis just on that so this is beyond the subject and by the way I didn't write any of that that's Jaron Johnson who I credited at the beginning and toward the end also so this affinity also has been used for clustering in the case of affinity propagation and for dimensionality reduction which would be SME or TC which are visual techniques for I mean it's not visual it basically allows you to reduce your large feature space into something that can be

easily by using two dimension or actually to be able to be modeled more easily because the more features you have it it can be a challenge sometimes to model these things right the algorithm we're not going to go through it that's yes a lot of text but we can go through it visually and that's what I was kind of hinting at here we have six data points one two three four five six these could be network packets these could be bank transactions these could be I mean you name it or they could be the features of this state as far as the coercivity of the magnetic material or I mean who knew who knows what could be useful right but we we

have to define these things so once we have these in this case it just two features these two features though are going to be combined because as a representation of just item one item one is then going to be compared to its itself which means that it's not dissimilar at all it's it's completely the same but we actually don't want that we just ignore because you can't really I mean it doesn't bring you anything to compare something with itself so we just skip those and then we'll compare it with two three four five and six and six is very green so that means it's more dissimilar than anything else with one and we repeat that now imagine the

challenge here is that if we have six elements we have to have six by six data points now in the matrix right I mean if you add a million that means you have a million by a million this is getting quickly intractable from that perspective right unless you have you know an infinity of memory so at some point you have to start looking at how much time period I want to cat and also how much data do I really want to look at once and can i partition that naturally so I can distribute it across a computing question so sure yes yes yes yes yeah so you only have to do half of the calculations here yes

because well let me let me rephrase that assuming we're looking at the like in this case a distance metric lydian space for example something like that that's true so meaning what I mean by Lydian space is that physically this guy is the outlier is the dissimilar thing not looking at anything else I decided that my metric is going to be the distance between the points for other things making might not be as good but yes it is symmetrical in the case of a nucleon distance the affinity itself though is not so that's a little bit different and so that's where that's where a lot of the work was done on that research is how to go from that dissimilarity to

that affinity and then binding probability so for each of these how likely am I to be a neighbor of this other data point and so that's basically it but that's not useful in the matrix form for us what we need in the end is the outlier probability so in this case and that's that that's not as much theory as we will cover after that there'll be a lot more interactive but so we convert basically these probabilities probability matrix into outlier probability for each data point so we end up with six values each ranging between 0 and 1 and 0.5 is an arbitrary decision that anything above 0.5 would be an outlier or an anomaly and anything below of course this can be

adjusted and you know if we decide that we want to not have it unsupervised but we want to supervise it that the high level in this case laser pointer it's showing this guy as a dark red which wouldn't put it over here so it's very likely that this is an outlier and visually we can fortunately it's a fairly simple example we can definitely see that yeah this guy is an anomaly compared to all the other data points right but it's not always like that so I did use this in conjunction with other techniques made a hybrid system where there's voting involved and things like that in order to identify outliers or I can't really go into the detail but for

financial data the one thing that you will learn and you've learned already if you've been doing information security for a while is that the data that somebody gives you that is supposedly clean and incredibly accurate usually it's not there's missing values there are errors there's indexing issues there are all kinds of stuff right so the raw data itself probably doesn't work out of the box you have to start looking into it so you'll have to do maybe cleaning the data you might have to engineer new features meaning that I'm going to combine different features like for example if it was in network security maybe I would expect specific requests to always have an answer or

something like that and I could say didn't get an answer that's a new flag that's going to be a feature right and sometimes we just have too many features we just got a drill down and reduce the number of features everybody following up to now yes good the other aspect is why my first thought is well I really need to flag anything that looks vaguely suspicious as as an anomaly right well each outlier or each anomaly that you bring up to somebody they are now obligated to research it they have to go true and figure out was this fraud was this whatever I mean actually they don't even care about the intent was this you

know did something happen here right so you know you cry wolf too many times and by itself SOS unsupervised is a little too trigger-happy to label stuff as outliers right so that's where a hybrid system is probably needed to really do a better job at the end of the day though you really want a human that can go true and bulk select and say no this is just normal business right and as I mentioned SOS is but one of many outlier selection algorithm and again ensembles so when you have multiple algorithms trying to tag something if I have five algorithms three of them flagged something as an outlier there's probably a good chance that it is or even better if like

so s they assign not only a outlier or not a liar they give me a probability then I can actually use these probabilities in the voting mechanism that's a little more advanced also so in order to use that there's a lot of installation that needs to happen however if you download there's something called anaconda know if you've heard of it but basically you can download anaconda anaconda or get think it's the website probably if it's not a tacoma comma more but and basically it allows you to install not just python it comes with the jupiter notebook it comes with pandas and scikit-learn the only thing you'll have to manually install is psychic SOS which i'll show here so I

would do a pip install once python is installed through anaconda or by hand pip install so I could SOS and well in this case I had already installed it so it says let me so it's already installed so it doesn't need to install it so that's the installation so from the command line we can actually use SOS once it is installed it doesn't do anything like this because it's expecting me to Pipe data to it so yes so I did already grabbed from the UCI machine learning public data that they have to have some different data sets and one about computers that's actually from the 80s so you won't recognize the brand probably but if I

[Music] Coffee dad so basically it went and gave me the probability I could also have

yeah it's pretty close or 56 but in certain cases depending on your features you might see a lot of 0.9 or are things like that so but anyway it can be used at the command line but really for this demo and for general use is probably best to use it so the same set here true a Jupiter notebook already set up so I don't have to type and spend too much time on this so I'm going to read this data and the first thing I look is that there's no header which that's always disappointing I transposed it so the header would be here just so I showed the first five records and we could read

it on the screen but ya know header so I don't know what these represent so that's not that great so let's read it this time with header so when I'm saying you know you might have to do some work before you can actually use the data that might be some of the things like trying to figure out what data do I have okay what are all these logs our DS like a web log is it configured in a standard fashion or is that some somebody that decided that they wanted things in a different order different data fields and things like that so these yeah I mean it doesn't need some work from that perspective AHA now I have all very

popular computer brains right the advisor and all no okay but what was interesting about this data set is that and basically the published performance and they were using that to test regression to guess the actual they were estimating the performance of the machine based on a few very simple things the the cycle time in nanosecond the RAM in the cache so basically if you think of a computer that's nowadays is probably not as interesting to just know these things because performance is impacted by a bunch of different things but in 1987 that was the case so you could do that so what does SOS does with dad so I I could initially say okay give

me all the data all the fields so I'm running my detector is SOS and let's see if it can run it oh I got in there can't multiply sequence by non integer of type string so what happened is that I add the brand like I'm doll that's a string I mean it doesn't know anything about that right I mean it's not a number that I can put in a matrix and multiply so that's not gonna fly and that's one of the challenge with feature engineering is if you have any strings let's say you have a it's a verb like a HTTP verb or something like that like get or post or put or things like that or or the type

of packet or whatever and these are categorical data it's not just a number so you have to convert them into a number before SOS can actually use it so here is what I'm doing with pandas and I will pose these things on github so just you know again I posted my email just drop me an email but if you look at github comm /f d on I got already a bunch of different things there I will eventually pose this when I get home so I'm converting the vendor and the model to category codes I'm rerunning the prediction and now this time it worked let's see what's the output okay roars oh I haven't assigned two scores

yet so yeah I didn't

this

yes our scores are right there so imagine these were packets and we'll look at the next notebook will be actual from a network capture but here we see these scores and I said just anything that's greater than 0.5 which means it's an anomaly I could have also said just soared em descending which is really ascending equal false and now it shows that the boroughs and AM dolma Sheen's are at the top of the anomaly based on their spec compared to the performance and so we're wondering so what does that really represent and really because I didn't do any job of selecting my features I'm actually you know getting probably garbage here because I would have to select only the fields that I

really need like if I were to redo this and just selecting these fields like the Machine cycle the maximum RAM the cache and the maximum channel let's say then I can actually see that my results are different a VAX machine is now at the top of mine outliers so you see how sensitive to the features this as it's trying to figure out the affinity between the data fields if you have too much data at some point you get the noise is actually what's fitting the model right so that's where you have to be really careful as to what you select now let's see if we can visualize this I said that SOS was related to T Sneaky's

needs a dimensionality reduction so I'm going to reduce that data set to just two dimension and I'm transforming it into the two dimensions I'm gonna assign this to X&Y and I'm going to use plot to display now this is interesting because it's definitely grouping different sets of data in different ways as you can see so now I mean if you and said this is a static graph but if you use an attractive graph then you could actually hover and see what the individual data points are and things like that so you can actually help a hybrid system with a human interaction to review some of these things and then you could color code that based on your probability that

you've assigned true SOS and now you're really gaining a lot of knowledge from your your data set from if you're looking at nominal anomalies or not let's look at the other example briefly network packets this is a recent 2017 from the same UCI repo which is a Freddie common repo for testing machine learning algorithm now here I'm going to read this actually I'm gonna click on the link so so the data is the burst header packet flooding attack on optical burst switching Network data set and there's a bit of information as to what the fields are why they were selected and in the research also the paper that's tied to that so this is more your

typical InfoSec type of work and again we can just use we can read the file no can read the file if we are looking at the right place okay so now we are looking so we'll have the node number the bandwidth rate the packet drop rate the full bandwidth for that I mean you see there's a lot of attributes that are available again we would have to do our due diligence to figure out what are the features that are really important for us or are there features that are missing and then that's where you actually have to spend some time trying to figure out oh well I need a temporal saying maybe I need to figure out how

much time has it been since I've seen such a low rate for that node or things like that is that a normal cycle that's happening this could be a flag as another feature but without that we'll just assume that the bandwidth raid the packet drop rate the full bandwidth and the lost percent loss packet in in in terms of rate and bytes and the received rate also is what we want to look at and now what we'll see

okay what we'll see running SOS again the detector predict it ran pretty quickly so now we're assigning the score if we look at this score greater than 0.5 what is interesting I'm going to look at the last column where this data set actually had been classified manually so it shows blocks these were blocking the traffic there and all of these that have very high probability of being outliers are all tied to packets that were malicious actually so that that's kind of cool without I mean I spent only a few minutes with the data set and it just spit that out however obviously there are suspicious packets that we didn't catch that are not even showing

as being outliers and that of course if you have a flooding event that's pretty intense your anomaly might be normal to a certain extent so again you got to keep that in mind but if you have multiple detectors and you combine them in a voting algorithm then you'll get closer to reality here so again these will be available but let's go back to our presentation we did the notebook demonstration best practices so don't boil the ocean start small take a very small data set just you've tried to work it true figure out is it working for something I really know that I've already done the work I've labeled things as being okay this is an outlier

this is definitely an attack or whatever you're looking at also eliminate the features that are not really helping sometimes you actually for example if I had the speed of your car in miles per hour and km/h that's useless information yes yes yes I didn't mention I wouldn't cover that but just briefly I'll say that yes it is very difficult I mean I got I've been doing this for about 11 years as I said and so you kind of start getting a feel for what's important the other thing is you have to have good tools that will measure some metrics so you can actually look at the quality of the data I did the assignment recently where they

assumed the quality of the data was at this level it turned out to be a little bit lowered and even for very critical data fields also if you don't have any way to measure that you know you don't know the features that are a very poor quality you don't want to use them obviously or maybe you can get a different source to gain the same information but eliminating features yes a challenge that's where you could use something like TCE as I was showing which is similar to using PCA also or I mean there's about a dozen or so techniques for doing that the best though is to try to figure out how combination of if you have particularly

if you have data where you know the actual results you can actually go and look how each of these features not just individually but as groups also impact your your results and there are some statistical techniques but I mean if you want we can talk about that later but there are ways to actually select features that are more important and you always want to start those and then you'll see if you have a tiny incremental gain adding more features scikit-learn has some of these things built in the the future in elimination or feature selection it also has pipelines so pipelines are your automation for these kinds of problems just like ansible would be for DevOps work or as we learned today maybe

also for orchestrating attacks on large networks but yes so that's that's the equivalent these are pipelines part of scikit-learn also think of your data as something worked for networks right as a temporal data so if you look at if you have I don't know if you have enough traffic that one hour has all the data you need to to find outliers then you can start with that and then just have a window that constantly look at one our previous data up to now and and see what what sticks out you know or something like that that way you don't have to scale on really expensive hardware or cloud services and yeah as much RAM as

in as fast as CPU as you can't afford it's time that you say during these different processing I used very small examples but when you start talking about hundreds of thousands of records that you want to process and try to find an outlier at once then you end up again with that problem we're in X&Y of that matrix you have a hundred plus thousand records right so and you have to basically compare each data point with every other data point if you do need to scale beyond you know very fast servers or whatever you can actually have multiple servers and there's a spark implementation if anybody familiar with spark so mostly for big data processing

but in this case it's just we need a ton of RAM we need a lot of RAM so it's not that it's big data is just expensive to run so spark if you search on github for a stochastic outlier selection beside the Jaron Jensen's implementation for cycle there's a spark implementation also out there and so with that we are exactly on the questions anybody has a question yes

so in fact so this is the second edition of Agarwal outlier analysis on Springer and he doesn't mention this OS here as far as I know I just upgraded from his first edition so yes it's relatively new but if you look at Jensen's thesis he goes and compares to other algorithms and how well they work so tease me also if I go back actually so I will I will pose that but like affinity propagation has been used like I mentioned for clustering by Fry and Duke but also Hinton I'm sure if you've heard anything about deep learning you've heard that name before so he developed as SNe and [Music] Vander matin continued dad work to do

the tea sneak so these are actually again fairly expensive affinity anything that's pairwise comparison where you have to compare one element with every other element is always expensive but that's definitely the the gold standard from a perspective of that without taking any shortcuts right so again you gotta you've got to measure do I have enough time to do this if not are there other techniques and as I mentioned it might be that you need a set of five different algorithms to really identify with enough confidence your data particle if your data is not labeled in any way then you're going to have to look at every single in specimen that is flagged as anomalous right yes

as features features yes so in the case of features I did say you know selection try to reduce the features it has a lot less impact than the actual number of data point in terms of memory usage but you still have to do a bunch of calculations obviously to reduce that into your dissimilarity matrix right and the dissimilarity matrix is in the github version is hardwired to be a new clean distance but there's nothing that perfect that prevents you from using any other type of similarity metric or or or dissimilarity matrix basically so that will also play a role into the effect of the features but yes I mean to answer your question features less expensive a

number of data points more expensive in terms of scaling down well I mean basically you have to multiply whatever number of feature by itself I mean I'm sorry number of data point by itself and that's the size of your matrix so you know it's it's a square right so and again if you have if you want to download these things pretty easily available I'm seeing a lot more notebooks being released also for information security in general so you'll want to check those out and again my contact info right here and different illustrations in the video and stuff like that yes yes I believe I will so yeah so I'm actually I organized a Python user group in winston-salem in

North Carolina so if you're I might actually do it in Charlotte - I don't know anybody from Charlotte they're wrong yeah a few so and might have different examples also so and well that's pretty much it then I guess thanks [Applause]

SOS*: from pirates of the high sea to those of high finance

Related talks