← All talks

When the Magic Wears Off: Flaws In ML For Security Evaluations - Lorenzo Cavallaro

BSides London1:01:05262 viewsPublished 2019-06Watch on YouTube ↗
Speakers
Tags
About this talk
Full Title: When the Magic Wears Off: Flaws In ML For Security Evaluations (And What To Do About It) - Lorenzo Cavallaro Academic research on machine learning-based malware classification appears to leave very little room for improvement, boasting F1 performance figures of up to 0.99. Is the problem solved? In this talk, we argue that there is an endemic issue of inflated results due to two pervasive sources of experimental bias: spatial bias, caused by distributions of training and testing data not representative of a real-world deployment, and temporal bias, caused by incorrect splits of training and testing sets (e.g., in cross-validation) leading to impossible configurations. To overcome this issue, we propose a set of space and time constraints for experiment design. Furthermore, we introduce a new metric that summarizes the performance of a classifier over time, i.e., its expected robustness in a real-world setting. Finally, we present an algorithm to tune the performance of a given classifier. We have implemented our solutions in TESSERACT, an open source evaluation framework that allows a fair comparison of malware classifiers in a realistic setting. We used TESSERACT to evaluate two well-known malware classifiers from the literature on a dataset of 129K applications, demonstrating the distortion of results due to experimental bias and showcasing significant improvements from tuning.
Show transcript [en]

thank you Mark for giving this opportunity and as Mark mentioned I am a professor computer science and chair in cyber security at King's College College London where I lead the system security research lab and we work at intersection of program analysis and machine learning for systems security so pretty broad remit another confession to make okay so my confession is because I've been working in this at this intersection for quite a little time and that my confession is that I believe we haven't been using machine learning the right way okay so this is a bit of a bold statement I understand so let me just rephrase it as smooth it down a little bit but also just point you guys out

that we're kind of getting to a brink of a machine learning crisis like this is what people are advocating all over and this is something that just was recently presented at the American Association Association for American Advancement of Sciences from Duke University where whether having mullen is machine learning causing science crisis because it's a bit unclear what's happening most of the times where we actually deploy these algorithms so there's a there's a fear that we're not able to reproduce properly experiments there's a fear that the data set that we're using our bias in one way or another and I'm not even touching on the social aspect of fairness but it's also a problem there there's a fear that although we might

have high predictions we're not able to interpret or explain why prediction actually happened and this is all important because at the end of the day we're not just going to trust a number we just want to answer these sort of questions there are surrounding this number right but if you look at our history as a community a security community a my perspective my my perspective is really from a little bit more the academic side but we've been actually using and relying on machine learning for a number of years of course you know since 2012 where machine gun has been demo is that so there's been a surge even in academic conferences of papers even top-tier conferences that accept and

work in machine learning and apply machining directly for cybersecurity from program analysis natural language processing and so for and so on don't get me wrong machine learning is called AI whatever you guys want to has been really promising on a number of tasks think about our relation systems image classification detection recognition speech recognitions and so forth and so on if we look at security in our context we've been using machine learning for quite some time in a number of different context I believe that we sort of started off with doing anomaly detection we're still doing it but then we try to look at machine learning in classification settings so supervised learning unsupervised learning so clustering and so for and so on but the

bottom line is on all sort of different tasks we can imagine doing binary classification to detect whether something's militias are not try to identify malicious network flows to identify anomalies in the network behavior and so for and so on okay now if you have a look at recent approaches I'm an academic so we really work and try to explain everything and how the flow works so if you try to look even at recent publications over the past years in top-tier conferences where you can see that everyone is gonna you know boost boasting very high accuracy results so it's not uncommon to find papers that with tantalizing result with 0.99 of f1 score or whatever score

you're interested in it could be precision could be recall but with that high number so the question is quite natural so it seems that the problem is mostly solved after all we're dealing with statistical approaches so it is unlikely we're gonna get to 100 percent 0.99 in most of the cases it's pretty good because it already gives us some performance that outperform humans at performing this type of tasks so the problem there is a there's a problem however and there is something that we as a community a security community I have been quite neglected and the fact is that usually we've been borrowing these practices algorithms from the machine learning community and most of the times we'll be using which we've

been applying and deploying these approaches as a black box so we're starting now to have a pretty good understanding and how these approaches work and what are the assumptions that these algorithms require and one assumptions that most of the algorithms actually requires the assumptions of iid so I a distance for identically independent and identically distributed datasets which means that if you are in a supervised setting so a labeled where you have labels for your data and you want to train a system the idea is that the objects that belong to the training data set and the testing data set are drawn from the same distribution okay from the same underlying probability solution we might not know what is this

distribution and it doesn't matter as long as this source generate objects from the same distribution and we're using this for training testing then these algorithms work pretty well so they work well in what is known as a stationary context in security however we don't have often the luxury of stationarity so most of the time we work in a non stationarity domain where the statistical properties of the data that we're trying to represent and to deal with change over time new threats variance of malware variants of vulnerability variance of attacks that you observe in a network and so for and so on so there is a point where a model will start decaying so the performance

of a model would start decane over time and that's you know just life we need to now evaluate algorithms with this new mindset in mind because doing the typical value evaluation that would just allow online a couple slides doesn't give you the whole picture it just provides you just one answer of how the algorithm would perform in the absence of what is known as concentrate so constant drifting whenever you have a drifting so the testing data set drifts so the properties of the orbits of these objects drift from the training data set and this is a pretty endemic problem that our community now the extent to which this problem affects a specific security domain is unknown so we need to

basically measure what is the effect of concentrate in a specific domain we might might find out that you know when you do malware classification on the Windows domain constant drift doesn't affect at all we might find out that when you do it on Android on the other hand it affects a lot the model performance and the model performance decay over time or you might find out other domains where you need to pay actually attention to it one thing that we don't usually pay attention to is that constant drift is really intertwined to the type of abstractions that you use to describe an object okay so take the example of a binary classification problem and you have

malware and good we're okay and you train a classifier what it means to train a classifier it means to minimize an optimization problem so most of the times you want to minimize a loss function okay and you have a labor data set so you know how you good you're performing with respect to your optimal scenario until you minimize this loss function and what you want to do in this case you need however to represent an object with a proper abstraction and therefore you need to represent the object in a proper feature space okay a feature space you have to imagine that you can represent an object as a vector so it's a n-dimensional numbers they

represent the object itself because a computer a machine an irregular it doesn't understand that how to represent a piece of software right you need to give them the abstraction that it makes sense from a mathematical perspective on how to reasoning about this and this abstraction can very could be sequences of bytes sequences of system calls some more complex a graph that captures dependencies between arguments in system causes of poor and so on all these different abstractions will give you a different implication in terms of accuracy of the algorithm the presence or in the absence of concentrate and support and so on so and this is something that we haven't been really neglected so we have been neglecting as

a community so what I would like to be doing and the rest of the time is try to show you what are the effect of constant drift in this setting so in settings where there is non stationarity which affects basically most of the security application domains that we know of and also how can we perform a sound evaluation so it turns out that a lot of the time when you have a training testing data set people kind of you know use it and combine it in more creative ways and you end up by having good results not because the other is actually perform very well but because you are playing with the data set a bit

too much and that doesn't represent anymore a realistic scenario so how can we perform a sound evaluation to remove any experimental bias that will actually affect and inflate the performance of our classifier so the problem I believe is endemic so we identify these experimental bias in the temporal in a spatial domain and I'll let you know in a second what it means by this but I believe that the problem is endemic to the community however as I mentioned to understand the extent to which this problem a specific application domain we have to measure it therefore it's hard for me to generalize to borrow some machine terminology to generalize to any security domains is the matter the

methodology I'm going to be talking to you about is generic but the effect of this problem needs to be measured on a specific application domain and for this reason I focus on droid security so binary classification and the use case is just the Android binary classification malware detection okay the reason to if we're doing this is because there is an abundance of data that is available to any researcher and therefore we can perform sound experiments in particular the data set I'm referring to is called Andrew Zhu and it's maintained by the University of Luxembourg there are references down here in the slide deck if you're interested to get access to the data set and it comprises roughly up to now eight

millions applications including malicious software as well so benign and malicious software so the data set might be a little bit biased because it's mostly West centric so there's an abundance of application from Google Play rather than third party markets there are also application from third body mark tech markets but just in the minority so this is the only by but as long as we are aware of a potential bias then it's fine so we can actually just take an informed decision and the data said that we work with covers from 2014 to 2016 and out of at that time we were talking about 5 million of applications we didn't have the resource to analyze and I tell you

in a second what it means analyze five million of applications in a reasonable amount of time so we decided to down sample this data set to 130,000 applications that are scattered pretty much evenly throughout the timeline with a with an average of a 10% of malware throughout this entire timeline and I believe this is still a pretty large and reasonable that I said to work with and the idea of having 10% of malware I'll tell you a bit a bit later on why 10 percent another number but basically this comes out from what vendors report and some sort of your private communication we had with other researchers believe this is the ratio of malware in Android that you observe and

this is an average so there's a bit of a spike that it's an average if you this is an important number and however so if you want to identify and measure the effect of concert drift and how the classifier decays over time and you want to perform a sound evaluation you need to have this class ratio right because if you don't have the class ratio right you end up by inflating the results and I motivate you this with some experiments later on now once we identified a data set that we want to work with well we need to understand what approaches would like to evaluate so what are the approaches we would like to reason about to understand what the

constant is a problem and how much is a problem and so for and so on so for this reason again it's a bit of a delicate question because one way would be to try to evaluate as many approaches as possible of course you know there's always a trade-off that you have to they have to pay attention to when it comes about doing this type of activities because you have limited resources a limited time basically so what we decided to do is just to select machine learning rhythms that were representative of their own category and program analysis to create these abstractions that I was telling to you about before to represent programs that are again representative

of their own categories in particular I we focus on on two main algorithms at the end of the talk there's going to be a third one so both approaches and be published at NDS s1 in 2014 and 2017 for those of you are not familiar with academic conferences and this is one of the top four conferences I mean security there are a bunch of conferences and there are for compasses there are top four like we consider the community considered top four and this is a user neck security IT security and privacy and this says NAC MCCS these are the top four so this top work has been peer reviewed and that reports results that are pretty good in

binary classification tasks on Android and in particular the first one and again the talk doesn't aim I don't really the intention my intention is not ready to point the finger against this piece of work it's just I took them as a representative of the state of the art to show what is the effect of concept briefed and what might happen if you don't enforce a proper sound evaluation to your classifier so the first approach are called Raven probably actually algorithm one for the rest of the talk it's a very simple it relies on a very simple automated static analysis you take Android application and the approach extracts basically strings that are meaningful there it could be URLs it

could be strings are embedded in the bytecode or it could be API is that are in the bytecode of the Android application okay and then you encode this information as a big vector so let's say that we identify a hundred different of such strings then the bit vector for this specific application the way that I represent this application this abstraction is just a vector of a hundred dimensions and you have a one if you have that specific string or a zero if you don't have it now you have to do this for a hundred thirty thousand applications and you end up with a very large feature space of a hundred thirty thousand features that is very sparse so

most of the our zero and just you know you have a few ones here and there this is a very large future space okay but it's a very simple static analysis so very lightweight it doesn't aim to capture any dependency between any of the actions so if you see that there is access to the IMEI so personal identifying information and you see there's a natural communication after after words you cannot really say that that information has been leaked through the network communication because there's no data flow analysis that tells you this this information so this could be just completely coincidental or there could be a leak okay but it's a very lightweight static analysis that you can

do and this gives you an abstraction the algorithm the machine learning algorithm relies on is a simple linear SVM so it's something your classifier okay and the results are pretty good I believe that they report roughly 95% of precision at 1% of positive rate I remember you know just by heart but I will see a couple of graphs later on second approach is different so it's three years after and it relies on a more sophisticated program analysis in particular it builds a cool graph out of every application and it encodes the Skoll graph in a Markov chain we're features so these dimensions that I mentioned to you about to represent specific programs are the probability of transition probabilities

in this graph okay and you end up with again another vector with some numbers and this is a way that you represent this application and you do this for you know all the application you get up with your feature space the algorithm that algorithm Android relies on is a random forest so it's a it's a it's a tree based algorithms the decision tree based argument so completely different from SVM okay and at the end of the talk I also show you something else that relies on deep learning because it's the third category in a way of algorithm that we can actually rely on to identify these things of course there are others but they haven't been really explored in the

security community so we didn't really want to add something that I wasn't an approach that hasn't been published yet or hasn't been vetted by the community so when it comes about temporal bias it's pretty simple but I'm sure that everyone that has ever used machine learning in any context that relies on so it needs to evaluate the model and we're talking about in this talk about supervised learning so we have a label data set a label train data set and a label testing data set of course you know you have to train the model without looking at the testing that I said but you evaluate the model on the testing data set and you have the

label so you you know how good your performance basically so a common way to evaluate this algorithm is just to use a best practice that's known as a k-fold cross-validation and it works as in this way it's very simple so you take your whole data set you partition it in k folds in k chunks you train your model on K minus 1 chunk in all but one and you test your model on the cake chunk and you repeat this up this this process for K times so the reason is that and you average the result so what is the reason to do this the reason is that you would like to hinder the possibility that whenever you have your entire data

set and you have to split it in training and testing you have a particularly lucky split where you classify performs really well because you're doing this all the time and every point is used as a training and as a testing just not at the same time you're just lowering this possibility of overfitting the model and of having a very lucky partition the problem is that this works pretty well and I'm not advocating of not using it so we have absolutely use k-fold cross-validation or holdout validation but the problem is that this works very well assuming that your data set is representative of the population and it doesn't change over time so again that the entire data set is stationary if

you're working in the context where there's known stationarity well eventually whatever results you have out of this careful cross-validation is not realistic anymore why it is not realistic because if you do unroll your data set on a time line okay you have a timestamp data set every object is time step with with a with some information if you unroll this on a time line I mentioned that you partition the data set in folds and you train on K minus 1 and came on the test one it might happen that you actually train with knowledge from the future in this timeline and you test from objects that are coming from the past that's cheating if you're deploying the

model right if you are at the point that you deploy the model and you work in this non stationarity so that things might change over time well you cannot train the model with things that you haven't even seen yet you have just to reason about okay this is my entire data set I train the model with this piece of information and I deploy it from here onwards I see new objects now some of these objects might actually be drawn from the same distribution that I train my model with and that's fine so this will give you the average performance of the model in the absence of concept drift but for the model for the object that will start

drifting away so will start being represented with properties that are different from the distribution that you train your model with then this objects will start being misclassified by the classifier because they just belong to a different distribution but if you do k-fold cross-validation only then you are mixing up the time line so you're using knowledge from the future and testing on the past now there is an interesting question about whether we should be forgetting about the past but I'll keep this as a maybe follow-up question or for online conversation so if you are basically relying only on k-fold cross-validation this is what happens so you're going to be using knowledge from the future to test

potentially from the past and this is of course you know provides an inflated result on your evaluation because it's not realistic of whenever you be deploying the algorithm so because it doesn't capture the trend the natural train of concept drift that you will be observing over time and the question is understanding given an algorithm and a specific program analysis so feature space obstruction what is the rapidity that you decay over time so this is what we're trying to measure so if you violate this we call this constraint we call C 1 and violation use future knowledge in training so results might be inflated there is a second violation in the temporal domain there is a little

bit more subtle the second the second violation basically suggests that you should be when you look at the class ratio so we're working with a binary classification case so we have good where and malware when we do extract you split the time and try data setting training and testing that's fine but once one you actually use the data to train your model you should actually be using benign and malware that are coming from the same time slice so there might be a situation in which you train with benign software that is from 2013 and then you train the model with malicious softer that is from 2015 okay so in that case and maybe you test your model with

software that comes from 2016 onwards okay so c1 is c1 holds in this in this scenario but the problem is that when I classify it when you train a classifier the co-ceo train tries to solve an optimization problem and because you're giving the the two different classes benign and malware there are far away in the time line there's no guarantee that you're actually having a machine learning algorithm the learns good and bad you might have a machine learning algorithm the picks up artifact and learns old from new so there are api's that didn't exist in 2013 but there are in 2015 some of them are deprecated and so forth on the classifier might actually be picking up those indicator

as the main one to derive and draw the decision boundary and actually separate the two classes but once you deploy the model again you observe objects that might fit into that time line or not so you add again a little bit of bias but it's an artificial bias that you add in the evaluation so ideally you should be drawing the objects of the different classes you train your classifier with from the same time slice okay and this is a way to sort of you know visually represent what I just mentioned in the KFOR cross-validation everything all the data set is used for training and for testing not at the same time but eventually everything so you end up by

using knowledge from the future and therefore you might inflate your result when you have so there is a first form of temporary consistency that of course this shouldn't be happening because this is even even worse so you're only using data from the future and you test it in the past so this is of course a scenario that this knot is not possible the question here would really be is this data from the past something that I can still observe once I deploy the model if there is the answer then of course you know it makes sense to perform that kind of evaluation but if you don't know the likelihood of that to happen then of

course you know this evaluation is again a bit biased the temporal inconsistent class ratio good or malware this shows you that although all the training points are on the left-hand side are antecedent to the testing point or at least overlapping with those on the right hand side you have malware and good where there are drawn from two different time slices so again this might actually induce the classifier to pick up artifacts and the promise that again we don't really know so most of the times what is the classification doing underneath depending on the abstraction that you are using so to try to remove bias you should actually be training a classifier with a data set

that is split in this way so Google and malware are drawn from the same time slice and all the training is antecedent to all the testing again I'm not saying that we should get rid of k-fold cross-validation careful cross-validation provides you provides an approximation of the performance of the classifier in the absence of concept drift so whenever you deploy your classifier and the objects that you observe in real life in your feed still fall within the distribution that you train the classifier with that's good the performance that you should be expecting is the one that you were training with k-fold cross-validation but whenever these options will start drifting off so falling in another distribution different from the one is

trying to classify with then the question is okay how quickly do I the Chi decay over time and how quickly do I need to do something like retrain a classifier for instance or using other techniques that I'll be mentioning in a second so this is actually what happens in terms of decay of performance of the classifier in this context so the graph shows a plot where precision recall and f1 are shown and the decay over time so the setting is this one we train the model on one year worth of data and then we test it on the subsequent two years this is just an example you can from an operational perspective you might don't

don't want to train it on a year worth of data you might just want to train it on a month worth of data on three months worth of data this is just an example an illustrative example so we train with one year worth of data test in two years over two years subsequent two years and this is algorithm one so linear SVM very light with static analysis again so the two are intertwined so we tend to reason only by looking at the machine learning but the two are intertwined and you see how you have a different graph for the other algorithm in fact so and here it shows that you know some if you pick the

blue line there's a harmonized procedure and recall so the f1 score you see that you know of course it goes a bit up and down and Y up and down because again some of the objects in that time slice do not drift too much from the training model while other eventually will start drifting a lot you can see that there is anyway a downtrend so the performance of the classifier over time will decay and this is what you would be expecting because it's a non stationary context if you compare that anyway and and you know we have we started with something a little bit more below 0.8 and it goes down to 0.4 something like that if you

look at the dashed red lines however the dashed red line is the performance of the classifier in a k-fold cross-validation setting so something like 0.94 of f1 score and this is again the performance that you will obtain in the absence of concept drift so if the points that you have observed do not drift from the model you train a classifier with that is the performance that you should be expecting if the evaluation is if there are no other bias in the evaluation and the other bias that we have to look at is spatial bias but if you then start encountering consider if this is how quickly the classifier decays over time and this is informative because at that point you

can understand you know I cannot accept a performance that goes below zero point seven and if that decay is depending on your operational situation well at that point you know that you have just a couple of months that you can live with this model after which you need to do something you need other to retrain incrementally but there's a cost associated to real Abel the object you might rely on classification with rejection or active learning there's all techniques that you can use to actually try to reduce the performance of the classifier over time and I'm not sure whether I have time to talk about these but if presentations are given it's in the slides but there is a paper that

describes at length all of our and saying and there is a lot of stuff there about active learning mental learning classification of rejection so okay so these are the two temporal constraints that we must enforce if we want to understand how our classifier would perform in the presence of concert drift okay if you're not interested about that that is fine however even if you are not interested in concept drift and how the Crucifier performs the case over time there's still another dimension that you have to be looking at to make sure that your evaluation is not biased from an experimental perspective and this is something that is called splash special values that we propose in this work the

reason is very simple so when you have your data set you split in training and testing and let's say that again is a binary classification problem so you have to decide what is the class ratio so some in some cases you cannot decide it really but in other cases by easy to the side so if you have a data set that you're available like under zoo you can download application and you have to decide okay how many benign I want in my in my training that I said how many malicious malware I want in my training data set so you have to decide a class ratio okay this is fine for the training for training purposes you can play with

the class ratio as you wish there are implications because you're making the algorithm more sensitive towards one class or not but the question is can you do the same in the testing data set can you actually play with the class ratio in the testing data set and the problem is that if you play with the class ratio in the testing data set well you're just playing with reality so the class eration the testing that that's the testing data set is representative of what you observe the feed that you get real-life okay so new attacks that you receive new network flows that you receive so those cannot be artificially manipulated when you perform this experiments in lab so you

need to understand in a way what is the natural class ratio of the problem that you're working with and again for Android malware classification it seems that it's reasonable to consider 10 percent of malware and 90% of benign software for other domains Windows might be the opposite wait for malicious URL it could be a different ratio as well but you need to understand what is it the phenomenon you're dealing with because if you play with the class eration the test analysis you inflict the result and the example that I have here it's quiet it's quite simple to understand let's assume that you have a classifier that it's reasonably good at detecting malware but it raises a lot of

false positives okay which means that in this case if you look at the precision metrics precision is the finest true positive over true positive plus plus false positives okay so if you have too many false positives then precision goes down now what I can do in my evaluation is to craft a testing data set with a class ratio of a lot more malware so I know that my classifier makes mistakes in misclassifying benign software as malware so what I can do in my testing data set is just give less benign software with respect to malware so I give more malware and I provide less benign software in that case the class ratio changes and as a matter of fact

the false positives do not change because I'm not actually the first wasn't the first positive lowers because I'm not feeding the classifier with more benign software and the true positive increases because I'm feeding the classifier with more malware so at that point you see if you look at the was the orange II yellow orange align that you have a boost in the precision just because you play with the cluster ratio in the testing data set so the point is that we cannot do that so the testing data set should be is representative of real life so we need to understand what is the class duration of the problem we're dealing with and we need to enforce strictly that close

ratio in some of the tasks that we have to deal with the class ratio is natural so if you just monitor network traffic well that's kind of you know what you observe over there but if you're building a data set and you can play with the data set because it's hard to to harvest that I said another way well you need to be sure that you enforce the right class ratio otherwise the results are just inflated and this is something that happened so let me just keep this this a couple of slides for a second but so the problem is that so the question is can we do this however can we play with the frustration in the training

data set yes we can do that because playing with the consolation and training data set just makes the classifier more sensitive towards one class or the other of course you have to be careful because by making the customer more sensitive towards one class it means that you improve the detection of those objects but of course you as a catch you might miss class end up by misclassifying many objects of the other class so what you have to do is just to find a sort of a sweet spot where the classify just would by playing with the class eration the training data set you can improve a little bit more the performance of the classifier because you use your data set in a more

intelligent way okay now of course you don't have to try at random because I you know this is just case so we designed an algorithm there's it's an empirical algorithm so there's no mathematical proof but basically the algorithm he finds this sort of sweet spot so there is an error that you have to be willing to accept in a classification and once you set the sort of error then the classifier the algorithm proposed you the class duration that better suite the task at hand and again this is really intertwined with the machine learning and the program analysis you're working with so the algorithm is generic but a class ratio for one algorithm is different from the class eration from

another algorithm so you can see that in these two plots from algorithm one the class ratio the sweet spot is around 25% of malicious software in the training data set okay test analysis 10% is fixed it's life we can forget about it with the training I complete a little bit more and here I can have a class ratio of 25% for malicious software and that gives me the best performance that I'm interested in but this other algorithm we find that the best class ratio is at 50% okay so it's a balanced situation 50% malware 50% benign software and again the argument finds the sweet spot there is tailored to that specific machine learning algorithm and that

specific program analysis that the machine learning algorithm relies on to create the feature space okay so this is just to show you I plotted here in this plus just look at the one at the bottom you can see that the shadowed blue line where that was the first the first performance if you look at the proceed if you look at the f1 of the graph that I showed you the very beginning of the decay there is the decay of that graph in both algorithms here the the solid blue line shows you how you can already improve the performance of the algorithms over time just by playing with the class ratio in the training data set okay so same algorithm same

optimizations same program analysis you're not doing anything you're just using the data set in a little bit more intelligent way okay on the training cannot do anything on the testing and this is you know add I would say like you know a very good improvement in performance so to sum up you know summarize a little bit more about the constraint we need to be sure if we're interested in understanding the performance of our algorithm in a non-stationary scenario so wherever there is constant drift then we can no longer just rely on a k-fold cross-validation type of evaluation we need to enforce temporal constraint to be sure that we evaluate the algorithm over a time line in a proper way and at

the same time we need to be sure that also we don't have any bias by playing with the class ratio in the testing data set because otherwise that represent an unrealistic scenario okay if we enforce these constraints then we can just show and capture the performance of this algorithms or the others that we have over time okay and this is this is what it looks like once we we enforce everything okay here I just wanted to introduce something else that for those that are familiar with a you rock it's just the air under the ROC curve that captures with a number how good you really performance over different thresholds so we borrowed a kind of you

know reasoning buoy it's done a little bit more so it's nice of a plot but I will be better to have a number that I can use in a program to compare things so and this number is what we call aut which is the area under the time of the performance that you are interested in so say that you are interested in the f1 score so the aut captures the area so the best part of the f1 score over the time line that you are interested in and again here as an example we train on one year with test on 12 months sorry I'm 24 months algorithm one are going to we show over those two years how the

performance decay from a practical perspective you might not be interested in the whole 24 months you just interested in 3-4 months and then you just limit the aut over just those three four months okay and then here you have a number so 0.58 zero 1:32 there already you can use to sort of compare so you say that you have a couple of approaches that you don't know which one really performs better in your context in the presence of constant drift then you can use these numbers you can evaluate your approach you can try to understand okay this approach works a little better than this other one but keep in mind it will only look at the

performance here we're not trying to answer any question about how does the argument performs in the absence of constant drift can i is this abstraction that I'm using that gives me that curve useful to explain the prediction these are all open questions okay so this is just one point one answer but don't take this as a wholly grade okay so now I wanted to show you something so really the effect of experimental bias and how this could really tricked us in believing that we're working with something that's really good well in fact it is not so here I'll show you the first two plots are about the two ignorant that I just really mentioned before algorithm one linear SVM so very

simple linear classifier very lightweight static analyses that extract strings so I'm oversimplifying but you know just going with me with that so extract strings representing the big vector 1 or 0 1 if you have it 0 if you don't have it simple is that very quick linear layer classifier in k-fold cross-validation the classifier performs around 0.94 and something like there are two straight lines - horizontal lines one is black and the other one is red the black is the one that is reported in the paper and the red is the one that we have so in the paper but on a different data set on a smaller data set and the red is the one

that we have reproduced on our own data set which is much larger you can see that the two lines are pretty close there is a bit of flux on there because it's a statistical approach so it's roughly the same okay it's not too bad the other lines capture what happens if you have constant drift okay so what is how the performance decays over time how quickly does it decay over in the presence of concentrate and this is already what we just mentioned before now interestingly let me just look at the third plot for a second and I'll go back to the second one the third one instead it's something that is based on deep learning okay so we wanted to also

evaluate how deep learning would affect the performance of the classifier in the presence of concentrate and so I'm not giving much details on how that works but it's the same in terms of program analysis the same program analysis of the first one it's just a different machine learning algorithm so in this case same abstraction but different ml algorithm and you can see that in the absence of concept rift so in a careful cross-validation settings our so the the the paper performance f1 score is a bit above at 0.8 so 0.18 182 okay compared to the first one it's not a winner right now our performance on our data set when we reproduce this those experiments is a

bit below 0.9 and it's a bit higher because they didn't perform hyper parameter optimization with the performance so there's some technicalities but it's it's alright you can see that now it's it's much more closer to the first argument so I wouldn't say that is you know I would I would discard it since they won because the performance in the absence concept is actually pretty close but interestingly interestingly you can actually observe that the third author is the one that is based on concept sorry on deep learning seems to resist ooh time decay better than compared to the other one okay so the reason is that it's still a bit unclear because we don't really know I mean we know deep

learning but we don't really know what the different layer in layers captures really but you can see that there is a smoother decay over time the catch to pay the catch is that on average there is a much lower performance compared to the first one so again I'm not trying to say that the first one is better than the third one or vice-versa I'm just saying that you know if you have the right methodology you have information that you can use to take decisions and you might depending on your case you might say look 0.9 or whatever it is there I don't really care about that one but I do care that I have a more stable performance over time if

that is what you're looking for then maybe you should be going for this one but again it might be harder to explain the reason of a classification if on the other hand you're not interested in time decay or you are interested only in a few months and then after a few months you have a chance to return your classify or things like that you can actually probably use the first one but I wanted to point you now your attention to drag your attention to the second one second one is the most recent algorithm so it's been published pretty much in the same timeframe of the third one second water again more complex program analysis there's a random force as a

machine learning algorithm but in the episode of the paper they claimed that they're arguing performs at 0.99 of f1 score it's pretty high right and that's the inner k-fold cross-validation and that's the black bashed and dot lying at the top okay of the second plot now of course because that's the KFOR cross-validation so the first and the second constraint in the temporal domain are violated but that's fine because you are interested in the kafir cross-validation so in the absence of concentrate but you have also to consider whether there is a violation of the third constraint so whether they had played with a class ratio in the testing data set turns out that they played a

lot with that but I'm not saying that they done the Dunnan on purpose just happened that they corrupted their data said their testing data set in a way that is completely unrealistic so I believe that the class ratio instead of being 10 to 90 it was 90 to 10 so 90% of malware and 10% of B'nai which basically boosts the classifier performance to 0.99 but it inflates the result now if you enforce the right constraint of having a class duration of 10% of malware in the testing data set the performance in a careful classification actually drops to 0.83 or 84 okay so now if that would have been the case probably this paper wouldn't have been

accepted but I'm not saying that you know this is bad I'm saying that we have to be careful when we use machine learning not just machine learning three line of sicut learn and then we are all machine learning specialists right so we need to try to understand what are the implications of using the algorithm especially when it's combined and coupled with program analysis as well and then it's the time decay over time okay so this is what I just mentioned yes so this is what I just mentioned so a couple of points that I can probably conclude with is that we can use the aut as a valid metric to to to measure a baseline performance of an algorithm and

again this is intertwine with the program analysis that we were lying on and then use that to explore different strategies on that I can engineer to address concert drift being this incremental retraining online learning active learning classification rejection so the aut will give me a number that quantifies the effort the performance and the cost within each of these strategies and I have a data for a methodology that I can use to take more informed decisions there are a bunch of other things that we could do so we can have a class eration that is actually changing over time as well because now we have a class ratio that is on average 10% over the whole timeline of course

that is might not be actually very accurate and realistic as well so we have a variable class ratio that changes over time but for that we need measurement paper so we need to understand what happens in real life we might actually try to understand whether what we're using to train our classifier with enables us to the tech objects that we haven't seen but the four would in the same distribution but what is the catch to pay how are we forgetting about the past because ideally we should be using the whole data set but not all the data is alike so there are a bunch of question that we can actually just look at and in all of this

I haven't mentioned the buzzword adversary machine learning okay so other cemetry learning on purpose manipulates the input space to create objects that are closed in the put space to each other but in the feature space they are very far away so they can be misclassified but it depends on what type of attack you do see if it's a targeted or not so how does this concept briefly relate with other machine learning and how what is the effect of other machine learning not on images because other semester learning has been quite prolific in the image domain or in in the in the speech or audio domain but what happens when we move away from that abstraction so from things that human

beings can relate to and we look at software so we need to sort of start thinking about adversary program generation so programs that can be automatically generated to create software that can miss classify a classifier so these are all football costs and things that we are also working on in my lab so to summarize this is the paper that is on my lab web page and if you want you can read the full details of what I just mentioned plus more because there's all part about incremental retraining active learning classification projection I didn't have time to mention to you guys about it's quite thorough so their content details so I'll be more than happy for you to

follow up with me and so my final points I believe is just you know we need to use machine learning not just as a black box but we need to understand the implications and the assumptions that are required to do it properly insecure in the security domain we need to be sure that when we evaluate an approach there's no temporal or spatial bias and if there is that we are aware of it and what it means okay and we can rely on some other metrics like ie aut in a gene in addition to other metrics to sort of evaluate now the performance of the classifier in this context in the context of concert drift and how quickly

model the case of a time so foreign soil and so I haven't mentioned that all has been released as open source so if you're interested in this you can just drop a line to me and I'll give you access to the bitbucket repository it's it's semi public in the sense that we love as impact you know keeping track of who's getting access to the code and whatnot so it's easy for me if you drop a line and said who you are and then without finishing you're working with and so you got just the line for private you know personal record and I'll give you access to the bitbucket repository all the code has been released open

source all the data set of the experiments so we cannot release the data set directly you need to obtain the data set from under Zoo but we released all the ashes of the application that we used in all the experiments and we release also the feature space so the program analysis that we ran to create that feature space which is it takes a bit of time to do that if you want to reproduce the expect the whole thing that we just mentioned about so it's all there you can for repress it a bit of reproducibility you can actually do it the code is a secret learn compatible so it follows the same API there's only a

couple of api's have an addition that time line so the time split that you want to work with and there is the aut as well in there plus the other things that I mentioned about incremental retraining and super and so I'm almost done thank you all but shameless plug so I'm the general co-chair for ACM CCS is one of the top four conferences I mentioned you before at the beginning the conference this year is usually held in the US and has been at four years is going back and forth US Europe US Europe and it's good before November 11 to 15 go to the website if you're interested in it it's a one on top for academic

conference so it's a very good quality talk it's a different vibe so it's a academic conference so it's good so that you understand what is that I kind of converse if you're interested there's or drop me a line or go on a web site be more than happy to talk to you more about it and without further ado I believe that I almost used up all my time slice if we have a few more minutes probably this time for you I'll be more than happy to take any question thanks [Applause] questions on the back yep that's a very good question so yeah yeah yeah so so the question is about who labels the apps right that's a very good question

it's an open prop so the short answer is a very open problem because that is about the ground truth so all that I've shown is good modulo the ground truth okay so if something is wrong of course the curves will look different that my technology is still valid it's just that the ground truth was not good enough so for this work we just relied on this and Roseau that asset and for the purposes labels are passed sorry apps are fed to buyer startup with just one constraint so we tend to we didn't use data up until a year ago so now the study finishes at 2016 we're expanding it up to 2018 we were not

using the last year of worth of data because there's been a recent piece of work that shows that within a year labels shouldn't be trusted because they can actually change so there is a bit of an uncertainty and unstability but after a year that an app has been flagged as been good there is a good likelihood that it is good there is a whole sort of modulo reasoning because now we talk about Android apps so there is dynamically loaded code or library there are you know add libraries and stuff so it's a very complicated more space but given these assumptions I believe we try to limit the risk of dealing with them with a bad ground

truth it it can still be there yeah [Music]

and I think questions so the question is about whether the apk so whether we consider a situation in which you could inject things at runtime not in this case so all the program analysis that we dealt with it's a static analysis it's already very complicated this way but of course you know I'm working a lot on this but the conversation is also about adversary program generation so I'm working on a bracero program generation where we do adversarial attacks but just not in the feature space because that is not realistic so from the feature space then we automatically generate a program that whenever is analyzed by a classifier it just generates features that will cause the MIS classification there's all bunch

of issues there because you have side-effects that you you bring in when you do the program analysis you have to be careful that what in j is not executed at one time because you have to guarantee semantics equivalence so there are all sort of huge problems with dynamic analysis it even is even worse so one step at a time at the back over there then I'm going to suggest one more and then we'll break for that back over there yeah yeah yes yes yeah so the question is how do we differentiate between behaviors that is malware or behaviors from legitimate tools that look like malware the answer is that we don't so we just take the data set as

these somebody told us after this virus total thing that has been vetted for a stable for more than a year that this is good and this is bad there is a another longer answer to your question and the the answer is that it's still quite unclear what the classifier learns so this very on topic with with this year besides so it really varies with the abstraction that you're using but most of the times it doesn't picks up on it doesn't pick up on what you believe will be the behavior that you're interested in it picks up on artifacts that would just make a nice decision boundary there is a lot of research that is needed there and

it ties back with the program analysis it ties back with explain ability ties back with adversary machine learning because the idea is if you if you constrain also if you spend more time on the engineering on the feature and you're using features that are also hard to manipulate you constrain a bunch of things you might actually help explain ability but you might not have a very good performance so it's it's unclear yet final question and then we'll have to attack Lorenzo and Jason we we are we are we are we are were not here with so in fact this work so sorry the question was whether we looked at the hospital feature engineering and now that effect

all of this I am so with my lab we are working on this as just a small lab and far too many things to work on but hey if you guys are from industry want to donate funds for my research I I use this for nonprofits so everything is really is open so we all do research so it's a it's a very complex problem because of what I mentioned that there are implication so performance is not just we were we're only we've been trained to look at performance as the answer for the problem it's no longer that and because there are a bunch of other issues so you care about performance sure but in many situations

so I had a very interesting conversation with a vendor that I'm not sharing anything about well you might guess so the way the word so they have very good detection rates and whatever but they're not really they don't really care about that of course they do care but they want to be able to understand whether what the classifier spits out can be trusted so the cancer cells malicious software how do I verify it without having a human being to reverse engineer in the app so they want really to have something that support analysts in understanding oh the content is bad oh and because I have this automated security centric generator description of the behavior of the program oh yeah I

can read in a few 10 seconds say oh yeah now I like it you know I trust it so that's fine that abstraction however has a cost so when you just extract strings it's a very lightweight cost when you do build a data dependency graph if you're familiar with that it's it's a very different type of cost so there's I don't have an answer so we are really working on it quite heavily I don't have an answer right I'm going to just record it Lorenza I'm here now and all afternoon so please feel free to find him and ask questions thank you very much [Applause]