The Role of Data Visualization in Improving Machine Learning Models

Name: The Role of Data Visualization in Improving Machine Learning Models
Uploaded: 2017-08-27
Duration: 26 min 13 s
Description: Phil Roth explores how data visualization supports machine learning model development, using his experience building Malware Score and Bit Inspector at Endgame. The talk covers visualization techniques for model evaluation (ROC curves, confusion matrices), strategies for different audiences (self, t

BSides Las Vegas · 201726:1342 viewsPublished 2017-08Watch on YouTube ↗

Speakers

Phil Roth

Tags

CategoryTechnical

TopicMalware Analysis

StyleTalk

Mentioned in this talk

Tools used

Bit Inspector Elasticsearch Facets Jupyter Notebook Kibana matplotlib Plotly YellowBrick

Service

VirusTotal

Frameworks

d3 Flask scikit-learn

About this talk

Phil Roth explores how data visualization supports machine learning model development, using his experience building Malware Score and Bit Inspector at Endgame. The talk covers visualization techniques for model evaluation (ROC curves, confusion matrices), strategies for different audiences (self, team, company), and tools including D3, Kibana, Yellow Brick, and Google Facets.

Show original YouTube description

GT - The Role of Data Visualization in Improving Machine Learning Models - Phil Roth Ground Truth BSidesLV 2017 - Tuscany Hotel - July 26, 2017

Show transcript [en]

and with that welcome Phil Roth all right yeah thanks a lot like you just said I'm Phil Roth I'm a data scientist at endgame I'm gonna talk about just the role the data visualization can play when you're building machine learning models an alternative title might just be screenshots of this internal visualization tool I built to test malware score along with some a lot of lessons I learned building it that will hopefully help you guys out at the end also kinda through some resources for some data visualization tools that I think I really enjoy using and maybe you will too so like I said I'm Phil Roth aye my background is in physics I used kind

of a machine learning algorithm getting my PhD in in physics I moved on to make images out of radar data but then I definitely wanted to get back to machine learning and so that's why I came here to MEA all right so let me talk about the the kind of the two products that I work on one external and one internal malware score is a machine learning first solution built for detecting and preventing mauer it's it's a model that operates on Windows executable files it's based on static features and it just tries to classify whether these executables are what it what if it thinks they're benign or malicious it's very lightweight it executes very quickly it's deployed to

customer machines and it's also available at virustotal we're proud of those scores and you can go get them publicly antivirus total and and dig around them and see in the scores and see how they work so bit inspector is an internal tool for communicating progress soliciting feedback and identifying errors related to malware scoring it was kind of like a tool that I built while building malware score to help me there are two okay so these are the tools I used to build it I'm a Python guy so I kind of built this web front-end and flask it kind of uses a little bit of d3 on the back just to get some of the visualizations that I

couldn't get in pure Python plotting and then it also generates some static plots and map out live and support it also connects to multiple internal endgame data and processing resources and that was kind of also one of the advantages that I didn't expect when building it is that you know that the code for this web front-end is in Python and it uses a bunch of stuff so that code is now available to all the end game employees and so when someone comes to me and says how do I grab the data for this sample or how do I upload this sample to our processing pipeline well bit inspector does all those things and the code is there so I could just

refer to them - well this is how bits fit inspector does it so you could go ahead and grab that code and do that on your own so yeah that was one of the unexpected advantages of building this there are two kind of main pages in bit Specter there's a little bit more but the main interfaces are showing all the information it has about one Windows executable file and that's kind of some screenshots of that are shown here you have the hash and a link to the virustotal and then you also have all the different versions of malware score and how those versions all scored the file and then on the right you kind of

have a visualization that I'll go into later but shows some of the features that malware score is based on and this is again a great resource of somebody one of the domain experts has a file and wants to know how our model scores it and how its scored it through the in the past then the other kind of way to look at pit inspector is a model page and it shows all kinds of information about one version of malware support there's some static files that are all available there for other people in the company to download and use and then lots of plots about the performance of that model and I'll get into what those plots well

right away I'll get into those plots whenever you're talking about visualizations to evaluate machine learning models there's kind of two basic ones that you definitely need to be using and that are in bit inspector here and that's the ROC curve and the ROC curve and the area under the ROC curve is really the main way that we had nem use to kind of compare models to each other and it's a great way to know how well you're doing I'm not going to go through it in detail here but I just wanted to show it as a resource and then also include maybe the code that we use to generate our rough curves here and make these slides publicly available so

if this can help anybody out here and then also the confusion matrix that's kind of showing on bit inspector and that's just a table where the columns represent the predicted class and the rows represent the actual class and then we kind of show this for all the data and also you know for different subsets of data but when you know that when the actual labeled class matches up with what is predicted that's going to be you know on this diagonal here and that's when you're doing a good job and then the other diagonal is gonna be when you're doing a bad job and just by looking at the numbers in that confusion matrix you can calculate some true

positive rates and some false positive rates over the whole data and then in certain subsets and then just as a resource that's kind of a Python code that we use to generate those static plots all right so with those that kind of those kind of basics laid out let's kind of get into what I view as the role of data visualization or how it's helped me as I build a machine learning model and the first idea is feature experimentation if you're visualizing your features about all the samples and all the data they kind of train on it's very helpful it when you're paging through all your training set and you're looking at these pages that kind of show

the features it kind of gives you a sense for what is in your data what what a sample might be and why a model might be scoring it one way or another I'm not gonna get into too much detail about what these features are they've been talked about before I'm just gonna refer to a link there that kind of describes what the sliding-window bite entropy is but I will say that just visualizing that bite entropy this way it really allowed us to kind of really quickly get a grasp for what might be in the file high entropy is kind of show up a high on the y-axis there and empty data shows up very low on the y-axis there and so just

at a glance it kind of gives you an idea of what the sample is that you're dealing with them and then might give you a sense for how it's being scored yeah so not only getting a sense for how what the features look like for all the data in your training you can so use data visualization to get a very quick sense of how your model is performing and like I said the way we do that you know it is using a ROC curve where we can compare models to each other but then also showing the scores for different samples throughout time another great way to use data visualization is to find problems and Red Team your model this is this has

been invaluable to us at endgame I mean it's kind of a big scary step when you've built a model and you think it's doing well and it you want to move forward and you want to go ahead and deploy it on customer boxes I mean you really need to build confidence in what you're doing and that them you want to build confidence in the model and that it's gonna do what you want it to do and then I found and we have endgame I found the best way to do that is to kind of open up the model to the rest of the company and kind of say hey you know beat this model show us where it's

lacking and show us where it's going wrong and so we made an interface in this bit inspector in order to do that you know domain experts can come in here and if they know about a hash that's usually classified wrong by other AV vendors they can plug it in there or upload the file itself and then and then that's a great way to obtain more data and make sure that everything that we know about is in our in our databases and in our data pipeline and then on the sample page where it shows all the scores through time there's also an interface for clicking you know if you're submitting a ballgame if a sample is score maliciously and you know it's

benign then you want to report it and give it to us and keep all those bug row and then we keep all those bug reports in our own little database and a lot and feed those scores back into our training data this is kind of a little bit like active learning and active learning is more about building a model to suggest to domain experts which samples they should label and which would best help the model so we're not doing that stuff yet we definitely want to get to that level but right now we're just kind of soliciting feedback and asking our domain experts to read team our models the thing that's gonna come from that

whole process is you know problems like subsets of data subsets of samples that were getting wrong and so once you find those things you want to you know change your labels you want to change your model parameters you want to work on that and get start getting that sub sample correct what data visualization can do then or what we do in bit spectra is to then track that solution over time as we train more and more models we want to make sure that all the problems we fix stay fixed and we don't break them at some point and breaking out all our samples into different subgroups and then plotting their confusion matrices and histograms of their malware score that really helps

us be confident that once we've solved a problem it stays solved so I'm gonna get into kind of how bit inspector evolved and how different people in the company started using it but before I talk about that I just want to talk about what you do when you're spending time improving your visualizations and in my mind the things that you're improving are the explaining bility the trustworthiness and the beauty of your visualizations so what do I mean by those three things first explain ability is kind of like can this visualisation be understood on its own if the creator of the visualization is not in the room when the person is looking at it can the

viewer still understand the point that is trying to get across and all the data that is in it there are some basic things that you can do to increase the explained ability like label your axes show the units and you know make sure it's titled and make sure everybody knows what is going on there and those are things you definitely should be doing but I just wanted to highlight some extra extra steps you can take to increase the explained ability of your visualizations and almost like stress them because I don't do them enough and that's adding annotations and also adding explanations and readable probes that you know allows people to just read and figure out what you're trying to get

across a great example of annotations especially is this xkcd comment that you can get right there that it's a visualization of the temperature of the earth over a long long period of time and I just love this visualization for all the annotations it shows it just gives you a great sense for what Randall Munroe is trying to show you all right so what do I mean by trustworthiness that this means can you trust the source of this visualization you know it's very easy to generate maybe a chart and Excel or something or make a plot and Python and stick with the defaults but you know that's not really gonna make the viewer trust that you know what you're doing in order to

get that kind of trust you you want to add some styling make it convey and this is especially true for like media outlets you want to you want to make consistent styling and make it look like you know what you're doing and I just thought the Economist does a great job at that lastly beauty I'm not really going to try and define that a lot of people might have different opinions on what data visualizations look the best this is one of my favorites from a site putting the cool that just shows the number of unique words used in different rappers and their lyrics yeah I really like it I think it looks really cool it

definitely takes a lot of time to generate something that's really catchy like this and that's one of the ways that you can spend time improving or visualization all right so let's get into who I was building bit inspector for in this long journey of building malware score up the first audience is definitely myself when I was building that inspector really I just wanted to convince myself that I was doing something useful and at that level you don't really need to be adding you know you don't really need to be spending a lot of time on your visualizations you can you know leave the explain ability low and and all those other things because you know exactly what you're

after you have a question in your head and you're trying to answer it by generating visualization and you're gonna immediately get that feedback and so you can pretty much leave the defaults on so you know you want to convince yourself that you're doing something useful and so I'm just gonna use this as an excuse since I have a background and physicist physics for dropping a quote from a famous physicist but you know you want to convince yourself but you are actually the easiest person to fool and one of the ways that you can fool yourself is by continuing in this MA building process you know you're trying something you're looking at the results and if something looks wrong then you

think about it and you fix it and you try something new and you know you're always going through this process the scientific process this model building process but at some point something might look right and you're gonna break out of this process and you're gonna say I'm done everything's great but there's still many things that could be wrong and you could be fooling yourself there could be two things wrong that are just like cancelling each other out and making like just one of the metrics that you're following look right and you know that's how you can fool yourself and the important thing to do is to you know look at good results and then sit back

and say well how well you know how could you still be wrong and what else can I test alright so you you know you've built a model you have some visualizations you have some metrics that are convincing yourself that you're doing things right the next audience is kind of the rest of your team and here the purpose is to communicate what you've done and get feedback from on what you didn't consider you know I'm a 10-game we have a bunch of data scientists they have a lot of different backgrounds they've trained different models on all kinds of different data sets and it's those various backgrounds that are gonna and those varied backgrounds that are gonna give you you

know new ways of looking at the problem and you're gonna get valuable feedback by doing that but you need to put a little bit more work in your visualizations in this case and definitely add context like where the sources of data are where you know what the model parameters that you went into the model training what they were because your data science team is gonna be very interested in those things pretty much everything is the same for domain experts but in this case the context that you're adding is gonna be a little bit different for me it's like PE header information hashes links to virustotal those sorts of things and once you've added that and opened it

opened these visualizations up to the rest of the company then you're gonna get valuable feedback from your data science team and your domain experts all right so you've done all this now our score is an important part of your product it's getting it's been getting thrown out to the public but now it's important that managers and executives look at it and figure out what the progress is and what this current state of the art is and where the problems and this is the point where it's really important to ramp up the explained ability of your plots you know people who are looking at it are not going to have their that background in machine learning to know what everything means

and you know at various times you know mark or Jamie come over and sat down next to me and being like what is this plot what does this mean and you know I'm happy to give it to them and and it you know explain it to them but I kind of see that as a failure on my own to make the visualization truly stand on its own the last audience is the public we have a technical blog I like it it's great we put a lot of work into trying to explain where we're coming from and the techniques we use to build machine learning models and not only you know you want the explain ability of the

plots at that point to be very high but then now this is the time when you want to ramp up how the visualizations look and how do how how nice they look alright so that's just some general thoughts and now let's get into kind of a tour of some of the tools and resources that I use and that hopefully you can use and that you might find useful this Tim hopper a data scientist that I definitely respect made a webpage called Python comm that kind of compares the plotting syntax of various Python plotting packet packages to each other so there's some certain tasks and then he accomplishes those tasks in a variety of Python plotting packages and those

are listed there it's a common complaint in Python that the plotting tools aren't that great they're pretty verbose they you know they're not as great as maybe some other statistical languages and I think this is one step in in trying to improve that and maybe get the Python plotting community just behind one one package Jupiter notebooks I use these all the time they're great for exploratory data analysis and it's just it you can you know look at a plot and then change something about how the data is gathered regenerate that plot right right in the notebooks there yeah so I use I use notebooks all the time maybe not as much as I should just because typing in them

I don't have all my max keybindings and so I get frustrated at typing in these little little boxes but they are great for just like keeping the data reading code right there and so that you can keep rerunning it with with new ideas Gabanna is something that we at endgame well we've known about it for a while but we're really getting into it most very recently come on table plows for a rapidly building and constantly updating dashboards well building these dashboards that constantly update themselves there's there is some extra work that goes into building a Cabana dashboard mostly you need to translate your data from wherever it is into elasticsearch because these these plots are kind of based on elastic search

queries that are then just displayed and constantly updated shout out to Daniel grant who made all of our recent kumano dashboards which are again getting us more and more confident in malware score and I just wanted to highlight this thing that Daniel told me recently and it just shows another it kind of opened to me another roll of data visualization it's it's publicizing what you've done to the rest of the company and sometimes that's really important sometimes you've made something or accomplished something and it's really hard to express but once you've made a data visualization that looks awesome then everybody's going to be going there and saying yeah you know now I start believing in this d3 I

mentioned that as something that bit inspectors built on a lot of data scientists maybe have a background name or or Python and so Java Script is definitely you know something else to learn it takes takes a while to ramp up on that so the cost is high in making d3 visualizations but the payoff is the customization possibilities so you can really do anything you can imagine in d3 and so you know if you have the time if you have the inclination I think it's definitely worth worth learning but but just realize that it you know you're not going to be able to do exploratory data analysis you're going to need to put a lot more work in near visualizations to

you know make them look exactly like you want yellow brick is something that I should probably get more into it's this project made by district data labs in DC and what it is the idea is to have pre-baked model evaluation visualizations that adhere to the scikit-learn api and you can kind of see the example of one of them here you know you have your training data if you're using scikit-learn you already have your training data and you know this kind of form where you have a feature back you have lots of feature vectors and labels and you know you're feeding those into psych that learn anyway so it's great to just use yellow-brick to automatically

generate visualizations that are going to tell you more about your model two weeks ago or so Google released assets which I was really excited about on it's early I don't totally understand it I want to work with it some more but right now I think this is really true it's the best option for truly responsive exploratory data analysis I've always been a little bit skeptical of getting anything done you know very responsibly like changing something and getting a plot like and keeping up with the speed of you thinking of ideas and exploring the data but I think this facets project is really getting to the point where you can explore your data just interactively and I do have time so I guess that means

I'm going to try a demo of facets but things can go totally wrong and yeah no guarantees all right so this yeah this is a notebook that's kind of shift with the facets project it's based on an open source data set right here that data set is kind of read from the internet and then converted to JSON here right here is kind of all the magic you have this your data in a JSON string right here and then you just feed it into this code and that kind of generates this interactive visualization so this is some example data that Google cannot you know it doesn't ship with the project but ships the code to read that data all

right so I'm going to so then what I did was kind of took some of my data and tried to explore it so I wrote some code to kind of take some of the samples that we have and feed them into PE file and you know then I can generate some section sizes and some file sizes and a bunch of other information like things that definitely are features in our model and so I read that data into the notebook here and then I just kind of take this magic after after facets is you know installed in your environment and everything then I fed this in here and the resolution isn't great but let's start clicking around so you can see you

can change what the colors represent for us right here right now it's the label and that means this red is malicious data on this yellow is benign data and then the cool thing is you can you know break that data out based on different makes some different features that you fed it so I'm gonna do file size here oh that doesn't look good all right it looks like the resolution isn't gonna let me do great things but yeah so you know I guess that looks a little better oh I think what it's doing is the display here like each dot can all also represent something in your data set so if it is an image like that dot can be

the image that you're looking at but that's not helping us here so let's scare that that's still it's not great so I don't know this is so you can start fooling with this and breaking your data out I think it's gonna be very useful for me in you know taking malicious and benign data and just breaking it out and automatically generating histograms and seeing what might be a great feature yes so like the overlay ratio you know a PE header kind of defines how large your file is but a lot of times there's some extra data on the end and the ratio of how much extra data there is to the actual size of the PE file is one of our

features so we I kind of broke it out based on that and the file size and you know I was just fooling with this maybe a week ago and kind of found you know a group of a group of data and it looks like exactly malicious data there that has you know a certain file size and a certain overlay ratio so you know this tool really allows you to kind of break out and see those groups and there might be something special about that group I I'm not sure there is I mean it's big enough that it's probably just probably a lot of things in there but I think the next step is to kind of dive into there and see if

there's anything interesting or special about those malicious files and then I'll start looking at that all right okay yeah so that's it thanks a lot and I'll just open up to questions I have a quick question for you um so let's say you have a bit of polymorphic code is that going to change the malware score you assign to the code and if so do you want that change reflected in the visualizations you produce as a result malware scores totally based on static features so if there is something unpacked runs it in running you know we're not going to look at that we're just gonna see the data as it is on disk so no it's not gonna change this score

but you know we're ramping up our dynamic analysis we want to maybe get some of that involved in a more a bigger model but right now we were focused kind of on models that can be shipped to endpoints and evaluated very quickly so we haven't built like deeper inspection models based on dynamic analysis like you described all right thanks all right Thank You Phil and everybody remember to follow him on peer list [Applause]

The Role of Data Visualization in Improving Machine Learning Models

Related talks