
hello everybody we're going to get started I'd like to introduce you to John Anderson John is on the open-source security team at Intel he's from Portland went to PSU for computer engineering with a focus on embedded systems and did his honors college thesis on machine learning he's been working at Intel as an intern been an employee for the past five years security research embedded systems machine learning and data flow programming are his current interests let's welcome John thanks guys thanks for showing up so I'm gonna talk to you a little bit today about Intel's dependency review process and how we automated some of it using machine learning so Who am I I'm John as we said
I'm a open source security guy at Intel so I've been playing around a lot past few years with you know Linux containers concurrency web apps and Python and machine learning a little bit more recently so well I was at PSU doing my undergraduate honors thesis I did it on machine learning and at the same time I was interning it Intel and so I got the chance to work on this project where I got to apply some of the stuff I was learning about machine learning to a real application so at Intel we have a dependency review process that is part of the release process for any given piece of software and so the software at
Intel goes through this whole security process right and before you can release you have to make sure that none of your dependencies end up on this band list because you know a lot of software out there can be pretty bad especially if you're just googling for something and you find it on github and you know it says it does a yes and it's you know half the size of OpenSSL and if you went to Matt's talk yesterday you might know why that is a bad thing and you should not use that so we have this review form it has a bunch of you can't really see it and this is not the exact same one that
the internal one this is one that went through legal review so it's got a bunch of little a through F answers and for each category looking at things like maintenance and security testing and you know like test testing in general and so what we were doing is we as the open source security team were put in charge of reviewing all packages for all anybody's dependencies for all of Intel and so until writes a lot of software and so we were reviewing thousands of dependencies of open source software and so we thought well it would be great if we could automate this because we have other jobs like real things to do and we don't want to be sitting there looking
at open source packages all day it's pretty mind-numbing so we took this data set of the form answers and in John Whiteman and I took the same approach at this the first step around and so we grabbed this data set we got the URLs to all these git repos and we have the review forms right so and then we have the classification so this should be pretty straightforward problem right we should be able to take this form and then map it to the classification train a machine learning model on it and it should tell us yes this is good or no this is bad based on you know our trained classifications but we ran into
a problem pretty much right off the bat and that was that reviewers weren't consistently filling out these forms and yeah big surprise right so we're we're charged with doing this mind-numbing task that takes a lot of time and we're really excited to go click every single drop down so part of the reason for this was that first right off the bat you would you could usually tell or not usually but you could sometimes tell whether something was good or bad right away right for signs like you know has this thing written in 2005 and never touched sensed was you know does it implement some crypto in a very non-standard way that you know one guy
wrote and no one else has looked at is it a parser and so all of these things are early warning signs that we might just go you know bad right you cannot use this thing you may not ship with this software there are also signs of things being good right you know this is systemd well we may not like system to you but you can use it we know that people are do maintain it so this was a failure we only got around 60% accuracy so we had to go back to the drawing board and think about well what is our approach right so if we don't have the data in the form then we're gonna have to
generate a new data set right and so we're going to instead of using that form data we're gonna generate some answers that are sort of like that form data right we don't know what they're gonna be yet but we know that there's some data that when you put it through a model you will get something that tells you this is good or this is bad at a high percentage accuracy because after reviewing all these packages we know that there were some things that we could put into numbers we just didn't quite know what they were right it must be possible so we took this iterative approach where we're gonna generate this data set train the model if the model
gives low accuracy that means we have the wrong data set right so go back and grab some different features so we came up with this plugin based system right where we're gonna scrape these different pieces of data that we're using as the features for the model and that way we can you know quickly swap in and out these features as we find out that this one doesn't matter or this one does matter and it will allow us to the the idea is that we can swap out these features very quickly and so that we can then find you know which ones matter which one don't and ideally we have a simplistic framework to do this because
you know if I just want to count the number of authors that should be as simple as you know get log grep authors you know sort unique right so that's that's very easy but the problem comes into play when you start doing these you start running multiple of these at once and they're all running git and we'll get apparently actually does some weird lock files on repos so you can't actually run git concurrently so if you want to scrape all of these repos and iterate on this right then you need this regeneration process for this data set to be fast right because I can't sit here and spend two days regenerating my data set and then go back and find out
that no that wasn't the right set of features that's it'd be a huge waste of my time so we needed to come up with something that gave us a little more flexibility around like how do we write these data scraping plugins without needing to know too much about the other things that are scraping data as well so we came up with this framework that's basically like this it's like a directed graph of operations that get run concurrently well the execution framework deals with locking so we defined the data types and we say you know this data type needs to be locked before it's used in an operation or it doesn't need to be locked so the git repo for example is a
data type that needs to be locked once i've parsed some of that data I can you know throw that data or I can throw that data back into this network of running operations and I can now run things concurrently or you know if it's just random data we can run it in parallel right we don't want to be running things that are CPU bound concurrently so then and then that releases the resource lock for the operation for the other operations that actually do need to be working on the locked resource like other things that need to go actively scrape the gate log or whatever it is so this is a gret this is the directed
graph showing how we would do a calculation of the number of lines of comments - the number of lines of code ratio which was a rough estimation for how well-documented is this code and so what you can see here is we take the URL of the git repo we clone it and then we're going to go find what the default branch is unless it was provided so you know we want to detect if the default branch is master or maybe it's like you know 5.3 points something and then we're running to combine that all of these the other thing that was happening here is that we were doing this on a quarterly basis so we're doing time series
information because as you can imagine the maintenance of git repos and the other properties of them are all varying over time right so somebody may start out doing Coverity or somebody may not start out having their thing in public Coverity and then they put it in public Coverity right so quarter by quarter these data points change so we have a set of operations that goes and takes a quarter start date and goes back you know X number of quarters and then generates what are the commits associated with those dates for those various quarters and since this is all running concurrently part of also what happens here is all the given any all the possible permutations of every input
set for any given function which is also known as an operation gets run so as soon as you generate all of those quarter gates it's gonna go and dispatch all of these operations to run concurrently and it's gonna go say okay go grab with that yet with that default branch that you did and that quarter date I want you to go and for each then it's gonna say for each of these quarter ranges I'm gonna go scrape that or I'm gonna go check out that repo and then run this clock or a basically we ran clock and then we did this division to find the ratio all right so this is how this is how this execution environment
works and we are we can the nice thing about this is runs everything concurrent concurrently manages the locking and it does this permutations of inputs so you get every unique permutation gets run so we're good to go we go to plug this back coming back to like the main framework here right we've got this system for doing plugins on the data set generation and we also came up with the system for doing plugins on the neural model so the machine learning models as well as a system for doing plugins on the data sources right so now we can either look we can iterally swap out the data generation pieces and then we can swap out the model the model
portion so we can go and say you know am I using psych it which is what we started with because psych it will tell you what are the feature importances of each of the pieces of data you're putting into it so it might say hey you know you gave me this ratio of how many comments per pork or how are there a lot of comments on every pull request or you know what is the diversity of authorship like how is is each line of code committed by a different person and so if you throw that through the random forest classifier it has this nice property being like this classical machine learning algorithm where it can
tell you that which which pieces of feature data were important in getting a good accuracy on the prediction so that helped us throw out these things right but then once we got to a reasonable amount of accuracy around 80 percent I think was like 82 percent we said okay well we want to go switch this to a you know a better machine learning model something that's not hasn't been around for you know many many years so let's try doing some neural networks through tensorflow and so since we've got this plug-in framework we basically say generate the data now instead of doing through the sidekick model do through the tensorflow model it's just a command line flag to say psych it's a tensor
flow and then now we're training using tensor flow and we get 90% accuracy so that's pretty that's good enough right there so we're just gonna go and start using this 90% accuracy to drop the banhammer on things as people submit them so they do don't have to wait for this four to six week turnaround to hear that they need to go change all their code because they used about 220 dependency so and then like I was saying we also swapped in and out the data sources so when we went to go take this into the production environment the production environment had this wacky my sequel set up so we you know we created this custom data source that knows how
to interact with all these various my sequel tables which is you know probably the case if you're going to integrate with a existing application like you don't know what the database is gonna be like or well you do know what the database is gonna be like and it's probably gonna be custom logic right so you're gonna go need to figure out how to integrate it with that custom logic so you were in a plugin so this is what it looked like once it got integrated it's pulling and pushing pushing the data from that production database and it you know you put in the URL it runs the data set generation and then it runs the prediction through the
trained model and it says yes I think this is a good thing or no I think this is a bad thing so what's behind all this well we've got this wonderfully named library called data flow facilitator for machine learning and that legal gave me this name so it's long and it's generic and descriptive so what is this thing do right it provides the abstractions around sources and around generation and around models and we've got a few big din for you but you also it's the plug-in based system does not require that you write a plugin that gets contributed back into the main source code of the repo right so people can publish plugins and then just like
have them on pi PI or have them in their git repos and you can install them from people's random source and it will work as a part of the existing ecosystem so we've got sources for CSV files JSON as my sequel and we've also got some models for tensorflow and Sai kit to do a few things there we provide a consistent API across the command line interface the library the HTTP API and then by extension the JavaScript API that works with the HTTP API this is just like a bunch of console.log from doing all the same things we were doing in Python we're now doing from JavaScript hitting the HTTP API so then you have access to all these
plugins you know you install somebody's trying to plug in and since it all fits in the framework it knows how to extrapolate all the data you need to go configure that model or whatever and display it to you through this other API so what else have we done note this thing well we wrote this metastatic analysis tool for Python called should I and if you're familiar with PI Python or basically any package manager you usually type you know pip install or affricate install blank right so we created this thing that goes and it downloads the PI pi package that you were about to install and it runs some static analysis tools on it and right
now there's only two static analysis tools that it runs but there can be lots more because we have this system where we're using these operations and the amount of code that it takes to add a new plugin to analyze the source code it's basically like I think it's like 20 lines if it's auto formatted with this thing that makes it really long so we take the package name we go grab the JSON information from the pi PI API this is the data flow for that right and so then we extract the URL we extract the version we download the package contents and we're running bandit which is like a tool that does source code analysis to
look for things like sequel injections and stuff in your Python code and then we also run this other tool called safety which goes and checks for like OpenCV DS or known vulnerabilities and packages and so all of this runs concurrently and then when you're scheduling out sub processes because it's back to nascent Keio the nice thing that will happen here is that when you call it to a sub process it automatically gets run in parallel so you're getting concurrency and some parallelism for free and also if you're doing CPU bound tasks it's really easy to just say hey you know schedule this in a thread and it works straighter we'll go do that for you so we've also
got this concept where you can take these data flows right since we've abstracted to this level we can take these data flows and we can deploy them in different kinds of environments right we could deploy this as a command-line application like we did where we say you know should I install this thing it goes runs them and gives you the results or we could deploy this behind like an HTTP API you know coincidentally the same one that does all the machine learning models and stuff so basically we export - yamo files and or json files or you know whatever config format you want like there's a serializer and deserialize ER plugin because if you haven't noticed yet I like plugins and
so you can export these data flows and then you could say like you know run it this is the specific data flow for the HTTP API and you can overlay other things on top of that to extend it and say you know I so I had this base set of operations for this data flow but now I want to create this new data flow right so I want to take these should I operations that we wrote right and one of them was download the package contents and run the banded operation and but now instead I'm going to take the package contents and then in parallel I'm going or concurrently I'm going to link that into those existing
gate operations that we had right and I didn't have to write any code all I had to do was modify the yamo file and now if you can see at the bottom here in this demo it's giving you the lines of code or lines of comments the line of code ratio and basically we have like a very small yamo file that gets in this overrides situation and then you can like overlay different operations to create new data flows by chaining together these operations and saying how they should be connected and so this is the graph that shows you how we took that package contents and now we are using it in or we're using those
other git repo operations to you know same code all we had to do is tweak some files and now it's all ranked and currently together so where do we go from here you can check out the machine learning integration usage example which is basically very very very similar to the code that had to be written to integrate this with in intel's environment and you can also check out should i which is this medica static analysis tool and hopefully you guys can go contribute because it should be very easy to write little operations and you if you want to contribute we have weekly meetings at 9:00 a.m. PST or PDT now right now and we've also got a git er
mailing list and the meeting links are on this website and I just updated all the documentation and rolled releases this morning so hopefully it all works any questions yes
I am