GT - An Introduction to Machine Learning and Deep Learning - Hillary Sanders

Name: GT - An Introduction to Machine Learning and Deep Learning - Hillary Sanders
Uploaded: 2018-09-20
Duration: 56 min 58 s
Description: An Introduction to Machine Learning and Deep Learning - Hillary Sanders Ground Truth BSidesLV 2018 - Tuscany Hotel - Aug 07, 2018

BSides Las Vegas56:58431 viewsPublished 2018-09Watch on YouTube ↗

Mentioned in this talk

Platforms

YouTube

Frameworks

Keras TensorFlow

About this talk

An Introduction to Machine Learning and Deep Learning - Hillary Sanders Ground Truth BSidesLV 2018 - Tuscany Hotel - Aug 07, 2018

Show transcript [en]

my name is Hilary I am a senior software engineer on the data science team that sofas I also just wrote part of a book on now our data science I'll say the bits on deep learning which is the type of machine learning since I really like this stuff and today I'm gonna give you a pretty high level overview of machine learning and run through a few quick examples of machine learning and then delve a bit more deeply into deep learning which is about machine learning since it seems to be all the rage today so for those of you new to machine learning which I think there may be your a lot of you guys in the audience that's

great this will be a fairly fast introduction since we only have like 50 minutes but I'm gonna try to keep things pretty simple and those to those experts in the audience you know feel free to sit back and relax and feel lovely and smug since you've probably heard most of this go for also as a side note if you have a question or if something it's confusing raise your hand and definitely ask because I think I have a few extra minutes of time and probably everyone else is thinking the same thing so that's great okay so um what is machine learning let's start from the top super-super generally machine learning just means any sort of mathematical process that

looks at some data and tries to derive rules to do things like predict something about that data or print something about future data or explain that data in some way um so that's really really general you can think of a ton of things that apply here but more concretely we can think of machine learning models as just big functions right so you have an input like a picture of an apple and you apply this this machine learning function um that we build and you get some output maybe like the word Apple or a numeric representation of the word Apple and so we can do the same thing with malware this is actually what the data science

team at Sophos focuses most of its time on we build deep learning models which are just big functions that take static files like a doc file or PDF file or what have you tomah's input and output a score from 0 to 1 indicating if the file looks malicious um okay so the question is right what are these functions and how the heck do you make them okay so I want to walk through the sort of a toy example to showcase this so let's say that we're working with HTML files and we want to build a machine learning model that can predict if a file is benign or malicious so a first step in building a machine

learning model it's called feature extraction so say we have a HTML file we need to represent it numerically in a way that we think will be particularly useful for a models task so because this is a toy example we're gonna extract two features file size the number of HTML elements and we will very much pretend that these two things are incredibly indicative of file maliciousness even though that's a lie to a certain extent so if we do this across a bunch of known malicious and benign HTML files right we end up for the table with one row per file two columns and in machine learning lingo we will be operating in a two dimensional feature space so side note

if you've ever heard someone be like oh yes are our models operate in a 512 dimensional feature space that just means they probably have 512 columns in the table and each item is just 512 numbers it's not not that fancy um but we're operating in a two dimensional feature space which is handy because we can plot our data so on the x-axis here we have file size on the y-axis we have a number of HTML elements and every dot in this plot represents our training data and so we're gonna build a machine learning model and it looks at this data and tries to do stuff so again because this is a lovely toy example our data is

highly clustered so we see that all of our known malware files are here on the bottom and are known benign where files are over there on the top so in machine learning at least in this example our goal is to basically build this model that looks at our data and finds a decision boundary which is this black line here right so that when a new example is seen it can look at it and say okay well this test file is above my decision boundary and it's pretty far away from it so I'm pretty sure it's benign not gonna block it and that's that's the basic idea so how do you find these decision boundaries how do you make models that

draw these lines well there's a ton of different methods and sometimes you're not really finding decision boundaries you're finding clusters or other things but whatever the method you're using in general you can sort of think of the learning process with a sort of a knob metaphor on a control panel um so I imagine the shape of our decision boundary is determined by right a bunch of knobs on this control panel and by looking at the training data that we have and turning our knobs into just the right position which are really just parameters in our mathematical function we can get a decision boundary that is good on so to do to do this we define

something called loss which is basically like air so in this example loss for us could be the percentage of misclassified observations and so we use various mathematical processes to try to find decision boundaries that minimize loss right so this could be as simple as for tur being the line maybe so that it goes right here and then recalculating loss and saying oh well our loss is decreased because now this blue dot is with its friends so we're happy so we're gonna keep that little parameter change that was that was random but there's different that there's different ways to do this okay so that again was a very very general description and so I want to go through a few actual examples of

machine learning functions um mostly mostly very classic examples using our toy dataset oops what's wrong okay so the first method to find a decision boundary that I'm going to introduce this very simple it's just a decision tree you may have heard of it so to create a decision tree we first ask okay I have my data and I split my data up into two groups by a single feature threshold right so we have two features and we can split up our data by getting a threshold on whatever features and so all this really means is that we're either choosing a horizontal or vertical line to split our data up into once we've done that we

just ask the same question again and split each of these two groups into two more groups right so in this case at sorry what I should have said and we choose this line in order to minimize loss right so we might we might test out this line and this line and this line and this line and calculate what lost every single time and realized hey the percentage of misclassified observations is lowest when I draw this line here right oh oh the orange guys are over on this side and all the blue guys over on this side so we do this again sorry that's wrong way again we do this again to get four groups and you could do this

again and again and again and grow a really big decision tree but for now we'll just have four groups and what you end up with is a decision tree like this which is right just a series of booing questions so this is a function with an input which is a file as defined by our two features and an output which is what group it ended up in and so the point is right that we can have an input test file and run it through a decision tree and realize hey well in our training examples when we ran through training examples and it ended up in this what's called as a leaf node when it ended up

in this group these the files were malicious about 7% of the time and so are sort of like output maliciousness probability score it's going to be 7% okay so that was a decision tree a bit more complicated than a decision tree is something called a random forest which is another type of machine learning function the basic idea is to make a bunch of different decision trees and then have them all look together to try to decide if in our example a file is benign or malicious so to make a bunch of different decision trees each tree is first trained on a random subset or every sampling of your data set right so instead of just grind

one tree like before we might grow a hundred and one tree we'll only look at some of the dots in our data set the next one we'll look at different dots in our data set the second special thing about decision trees and random forests is that instead of allowing them to split on sort of like the any feature that best minimizes lost in each Fork the decision tree each fork is only allowed to split on a random subset of the features does that make sense and so the point of this it's probably a bit out of scope to really delve into why this is super useful but the basic idea is that you're sort of jumbling up all

these decision trees to make them a bit more variable and then averaging their results right and so you sort of get like this this smoother approximation to benign where an hour so if you kind of like look back here imagine if these boxes were transparent a little bit transparent and you created a bunch of jews from just sitting around use here a bunch of different decision trees and then layer them over you'd end up getting like a much smoother sort of surface so it's kind of the idea behind a random forest okay so another example of machine learning is linear regression economists love these for some reason a linear regression is a super simple function it's just a linear combination

of your inputs all right so if we go back to our data um we can look at it and say okay I want a maliciousness prediction I want it to be a linear combination of my inputs file size number of elements so we have rate A which is a parameter so a knob and our on our control panel that we're going to tweak during training times file size our first feature plus B a another parameter times number of elements plus C which is a bias term so we can use you know well in real life high-level programming languages but in college life some matrix algebra to calculate exactly what a and B and C should be but if you think about it

right uh what would happen if an B are positive if you have a lot of if you have a large number of elements and a large file size you'd get a higher score but we want our maliciousness score to be higher when these are low or lower because this is where all of our malware is right so if we used on the math that I won't explain to you um we'd end up getting probably negative negative numbers for a and B so that's the linear regression just a linear combination of inputs okay so all of these examples that I just gave to you are technically under a big umbrella called supervised machine learning which I did not tell

you before and all that really means is that our training data had no labels that we're trying to be able to predict right so we had a bunch of files and we knew that their malware and we knew we knew certain ones were malware and we knew certain ones for a benign were and our goal was to look at the files so that when we see a new file we can guess if it's good or bad so especially in the corporate world this is really really useful and by far the most commonly used type of machine learning but sometimes right we're asking other questions too right so maybe we have a bunch of HTML files and we know they're all malicious

but we're interested in how they're different or maybe we think that there are two malware families what we don't know which ones are which so we can use sort of other types of machine learning to learn more about our data even if we don't have known labels so these kinds of algorithms are under an umbrella called unsupervised machine learning and they usually have to do things with clustering or learning even learning how to efficiently compress your dataset and they're mostly basically concerned with sort of figuring out the underlying structure of your data so the basic difference rate is to instead of minimizing loss as defined by the difference between your predicted labels and your real labels right we're

minimizing loss as defined by some other objective right which I'll give an example of so I'm going to go through a classic example of clustering which is unsupervised learning which is the k-means algorithm so k-means is a clustering function so say we have all these data points that work great and we said I think there's two clusters in these data points but I don't really know where they are and I don't know which which observations which files are assigned to each cluster um so k-means is meant to sort of figure this out and give us these two clusters so here happy faces indicate cluster centroids so K in this example is just two because it's easier to make slides

for so I've assumed that there are two groups but K could equal any number like three or four or a thousand actually no because we don't have a lot of data but K is just a parameter that you choose and this is how you do it so k-means is just sort of a procedure that you use to cluster of data so it's sort of like this training process to develop a function it tells us where our cluster centroids are so to find two clusters we first picked two random points in our feature space and we're just going to call them centroids right so I've picked two random points and made them yellow and purple and these

are our centroids so the next thing that you do is you assign every single data point in your data set to the nearest centroid right so all of the purple dots are now assigned to the purple centroid all the yellow dots are now assigned to the yellow centroid and next you update the centroid locations right so you say okay what is the average look what is the average location of all of the data points in my domain right so the yellow guy is taking the average file size and the average number of HTML elements and then moving towards that spot in the feature space and then you just iterate a bunch of times so you just repeat this

process so you reassign all your data points to the nearest centroid you move your centroids look based on video the data points that were just assigned to them you it again you do it again you again you again until you end up with until you basically reach some stopping point and so that can be based on a few different things you can look at your loss over time so the percentage of you can define loss by something like that the average distance from data points to your clusters or something like that so you can look at your loss and say okay my loss is is not really continuing to decrease you can look at your cluster locations and say

hey they've kind of stabilized they're not moving anymore and I'm going to find that as a stopping point and sort of stop stop by procedure and then you have right these two these two clusters and assigned data points okay so that was k-means um so so far in this talk I have described you sort of the basic ideas behind machine learning and giving you a few examples but all the other ones of a very super simple toy dataset with only two features and write a few dozen data points um so what happens when there is more right if we added a third feature now finding the decision boundary would look something like this instead of finding a line we would be

finding a plane or really a curvy plane in our three dimensional feature space but the idea stays exactly the same right we're just looking at our training data in order to optimize some function that we've decided upon so that we can find a good decision tree okay so of course in real life when we were working with hundreds of millions of HTML files and thousands of features this gets a bit complicated right we're trying to find decision boundaries in feature spaces that are just massive but the objective stays exactly the same we're just trying to find this decision boundary and we're looking at loss and after maizing parameters to do that um however this could be really really

difficult right finding good decision boundaries and super high dimensional spaces especially when some of the features have a lot of sort of like weird and complex and nonlinear relationships to one another can be quite tricky and so this this problem comes up a lot in things like image processing and natural language processing and cybersecurity because the data that you're working this is really really rich and it's also very context dependent so if you think about like a pixel in an image or a word in a paragraph the meaning of that pixel doesn't really transfer to the next images that you see very easily it's a meaning of a word is very context dependent and so it's often very hard to

extract meaning on that can be projected onto new test samples using machine learning and so this this applies to software security because often were we're looking at bright Program Files and trying to decide if they look malicious or not which is sort of similar to natural language processing if you think about it because you're just looking at a bunch of code bunch of language um okay so what do you what do you do in real life um when working on these difficult sorts of problems with really large feature spaces um well ask any data scientist what the best machine learning algorithm or model is and they will probably tell you it depends go away unless they're

very drunk I use that all the time but we're well at least I'm interested in our detection and right as I mentioned in malware detection at least if you're trying to figure out if a static file is malicious or not that's a really complex problem and we also have a ton of training data and so given all that at least in my opinion the best machine learning approach you can use is deep learning which is what the rest of this talk is going to focus on uh I cannot yet convince you why this is true without actually explaining deep learning so I'm gonna do that first okay oops again okay so when you hear about

machine learning beating experts at chess and it go and performing facial recognition and generating recognition and language translation all of these very complex historically human centric tasks that's basically all deep learning which is just the specific family of machine learning functions deep learning is disgust is this sort of magical like AI just meaningless like panacea just kind of true but but actually the basic concepts are super simple and really really interesting so I want to delve into that what what is actually deep learning according to most of the internet deep learning basically just means deep neural networks so what is a neural network well it's machine learning function so these inputs and outputs and like the

name implies a neural network is just a network of neurons stuck together so what is a neuron Helen urn is also just a little function with some inputs and output and figure out what it is okay so this is a neuron there's two components of a neuron we have our inputs this this could be 100 inputs but right now we have 2 x1 and x2 right so this could be our two features for a given HTML file and each input has an Associated weight parameter which is one of the knobs that we're gonna optimize during our training process right a parameter in our function and so the first component of a neuron it's called the weighted sum component and I added

in some sample values here but we're basically just taking a weighted sum of the inputs and adding a bias term right so this is 1 plus 6 equals 7 we're adding a bias term which I'm just randomly setting to 1 here and we get 8 the second component of a neuron is called the activation function activation functions need to be differentiable they mostly are nonlinear you usually want them in nonlinear but they tend to be incredibly simple as an example literally the most popular and most commonly used activation function function out there is called the rectified linear unit or Lu and all it is asking is what is bigger my weight sum plus bias value or 0 and it returns

whatever is bigger right so that's that's the most common one and they tend to be very very simple and so in this neuron right we have our inputs and our output is just going to be 8 because it is bigger than 0 okay so as you can see right this is a pretty simple function that you can write out on a piece of paper it's just maximum zero because we're using a rail you activation function weighted sum plus bias equals some number and this is a very small example of an actual neural network right so on the left hand side here we have our inputs so in this example we have three feature inputs each of these are sent to a neuron and

based on the weight and vice parameters in each neuron you're going to get a different output right and then all of these outputs are separately sent to this final neuron which will take the first neurons outputs as inputs and give us our final output score for the entire entire model so you might sort of be able to imagine that with the right weight and bias terms and maybe with a couple thousand more neurons a neural network like this one could theoretically take features from an HTML file as input and output some sort of meaningful score that predicts file maliciousness right because if you if you look back to our neuron remember the first part of a neuron is just a linear

combination of the inputs which is sort of like a regression right so with the right parameters you can you can get something interesting going on at least so how do you how do you get these weight and bias terms the model optimizes them during training by looking at your training data right so I will go super into the nitty-gritty math details but the basic idea is that we start off with randomized parameter values and so let's say that we take a malicious HTML file and we feed it into our neural network right so let's say that we have three features for malicious HTML HTML file one for and two and we know it's bad so we know the

output that we want is one we propagate it through our network and we end up getting a final output here which is point two three nine and so now we're just asking a question okay how can I modify my parameters rate the knobs such that our output moves just a little bit closer to one and so in real life we use some fancy calculus derivatives and chain rule magic to calculate this very quickly but you can sort of think of it in a much more simple way so you can kind of just think okay the model is gonna take a parameter here and just move it a tiny tiny tiny amount reevaluate the output and if it moved in the right direction

then we know that's the direction in which we want to change your parameter and if it didn't we know we don't right so that's that's the basic idea we're just sort of like evaluating all these files and nudging our parameters and just a tiny little um just a tiny little bit in the right direction and then we do this over millions of samples or in our case if he doesn't data points but in real life millions of samples and when you do this over and over and over again you end up getting parameters that create this neural network that give you really really surprisingly accurate on outputs okay so what I've shown you so far is what a neural network is but the

neural network that I showed you before was not actually deep neural network because there was only a single hidden layer um a hidden layer is basically any layer in between our input and output layers um so I should actually use this laser pointer in-betweener hidden input and output layers so in the previous slide we have one hidden layer but when you have many hidden layers this is called a deep neural network um so this is an example of a very tiny deep neural network um deep neural networks are really really powerful because their structure basically allows them to learn this sort of deep nested hierarchy of concepts and in other words lower layers basically learn to develop simple

transformations of your original input input features and later layers learn to use these representations as their own inputs right so this this layer might represent a somewhat complex representation of original inputs this layer gets to use these guys as inputs and so if this found something interesting this layer can use it in different ways to figure something else out thank you so that's that's sort of what why they're learning this deep nested hierarchy because there's a concept here and then it's being used in different ways by later layers and this allows for like super super efficient sort of representation of ideas and patterns so this kind of neural network is able to map out incredibly complex patterns

right because later layers are able to take advantage of lower level learning patterns in different ways so there's the sort of kind of multiplicative effect okay so I want to do a really quick demo to give you guys an idea of what this means so let's see if I can click on the right thing there we are okay so this is a really cool website and produced by tensorflow which is an open source deep learning framework um prease by google and it basically lets you build really tiny neural networks with some sample data for fun so right here on the Left we have a laser pointer our input features x1 and x2 and here we have an actual

neural network that we can build so we can add hidden layers take away hidden layers and this graph here is basically showing our final output neuron we're just gonna give us a score right so this is our training data and we want to find some decision boundary so when I press this play button over here this websites is going to have our machine learning model that we build look at that training data optimize the parameters and try to find a decision boundary so let's let's press play and right now this is just one neuron connected to a final output neuron so pretty simple neural network if you can call it that okay it does very well um right because

our se is pretty simple but what happens when we have a bit more ah data right for some reason let's just that file maliciousness is defined by number of elements and file size and there's a circle pattern I don't know why can we find this with one neuron not really because one neuron right is just a linear combination of the inputs with a little activation function so like how is it supposed to find a circle I can't however it's one thing if we add more neurons maybe maybe it will be able to write if we had to it can do a little bit of a better job right but it's still not about that circle um if we add three neurons it

does a little bit better it actually finds like a nice shape oh sorry um so that's really cool but let's let's throw something a little bit harder at it um I want to throw this at it so this is a spiral shape and again rate we're not telling our neural network Hey look for a spiral shape or look for a circle shape we're just giving it the data and saying like please solve this for me but even with a lot of neurons it's gonna be pretty tricky um if you give it a ton of neurons it would eventually find this spiral shape but it's very very hard so if we do this right it's gonna be tricky

and it won't won't really win something very fast however if we add some more hidden layers it can start using this like benefit of a deep nested hierarchy of concepts right so something's happening but not great so imma add some more layers so I'll add two more layers give it some neurons like six or something and hope this works and I'll press play to train so what I want you guys to look at well this is training or the feature activations and these little blue boxes so these feature activations are basically showing the decision boundary that each neuron is finding right so in this first layer they're very simple they're basically just lines because they're looking it's

just a single neuron however these later layers are learning much much more complex feature representations because they're taking the first all right neurons as input and so right this final layer has the most complex feature representations and alright my drink some coffee it ends up being able to find a sort of spiral shape which is pretty cool because we haven't told it to find a spiral shape so that's awesome well and if we gave it more data right it would do an even better job okay so hopefully that kind of gives you guys an idea of the power of deep learning um go back to the slide deck screen

okay cool so keep learning um one really cool thing about deep learning as well is that it skills like super beautifully with a ton of data right when you when you throw a bunch of data a lot of classical machine learning methods one of two things generally happens when your computer explodes because it's just computationally intractable it's not going to happen right the second thing is that your model just doesn't get much better a model trained on thirty million files like does great and then when you train it on forty million files it's like the same model um and that is because right there just isn't a ton of model capacity there's like no more space to fit all

these new patterns in it's not flexible enough deep learning in general is is not like this like it scales beautifully with a ton of data usually you keep on getting better and better and better even when you've trained it on like a hundred million files so that's super cool and that's related to this idea of our deep nested hierarchy of concepts because right we're able to learn these concepts very very efficiently because our multiple layers and we also just have a ton of information able to be stored because of the size of these models right so um normally when we're actually creating these neural networks in real life neural networks have thousands of neurons and millions and

millions of parameters and so that's like that's a lot of information that we're able to compress in their other machine learning models often have fewer parameters and so sort of the size of these models combined with this deep structure allows them to learn super super complex feature representations and scale with a ton of data okay so oh and I did want to mention right this is great for cybersecurity because whatever tell our data so it ends up working really well in theory and in practice at least the team that I work on generally finds that deep learning models are about the best you can do I'm in Austin blow away other methods so that's pretty cool um so what I showed

you here was actually an example of a feed-forward neural network which is basically like a cheese pizza so you can kind of toss on other toppings and get a new type of neural network but this is sort of like your your bread and butter but there are lots of other kind of structure is for super for unsupervised learning for supervised learning for four different types of machine learning that you can look into okay so to sort of wrap up what we've learned so far machine learning functions machine learning models or dysfunctions they've been puts and outputs and they're a bunch of parameters that we optimize by looking at our training data and trying to minimize loss and there's

various mathematical procedures to do this deep learning is a subset of machine learning so it's just a special family of machine learning functions it just means deep neural networks at least according most people and neural networks are just a network of neurons and a neuron is just a weighted sum and an activation function and a deep neural network is just a neural network with lots of hidden layers they're really really excellent at performing complex human human-like tasks and given enough data they can have really really extraordinary results okay so let's see do you know how much time you have left so you can look at my clock okay what okay good so I want to give a quick

run-through of one of the deep learning models that we've built on our team this is might not be easy to see this but this this is sort of the architecture of our short string detector which we use for URL classification so we give it a string like you know facebook.com and it looks at it and tells us if it thinks it's suspicious or not um so we generally pair this with our HTML HTML detector for browsers so I'm gonna run through the structure of this neural network more to give you an idea of what production neural networks sort of look like and less to fully understand what's going on so if you do not understand a word that's okay it's just

sort of a basic idea um so first we take a euro string as input it's one hot encoded we we add it to we apply learned embedding that's learned during the training process so instead of a 100 coded alright character vector this is transform to a 30 sheet dimensional feature vector so that things like uppercase letters are closer in this 32 dimensional space then lowercase letters are right so they're big cluster next up we use convolutional neurons which is a special type of sort of layer of neurons to extract engrams this is usually used in image processing because you sort of take a sliding window of your data to really make it focus on localized

portion of your portions of your data and so are you we're using it on an actual string just sort of extract engrams um these feature activations are summed up and sent to a bunch of fully connected layers each layer has about a thousand 24 neurons right so that means that there's about a million parameters for each of these connected layers it's sent through three of those and then finally aggregated through what's called a sigmoid output neuron to give us a final suspiciousness score so that's it um what's really cool is that this entire sort of design can be defined in Python just with this code right so you don't need to understand any of this

code but that entire design is right there and just a few lines of cortex and so which full-rate is that in 2018 you can just use super high-level packages and programming languages to to sort of design these models and test them on your computer um maybe not super super quickly and maybe not 100 million files without a bit of money that you can try out this stuff so this is we actually um for this model this is defined in Kerris which is a wrapper to tensorflow which is also available in python which is that open source in cream work produced by Google okay so I'm gonna demo this real fast if we can

okay so I'm gonna load up my model right here and test out some URLs so feel so I don't cherry-pick um feel free to shout out URLs that you want me to test I think Google them to get the exact URL as an example 20 houses is Newark but you can you know put in some URL and it will give us some output right like point zero zero five so this is very close to zero so our model thinks this is probably not going to infect my computer um however if you give it something a little bit more suspicious-looking like Amazon login credentials right it's gonna give you a high score um so yeah any any

suggestions go ahead I don't know paypal.com maybe paypal.com slash login slash lots of random letters I don't know gives you a nice so the basic idea right is that it hasn't memorized like a bunch of URLs it's like figure it out what looks malicious in a URL so that even if you give it a year old it's never seen before I can say hey it looks like it's asking you for credentials and it looks like this is kind of like the PayPal word but a little bit different um or like it looks like oh I didn't put issue yes it looks like um this is HTV instead of HTTP right so this is 0.99 because there's

not an S and I hope this works when you add an S it's much safer um right so stuff like this so that's the basic idea um yeah let's see I'm gonna go to Reddit and get a random URL see what's something that's safe for work what's let's go to machine learning

maybe oh gosh you're right shows me I need to read it more often okay so this is blue so I promise I happen huh oh there's no link um we gon do these before the xra sure where's that oh oh I see one two or three okay see this there's a link somewhere oh really okay okay cool all right let's see what happens I hope I won't get fired okay very low yay and reading itself let's see hopefully it will be very very low um yeah so this is this is the basic idea behind these models we train them on like hundreds of millions of URLs so that you know they can learn these patterns and figure out that this is not

gonna a hacker come here okay so see I can go back to this um I've been extra time for questions so okay so that's actually the end of all if I talk I'm a bit early um but if you want to learn more I highly recommend things like YouTube see blue and brown is like really really awesome at explaining neural network theory um Wikipedia Coursera I also just helped read a book so that's fine you know what have you yeah so feel free to learn it's it's very very fun and that's it so yes any questions yeah that's great I like this is this one oh okay great um this is probably a really basic or even

dumb question I'm having trouble understanding where like hidden layers come in how do you test or detect for that or how do you set them up okay so yeah I definitely um me go back through this okay so what you do is you're not sort of growing hidden layers during the training process you sort of define your architecture and then you train right so yeah so it's sort of like a hyper parameter um so you're saying like I want a deep learning model with like seven layers yeah and then you train and then the parameters in that neural network are optimized oh sorry yeah this the actual structure of the modeler yeah for the actual design of the model or

like parameters so the optimized parameters okay yeah no so they don't self-generate so you so like in Python in Karis right you'd say I want my first input layer and I wanted to have three neurons and then this is gonna feed to a new layer and I wanted to have four neurons and this is gonna feed to new layer with four nines that's gonna feed to a new layer with two neurons so you define that right and that's fixed no it doesn't change during the training process and each of these neurons rate has it has parameters there waiting bias terms and then during training all we're altering is just these parameters right yeah okay cool yeah for sure

yeah as you're training the model do you train the first layer first and then like is it important to hold that fixed as you train the upper layers or it like is that order important that's a great question so it's it's all trained at once with math yes google backpropagation probably watch a youtube video to learn more but basically we use the chain rule and some really fun Gervais math to sort of calculate which direction these all these parameters should be pushed sort of sort of at once as a follow-up to that how do you do incorrectly train layers or is that not a thing incredibly well in train players like for example you train this deep snip and

you found something ordinate so you want to debug it or is then not possible um so I guess it depends what you're looking for you can definitely like look at the feature activations so you can kind of debug it to a certain extent but deep neural networks are famous for being kind of black boxes so it can be very difficult especially if it's like yeah if it's a hidden layer you don't really fully understand what the inputs are or the outfits so it can be tricky um but yeah you usually can just do things like did my mom will get better when I added a hidden layer if it did that's great if it didn't what did you

input into that so did you go through and configure each neuron with what the weighted average should be and then you because you were just one throughput saying plus plus plus plus plus at it look at magic butBut I'm gonna assume that you had pre-configured those so it's really simple and carrots specifically you basically work in layers so usually with neural networks you have layers of neurons so what usually do is say I want a layer of rel uner ons with that rel u activation function and I want there to be a thousand 24 right and then you just say I want another layer and I want it to be fully connected which means that every

single neuron and this layer is connected every small one in the next so that's why you can design these like with like amazing speed because you're not actually defining every single night choosing each individual neuron with yeah the weighted average or whatever they do yeah I mean like you could but usually you're working with big layers with thousand-dollar ones just because like yeah that's that's what you want to do um so it makes things what easier

[Music] yeah yeah there are definitely some really cool ways to like lime is a really really do that oh yeah oh did he be lying yeah that was great we actually tried out some of those methods on our team and it was super awesome we actually tried it on the URL model to see those strings okay cool um yeah for sure any others yeah yeah yes yes so yeah so trial and error is the why because that's not efficient it's just a really nice way to think about it yeah so basically if you think about every single neuron fully a combination of the inputs which is differentiable and it's a differentiable activation function and you're just sticking these guys together

so a massive deep learning network is all differentiable so as a result we can we can basically just use chain rule magic to calculate like how will my output change if I change a certain parameter because it's just a partial derivative and then we use channel magic to calculate that super super efficiently by going backwards layer by layer which makes it like super super fast and that's why we turn on these yes yes sorry that's why yes sorry I mean a background in data science is really nice a background and like probability theory is like lovely and I think will always help you out but I think if you're smart and you're ready to learn

like anyone can really do this especially if you have a background in programming so yeah there really I don't think there are barriers like honestly there's so many amazing researches on internet nowadays like if you really want to and you're willing to spend the time learning and are interested in this stuff like like you can totally do it

um let's see I yeah so there's a visual introduction to machine learning which is a deep learning which is really good um I think that's what it's called so I recommend this it's like it's like a visual blog with like a ton of really awesome link d3 based visualizations which a book although you know yay for books that I helped write but a book can't do that so that's really awesome that's super cool yeah so I think this is this is like an introduction to machine learning I guess I forget if it goes in a deep learning later me doesn't this is a really good introduction to the basic idea of machine learning I think the

blue brown YouTube authors really awesome if you're interested in summaries of machine learning papers I highly recommend the morning paper he is like super awesome TL DRS of complicated machine learning papers the morning paper very smart dude um it's it's a blog yeah it's it's just a website with a big block on it this guy just reads scientific papers from various different domains including the teeth learning machine learning and summarizes them and like really nice blog posts I think I'm not I think those are probably my about my best recommendations um yeah any other questions yeah oh sorry yes so that the datasets that you're working on are trying not to be detected they don't want to be trainable

so are you do you just retrain as you recollect more data into your datasets or are you also adding attributes over time um so we try to train on really really fresh data so yeah we often know training on like 30 days of never-before-seen files because we want to focus on being able to protect files that signatures will miss so we definitely focus on that in terms of altering your asking if we like ultra feature space like the features that were extracting I think we do that a lot less often um one benefit of deep learning is that you don't actually have to spend a lot of time working on like complex feature extraction because if you give a deep

learning model like pretty raw features they'll do the feature extraction for you right cuz they're learning these complex feature representations so you don't need to say like oh here's like the length of my file or like you know the number of A's in my file if that meant anything like it'll it'll figure that out for itself um so at least for asked that tends to be pretty stable but there's always room for improvement for sure

um yeah um so deep warning in particular well and machine learning - there's something called adversarial texts which are sort of a big hole in machine learning defenses so malware office can definitely if they have access to sort of a bunch of inputs to one of our models and the models answer there are certain ways to sort of craft other models to craft inputs that will get us get the wrong answer and so that's like that's a big problem in sort of machine learning and malware section that a lot of people are thinking a lot about and another good reason why you really can't just have machine learning you really want like machine learning and signature

isn't like other other methods of detection so that's probably the biggest hole I can think of do you know use guess where I should not use deep blender um use case we can not do that's well so when you don't have a lot of data on deep learning probably is not the best bet um if your problem if you really want your problem to be explainable also probably not your best but if you're really really concerned with just results and you have a lot of data than deep learning is a great idea excellent job at deep learning and machine learning tools what can you know of any other industry players that may be doing decent job in your opinion I I

think there are a lot of players out there that are doing a great job but I also do not really have access to their results so I plead the fifth but I think lots of people are working on a lot of super interesting things but I also like my team so

for your talk do you use any ottoman strategies to train your model or other video strategies like to optimize like the structure yeah um so we've done like automated hyper perimeter searches to a certain extent but when you do like like Google's tried like building a machine learning model to like predict the optimal structure of a neural network and that's great but it also costs like a lot of money because training neural networks when you're training them on hundreds of millions of files is quite expensive yeah so we've mostly limited it to like type of parameter searches and not anything more complex purely because we don't want to spend like two million dollars training our next neural

network cool I think that's all the time we have so thank you so much feel free to hit me up afterwards with questions [Applause]

GT - An Introduction to Machine Learning and Deep Learning - Hillary Sanders

Related talks