Detecting Malicious Certificates Using Machine Learning

Name: Detecting Malicious Certificates Using Machine Learning
Uploaded: 2017-10-15
Duration: 40 min 43 s
Description: Researchers present machine learning algorithms for identifying malicious SSL certificates with high accuracy using publicly available data and Python libraries. The talk covers feature extraction from certificate metadata, classification using Logistic Regression, Support Vector Machines, and Rando

BSides DC · 201740:43495 viewsPublished 2017-10Watch on YouTube ↗

Speakers

Abhishek Sharma Khaled Al-Hassanieh Jason Reaves

Tags

CategoryResearch Technical

TopicCryptography Malware Analysis Threat Intel

StyleTalk

Mentioned in this talk

Frameworks

scikit-learn

Languages

Python

About this talk

Researchers present machine learning algorithms for identifying malicious SSL certificates with high accuracy using publicly available data and Python libraries. The talk covers feature extraction from certificate metadata, classification using Logistic Regression, Support Vector Machines, and Random Forests, and addresses the challenge of deploying models in low-prevalence environments where legitimate certificates vastly outnumber malicious ones.

Show original YouTube description

We present machine learning algorithms for detecting malicious certificates with a high level of accuracy. The performance of our algorithm meets the demands of deploying such models in a product. Interestingly, the key ingredients for building such models are all publicly available! However, one still needs to connect the dots, i.e. collect represent good and malicious certificates from various online sources and/or network traffic as well as identify which “cookie-cutter” machine learning algorithm, available as Python libraries, to use. Key takeaways from our presentation: Understand how to leverage the fact that SSL certificates contain information in a structured format to build machine learning models. It is embarrassingly easy to build algorithms for detecting malicious certificates using Python libraries. We will share results for three of them—Logistic Regression, Support Vector Machines, and Random Forests. Identify which attributes are important for distinguishing between malicious and legitimate certificates. The main challenges in deploying these models is the low prevalence environment—i.e., on an average, your network traffic will have orders of magnitude lower malicious certificates compared to legitimate ones. How do we fine tuning machine learning algorithms to perform robustly in such environments? Abhishek Sharma (Senior Data Scientist at Fidelis Cybersecurity) Abhishek Sharma is a Data Scientist and Team Lead at Fidelis Cybersecurity. He develops predictive models for detecting malware and data science based products to enhance the productivity of security analysts. Prior to Fidelis Cybersecurity, he was a Researcher at NEC Labs America where he researched how to use machine learning and data science to improve the efficiency and robustness of complex physical systems found in the power and manufacturing sectors. He has published more than 20 articles in peer-reviewed journals and conferences, and holds 7 patents. He received his Ph.D. in Computer Science from the University of Southern California, Los Angeles, and his 5-year Integrated Masters degree from the Indian Institute of Technology, Delhi. Khaled Al-Hassanieh (Senior Software Engineer at FIdelis Cybersecurity) Khaled Al-Hassanieh is a senior software engineer at Fidelis Cybersecurity. His work combines software engineering and machine learning. He develops and productizes models for malware detection and other security applications. Before joining Fidelis Cybersecurity, he was a postdoctoral researcher at Los Alamos and Oak Ridge National Laboratories. As a theoretical physicist, he developed theoretical and numerical models to study condensed matter systems. His research has led to 29 peer-reviewed articles in renowned journals. Khaled holds a Ph.D. in Physics from Florida State University. Jason Reeves (Threat Researcher at Fidelis Cybersecurity) Jason Reaves is a threat researcher at Fidelis Cybersecurity. His work primarily focuses on reverse engineering data structures, algorithms and botnet protocols found in malware. He develops signatures to detect threats, scripts and programs to automate data or malware configuration collection and framework development to automatically harvest threat related data from various sources to further various research projects. Before joining Fidelis Cybersecurity, he worked primarily on Banking Trojan research in the financial industry in order to create frameworks pretending to be infected clients in order to automatically harvest configuration and targeting data.

Show transcript [en]

the besides DC 2017 videos are brought to you by threat quotient introducing the industry's first threat intelligence platform designed to enable threat operations and management and data tribe a new kind of startup studio co building the next generation of commercial cyber security analytics and big data product companies my name is khalid al hassan ii detecting malicious ssl certificates using machine learning i know on the schedule it says in your spare time I haven't need to do that here I'll get to that a little bit little later but it's kind of you know kind of an accurate statement so the motivation is given pretty much well known by now that increasing share of encrypted network traffic traffic let's say half

or more than half of network traffic is encrypted and there is a push to monitor traffic for malware and other threats without man-in-the-middle being without decryption and the SSL certificate provides us with a fundamental part of the encrypted traffic matter metadata ad it provides us with its ensure information that you know we can we can use so as as with any machine learning or data science project first you need you need the data and in this case we need label data meaning we need certificates that we know are bad certificates we know are are good more or less we know they're they can be some certificates that we know they're benign they're not to be to be malicious but in

general benign certificates and and malicious certificates collecting benign certificates is is not a big problem there are many available we use this the source census plenty of millions of certificates there we collect the trusted certificates there we sample from there Alexa ten thousand domains we can sample from there and trusted certificate authority root certificates these are self signed a few hundred of them getting the militia certificates is the is a more challenging problem and this is done by our threat research team particularly Jason Reeves basically collect the data the IOC data of different malware's okay collective collect I you see data of different malware and for the ones that use as a cell we store this this this

data and for the ones that use SSL go back and scan this scan this data meaning port IPS and and pull out the other certificate for for this study so far we have a data set of about thirteen thousand two hundred certs including four hundred four thousand three hundred roughly militia certs and around 9,000 benign certs we want to keep the we want to keep the the data set from being too imbalanced we can collect millions of benign certificates but that wouldn't be very helpful because we don't want to get the the data set to be too imbalanced if we look at one sample part cert some of the fields their serial number begin date

which is the issuing date in this case 2014 and date in expired date with is this case 2026 subject fields include the country us the state locality organization organization name email address common name and in this case there is some unstructured name that's there it's not not usually there the issuer in this case we only have one field it's organization the key length key type the signature algorithm in general there are there can be you know more fields for the issue or including all these fields so far I mean so far doesn't look very bad some names that don't look terrible if you look at the dates the issuing and end dates this is a large there is a

large time span and this is usually uncommon for for regular certs it can be common for route certs but this is like certificate authority route search this is not a root search so there is something might be might be malicious another important thing in insert is a extensions and these are additional fields that can can can be present in an assert I'm gonna zoom in on the extensions here so this the ones that are present in in desserts or basic constraints include CA a flag this a certificate authority there's there can be another field called path length meaning how many certs can be authorized in in a chain the fields 0 here I mean it means that this this constrain these

extensions are not critical so each extension can be critical or not critical key usage in this case that has three three entries for key usage extended key usage this is relevant mostly for leaf certificates and subject alt name in this case as the DNS this DNS so other other extensions can be present in addition to the ones that we see in this certificate including issuer alt name subject alt name Authority key identifier certificate policies and so on as I mentioned each extension can be critical or not critical to not talk a little bit about the Machine machine learning classification algorithms that that we use so we the basic ideas we want to build a model trainer model

using the data using some algorithm and after all that we give it a certificate it will tell us if it's malicious or not so use these these algorithms well-known algorithms for classification random forests logistic regression support vector machine and at any point in this work we can switch between the algorithms were not really committed to any of them however we compared the accuracy and in this case that the performance we mostly care about is a false positive rate and of the three using the same data sets and the random forest in this case gives us better false positive rate lower false positive rate so 0.3 percent I'll get into the details later 0.3 percent compared to

0.5 percent percent for logistic regression and 0.7 percent for support vector machines in addition this is important so in addition to the to the data set that we use for training the label data set we want to test how the classifier does in the wild so we collect data from some network and we test it on that network so real network not related at all to the training data and we we test the algorithm there just to check the sanity because it might work for the data set that we have but then when we put it in the wild it might go crazy and gives a lot of alerts so for the rest of the of the presentation

I'm gonna be using random forests so a few words about random forests it's an ensemble of decision trees so what's a decision tree this is an example of a decision tree that I got on a car too that I got from Wikipedia unfortunately it's not it's not a happy one it's a classification for passengers whether they survived or do they didn't survive the Titanic so not exactly happy and it goes like this you know is the passenger a male if if no you have you have a high probability of surviving 0.73 73% chance of surviving if yes you go down you look at another feature or another property of the passenger and they age so I can see it greater than

9.5 if yes then you have a very high chance of not surviving here you know if yes if no then you go down furthermore you need to you need more information you look at at another attribute or another property and in this case it's a number of family members and you go yes or no so this is a this is a decision tree now you want to build the decision three-year so you want to determine what what features you look at and what features you decide on and the split and this is done basically this is done using information gained so you did you first you give you look at the feature that gives you the highest information

gained from the split which is in this case the gender and you go down by the end of the training you you you build multiple trees and then which which gives you your random forest which is an ensemble of trees so by that's by the end of the training and then you have a new certificate you pass it through all these trees and the decision you look at the decision of each tree and you look at the average and it tells you if it's better if it's malicious or not we use why you're widely used well-known psychic learn Python library and this this these calculations are not highly demanding we do that on I do that on my

laptop my laptop so that's part of why we say in your spare time it's not finally it's not highly demanding I'm gonna get into the core of the of the work and look at the results the first step is we have the certificate we have the attributes how do you you what do you use from these attributes you can't just give the certificate as it is so we want to get obtain or extract information in some form or some representation of the of these attributes so we have these general guidelines or general options a general feature let's say issuer country we have these options we can use just the value as it is issuer countries the US for

example or present or absent there is no actual country this can be another you know another way to encode this length of a string value this is useful in the case of the serial number because the serial number by itself doesn't really mean much so the extensions are similar you know one option is just to look if the the extension is present or absent crit critical or not critical for some extensions we dig deeper and we look at the values most extensions have multiple values so we dig deeper and we look at the specific values and for some extensions we look at the number of values number of subject alt names number of issuer alt names and and so on

can I get here into into this feature extraction from from attributes into some detail and this is mainly this is basically the bulk of the work the algorithms are well known or well established and the bulk of the work is are actually two parts obtaining the malicious certificates and doing the feature feature extraction the serial number we use the length of the serial number the value of the serial number does it doesn't mean much you know if two serial numbers are close to that that doesn't mean that there is similarity that we use the length of the serial number basically because there is a correlation between the certificate authority and the pattern between this very length of the serial number so

there is some information they're issuing and end dates how do you use that we use the validity periods in period in years like the certificate we saw it has a validity period of 12 so that's a feature 12 and whether if it is whether it's expired or not the version of the certificate can be 2 3 etc we use the value of the version the key length it says specific values 2048 and so on so we use the value itself key type similarly signature algorithm we use the value itself extensions the whole set as a set we use the number of extensions and which extensions are or whether there are extensions or not there is

some redundancy here but the main idea is use known number of extensions each of the extensions the the encoding take 3 different values if it's absent extension is not their present and critical and present are not critical so it can take 3 3 values for it for each of the extension we need for some extensions this is a general one for some extensions we dig deeper we look at the number of subject alt names we look at the number of issuer out names key usage has a specific set of values and any of these values can be present or absent so we look at each value like for example digit digital signature so if it's there it takes a

value one if it's not there takes a value zero other values and cypher wrong and so on for further four I think there is a set of nine values yeah extended key usage similarly similar to that Authority key identifier we use that value as it is because in principle there is a limited number of certificate authorities so that the value set is not overwhelming so you we can use the value in principle the the subject and issuer fields the the issuer country we use the value as it is so US Japan China each other whether it's valid or not this is another feature we see sometimes values like xx dash dash and so on these are

not valid kind of valid countries so we use that as a feature as well for all other issuer fields we use values because you know there is a it's a less it's a more restricted set of of fields than than the subject values for the subject values we use the country similarly to the issuer country value and whether it's valid or not the other fields because there is a large set so it can be anything anywhere we use we use the values of the the encoding is if it's absent or present so 0 or 1 now all this are still we're still not ready to give to the to the machine learning algorithm we have to do a little more a

little more encoding so this is a small set let's say you have a small data set for training set and you collect the values and this is the this is basically the universe the the feature space that you end up with so you have some certificates with issuer country China US Japan some certificates with version 2 version 3 lengths of serial number 3 7 some certificates have key usage is critical some have this value digital signature in usage and so on so this is a small subset if we look at one of these certificates it will be a set of it will be a vector of 0s and 1 in this case the certificate has issuer country

you as it has version 3 it has 7 as the length of the serial number and and so on so we take this numerical vector 0 0 or zeros and ones and we feed it to the machine learning algorithm that we trained and with it will tell us if it's if it's malicious or not so basically this is the the well-known machine learning pipeline with details related to our work on on on certificates now we start looking at some results what were mainly concerned with our to false positive and false negative rates with more in more emphasis on false positive rates I'll explain why so I as I mentioned before the data set has 13,000 that we

use here it has 13,000 200 certs and in the data set and we do the Train test plate 80 20 meaning we use with sample 80 percent of these certificates we use that as at our trainings the data set we train the model and then we use the remaining 20% of the certs for testing the we look at these two measures a false positive rate meaning the ones that classifier says they are positive there are malicious out of the the number out of the number that we have that are that are negative they're benign meaning that it classified them as wrongly as as positive and this sealed the number of the negative is the ones that are truly

they're classified as negative through negative and the ones that are classified as positive for the false negative rate is very similar but you know the mirror image meaning the numbers that work the number that number of certificates that were classified as as negative meaning benign over the total number of positive deserts so we value one this is random forests and it gives us a probability and we vary the probability threshold like where where do we set the threshold where we decide it's malicious or not meaning so if we increase this pro this probability threshold from 0.5 to 0.9 meaning that we want to be more sure that the that the certificate is malicious the false

positive rate as expected will decrease from 0.3% so this is multiplied by 100 and this is percentage so 0.3% to 0.1% the false negative rate will increase as we increase this because you know we want to be more sure so we missed some certificates that are you know looked malicious but not they don't look very malicious the probability is let's say 0.5 0.6 so we miss these and false negative rate increases so this trade-off is is there and you have to decide which which one you care about more in our case we we care about the false positive rate because there is a large number of benign certificates and if the false positive rate is

it will give you a lot of warnings or a lot of malicious certificates that are are not actually malicious so you will be you will be overwhelmed by the number so so this is the this is the these are numbers we get 0.1 percent 5% and can say a few words what what that means 0.1 percent means that false positive rate means that if you have a thousand benign cert you will get one cert that's that is the the classifier will tell you out of these 1000 benign certs to 12 it will tell you that this is malicious out of 1,000 so for that you will miss so odd for in return five percent means or five

point five percent means that out of a hundred benign out of one hundred malicious certs you will miss five or six it will tell you it's not been malicious or so this is the this is what the numbers mean and this is a trade off as and as I mentioned before it's it's very important that we test this on your dependent real real data we collect from the field make sure that it's these results are not just a property of the of the data set that we have and the classifier works well in the field so the now the dependence of the how does this how do how do these results depend on on the data set size why we look at

that because as I mentioned the number of malicious cells that we have is is an issue it's a we're bottlenecked on that and want to make sure that this doesn't doesn't hurt us here I'm plotting the false positive rate and pick false positive rate as a number of certificates in in the data set so these numbers two thousand four thousand etc we sampled them from the big data set which is thirteen thousand certs we do the same with sample many times for each sampling we do the test data split also many times we calculate we calculate these the false positive rate and which care we calculate the average and unruly we get this result and this is a probability

threshold of 0.5 that we saw that we saw before in that table it was the false positive rate was around 0.3% so at the beginning for a small data set the false positive rate is high and it decreases and it saturates that's safe if you fit this to a plot it will saturate around 6,000 to 8,000 here and let those so meaning that we have a good number of cells for for training now with that with that said we continue to harvest malicious certs no we're not stopping here and why because we might if we want to squeeze out more more accuracy you know a lower false positive rate we which we still we can still look at more

data we can still do more feature engineering we can look at other let's say more complex methods meaning deep neural networks and see if this you know this trend changes you know if it keeps keeps decreasing with with data set size me so meaning we can still gain from this from this from collecting more data and also we can still gain by making the classifier work better in in the field the final set of results I'm gonna show is the the feature importance we looked at some the attributes and how extract features from them we used a bunch of them and we want to get some intuition as to which which of these features which of these attributes is more

important in determining whether determining whether a certificate is malicious or not so this is the the table for the most important features this is the future importance percentage this is what this is what we get from the random from the random forest classifier what I'm showing here are the features that have that have importance higher than 1.1 percent so this is the last one here other features have importance lower than 1% and this all add up to 9 around 93% so this is the bulk if you look at we look at at these features so certificate policies this is an this is an extension apparently has the highest highest importance Authority in for access this

usually has links to the certificate authority URLs certificate policies has also links to specific policies related to SSL certificate authorities or PKS and so on subject alt names make different names for us website and so on this is TLS web server authentication and extended key usage critical key usage meaning if that creative a key usage is critical or not Authority key identifier this is an ID for the certificate authority and etc this goes down one here is the issuer country if it's valid or not and then here it's a issuer country the value of the issuer country the organisation in the India or entity and and so on length of serial number and so on and what we see here is

that extensions have particularly high importance and in addition to issuer fields so you cannot get away without looking at at these attributes also I want to point out here that this is there is a lot of randomness you know that it's a random forest there is sampling and so on so you might run another set of calculations sampling and averaging and so on and you might not get this exact this exact order or these exact numbers but the most important ones will be there so this might be the you know instead of the third place will be the second and so on but the most important attributes are gonna be there so the order might change a little bit

now I'm gonna take the that the first five attributes and look at look at them in a little more detail so I look at the prevalence in the of these attributes in benign certs and in militia certs basically the idea is to get some intuition as to what a malicious cert look like what to expect and what a benign third looks like so certificate policies has the highest importance and turns out that it's present in this extension certificate policies is present in ninety five point eight nine percent of the benign certs while its present on only 9.75 of the malicious certs Authority keen info access same thing very similar numbers subject alt name again similar numbers this is subject

alt name meaning whether it's present or not it has a little bit higher presence in malicious certs than than the other two critical key usage it's mostly in in in benign it's mostly absent in malicious Authority key identifier similar story except that it's a little more more prevalent in in malicious certs which is most likely why it gets lower importance than than than the bar above attributes we go back to to the sample cert that I saw at the beginning and we will look at the extension since we saw that the extensions have a very particularly high importance and these are the attributes or from the extensions of features that have high importance and we look at them in in

this cert certificate policies it's not present here it should be present usually Authority info X is not present subject alt name its present TLS web server authentication it's their critical key usage the key usage here is not critical and the authority key identifier it's not present here so we look we have some bad features it's on the wrong side of the feature split the cert someone that are and two that are ok and here turns out that this is actually a malicious earth

now finally I just saw some summary and conclusions of all this work so we applied machine learning out successfully to classify SSL certificate certificates we calculated the future importance and unprivileged and this gives us a good intuition as to you know looking at a certificate and I'm guessing whether it's good or bad it's not exact so usually you know you can sometimes you can see a lot of these features that look good but you know there are other features and it and so you know other features and ends up being that so but you get a good intuition study the effect of the data size you know being limited by the number of malicious certs we studied

that and we see that it doesn't it doesn't really hurt us some some future direction obviously we're gonna keep collecting malicious certs dig deeper into features try to squeeze out more accurate higher accuracy again in the spirit of looking for higher accuracy we look into whether we this will help help us using more sophisticated algorithms mainly deep neural networks which is very trendy now also look at the search in the in a broader context we looked in this study were so far we've been looking at the certificate itself look at and we can look in a broader context meaning the certification chain you know the parent child and so on more metadata from the TLS traffic and so on

so with that thank you this work has been done by myself and Abhishek Sharma at the data analytics team at Fidelis on Y Jason Reeves at the threat research team he collected he has this project of collecting the malicious certs and we are hiring so if you're interested please talk to me or to Daniel with the Fidelis t-shirt and you can look at the fedora security website alright thank you yes actually it's it's not only the Alexa 10,000 we use the Alexa 10,000 we sample from that we also have the census and there it's not necessarily from that has a millions of certificates we sample from those this though that is not necessarily the most

popular now well it has you know it has flags whether these certs are trusted or not you know again it can be trusted and malicious but you know we're relying on the fact that the percentage of malicious certificates out there in among trusted certificates is low so there might be a little bit of mislabelling but you know

yes what's your actual response time are you talking about oh no no it's definitely not hours not minutes it's it it closed the real time so the response time in in real life meaning out there in on the network you're collect your you're collecting the certificates you want to classify them and how how fast is that right I

we have been gotten into into that that detail yet well we're working on putting that in in we're working on putting that in the product we collect we have a database we collect from the research areas from the data from the from the field and looking at one certificate is extremely fast I you know I don't think it's anywhere near a second so yeah true

specific like specific types of malware we haven't looked at that you know one reason we haven't looked it so the the malicious malicious set of certificates have we looked at specific we have we you know partition the set into different malware types different actors and so on we haven't looked at that yet one reason we haven't looked at that is that we have we don't have a large data set so if you partition it try a farther you know the training is not gonna be good enough yes

[Music] yes so I think there is there is some work towards tweaking some militia there's some certificates that look benign and making the you know the classifier or fooling the classifier basically this is something in I haven't mentioned here but this is something that we were gonna be working on a

feature that is not present in the in the certificate itself I I don't have one I you know one way as I mentioned we're gonna look at the broader context I'm sure we'll get some something there thank you yes

so it's the is it's like this you know we have we have this sense in our product we have we have multiple components one of them is the sensor is the one that looks looks at the wire and this is the one that collects this the metadata it collects also certificates as part of the metadata we send it to the to our collector where we store the metadata so we'll have a database of search there and then we're gonna look at the we're gonna look at the certificates in that in that database we can do that in real time we can look you know once it get the sensor captures that certificate we can classify it but

this is not the architecture where we're designing or designing the the other architecture the sensor sends it to the collector and then we go to the collector and we we look at the other certificates in the database yes

so the reason I ask is we want to be better owned object based I guess we do go time of learning long before a few machine learning is not the certificate yet I see yeah we can't go it but I think this is not what we were choosing to do

I I don't have the details you know it's it was Jason that did that's doing the it's Jason that's doing the work I would guess that some of them are not still not live I don't know what the ratio is or what the percentage is or you know how limited we are by that but I'm sure it's it's there yes the code I I can't answer that sorry I I I have to check but you know if you can send me an email oh I will I will let you know no problem

all right thank you

Detecting Malicious Certificates Using Machine Learning

Related talks