Building a Predictive Pipeline to Rapidly Detect Phishing Domains

Name: Building a Predictive Pipeline to Rapidly Detect Phishing Domains
Uploaded: 2021-05-08
Duration: 40 min 14 s
Description: Phishing domain detection has traditionally been reactive, but monitoring SSL certificate issuance in real-time via CertStream enables proactive identification. This talk presents a Python-based machine learning framework using logistic regression to classify newly registered domains as phishing bas

BSides Charm · 201840:1441 viewsPublished 2021-05Watch on YouTube ↗

Speakers

Wes Connell

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

certstream

Service

VirusTotal

About this talk

Phishing domain detection has traditionally been reactive, but monitoring SSL certificate issuance in real-time via CertStream enables proactive identification. This talk presents a Python-based machine learning framework using logistic regression to classify newly registered domains as phishing based on linguistic and structural features, demonstrated with a live command-line tool and practical analysis of caught domains.

Show original YouTube description

Building a Predictive Pipeline to Rapidly Detect Phishing Domains Registering a new domain, requesting an SSL certificate, and installing it on the server got much cheaper for threat actors thanks to the LetsEncrypt Certificate Authority. Detecting new phishing domains has always been a reactive process for security teams; just like malware, one cannot provide threat intelligence on phishing domains before they're registered and operationalized. The development of CertStream adds an interesting dimension for how this process can be improved. SSL certificates, and the domains for which they are issued, can now be monitored in real-time. Security analysts have intuition on what a phishing domain looks like when they see it. Building a predictive pipeline to detect SSL certificates issued to new phishing domains can be accomplished very simply using supervised machine learning. In this talk, I'll introduce a Python-based framework for building this predictive pipeline from scratch. Presenter: Wes Connell (@wesleyraptor) Wes is especially motivated and passionate for dramatically improving data hunting tradecraft within the cyber security domain. He has a very broad range of technical interests - particularly in the securing hardware, software, systems, and networks. When he's not hacking the planet, he enjoys playing more golf than is healthy and painfully rooting for the Washington Capitals.

Show transcript [en]

oh yep

all set okay great wasn't sure if there's an intro uh thank you for being here thanks for having me thanks to the volunteers and the audio video team by way of introduction my name is wes connell and the topic for my presentation is building a predictive pipeline to rapidly detect phishing domains now this is some work that i had done in the november december time period i'm going to publish my slides and i've also pushed some all of my source code to github and i'll be doing a live demo during my presentation as well a little bit about myself i am the security analytics lead at pattern x so we are a security startup based out

in silicon valley and before that i was at northrop grumman fortunately for me i was in a role that was doing a lot of security research so a lot of predictive modeling a lot of cyber attacks and things like that and that's for me i've always been a security guru and i was fortunate to sit next to data scientists and things of that nature so for me there's been a lot of dialogue about data science and artificial intelligence and for me all of my everything i've learned has been on the job so i don't have a master's i don't have a phd but i do have a ton of domain experience and i've always been very interested in

statistics and so that's kind of my background here's what the agenda looks like so i'll start off with a little bit of a primer some of the ingredients that go into the work that i had done and also the elements of where i'm applying this work so the first of which is the certificate transparency log network i'll get into some of the fishing i think there's a lot of you know in this crowd it's universally understood whether it's malicious downloads or credential theft and then some of the ingredients again that make up supervised machine learning i'll go into the implementation some of the training data the features feature engineering extraction and the accuracy that i'm uh using to uh

determine if you know the suitability of my project and at the end again we'll circle back with the command line tool and this ipython notebook that i've built and i should have it's a 50 minute time slot so i've got a lot of flexibility and should have plenty of time for q a and i'll interject as i go along so with that for my audience awareness how many of you are familiar with the certificate transparency log network okay great a couple hands and how many of you have ever trained a predictive model okay more hands than i was expecting this is great so the certificate transparency log network is all about open auditing and monitoring for the

issuance of ssl certificates and you might notice you know what on earth does this you know could this possibly have to do with twitter there's the twitter check mark and i thought that was a good analogy so i don't work at twitter but i would imagine that you know for validating identities let's say if you were to look for mark wahlberg you would search for it on twitter and there might be 30 or 40 accounts and your visual cue for which one is actually him is the account that has the check mark and i imagine that if instead at twitter of having you know they being the sole authority they had outsourced that to 170

different authorities and said we're not going to do any monitoring no auditing they will individually you know use their different barriers and metrics for for validating identities that would sound a little bit crazy but that's exactly what we have with ssl certificates today so when you buy a domain name and you want to prove that hey i own this i want to get an ssl certificate they use those certificates to validate you know who that is and there's no way you've got lots of different attacks whether it's issuing subordinate certificates uh or a surface certificate authority gets popped you have the issuance of these certificates and again there's no monitoring so that's what this is all about

it was a project that started in 2013 and for the work that i've done it's the data input so every time an authority issues an ssl certificate it also gets logged and when i have that certificate specifically i can see the common name that it's being issued to the domain name and i'm looking for heuristics and things that look like phishing so this is pretty much as soon as they go operational there's a python library called certstream and there's a security team based in silicon valley that was nice enough to write this and this is available on github it is a you know you have this log network for me i really like doing prototyping in

python so this is a python based utility you import it you can just run it i'll show you what it looks like as well and you get this fire hose of domain names that you'll see on the right hand side and then for the problem space we're looking at fishing and if that's not exponential i don't know what is but this is the adoption of phishing attacks being conducted via https and you know when you look historically especially the last three or five years on different attacks you've got exploit kits um which you know they're still around they're still present but the sophistication for them can be pretty high versus phishing you think you can get a domain name for free

for now you can get an ssl certificate from some of these authorities for free and you can get hosting for free so it's very cheap and in most security problems or programs i think most of us would agree that humans are the weakest element so that's i think why you're seeing this popularity this is a diagram from a group called fish labs and again you can see every quarter there's a lot more phishing attacks being conducted via https and this is also what inspired me to build this project so for me i'm more in the predictive analytics and then you know the security space and this was a really popular prototype from an anonymous anonymous researcher named xors and he

was looking for suspicious phishing keywords he had a couple different characteristics that i used and borrowed and the work that i had done and you'll see at the bottom he's got predetermined scores and weights for the the different criteria and i thought you know when you run it work it works great this isn't a comparative thing but i just thought you know if he had published this and he didn't have the scores how do i know if that's the most optimal and how do i know if he didn't miss any keywords or you know over time yeah this is very static maybe six months from now there's something you know another you know thing like blockchain

or some of those exchanges that are not being included so how do i know um if that's if that's representative and again from my my background i do a lot of supervised machine learning and those are the questions i had asked so looking if this is the most optical scoring framework um if these are good keywords and then a really important characteristic that i'll get into is a levenshtein distance where where this is really good for me and it characterizes uh string similarity so um i'll get into that in more detail as far as building a predictive model an important characteristic is that you have historical training data and that i have we've seen phishing

attacks there are tons of fishing domains and then secondly can we characterize you know what i look at when i say that's fishing or that's not fishing in the form of features and specifically can i do this with python code and the answer is yes so i'm going to go into what that looks like in just a moment but we're going to do a live demonstration so this is the command line utility i've built and you're not doing real software development if you don't have awesome ascii art so that's there and i'm going to go ahead and let that run so this periodically we'll hopefully see domains pop up that look like phishing and again i'll circle back on this

in just a few moments a little background on supervised machine learning so what this looks like we're going to start in the top left the training text documents and images and the training data for for this work is these host names these fully qualified domain names api.support.microsoft.com for example versus that third one home.pavpalcountry.somerandomdomain.com so i've got libraries that have labels of these 5 000 domains are benign these 5000 are fishing and again those labels are directly beneath that and 0 would represent benign in one represents phishing and then just you know adjacent to that you've got feature vectors and that's just a fancy word to it's about characteristics how you describe the data so and the feature vectors with those

labels is what get what gets past that algorithm it's not like you just throw a bunch of domain names on this black hole and all the vendors say it's magic it's a unicorn it's just how we describe them and what that looks like is right here so for these these aren't exactly representative but each record here would be a domain from my training set and the you know the columns those are individual features from my feature vector and it it kind of encapsulates how i describe them and you'll see two of those are zero the labels are that would be benign and one is fishing and what the algorithm is doing is looking at the differentiating

characteristics between those two classes and when you think back about the keyword based you know rule based tool it's going to determine the appropriate weights based on my training data which for me is huge because then i can iterate very quickly as the training data evolves or if i have to update my features that they're things that i should be looking at that would distinguish these two classes that i'm not and then i don't have to go adjust every single feature and then at the bottom after i train the model the data input this is that search stream framework those are all of the the hosts the fully qualified domain names from the certificate transparency log network and

as they come in i'm extracting those characteristics characteristics that i had from training scoring it with my predictive model and then i'm saying this is either fishing or it's not fishing any questions on that

okay so i've broadly i you know kind of defined the machine learning life cycle depending on the problem that you're solving uh this can vary but for the work that i've done and doing supervised machine learning where i have the labels i know what the data is this is a rough idea of what my process looked like so starting in the top right we have this idea i think i can build a predictive model that encapsulates how i identify phishing domains i'm going to normalize the data so again for me it's that search stream framework that helps a lot but maybe if you're deploying this in production you're looking at urls you're looking at domain names

it's really important that you're very consistent with that and then moving on to feature exploration i'm extracting features i'm engineering i'm looking at combinations of them i might be doing some down selection you select the algorithm and then different algorithms have different parameters and that's that's kind of that feature exploration part and then you deploy and again what's really important here is there there can be lots of oversights and it's important to iterate quickly so for example two of the features i looked at were the keyword of account and another one being service and for the first week that i deployed this every accounting service being tax season i flagged as being phishing and then i just added that as a

feature provided some examples of it and then it learned that that's not a phishing domain as far as the data acquisition what i had done was i just took a rough list of words to use as a regex against the search stream feed to say you know rough idea of kind of what i'm looking for and i'll let that run for about two months and again it's all uncharacterized there are lots of you know what i would say are false positives but it's close to the criteria that i'm trying to catch and i got those some of those keywords from that the popular prototype from xors i also looked at um there's a security researcher swift on security

and he's published a bunch of exchange rules and i took some of those words to help me use for that another one is fish tank so this is a a pretty popular phishing resource that's got urls and domains and targeted brands which i'll get into that was important and then i looked at the top alexa domain so those are the top you know the most popular million domains and just because they're popular doesn't mean they're not malicious but generally with phishing the sites are brand new and they won't be in that list and then again model correction from the different iterations that i had done for the keywords i just split you know from everything i'd flagged i split it

by um the the special character so it's difficult to see on the monitor but in python you can do a import the regex library and do a split on slash w and then i just get this really long word list and i didn't take out the top level domains but you'll see a few like account apple apple id support paypal and i just eyeballed about 100 of these and and used those as keywords next was popular brands that were frequently targeted and this is the fish tank website you can search by targeted brand which again was really helpful for me so i i there are a bunch that i didn't think about and there are plenty of banks i

naturally thought jpmorgan or hsbc or bank of america but you'll see a few more in that list fifth third bank this first federal bank of california and several others i also added a few different email providers and and then i searched those to help seed my training data and i found examples of both from my the data that i had started with this one's really important the levenshtein distances from fishing words and what this does is it measures similarity between two strings so i would say give me things that have a distance of one from paypal or from apple id and you'll see those in the top right and this is something that i think

is very simple but very impactful because i don't know every combination of when you know somebody creates a domain name and tries to make it look like paypal or apple id what it's going to be and that's what signatures are really good at but what i can just say is you know take take all those fully qualified domain names split by the special characters and see if it has a distance of one to you know these 30 different brands i targeted and you can see play pal par pal paypal with two ways that that's pretty important for me same thing with apple for the targeted brands again this is crucial so there are lots of brands that i was

looking for and there's a huge delineation between the presence of where that brand is so if it's in the second level domain like we have on the left hand side that's microsoft.com and those are all domains that i had flagged i wouldn't say that they're malicious i mean it's it's not you know one of the attacks the sub domain hijacking so that's certainly possible but in general those are a bunch of domains that have microsoft but what it's saying is those are hosts at microsoft.com versus on the right hand side the phishing data i've got the brain in a subdomain which says that that's a host on some other domain and you can see some of those like a couple down

dash apple.com.updateyourinfo-secure so apple.com when you look at it in your browser it might it would probably throw off you know plenty of non-technical folks but it's actually a host on another domain and the thing is great is just a couple lines of python i can implement the logic to look for this a few other features i looked at the tlds the top level domains and used them as categorical features some of the ones have have pretty bad reputations like tk and ml and part of that being that they're free um i looked at the number of dashes a number of periods and the domain entropy which is kind of a measure of randomness as far as the algorithm i use logistic

regression and some would argue that this is really just advanced statistics but for me it's very fast to train very fast to predict it's good with lots of features and i particularly like it because this diagram at the bottom so some of the you know it's it's a lot more popular to say i did deep learning and trained a neural net but that's for me very unnecessary it requires a lot more training data the complexity of tuning the different parameters is also very difficult whereas this again is about as simple as it gets and what it's doing is it's assigning for each feature you'll see w0 w of one for each one it's calculating the feature weight and

the offset which is something specific to logistic regression and will determine that that classification and what the bottom is showing is that those are the words or characteristics that are most indicative of that that class so the positive class for me is fishing and you'll see the histogram the the ones with the largest columns in blue some of those are bank of america in a brand sub domain or paypal has a levenshtein distance of one the login keyword icloud keyword things like that on the opposite side you'll see like the tld is com you've got accounting service as a keyword you've got paypal in the domain it's the second level domain so those are the

characteristics that based on my training data the algorithm assigned those scores for and again this is exactly what i'm trying to get because when i saw that rule-based utility i was like how do i know that you know paypal should be weighted at 85 versus 60 things like that now this is also indicative of my training data and it's very important that when you build predictive models your training data is representative of global environments so just because i have the highest score for bank of america in the sub domain doesn't mean that that's actually the truth if the data i trained on is not representative of everything you know globally it might just be that in my training data i have

all of them but that's another thing that's really important to take into account when you're training predictive models for measuring performance i'll get into this in just a moment this is i've got a jupiter notebook that has all of these metrics i got the you know the area under the curve the precision recall very very high but ultimately when i was doing the deployments i had some false positives so in the jupiter notebook the metrics are a little bit lower it's not quite four nines it's probably around 99.2 and two of these metrics for example for precision it's when the classifier cries wolf how often is there actually a wolf and that's another big thing with uh

predictive modeling the authenticity and the integrity when it says something's on fire is it actually on fire conversely you've got the recall rate which says out of all of the phishing domains that exist how many am i catching so you've got depending on different thresholds when you train your predictive model they can perform differently or be noisy but catch everything and different enterprises have different risk appetites so that's something that i'll show you as well so we'll pivot back to the command line and it looks like we've got a few good ones you'll see paypal.ga up top uh it's in red you can't really see that but let's see facebook facebookcustomersupport.tech.blog that's probably not legit yeah

appleid.apple.com.security.subscriptions.ml but you'd be surprised i mean especially you've got attacks that do url padding i don't have any right here but um it might be like apple.com and then it will have like 10 dashes and on your mobile phone um it'll look you don't really see the end of the dash as they just let it go and it looks like apple.com or facebook.com or offerup.com but i mean here's some examples that i think universally we can agree that those are not legitimate and i didn't use you know i i used some keywords but there are plenty of different characteristics that i use to encapsulate how i identify those and it's flagged them for me um again there's a lot here so um

so that's good and to give you some perspective of of you know how noisy this is i'm probably getting one hit every 10 seconds or so and the certificate you know the search stream library it's takes about 10 or 15 seconds to fire up but this is i've seen it go usually around 200 domains per second and as high as maybe 4 or 5 000 per second so that's kind of what it's it's looking at this again it takes about 20 seconds to really get going but uh but it's looking at that using my predictive model that encapsulates my intuition what i think looks like phishing and it's you know detecting those domains it's getting a little bit faster

there you'll see so again this is a really simple application that i built but that doesn't mean it's not impactful so on the theme of doing demos i'll show you here's what it looks like there's a jupiter notebook that comes with it and one of the things that also inspired me to do this is i've attended plenty of really high quality talks on predictive modeling and machine learning and ai and then as soon as i walk away i'm thinking like okay like how do i get started like i've never done a lot of that stuff i don't have all the resources or it's not blatantly clear or maybe i'm really curious on how they work but

all of that's been abstracted away this is it doesn't matter if you're a software developer a stock analyst if you know what a virtual private server is and you can spin one up in digitalocean or aws and run a bash script it'll build everything for you and this notebook is one of those utilities and it's kind of like going from you know scratch all the way to training a predictive model so we can walk through some of this kind of describes what i'm doing in this notebook i've already seeded it with all of my training data so phishing domains and those benign domains and kind of you know describing some of where i got them the code that i use to

to run this and all you have to do is press shift enter and then it'll load that into some lists here's how i define and compute my features and i go into some um you know for this this is very simple like building a predictive model using spatial features at pattern x we've got predictive models that look at temporal features that look at entities over a different time and we look at different data sources as well so again this is this is pretty straightforward and then i get into a little bit of the discussion about the feature that i had extracted and again i've reviewed most of these again here's all the code to do that

you just press shift enter a little bit more dialogue on the you know computing the features and and how that works and what it does so an example here is you know i've got three different domains espn.com tourproject.com and the top level domain is com.org tk and when you're looking at categorical features what it's doing is it's creating a category for each unique top level domain so for espn.com tld com being the feature would be one and everything else it would be zeros and then for each respective domain that's what it's doing so then you compute those here's where i'm training the classifier so i'm assigning the labels and i i've already extracted those features so the

features have a label and i just hand it to the algorithm and here is the one line you know if you're concerned that logistic regression is not really machine learning you could just put random forests or you could put something else and that's where it's training the classifier and or the silver bullet or black magic whatever you like now in you know production you know when i'm actually training predictive models and see the note right there generally you'd want to do some kind of grid search you want to evaluate and this is so common that it's built into scikit-learn you might want to look at multiple algorithms you might want to look at multiple parameters for each algorithm you might

have different ways that you extract features and you want to run all of them in parallel you can do that as well but for this project it's out of scope and it's because i had done that and i think this is such a simple and easy problem for what i built that it's it doesn't really matter which algorithm i was using i was getting the same performance and then again here a lot of the the classifier metrics i had mentioned precision and recall there's plenty of different metrics that you can use again here i've updated this a little bit and you'll see the weights have changed based on my training data i found a lot more

examples of dropbox ebay yahoo so those now have very high weights and then on the other side you'll see the top level domain being calm having paypal in the brand domain which when i originally got my the training data that i started with there were over 700 domains from paypal that were very long but they were legitimate and i had incorrectly flagged them as being phishing but once i then extend my training data to include them with the label of this is not phishing the algorithm learns how to distinguish those domains from actual based on those features here's the precision and recall again and i kind of what this is doing is based on different

thresholds so this is the output of my classifier as a score between zero and one so the default malicious threshold is 50 if you were to increase that to 60 or 70 or 80 your precision would go up but the recall might go down so i would i'd be i'd have more certainty on the ones i'm flagging are actually fishing but i might miss a few in that gray area and based again on your risk appetite or if you're doing tuning or throttling of any predictive models that you're running this is an important piece to look at a couple more the true positive and false positive rate uh this is very similar to the above chart but

this kind of gives you a rough idea of how noisy it might be or how effective it is and again these numbers i didn't really get into this but when you have your 10 000 domains when you've got your training data what you want to do is split that into the data that you train on and then the data that you evaluate it's usually an 80 20 split so i would take 800 domains to train and i've got 200 that i test against and i know the label so i know if my model is accurate when it makes its predictions this one's among the the easiest to understand the classification report there's also a confusion matrix here

this is the one i wanted to get at what this is showing is when the true label is not phishing how frequently it says it's correctly not phishing in the top right you've got false positives where it's not or sorry those are false negatives in the top right when it's not fishing and then the predicted label is fishing those are actually false positive the bottom left is false negatives where it's actually fishing but you're saying that it's not and that's another metric that you can use to gauge the effectiveness and here i've got some new data again um i'll show you some of this in the command line tool but you can train your model right here

you can plug in whichever ones you want and then run them and you get the scores right there so you'll see i didn't flag paypal.com or apple.com but i did flag paypal or pavpalverify.com supportapple.xyz things like that and i've got a few examples of you know what i've caught we saw those domains that i'd fly during the demo um if if i'm a sock analyst or or maybe i kind of extend my classifier to only look at my company's brand you would then rather than just looking at the domain name do some kind of more dynamic analysis either look at it in a sandbox look at passive dns look at the whois registration so i've got a few things here i used a

honeycline and the top which is just it's a emulated browser that i can use from the command line and when i tried to hit it i think it was my user agent that said the server would then return a 403 forbidden error and when i looked on virustotal you can't see the detections here but there were no hits and then when i visited in a sandbox you can see that's the ssl certificate and behind that is the fake paypal login screen another thing that i found quite a bit of are these fishing kits so when i would visit the page it's maybe they it's still not fully operational so i can just visit it i get this directory

listing and i find these zip files and i took a couple of them and submitted them to virustotal and you can see this one is flagged as a web shell it could be because it's just so new and it hasn't really gone live yet or maybe there's a specific resource on this domain that they're using to run the phishing attack but nonetheless i've got over 100 that i've collected and it's pretty interesting and some of the work that i'm doing after this again i've open sourced this you can use this to the extent that you'd like but um this was more about showing you know some of the plumbing of and the different independent variables

that go into training a predictive model you know for me as you know actually solving the fishing problem i would want to do a little bit more orchestration with when i flag something that looks like phishing go fetch the domain name or go fetch the the a record from that domain see if it's shared hosting or dedicated hosting maybe look at passive dns to see some of those things maybe look at virustotal maybe you've got a threat feed of your own if you have a sandbox and to do as close as i can to self-adjudication of new fishing domains again this is not a unicorn it's not magic it's not a silver bullet and with

that i recognize for this problem for this approach in this predictive model there are plenty of flaws so number one is with international characters so if you've got puny code encoding uh mine's looking for utf-8 but um some of the international characters they show up as xn dash dash so that could probably throw off my my classifier and there are plenty examples of phishing attacks that use puny code which it's with the special characters or international characters it would be like a little l but it would look just like paypal.com my classifier is not really optimized for that but you could extend it to try to do that secondly if the suspicious attributes that i'm looking at only exist

in the uri so i've got the fully qualified domain name but the resource has my account paypal whatever the data input for me is not looking at that but you could again extend this to look at urls and have a new data input and provided some labeled examples in your training data and do that third is that the search stream is only for domains that are being issued ssl certificates if you've got attacks that only use http and don't have ssl certs i'm not going to get that data so i'm not going to see it and i can't flag it one of the things that i'm looking at is zone transfers from a lot of really popular top level

domains so i can basically say from everything from you've got you can go to icann.org and i'm looking at for example the dot review top level domain i can do a daily download of everything in that zone and do a diff every 24 hours and that would show me the new domains from that day and that would be one way to see domains that are newly registered but maybe don't have ssl certificates another really important one is with let's encrypt it's a free ssl service so they added wild card certificate support last month and i thought you know the work edit is pretty interesting i've got a ton of data maybe if i run it against search stream

or the log network i'm not going to flag as much because you can hide all of you know rather than having apple apple id com support i forgot whatever dot and then yourdomain.com it would just be a wildcard.domain.com but i haven't seen any drop off in the in the number that i'm flagging so but that's not something that i expect to last for too long i'm not really sure why that's the case but that's what i'm seeing now that again i have all this training data another option that you could do is maybe run it against your if you've got the ability to filter do content filtering for dns you could run the predictive model there and then

when you're doing the dns lookups you would have even if the the target server is running with a wildcard certificate you would see the full host the fully qualified domain name and because this runs yeah it's really lightweight the data input is only a string it's a fully qualified domain name it's really quick maybe you could do some kind of blocking maybe um yeah there are plenty of different authorities and i've got a lot of questions like have you talked to gmail about incorporating this or things like that you certainly could and then the the biggest thing here i think is that my recall rate is uh unknown so what and what i mean by

that is that of all of the phishing domains that are getting ssl certificates i don't know how many i'm missing because as you saw when i ran search stream there was a flood of those domain names that are you know a thousand to five thousand per second and for me to calculate the true recall rate i would need ground truth on every single one of those and i don't what i can do is because the number of alerts and the number of flag domains i got is so manageable i can calculate my precision rate so when i say something is fishing is it actually fishing again it's available on github i'll leave this up for just a moment

i published this about two weeks ago and i'll show you what it looks like in just a moment it's called streaming fish and the kind of the follow-on project is um i've named infistigate get it so i hope to have that done in in time for black cat i'd like to present it there and again this is more about you know i've got lots of questions like you could use unsupervised learning for the feature extraction you could look at you know all these different data elements you could look at the a records to see hosts that are frequently hosting phishing domains you could certainly do that this is more of a you know when you

especially if you're at rsa last week and you know for me the last five years working with security vendors that are doing predictive modeling there's a lot of discussion there's a lot of hype a lot of skepticism with the capabilities of machine learning and no one really talks about the the independent variables because it's dynamic it's very robust it's proactive but it would be really nice to see um you know some of the the end users say um you know what's your your precision rate look like what about recall where do you get your training data and really understand all these different independent variables and some you know use cases like this one i would

say is very simple for others not so much so i just thought this is something that might be helpful again this is what it looks like on github this is it right here so wesley raptor and then streaming fish and i've got some description about this uh and managed to build a decent graphic about you know kind of showing how this works you've got all these different certificate authorities they issue these ssl certificates they get aggregated by the transparency log network and then i've dockerized everything in my my application so i was using python 3 and i really like doing development on linux and i realized that a lot of people maybe use python 2

or they have windows or mac so i've got an automated install script for linux workstations debian ubuntu fedora but if you've got docker and docker compose installed on windows or mac you can get this up and running in under five minutes again here's a little bit more about what each container there are three different containers one for the notebook one for the command line utility and one for the database where i'm storing if you train multiple classifiers you have to keep consistent the features that you were extracting at training so i store everything in this database and you can swap and do all of that and again i've got quite a bit of documentation on

how to do this so to install on linux i would just click this and this says use the install streaming fish script here's what it looks like for windows and mac i didn't add any automated install for docker and docker compose but if you have that i mean the websites have documentation on that all you do is check out this directory or this project navigate into the directory and then bring it up and then that's that so that's everything i had i hope you liked it if you think i'm a genius if you think i'm an idiot i'll be around for a little bit after the talk and thanks again for staying at the end of the day so thank

you very much [Applause] you

Building a Predictive Pipeline to Rapidly Detect Phishing Domains

Related talks