← All talks

Bad Neighborhoods – Data-Driven Detection of Malicious Internet Infrastructure

BSides Las Vegas · 202150:0467 viewsPublished 2021-08Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
Frameworks
About this talk
Sophos AI researchers present a machine learning approach to predict malicious IP addresses by exploiting hierarchical structure in the internet. Using convolutional neural networks and transformer-based models, they identify clusters of malicious infrastructure and demonstrate that incorporating ISP-level context significantly improves detection accuracy across spam and web-based threat datasets.
Show original YouTube description
GT - Bad Neighborhoods – data-driven detection of malicious internet infrastructure - Tamás Vörös Ground Truth BSidesLV 2021 - Camp Stay At Home - July 31 Video Tags: bslv2021-gt-bad_neighborhoods-1046317
Show transcript [en]

Hey everybody, this is Urban Martin and I help run the Ground Truth track at B-Sides Las Vegas. The next talk we have is called Bad Neighborhoods, Data-Driven Detection of Malicious Internet Infrastructure and it's by Thomas Vorosh. Thomas is a data scientist at Sophos where he researches various machine learning models for security applications. He has a master's degree in computer science. If you have any questions, you can hold them for the live Q&A, which will immediately follow the talk, and those questions can be submitted via the Ground Truth Discord channel. So thanks very much, and I hope you enjoy the talk. So, hey everyone. Thank you very much for taking your time to listening to our talk called Bad

Neighborhoods Learning Malicious Infrastructure at Internet Scale. This is a research project at Sophos AI. My name is Tomás Wörös. I'm a data scientist with Sophos AI. Two of my brilliant colleagues contributed to this project. Richard Arang, back at the time with Sophos AI, now with Duo Security as a senior tech lead, and George Sacks, the chief data scientist at Sophos AI.

In this talk, I would like to cover our ML approach, our data-driven ML approach to learn malicious network infrastructure and not only learn the infrastructure but being able to predict malicious infrastructure. So put it in a very basic way. we would like to take an IP address at the very basic and unmaskable and core building unit of the internet, take it as an input, run it through our ML models, and in the end, we would like to have the ML models to make a prediction whether that IP address will be involved in some kind of malicious activity, such as sending out spam, hosting malicious domain, or things like that so we would like to have a reputation score assigned to each of the

IP addresses so here we have the agenda lined out and I hope it will help us understand why we think it's possible why it's even possible so first what we are going to do we are going to we are going to review the hierarchical structure of the internet starting from an IP address going up all the way to Ayana. And after that we will show that on the two datasets that we have collected we can spot hierarchical uneven distribution over the IP address space. Once we have that, once we Once we can see that there is signal in the hierarchical structure of the internet, we propose two deep learning architectures or IP representations that have

our intuitions about uneven distribution of maliciousness over the internet baked into the design. So sort of finding the best hammer to hit the nail of predicting IP address maliciousness. And once we do that, we highlight one of the weaknesses of using only IP addresses input of the model. And we will propose a solution to overcome that. And we will do that by keeping the IP only input, but improve significantly the performance of our models just by gapping that result by overcoming that gap in the non-contiguous IP prefixes for one ISP.

Why is this whole thing we think is worth doing? It's because as a regular computer user I'm exposed to all sorts of threats

lying under, I check my email, there could be phishing emails, I browse the internet, there are malicious websites. If I fell for a phishing email or if I click the malicious websites with some lack, I am infected with a virus that takes the control over my machine and then the malicious actors are controlling my machines and then they are extracting data from my machine. over the internet. And what all these very basic and common attacks have all in common that they all have an IP address involved during their execution. So with the phishing email, it comes from a domain that has an IP address. A malicious website is hosted by a server that has an IP address.

If someone gains control over my machine, that's the IP address. where the controlling comes from is showing up in audit logs. I can obviously see that the data exfiltration is going somewhere over the internet that requires an IP. So if we are to support a multi-layered approach,

it's a reasonable goal to set that every stage of that approach we can say something about an incoming IP address and it's a vertical goal to explore. Here's a quick review of the hierarchical structure of the internet. A simplified view but hopefully it drives home the idea. So let's look at the bottom right corner where we have the our domain is aphos.com hosted.

If you ping that domain you can see it's hosted by one of these IP addresses and

that IP address belongs to a prefix which is a slash 20 prefix meaning that

at least 20 significant widths of an IP address is variable so that prefix encompasses a larger IP space. And then this prefix belongs to Akamai which is an internet service provider and Akamai owns multiple prefixes, not necessarily continuous that I was mentioning before. And then Akamai reports to the regional internet registry of ARIN

And then on top of RAIN there is IANA supervising everything. So the quick question here, if you were to assign a reputation to any of the IP address or any of the artifacts within the internet infrastructure, at which level should we assign that reputation? Should I assign it to Sophos.com? Should I assign it to this very specific IP address? Should I do this on the prefix level or the ISP level? Or should I say everything from ARIN is bad or good? And if I did that for this branch, can I pick a specific branch where I can say, OK, everything from Cloudflare is good, everything from Akamai is good, or should I go down? And our answer is, we don't know.

Let's figure it out from the data and let the data drive the appropriate level of reputation on this tree. So what we did, we gathered two separate data sets to have two different goals modeled. We have one empirical data set that we collected from customer telemetry, my quality call it in this presentation the web dataset. As ground truth we used for the malicious part the IP addresses of domains that were labeled as malware repositories or phishing sites or call homes and we used the IPs of domains that were labeled as social network infrastructure source or search engines so for example Facebook or Google and we have roughly a half a million size of a dataset for

that. And then for the more standard spam-based dataset, we utilized two different static blocklists. For the malicious ground truth, we used the Spamhouse XBL dataset that contains IP addresses of spam engines and hijacked pieces. And for the benign version, we use the DNS-VL dataset, which contains IP addresses of email service providers that are known to react quick once they got infected and they clean it up fast. So less likely that there is malicious behavior going on on those IP addresses. We downsampled this list, so it sort of reflects the size of the empirical dataset. The empirical dataset, we gathered it We fixed the half a year time period and we resolved the IP addresses of

domains during that period. What we need to do before we start modeling, we need to look at

it's beneficial to visualize the data to see if there is any benefit of modeling this data. Is there a signal in there? So,

before looking at the data, a quick recap of what Hilbert Cursor are. Hilbert Cursor are the de facto tools to model IP addresses. and it's because an ip address is is one dimensional by nature so if we take the string of an ip address it's lined out in one dimension sorry if we take the ip address as an integer and we compare it to other integers then then then they are one one dimensional by nature so what Hilbert curve does it takes the one dimensional IP address in our case

and projects it into a 2D space and it does it in such a way that the locality is preserved of the IP addresses meaning if two IP addresses were close to each other in the 1D space then they will be close to each other in the 2D space as well.

So let's look at the Hilbert curves. what do they say about our data. On the left we have the Hilbert curve for the web-based data set, on the right we have the Hilbert curve for the spam data set. Each dot on these Hilbert curves is an IP address, each red dot is a maliciously labeled IP address, and each blue dot is a benign labeled IP address.

So as good news what we can see here there are clearly separable clusters where we can see that this is probably a more benign region of the internet and this is let's say a more malicious region of the internet and we can spot the same clusters in the spam dataset as well though more to the nature that that dataset being samples from block lists, it's way more clustered than just the empirical data set. So that's actually good news that we have these clusters. So there is actually a signal in there that we can try to model. The question is, if we want to model an IP address, how are we going to feed it to a machine learning model that operates over numbers?

So what's the best way to model an IP address? IP address is multiple representations by nature. So we have the standard decimal adopted form that everyone knows or might have seen somewhere. IP addressing under the hood is just an integer. So that's a trivial representation too. or there is the binary representation of the same IP address. So these are representations that we are going to baseline our models against and see why we think our models are more appropriate to reflect the intuition that there is a hierarchical bias. Because for example, if we take the binary form of this IP address, if we had these two samples in our training set where we train our model, and these IP addresses have the same

label, they match on the least significant subnets, but they don't match on the most significant bits.

there is a chance that the model learns that the key factor is in the significant bits which is counterintuitive of what we want to do so we want to make sure that our model understands correct subnet arithmetics so

this is our first proposal we take the IP address and we take the binary representation of the IP address. So only the left part will be the representation, the prefix is just for illustrative purposes. So this binary bit string represents of that 32 prefix.

This is nothing new compared to the previous slides. So what we do after this is we take almost the same bit string, except we zero out the least significant bit thus the prefix represents the slash 31

prefix meaning the second row here is responsible of representing two IPs in in that subnet

in the next row we we again type the binary representation of the IP address and zero out the two least significant bits which now in that way will correspond to the slash 30 prefix meaning it's responsible of representing for IP addresses

and so on thus we get a 32 by 32 matrix where each subnet of the IP address is represented each row

of the matrix is a subnet of the IP address.

Why we think this is good? In deep learning, there is a specific subset of deep learning models called convolutional neural networks or CNNs. CNNs are the go-to approach when it comes to image modeling. and this is a super high level recap of how cnn's work cnn has a image as an input and in our example this cnn takes image as an input it does its stuff it's not specifically interesting in our case and then then it takes and then it outputs whether is it a bear or not the key part of processing an image with cnn is sliding a so-called convolutional window that you can see here. So this is slid all over the image. It slid over the ear of the bear,

face of the bear, and the completely empty background. And the reason this is super effective, because the model can map each segment of the image to

So the model is capable to assign importances to specific parts of the image. So the empty background is probably not super highly predictive whether it is a bear. The face of the bear is super highly predictive, and you get the idea. But why am I talking about bears when we were talking about malicious IP addresses?

CNNs work over matrixes such as an image. What we just constructed before, it is a matrix of subnets of an IP address. So suddenly what we can do is take a convolutional window just as we do with an image and we can slide it over each row of the matrix of the subnets. And why is that good?

what we can observe here so we have the we have our observations in our training data set so let's say the convolutional window looks at the first row okay it was a benign ip address okay then we that that that's a good indicator that that this ip is a

is is benign but what we want to do is generalize beyond one ip address so we go go further further further down and just as with the bear that with this representation the model gets the capability of pinpointing out that maybe okay this is this is the this is the subnet that I have seen with a lot of IPs and 85% of the IPs in this subnet was a predictor of benign behavior so the model gets the chance to assign a good reputation to this specific subset and thereby sliding this convolutional window over all of the subnets so just as with the bear, okay this is the face of the bear then Moving down to the most significant bits, probably that's just white noise because it

evens out over all the internet. So it's not going to be predictive whether something is malicious or not. And it's not going to be picked up likely by the model. But the goal is that this matrix has all the 32

subnets, so it gets gets the chance to observe all of the subnets and make the decision on itself at what level it is supposed to score the IP address.

So this is our first proposal. As for our second proposal, we turn to transformers. So transformers are really a powerhouse. They are now the state of the art for sequence to sequence architectures. If you ever faced a

machine translation, it's probably transformers running under the hood. It's a super complicated architecture. There is one specific part of the transformer architecture that we would like to highlight and that's why we think this is a good fit to use.

It is called self-attention and what it does, if we look at this example, if the task was to translate this sentence, the animal didn't cross the street because it was too wide, so if the task was to translate this to French, And we are done up to this point. The word it, what is it supposed to refer to? And what the self-attention mechanism allows the transformers to do is to bake in its encoding the important words that are relevant to it. So a transformer model is designed in a way that

that the word it can pay more attention to street and it's not going to pay attention to irrelevant words when it is getting encoded. So this is again just with the bear, we took the idea of convolutional networks and applied it to IP address representations. Now this is a sentence to sentence translation, why we think it's good for IP addresses.

So this time we have this weird sentence that does not have English words, but this time it has parts of an IP address.

We used the pre-trained Distilled BERT model from Hiregameface, just a minor detail, and fine-tuned it on IP addresses. And what it allows the models to do is when encoding, for example, this not significant part of the bit. So, arguably, you would say that ISPs behave the same on specific levels and upholding that policy. So, if we are going to make a decision, we don't exactly know at which level the decision we need to make, but it's probably not the least significant quad or maybe not even in this quad. And this is actually a snapshot from the visualization of the attention that this model takes, for example, for encoding the number 21. So what it learns to do

is pay attention in this encoding to the more significant quads

of the string itself. And this is exactly what we want to do. that we want to emphasize the more significant parts. We don't exactly know which ones, but we want to emphasize the more significant parts.

So these are the results. We present the results as ROC curves. We have two REC curves here, one for each data set. What's a REC curve? A REC curve is supposed to display a trade-off between false positive and true positive rates. So on the x-axis, I have here the false positive rates, meaning that I can decide that

based on evaluations, I

want to allow my model to make a 10 to the minus 2 false positive rate, meaning that one in a hundred IP will be a false positive in exchange for catching 40% of the malicious IPs ongoing. So this is what the record does and clearly as you allow more false positives, your true positives go up and the opposite is true in the other way. So what we can see here is that Oh yeah, sorry, one more thing. AUC, we measure the goodness of our models in terms of AUC, which is around under the curve. So the more around under these curves are the better predictive power they have. So what you can see our convolutional neural network

approach has the highest the second highest AUC then the transformer approach has the highest AUC so compared to the trivial approaches we beat them with both our proposals and we do it consistently so on both of the data sets this online AUC is relatively low so this wouldn't be strong enough for a standalone deployment, but it can be a viable part of a multi-layered approach, just as an additional signal. But for the spam dataset, which is differentiating IPs from the two different blocklists, it is a significantly more clustered and easier dataset. There we can achieve at the 10 to the minus Third, false positive rate is 60% true positive rate, which is maybe standalone-ish model performance. So on both

data sets, the CNN and the transformer approach is win. And the rest of the encodings are up close, but they sort of differ on the two datasets but our two approaches are consistently up front. Okay, so these are when we use models that have only IP addresses as a result. The question is

can we do better? And if yes then why? Let's fix ISP. Let's drill down into a specific ISP. So I looked down Cloudflare for investigation, and this is a hypothetical scenario in our training set. So as mentioned before, an ISP can own multiple IP prefixes, and those are not necessarily continuous. So what happens in this scenario? So we have this first prefix, where we have 5,000 data and then 5,000 samples for training data and 5,000 samples for test data. We train our model on 5,000 samples and predict the 5,000 other samples in the test set. It's maybe fine to say that we have observed enough behavior from the specific prefix to say that, OK, we

are confident about making a decision. But what happens in the second case where we have the other prefix from the same ISP and the training data we have one IP address. Should we be confident to make predictions of 5000 IP addresses based on one IP address? Probably it's not a good idea. On the right you can see the representations of IP addresses. So here we have the

here we have the for example one of the class one of the prefixes here we have the other cluster of the prefixes of cloudflare really what happens when we when we split in a way that one prefix has only one ip address or few ip addresses as a representative would it be more reasonable to to generalize from a different prefix from a same ISP if we are so low on sample size in this prefix. And that's what we set out to explore. So we built a different model. In our previous experiments, we had the IP only input. We encoded it in this subnet matrix representation that went through a convolutional layer and predicted if the IP was malicious or not. What we do is now we extend

the set of inputs and we take the ISP of the IP address in its textual form and wire it in another branch into the same neural network. Important detail that now it is predicting anymore whether an IP address is malicious now it's predicting whether an IP plus an ISP spare is malicious this is certainly doable as there is the there are many ways to look up a GYSP database but our main goal was to predict on a single IP address and this is

sort of going into the opposite direction. So let's see if we can do something better. The information which ISP and IP address belongs to is encoded in the IP address itself. So what we do is build a totally different model from what we had now.

from the ones that we had before. Before what we did predicted whether an IP or an ISP plus IP was malicious. So for now, we build a model that has IP address as an input. We have this convolutional approach. And as an output, this model tells us which ISP that model belongs to. For this we use the MaxManage GeoISP database. That's basically our output. So essentially we are compressing the database and we are trying to predict the ISPs just from the IP. It is supposed to be doable because it's within the IP address.

It's really a simple compression. of the of the database why is this good sorry forgot to mention so for for this specific model we use a special loss instead of the standard sigmoid or softmax which is called costface uh mentioned the archive paper for that which was designed for face recognition but we saw how that that can be analogous to our case is that the specific loss, what it does, it is supposed to force the model to learn embeddings of the inputs, meaning internal representation in the model, internal representation of the inputs in the model, such that intraclass distance is minimized and inter-class distance is maximized. So what does that mean? It means that if I look at the IP address representation

at the end of this layer and plot it in 2D, then I can see that before this training, we had these two separate clusters for the Cloudflare ISP. And really a huge overlap for all of the IPs.

And what Cosphase does is it contracts these representations not in a dot, but they are more closer to each other, meaning that the intraclass distance is minimized. They are not that far apart from each other anymore. And intraclass distance is maximized, meaning it tries to put much distance more distance between the two different ISPs. So why is this good for us? Because we build again a new different model. But this time it's very similar to the one that had IP and ISPS input. Except we don't have the ISPS input now. but we take the previous model that we trained and plug it in in this model instead of the ISP. And we plug it in a way that we freeze these submodels weights. That

means when we are starting to train this model, those weights are never going to be changed. Those weights are never going to be changed. And essentially we have the

the compressed version of MaxMine database in there.

Why is this good? When we use the ISP plus IP, when we just plugged in the ISP input, we were predicting the IP-ISP pair, whether it's malicious or not, which is additional engineering cost to have it present at inference time, the ISP information. But with this trick where we stack the two models,

we are now again just predicting whether the IP is malicious as opposed to IP plus ISP is malicious. So let's look at the results. What does this do? Again, we have the web and spam dataset, stroke curves. We have the very same results from the IP only experiments.

you can see with the orange line there is a significant improve including the ISP even in its one hot encoded form that's the orange line and then we have the IP plus arc phase print in model version which is not as good as the IP and ISP AUC but it's significantly better than the IP only AUC and the benefit benefit again over the IP and the ISP that it has only the IP as an input, no additional input. And similarly

to the web dataset on the spam dataset also there is the IP only result, the improved ISP result and similarly here the arc-face pre-trained model interpolates between the two results. So

since the AUCs were already stronger here, it's not that significant of an improvement, yet it can improve there as well.

So why did we go through all this exercise?

It is because Using this IP only models, we generated heat map of the IP space. Again, here each dot is an IP address. These are randomly generated IP addresses. And we took the model trained on the web data set. We used IP only inputs of the model where it was tagged. and we can see that the model successfully picked up the malicious clusters at multiple places and even if we zoom in into the specific region of the internet we can see that the granularity goes on in arbitrary or near arbitrary depth to locate specific malicious regions.

And that's about it. Thank you very much.

Hey, campers. Welcome to Camp Stay at Home. I hope you all got to take a nice refreshing dip in the data lake this morning. So we're here with Thomas Vores for some Q&A. I will lead off with this one question that we have here. This is from Gabe the engineer who asks, what improvements could be made outside of training the model? Improvements in data, additional domain expertise, et cetera.

Yeah, so hello everyone. Thank you for listening to this talk. Yeah, it's a good question. I think it's a general direction in AI to put more emphasis on data quality over additional modeling. But to be specific, one answer that comes to my mind is this whole talk was about preferential bias by malicious actors over the IP space. And that thought can be expanded to other artifacts over the internet. So you might we might as well include a name server of a domain as a preferential choice of a malicious actor and such covering more interactions between IP addresses by having an edge between an IP address, a name server and an IP address again and that somehow knowledge can be

transferred over between two IP addresses just as with an ISP or same thing could be using whoisartifacts I like the who is registrar, our registrant name. So that's one aspect where we could improve, including more data, more features, more orthogonal features to the IP. And also we could do maybe a better job with labeling, because with IP addresses, it's always a little sketchy, because an IP address can host multiple things.

So for now we have thrown away everything that was hosting mixed content and just went with exclusively malicious and exclusively benign. So for example we could have gone like, okay our ground fruit is an IP address that hosts at least five malicious stuff and be more precise there. But this was the first proof of concept. So those are for example two ways I can think of. that can be made of as an improvement.

Cool. OK. Another training or another question is asking, between the different methods, how hard was training? Like, did it take more or less resources, depending on the method? And also, was there any sensitivity to hyperparameters?

I think for Transformers it's always the rough question that they are huge. So

if resource is one aspect that would favor picking a model then it would lean towards a convolutional network because it's much smaller in size

than the Transformer one. Though that stands for Random Forest too, it's significantly smaller. than a Aurel network and maybe faster to train. But

yeah, so that's one aspect. So yes, there are resource differences, Transformer being the heaviest one, then the Covolution and then the 4S baselines. And obviously there is the opposite way of trading of the accuracy. And for hyperparameter tuning, There were some, but not significant differences. So it's within a reasonable or ignorable domain.

Cool. Can you talk a little bit about, go into a little bit more detail about sort of the advantages of using the CoSpace model over the first CNN you described? Yeah, so it's more of a trick that we did when we first trained the IP only model that literally just has the IP information assets that it is giving a performance and it might not be a performance, especially on one of the datasets that we expect. And then there was a thought that

it might be due to the fact that we have poor coverage if you recall the Hilbert curves and we can get that with including ISPs as an additional feature but we really didn't want to walk down the road for this IP we wanted to have an IP specific model too and the thought was that having the ISP information which ISP owns an IP address, is baked in the IP address itself. So that's something that we should be able to infer. So having the original model and stacking the cost phase model in the original model, like it encompasses the ISP information just by stacking the two models, but keeping with the original one IP input. And we use cost phase because

we tested Softmax and all the face families, so-called face family, which is Fairface, Archface, Costface. Costface happened to come out as the winner. So the idea is there that Costface generates a representation of an IP address that's better segmented than with Softmax from other classes. So for our one class was one ISP and we wanted to encode the representation on one ISP that's well different so the representation is very well different for two separate classes so that's why we pick cost phase because it does a better job than softmax at representing with bigger margins of between classes and we can steal from just the IP address with this trick. Yeah. Yeah, kind of maybe a related question. Somebody mentioned

that in the past, one thing that's kind of come up is that often more complex models or fancier models don't do as well in classification. And so the question is, in this case, what about these made the more complex models better, do you think? What made them better? Yeah. Yeah, so

we on purpose tried to architecture in one aspect the convolutional method such that it supports our hypothesis of this hierarchical bias that the IP address has. So it has the ISP, the prefix and the IP and the different subnets. So this is something we specifically build the representation for to have this intuition, this specific intuition. that it was only one step to use the convolutional network and I think it picked up on the intuition because we baselined it against this set of approaches that we showed on the slides and a few more that are more research type and this approach seemed to come ahead and same goes for the transformer that we wanted to make in the idea of this hierarchical bias and so far

these are the representations that we found that separate that so far.

Okay, one more quick question before we go, and this is again related to training. How often would you think you would need to retrain a model like this to sort of keep up with changes in malicious IPs? Do you think it tends to more or less remain the same in terms of the hierarchical structure or how often do you think that would change? Yeah. so i guess the easiest route is as often as you can that's clearly the best thing you can do but the underlying idea is that if you if someone is hosting malicious content there are the urls that that someone can change super fast then there are the domains that can change less frequently than a url and then there are the ips that change much

less frequently than the domain so they are more stable across time at least At least that's what we think. So if there is a bulletproof ISP, maybe it's not that easy to just switch off the set of IPs that an ISP owns over time than it is with URL. But certainly, it needs to be retrained as frequent as you can. OK, great. Well, thanks again for your talk. Great job. Thank you very much. And yeah, enjoy the rest of the conference. Thank you. Thank you, guys. Bye.