← All talks

Preserving Telecom Infrastructure Using Machine Learning

BSides Calgary · 202127:1929 viewsPublished 2021-12Watch on YouTube ↗
Speakers
Tags
About this talk
Ali Abdollahi explores how machine learning can enhance intrusion detection systems (IDS) for telecom and network infrastructure. The talk contrasts traditional signature-based IDS with next-generation ML-based approaches, explaining how ML overcomes the limitations of rule-based detection by automatically learning from network traffic patterns, handling zero-day threats, and adapting to evolving attack techniques. Key topics include IDS/IPS architecture, training methodologies, data hygiene, and practical deployment considerations.
Show transcript [en]

[Music]

[Music] yeah hello everybody welcome to my session uh this is ali and today i'm going to talk about uh telecom industry and how we can take advantage of machine learning to preserve telecom infrastructure uh if you're ready we can go ahead uh i'm a technology nerd and infosec expert and working for almost nine years and i'm an invited speaker and trainer at many uh international conferences they've gone like many b-sides and this one is very lovely besides calgary and uh so i really love to play with beats and buy signal and radio frequencies and i'm really happy to uh have you here at this great events besides calgary 2021 so if you're ready let's dig into

so uh intrusion detection tools actually i'm going to talk about some of the tools you are used to actually prevents uh intrusion so one of the tools focusing in this presentation is intrusion detection system or ids and it's commonly referred to as that another one is an ips or intrusion prevention system as you hear many times during your professional and i will discuss this in more detail in a moment and uh one that you're almost certainly familiar with fireball is fireball so to understand these systems firewall is basically something uh that looks at the ip address source destination and that the connection is coming from and it looks at the ports as well as services it looks at the uh actually uh

and after that uh deciding whether to allow the traffic to come in or out of the network or whether it is obviously an ids on the other hand is much more sophisticated it's going to look at the traffic and it's going to look for patterns actually some predefined patterns and rules so [Music] basically it's going to look for anomalies it's going to look for signatures and in this uh session i'm going to talk about how we can use machine learning to determine if the traffic is malicious or not now the difference between an ids and ips is simply that an ips address is like an ids that has permission to take action uh so when it detects malicious activity

it will go ahead and prevent it

so an ideas uh on the other hand will go ahead and simply alert that a human can look for at the alert for further analysis and decide uh that is actually a malicious incident log uh or not it is a normal traffic and uh there are use cases for each of these for instance uh where you're testing out an ips it should be basically in ids mode so you can make sure that it's not faulty and by faulty women perhaps trigger too many uh for example false positive alerts so and right now we're going to talk about the uh idss and the difference about uh and types of ideas and uh sorry sorry ally um really quick

um your slides aren't advancing for me um i still see your title slide and i i don't know if that's happening for anybody else but uh can you can you see the this the current slide ideas varieties uh the slide i see is preserving telecom industries in machine learning oh let me stop sharing and again share my screen okay

so is it okay right now yes now i see ids varieties perfect yes okay and the previous one was intrusion detection tools uh sorry about that thank you thank you for your notification so actually um ideas has falled into two varieties uh in fact there are traditional ideas and next generation intrusion detection system and traditional ideas those have been used for many years rely on rules patterns and signatures in in other words as traffic comes in and out the ideas monitor it and reply and apply as a set of rules set up signature for example to decide whether uh this traffic should be load or flagged as malicious or suspicious uh one of the great challenge with

traditional ideas is that these rules and signatures are very specific and as a result you constantly have to update these patterns kind of like worrying about the latest flow where your best shot is to stay up to date so traditional ideas unfortunately the problem is that even if you're consistently dating your intrusion detection system uh device the bad guys are always coming up with the new techniques of attacks and there are so many uh categories and they're constantly evolving as a result uh old intrusion detection system just cannot keep up and uh so let me let me give you a tangible example for for uh ease of explanation i use firewall here but the same id uh ideas

apply to ideas so one type of attack that hackers use is called packet fragmentation and that allows hackers to avoid rules will be applied to their packets to determine that they're malicious or even suspicious and another type of attack that hackers and malfactors can use is spoofing source ip address uh the third method of attack is spoofing source port stop uh any single uh one of these attacks requires great uh uh actually great technique a deep knowledge of domain and plenty of expert hours but the problem is that these are just three types of attacks and on on firewalls that are off the top of my head uh what about the hundreds of even thousands of other attacks that

attackers are constantly coming up with improving and evolving and uh how how how can you possibly keep up with all that uh so the real the the reality is that um as much as we can hope to keep up with the attackers to figure out their methods their techniques uh to improve our own defenses and our security mechanisms and to keep constantly updating and working together and the reality is if we look at the statistics uh we can see that uh unfortunately it is pretty much impossible to keep up uh if it was uh the amount of security incidents wouldn't be increasing but rather decreasing whereas we know that they are increasing in numbers scale and cost

uh so hopefully human brains and human technologies does offer us a solution so it's not that we can we cannot really keep up it is what we cannot keep up using these old-fashioned techniques so let's go ahead and using artificial intelligence and ideas on next generation network security in different kind of network not only limited to telecom infrastructure and ip background infrastructure and other type of uh infrastructure and this promising solution is artificial intelligence and in particular it's soft field which is machine learning or ml and artificial intelligence is so promising for cyber security because it is scalable this is a very good solution it is hot topic nowadays and uh it is scientific in the sense that you can always

recreate your experiments you can control for variables and you can debug it and understand why one thing happens or another way it made a certain protection and learn lesson from it is able to stop zero day threads and tanks and exploits in other words uh threads that have never been seen before which is something that traditional methods have absolutely had no success uh it can be adapted to be fast it can be adapted to work in a real time as fast you need it for instance if you needed to be put on a network as packets come in and out that can be done it can also be adapted for low memory situation and also tough circumstances in your

network on the other hand it is very flexible framework and finally it is tunable tuning able to satisfy your goal your organization uh targets and objectives or even your customer in other words you can decide on the relative importance of catching a certain threat compared to another threat or compared to a false positive or false negative or any related alarms or compared to leading through benign traffic or normal flow you have full control over all of these decisions if you have the knowledge that that a particular specific attack is especially deadly and malicious uh actually you can set this in your artificial intelligence based solution so that it knows that this is a much more costly attack and therefore will uh

threat it as such and radically increasing odds of this tank getting cough so now that i've discussed the numerous benefits of machine learning what exactly is it actually machine learning is the science of applying sophisticated statistical algorithms that have been designed to be scalable and for example suitable to be used with our computer systems and infrastructure to automatically learn from data okay so in particular the next generation intrusion detection system is a machine learning based intrusion detection system that is going to automatically learn from the traffic that it monitors inside our network or it doesn't matter which what kind of network so in this slide i'm going to discuss about the architecture of a machine

learning based ideas and machine learning based ideas also known as ng ids or next generation ideas consists of three main parts these are training data and model and objective and let's discuss more about these uh training data is the raw material which are used to infer rules patterns and parameters basically what you do is you feed the training data into a machine learning model so the model is very important in this kind of procedure a machine learning model is a statistical algorithms that takes in the data and automatically inferior rules uh probability estimates and so on and so forth but before the machine learning model can infer the rules it needs to know what it's hoping to do what is the

predicting what uh kind of uh category of data we're going to move forward with that and what kind of trade-offs you would like to do from uh instance in terms of training time versus accuracy so uh the objective provides uh most of this information by uh telling the model what's aiming to achieve you know the goal the targets so important example of objectives are accuracy precision ratio recall and true and false positive rates so there are many other objectives as well as uh f1 but in cyber security the most important goals is measuring the true and false positive ratio accuracy although it's intuitive can be deceptive because the data sets are so imbalanced we will

discussing objective in more detail later and uh right now let's talk about the model deployment which is very important and uh actually develop the next generation ideas that it that the process is that you uh take the data which you have uh create understanding of and you split it into two parts a training set and a testing set oftentimes the split is an 80 20 split with 80 going into training and 20 going to testing this all depends on the size of the data and uh the type of the application you know and in the case of uh uh ids the data has a temporal component meaning that the data matter when an event is observed for the reason it

often makes sense to keep the training to be say the first 80 percent of data in time and 20 for testing or as another uh example uh say you have one year of data you put 10 months into training and two months into testing okay that was i i hope it is clear right now and it's uh the last two months of the year that are in testing once you've set up the split you must set aside the testing set and never look at or touch it until very end of the training and what you're going to do is is split the training set once more into a new training set and a validation set

so you're going to fit the model into the training set you're going to form a collection of promising model candidates and uh so basically each of these is going to be trained on the training set and then evaluated on the validation set one of these is going to win as the best performing one you will take this model and test it out on the testing set and this will give you an uh and based evaluation of how your models should perform in real life assuming all your data is representative of real life i mean real traffic flow so to say that again you take your overall data set you split it into three parts training validation and testing

port and testing gets set aside for the very end just so you can have a good estimate of your actual performance and then you're going to train various models on the training data and compare them to another on the validation set and finally the director model will be evaluated on the testing set to give you a prediction on its performance the reason we are so careful about how we hand out data also called data hygiene is that every time you you do anything with a data set you're essentially leaking information from it so let's say that uh i set aside a test set and then i tested out a model on it and i saw how the model is

performing uh i have now model uh actually leaked uh i licked out some information there the more information i leak out about the data set the more likely i'm uh to cherry pick a model that performs uh well on the testing set but not in real life here so as a simple analogy you might know the game 20 questions where someone comes up with some sort of object some sort of goals and your job is to ask questions until you can guess what is it such as is it a animal is it an animal or something like that yes or no is it a plan yes or no and anything like that and each time you ask you are

leaking some information about the answer so if you play and you get the answer you don't want to play again with the same exact object and uh knowing that it's the same exact uh object because now you can guess guess it in one try uh because you know the exact answer uh and you already have all information about the answer and it doesn't mean that you're a better guesser it just means that you you have all the necessary information so the same goes from model development it doesn't mean that the model is better it's just that it already knows what the data is like for that reason it's impressive to keep the data hygiene practice finally

so once you're done evaluating the model on a testing set uh uh in fact uh you're happy with its performance now you can take the same model reta retrain it on the whole data set because you're no longer care about overfitting an information leak and now you can finally deploy it and uh react the benefits of your hard work with the self-learning intrusion detection system so and what about network intrusion detection data and actually data set there are two main ways in which you obtain data uh which as you might remember it is basis of your intrusion detection system the first ways to collect your own data and the second way is to use already

existing and predefined data and their pre-assembled data set in fact so each strategy has its own cons and process for example when you collect your own data you can make it as specific and relevant to your own circumstance as can be on the other hand if you don't have a cooperative relationship with a large cyber security enterprise it's unlikely uh that's you you will be able to collect the right amount of the the right type of data and uh in that case uh you should go with the pre-assembled and pre-defined data sets and and another difference is that collecting your own data is uh obviously more lava intensive than simply downloading a data set that someone else

has already created finally you want to make sure that the data uh you collect is also properly labeled and so for that you need to make sure that you actually understand what is going on in the network in a pre-assembled and pre-defined data set uh this has already been done for you by experts in this field so for someone starting to understand in this field it makes the most sense to take the following roads uh so first understand how to collect your own data which i will show you in a moment then access and learn from pre-assembled data set which i will show you again and once you explore these pre-assembled data set you will

understand what type of data you need what data looks like and what size of data are relevant to your own case at that point you can stick to these pre-assembled data sets or having uh you know the knowledge you uh you can uh start collecting your own data set and if you work at a large enterprise uh or work such an enterprise uh then you uh they would have already collected all kinds of this data for you and you can simply proceed with there so when it comes to collecting data now let's discuss some pre-assembled pre-label data set the classical data set for ideas is kdd cop data set and this data set was assembled for a

competition [Music] and its setting is a military setting where they had a large network where they simulate different tags this data set has the advantage that uh it's very accessible it's well studied so you can look to literature and look at different approaches to attacking it to using it for training for future engineering and analysis another data set uh is called cic ids 2017 and this data set is similar to kdd cop as well except much more recent it's a little harder uh to work with and it's been less well studied so for obvious reason uh so i recommend that you only check it out after you've had some experience with kdd cup data set

uh for instance let's do a quick overview of the data set is like first it consists of a wide range of different tanks or attack sample as well as normal traffic you can see there are denied of service attack there are user or to remote attack and removed to local attack prop attack and a lot of normal traffic and normal flows and it is highly imbalanced as you will see when you actually handle the data and each event or which you can think of as a data point and consist of a large number of features features can be things like number of seconds a number of connections the duration and it can be a protocol type like udp

tcp icmp and just like when when we use wireshark to collect events uh it can be many other things with which you will see like short spots and destination address and in addition if you if you survey uh the literature you will see that there is also a lot of future engineering that has been done so people have figured out ways to take the existing features and uh construct new features that they expect uh will be ever even be better even more indicative of uh malicious or benign of the connection so here you're seeing a collection of these engineer feature things such as uh was root shell obtained or not and what is the number of failed login

attempts uh so if if you have the domain expertise actually you you can always take uh existing data and construct additional features and if the feature you come up with are good indicator of uh compromise or something like that whether the connection is malicious or benign and then your classifier the machine learning model will be able to take advantage of these features and perform even better now going back for a moment to the feature in case you're not familiar with this concept uh one of the best way is to understand features like that saying if it walks uh like a duck if it quake like a dog and if it looks like a duck and then it's dark so

here you can think of three features walk a drug type sounds that it makes an appearance and then finally there is a label which is what type of animal so uh in in this case there are three features which indicate that the label should be dutch so the same thing applies here or uh more generally in machine learning classification you have a bunch of features and then the label is of what is it actually so thank you very much guys uh and uh i really i was really happy during this great event i hope you enjoyed this uh session i'm open uh to answer and discuss more about this uh have a nice day