BSIDES CPT 2019 - Natural Language Processing & Anomaly detection in Sys call logs - Christo Goosen

Name: BSIDES CPT 2019 - Natural Language Processing & Anomaly detection in Sys call logs - Christo Goosen
Uploaded: 2019-12-11
Duration: 46 min 44 s
Description: Title: Natural Language Processing and Anomaly detection in System call logs Abstract: Containers (lightweight application virtualization) provide further isolation for application’s, but the container daemon and management systems, ads more attack surface. The research problem is that despite se

BSides Cape Town46:44359 viewsPublished 2019-12Watch on YouTube ↗

Mentioned in this talk

Tools used

Elasticsearch gVisor PostgreSQL Prometheus seccomp strace sysdig

Platforms

Docker Kubernetes KVM

Service

VirusTotal

About this talk

Title: Natural Language Processing and Anomaly detection in System call logs Abstract: Containers (lightweight application virtualization) provide further isolation for application’s, but the container daemon and management systems, ads more attack surface. The research problem is that despite segmentation and system call hardening, containers are still vulnerable and the host and other containers can be affected. In this paper, the use of syscall (system calls, calls to kernel) logging in Linux x86_64 systems is investigated with Natural Language Processing. Logs are tokenized and hashed, then transformed into a sparse matrix encoding. The purpose of the method is to classify the documents and test the different accuracies of different classifiers, such as Random Forest, K Nearest Neighbor, etc. Baseline data, as well as labeled malicious events, are used to train a model to identify and classify anomalies within syscall data. Syscall data can be used to identify usage/abuse of file, system resources and the network, hence a good source of data for anomalies. Further containers, specific Docker is chosen as the application isolation decreases the noise in logs from other daemons and systems running alongside the application. Docker and Google’s gvisor (Sandboxed container runtime by Google) applies seccomp (secure computing mode) rules in Linux, decreasing the attack surface, additionally provides and opportunity to conduct further research into detecting unknown and potential zero-day attacks. The data is derived from malware provided through Virus Total’s Academic access and docker exploits during the time period of 2017 to 2019. Malware samples are dynamically run in a container environment, with syscall logs gathered through strace/sysdig (system call tracing utilities). Methods addressed on this paper can be applied to other systems, however the scope of this paper does not cover for the differences between architectures, but rather the focus will be primarily on x86_64 Linux system calls. Speaker: @crypticG00se Speaker Bio: People call me the Goose. Otherwise known as CrypticGoose. I am a CTO in fintech/insuretech by day, security researcher and defender by night. Busy writing my masters thesis on the above topic. Doing a masters of Computer Science in Information Security at Rhodes. Organizer of OWASP Cape Town and BSides Cape Town.

Show transcript [en]

long title let's just say natural language processing and anomaly detection and that's quite too long especially if you just had lunch so this is me or otherwise known as the goose so intent if it looks like a duck and it quacks like a duck but it needs batteries you probably have the wrong abstraction but also if you haven't figured it out yet I am the golden goose so brush up beside me with a badge and apparently my face shows up which I didn't consent it beforehand but that's Mike that's Mike for you I'm getting used to it so I mean it gets better every year at PyCon one year Mike blocked my wireless mic right in the

middle of the talk so good times oh yeah so just obviously besides a wasp and just kind of all over the place kind of person so but very much a Python dev and DevOps so a little bit of a wasp we have a meet-up it feels like we died this year because we were a bit slow but we still alive so please attend in future as we do builds up towards be sort as well and in the general outline so I'm trying to pitch this more as not first of all I'm not a machine learning expert at all I also want to pitch this in a way that you take something home ago experiment yourself this is kind of what I've been

working with in my Master's my Master's was mostly on anomaly detection at first and then it kind of took a detour and I'll talk a bit about that but the idea is that a lot of Python and a lot of machine learning tools in Python have become easy for the everyday person to use not although with great power comes great responsibility but the idea being that you should at least try or see what the tools can do for you it's not just for the domain of the super nerds I mean most of or probably in that domain already but so like I said introductory talk not done with all the research I'm not a expert and there might be lots of words

especially after lunch that is troubling and let's see and sometimes especially on a day like this I feel like I'm talking like this I haven't slept much and I might have lead poisoning thanks to the badge but we'll see about that burns on the hands this year so that's a good sign but hopefully there's something you can understand so generally the problem at first I just looked at anomaly detection because for me the Holy Grails to find something that wasn't today that you didn't even know about so a new zero day or something that no one saw before and the problem is that we're starting to do a lot of micro services we've got huge log

volumes writing grep and scripts just to analyze logs become tiresome and we've got so many sources and not everything is Ness necessarily in a time series already so how can we make this a little bit easier in the future and then static analysis doesn't catch everything you know it's it's still difficult for us to find a normal behavior systems and different log formats and the high volume just makes it a difficult problem to solve at the moment not that I'm saying the methods out they already aren't working it's just trying to find new ways to do things so you are the kind of the hypothesis is just that natural language processing and machine learning techniques could help scale

identify previously unknown vulnerabilities and just make your life easier and because I did this talk before I thought I might just actually brush up on some of the terms that people might not know briefly so I mean system calls you're talking to the executing things on the system and you're trying to access a resource on on your colonel I see you've got things like process creation main memory memory allocation devices network file access etc and even protecting the kernel itself there might be some Siskel's around that dynamic analysis so I looked at static analysis first but then from a scale and DevOps perspective I found it interesting to look at actually running as much code as possible and then

finding vulnerabilities in the logs after all with rapid change and people just pulling any docker container from the internet dynamic analysis is going to be essential in real-time and see what it actually does so make it actually behave like we expect it to do there's so many different definitions for artificial intelligence and it's become such a big thing I think it's much easier if we look at that so we've got natural language processing machine learning computer vision speech many others and we see it all over the place already and then machine learning you know if we focusing much more on the data and analytical models and then towards more what I was looking at is

natural language processing so processing our natural languages in a way that our computer actually understands and can respond and just data that we can derive from our own languages anything you type into Google these days and or even Siri and then lastly are not necessarily lost not lost ly we actually want to classify what we find so we want to group things together we want to classify it either as abnormal or normal we want to classify it as part of a group or not so machine learning it used a lot in classification makes makes our lives easier there and then just forever might not know when we're talking about logs anything that when the application is

actually running its savings a log of what it's been doing activities things that's accessed errors execution and obviously anomaly in its most basic form is just a deviation from the norm something that went wrong something we're not expecting luckily the human mind is quite good at actually identifying anomalies especially visually but for this overall we want to find anomalies in the data itself and if you look at for instance elasticsearch already has some anomaly detection built in for time series data it's not a new thing but I myself haven't even used it enough and elasticsearch but essentially a lot of our systems do the same thing over and over every day so we we really want to

find the things that are quite abnormal not like Black Friday being seen as DDoS but seeing actual overloading traffic or increase in Siskel's that are completely out of the norm so the big thing is that we want some kind of baseline or something that looks normal and then find the abnormalities and we get different kinds of abnormalities just to quickly touch on that point anomalies contextual anomalies and collective anomalies and then kind of the the hypothesis that I have is that even though this is coming from a machine to some extent these human language in it so we can analyze it to some extent as a human language and I mean it's words that we can understand how can we

actually do that

and then I actually went and looked at different sources for logs so obviously finding logs with anomalies in them already was a bit difficult but I looked at application logs event logs system logs obviously and at the time I decided to focus mostly on Cisco logs especially because you can pick up on a lot of things like files being changed open you can see sockets being created or processes attaching to sockets we've got a lot of what you might call anomalous behavior might be able to pick up based on what the program is doing so is it connecting to the internet is exchanging files is in mapping memory is it changing or attempting to access

other processes so I'm going to look at the system calls themselves now the problem with system calls though would be that we've got a lot of events in a system a lot of logs but it might give us a good insight into what we're looking for just a easy example of a system call is opening a file just to give the idea what what we might be looking at later and then clearly for my research I decided to focus on lining syscalls and x86 because of the data that I could access virustotal has a lot of elf binaries if you ask them under academic axis so thankfully being at Rhodes and doing my Master's I was able to get access to

quite a lot of binaries and then for the most part use is trace just to look at what the actual binaries are doing and get data out of it his race is easy and we've all used it before at some point but then I also found Cystic although I didn't use the whole ecosystem they have a whole cloud ecosystem where that quite focused on monitoring and security but very much on docker which was one of the things that I found useful in my research and also they were focusing on cloud and kubernetes so you could extend it quite a bit it allows you to trace all the SIS calls for a specific container so I

didn't have to do a lot more tooling around the container the only problem being that it has a kernel module and if you've got secure boot it might give you some trouble and so the idea being that I would isolate one of the binaries in a container and then look at what I was doing inside of docker but I'll get to that in a bit and also the relevance for me was that it does things like opening files accessing memory addresses or stating memory and we already have some sis goals that we know are problematic or something like starts another process we we have an idea of what to look like look at and we especially know with the

network what we'll be looking at and all especially because I was looking at Daka as well the problems that I saw was faul faul system access network access and price manipulation especially with the vulnerabilities of docker lately especially because you're running a demon on the host the network calls were were important to look at although it also has some issues dealing with it large log size just doing simple tasting I had to do very short events otherwise I'd run out of RAM processing it permissions and actually and then shipping it off the house obviously I'm executing binary so I want to kind of isolate my machine as much as possible ran it in a virtual

machine in a docker and hardened docker as well so some challenges also in your environments actually shipping the logs off to centralize house so you can process it as well and why containers not just because I'm really interested in it but I wanted to isolate the binary from the noise of the overall system to some extent and that it wasn't aware of everything else running and to isolate my work machine as well at the time and [Music] also because dock itself has seen interesting vulnerabilities around the diamond itself in the runtimes and the network being attacked yeah as well because I found Cystic and it works a while with docker a lot less tooling had to be built around getting the log

straight of the darker daemon of the docker container running and then we'll get to some of the stuff next also looked at secure computing mode in Linux so because I'm trying to find things that all that haven't been found before that are anomalous to the normal behavior on a limited as well so I don't want a container just running you know root easiest attack surface we actually want to see if we can find something that's completely abnormal and second allows you to block specific goals which is rather handy in that that sense and then I spoke about this last year in the Lightning talks Google's jeev eyes already i'll get EG vizor later but already had default second rules applied

and further abstraction for each docker container so the network's abstracted the process is isolated but not all cases accounted for which is kind of part of at the point is i don't want to block everything but i want to see whatever what's outside of the baseline just an example of a second peru and then you could just easily apply to any docker container if you weren't using something like gee visor when it's running so YG visor so already you're not running a hypervisor and run to running a virtual machine so you're much closer to the actual system surface through the Siskel's the filesystem network so google both a very lightweight and fast micro hypervisor to

isolate the the process and some of the network calls and they've also really applied default say comp rules and things that they find problematic and when they release python 3.7 on google cloud this was a default way to actually run their cloud functions and it kind of looks like that so you've got a shim in between the host cone on the application and then it proxies system calls and network calls in between with a block list as well so that the house Colonel Stowe isolated it's not completely isolated but adds a level of abstraction and why I looked at G visor is they had an interesting list of things that they did support and things that weren't

supported yet so there's a this was a hint of which kind of Siskel's they found still problematic didn't have blockless for both grace only added about six months ago I think because of a way it said it's a process IDs I think when it that concurrency if I remember correctly so also this gives me an interesting list of things to use as a baseline so I can run prometheus I can run post grace I can run elastic search in a container and see that as a baseline and then compare it to something else and Jeevitha is also interesting because it also integrates with kubernetes and initially when I looked at this research the the vulnerability of the docker

daemon itself would add further attack surface to kubernetes and you could hit potentially multiple nodes as everything's an API and because of the amount of services each one running as a container and you could hit multiple workers so G visor also allows you to scale it across multiple containers across pods and in the natural language processing I hope I'm not going too fast and everyone's still getting me but so why natural language processing I came across a couple of interesting articles and tutorials around especially log noise people doing research about the vast amounts of logs in different formats and also more research along not just looking at application logs but siskel other sources of security

incidents and the fact that it could parse unstructured messages and also allow a InfoSec or sysadmin or DevOps person to further enhance the processing of logs and the speed of it [Music] so if we look at kind of a example at the end all I'll have the links for a interesting exercise and tutorial around this it's primarily built around CI not so much info SiC but it's a good example or something to try to just get going but the big thing is we want to establish some kind of baseline so in my own baselines I looked at just running a bun to it by itself running elasticsearch running post grace for a while and running some kind of like sane

workload now the problem with that obviously being that we're assuming that that workload is safe at that point in time which is a potentially dangerous assumption but the assumption we have to make and then we would transform the data the natural language into something that we can process trainer model have it learn tweak the model and then once we've got a model that works we taste it against anomalies that we have coming in and then constantly add to the baseline retrain it so the first step we want to look at is we actually want to start tokenizing the actual language coming in so we want to put it in a a a format that the machine learning algorithms can

use so here's just a basic example to show we want to look at frequency and way in the document it is so we've got two documents the quick brown fox and jumps over the lazy dog and then build a vector of where each one is but then I looked at multiple tokenization x' because we've got a lot of information in its each Cisco we want to start looking at different ways of actually grabbing out the words and the meaning so a very simple one a very easy one to implement would be whitespace tokenization so obviously between each item we've got a wide space breaking it up another interesting one that I found is the tree banquet

organizer so it's trained with a lot of words from newspaper articles and it's rather good at picking up as well but at the time I chose just the hashing vectorizer as it was efficient for the problem that I was trying to solve so essentially when we're tokenizing we're creating a hash of each word or item that we're picking up and putting it into a sparse matrix matrix encoding and that's how we're breaking it down said we can process it later and we just as a example and in the tutorial as well that you can follow we just map some of the data to see what it looks like with the green dots being what we call our baseline and the red

dots being something that we're tasting so we can see kind of way it's classifying things and then I also looked at frequency inverse document vectorizer so looking at not only the term frequency in each line but in each document as well but I haven't used that any further and then obviously the point of the research is we actually want to classify the data so once once we've actually processed the data we now have it in a form that we can use the algorithms we actually want to classify it so we want to say this type of data is grouped together or classified with other data so one of the classic examples and where machine learning is used quite often is

spam detection in your email so it's quite a simple one for the most part we want to classify something either spam or not so when we train it we're essentially looking at do we think it's spam or not are we confident that it is or not and it's a simple example to use and we have a bunch of classification algorithms and then I also looked at the the the big part of the research that I'm also doing is to actually look at the ease of implementing it I don't want something overly I'm not trying to save the world I'm not trying to make something that people can't use so it has to be simple it doesn't have to be

the most accurate but we need to have some kind of some some point to start to look at things so for the most part I've used k-nearest okay nearest neighbor so for the most part I want to look at how close is one data point and to another to classify them with the easy way to remember birds of a feather flock together so we want to classify clusters and essentially you've got a big database of your baseline and where does it lie within that baseline

so we're going to store the entire training set and then it's easy to implement because you don't have to constantly train and learn it too much and it makes predictions just in time but it can as my laptop knows take quite a bit of processing and beat it around but like I said I wanted to make this simple I myself am NOT a machine learning expert so I want something that I can learn quickly and get results in the research and then take it further so a visual representation of what this algorithm would do is it's trying to classify things based on how far it is from another data point so we put a new

example a new data point and we're trying to classify it based on what's around it and we choose a distance of K to use in the classification and to visually represent how it would do that with a log line it works out each word well it works out the distance for that specific example each near vein from the baseline so we've got a baseline we assume that it's it's secure it works and then we add a new line something that's happened even accessing a file accessing the network and we work out how sure we are that the event is outside of the norm or close to something else example just the data doesn't say anything really about what

happened but just an example of how it represents it in the in the example that I talked about earlier so for each new line going in it's giving you a cane in score and then also the closest line or the closest data point to it from the baseline I also looked at the random forest but that I'll leave to further research it's a rather interesting classification algorithm works really well but I just haven't had time to look at it too much so just to go back so largely we've got a baseline we're taking that the the log lines each line in the document transforming it into a matrix that we can use in the model tasting the model

so trying to see how accurate the baseline is and then we're adding new lines into it and seeing if it's an anomaly or not that's largely the the focus here and just to remind you using the hashing vectorizer and then classifying it with k-nearest and then at the end we'll see if it if the demo still works because I haven't made any sacrifices so you might be sacrificing me at the at the altar so the contributions being to see that we want to add simple tools simple techniques that already have impact already have success in being applied in other fields and in seeing if we can find anything completely out of the ordinary so obviously I had to get

sources of data I spoke about it earlier between 2017 and 2019 alfe binaries some of it include everything from botnets on IOT devices Reuters to ransom way a really a mixed bag so I had to filter through it as well and some of it even though it's it's all just put under elf binary so you've got other architectures as well and folded it down here's an example of Mariah that I found in there which wasn't applicable but it's just an interesting example they already had and like I said if you want to try something like this there's a really good example called quieting log noise of Python and machine learning and it is very much

focused on CI so it's it's not a security focused thing but you could use the same examples the same logic to apply it to security and it goes step by step through everything it's doing the steps the theory that it's applying it's a ipython notebook that you can just download and use so let's see

okay so this this isn't a very specific example I've just taken a Bentley as a baseline of run it for some time and then I used one of the elf binary zazz an example maybe just to talk about it and have a slide about it but I used key mu virtualization with KVM and a bun to just with docker running inside G visor as well G visors as simple as compiling a binary and adding it to a config for docker and then adding it as a runtime so I've got the virtual machine running I've got docker isolating it and then Google's G visor so I'm trying to kind of create a funnel as much as possible

so that anything strange pops out but then also trying to protect the machine itself and this is based off sistex version of the cisco oh great just give me one second

of course of course the sacrifice wasn't done the demo gods are not happy so let's see oh yes that's massive

okay I think we're back okay again a bunt to baseline one of malicious files we have and then purely we're just reading in each one as a separate line into a list said we can process them Cystic does love adding Jason a whole bunch of other information at the beginning so I'll clean that up but what you can see is all the various schools happening while it's running that's just the baseline and then cleaned it up removed all the JSON data from sistex so that we just have the Siskel's that we're looking at at the moment so we've got so we've got the baseline file and the wrong file so looking at it I've limited the length of the data that I'm

using just because I ran ran out of RAM quite a few times obviously Amazon's amazing and just spin up massive machines until I forget about them and the credit card hits that but ideally if you're starting to look at Cisco data you either want to start cleaning it up removing things or looking at specific parts of it or just get more either more efficient algorithms or a huge machine so just looking at some of the shape of it and just to look at what we're looking at so it's going to be difficult to understand this but I've even like removed some of the data so you can see better so probably because it's running in a

docker container we see a lot of data all in the same place for green and then that was the rate obviously is what the malicious binary was doing and where it landed in the vector and then just looking at the size and then fitting it and once we've fed it to the model the k-nearest so this isn't a good example it's not accurate one at all but then we look at the distances and the various examples that are found in the baseline and I've just added a little bit more info at the bottom so here we looking at the distance for a new item we're looking at a comparative example in the baseline and then the actual the actual item that

we've tasted like I said this isn't a very good representation yet it's purely just something to to show at the moment so obviously pretty and very dirty looking graphs but but all of it can be tried yourself I'll try and do a follow-up on it next year once I've got more results as I'm still writing the masters and working on it but the amount of code that you have to write this the amount of work it is the speed that it works so FAR's looking good I just want to have more concrete results before we talk about it and that's pretty much it and then I suppose that's probably at about a seven seven out of ten

yes any questions okay so the question is if green is a good good code or the good binary and the red dots are the bad what methods am I using to kind of look into it more isolated so I've done none of that on that example but what I've started doing is applying more second rules to the docker and limit the binary some more because a lot of the red dots are actually also just potentially something like memory being mapped because the binary is being run or the binary is being executed opened a red so it's I'm still busy working on the isolation part and also other ways of visualizing it because that's still quite difficult to look at but for the

most part the kind years to algorithm you start looking at the classifications for specific things and then start looking inside of that to see which ones are problematic but it's also a bit of a shot in the dark because I'm trying to find things that haven't been there before and you also have part of the difficulty is that you want to also look at not just a single line but where in the document that line is so if we think about any binary right it's executing multiple steps it any vulnerabilities not one siskel or one action so that's what I'm struggling with is once I find something to then retrace the rest of it in the document

so they still work in that but for the most part they still on my side a lot of still you men searching intervention and part of the research is sure it'd be lovely to find something completely automated but at the moment it's very much still you know find an anomaly like with elastic searches machine learning abilities and then actually go deeper and go look at the data itself especially on time series then at least you can see in this period of time so you you slice that piece of time and look at it but well more closely but some of the assumptions that I'm making is a lot of the binaries the malware is gonna try and do something immediately

to say the foothold and then I'm trying to look for things that happen further down periodically as well

okay if I had to summarize that so in point detection things like that already look at anomalies right and how does it differentiate so maybe just at the moment what I'm looking at mostly is just point detection so if we just look at a classification you know these two points are so far from the race of the points that something's wrong right if I think about Microsoft has what's it called again if it takes virus on your machine and then Sentinel one has something similar for Linux from what I learnt talking to someone it's a bit more of a contextual anomaly so you're looking at anomalies happening across events not just one single one so you're

on your Windows machine and you got an email and something happened so at the moment I'm merely looking at identifying a single problem and then some human intervention for that but a lot of the current products out there what I'd say it looks very specific to a platform or something I'm trying to be very and I'm focusing on Cisco logs for Linux but ideally in the long run any kind of text that's human readable ingest it and look for weirdness but then the context is lost right so I think they is where in point detection works works fairly well because you know you know the hardware that you're running on they might be a certificate this is that

machine if it answers your question not at all at the moment I just looked at making my own data because finding existing data is hard virustotal was a fairly decent source because they had logs they had the binaries and stuff but I haven't looked at it deep enough right now I'm looking at kind of binaries and just seeing is there anything that someone might have missed and then looking at it further and kind of building something around that

so to some extent so how do i account for different things running in the same environment so partly that's what docker comes in because I'm looking at a single isolated process but that would probably get into the contextual stuff where you look at why is the database server doing what it's doing something similar to what the application server is doing anomaly but I haven't really looked at how you would differentiate at all there's also a lot to be said about just removing noise applications starting in a certain way even if it is a malicious binary some of the Siskel's will be the same but yeah but it depends if if you I tried not to just go for something like

network logs pcap files and stuff like that but ideally if you look at peak apps and you just do it on a single network device and everything's running through the same network device that would be one way of doing it but here I seemed I'm on a system in a container I don't know what's around me I I'm on the host but the actual container running that I'm looking at is isolated from the rest well that's this assumption I'm trying to make because I want a baseline to be safe not that it's communicating out but I mean that's part of the problem is that potentially a container could it's already talking to something else but can you then using the daka

Damons tcp get even further that's a hard one oh I'll think about that one a bit

BSIDES CPT 2019 - Natural Language Processing & Anomaly detection in Sys call logs - Christo Goosen

Related talks