Mining Software Vulnerabilities in SCCM with NIST's NVD: Data Challenges and Machine Learning Solutions

Name: Mining Software Vulnerabilities in SCCM with NIST's NVD: Data Challenges and Machine Learning Solutions
Uploaded: 2017-09-01
Duration: 24 min 1 s
Description: A technical deep-dive into matching complex SCCM inventory data against NIST's NVD vulnerability database using machine learning. Loren Gordon covers the challenges of dealing with unstructured registry data, the structure of NVD datasets, and a divide-and-conquer approach using fuzzy matching and s

BSides Las Vegas · 201724:01156 viewsPublished 2017-09Watch on YouTube ↗

Speakers

Loren Gordon

Tags

CategoryTechnical

TopicThreat Intel Vulnerability Research

StyleTalk

Mentioned in this talk

Tools used

fuzzy-wuzzy pandas

Platforms

Active Directory Docker Linux Microsoft SQL Server SCCM Windows

Service

National Vulnerability Database

Frameworks

scikit-learn

Languages

Python

Standard

CPE

Concepts

Windows Management Instrumentation

About this talk

A technical deep-dive into matching complex SCCM inventory data against NIST's NVD vulnerability database using machine learning. Loren Gordon covers the challenges of dealing with unstructured registry data, the structure of NVD datasets, and a divide-and-conquer approach using fuzzy matching and scikit-learn to automate vulnerability correlation at enterprise scale.

Show original YouTube description

GT - Mining Software Vulns in SCCM / NIST’s NVD– The Rocky Road to Data Nirvana - Loren Gordon Ground Truth BSidesLV 2017 - Tuscany Hotel - July 26, 2017

Show transcript [en]

I flew on Gordon Gordon we're going to be talking about a software project that was using simple machine learning to mine fairly complex data s SCM inventory data and this Bunner ability theta going to start off with a quick overview of the network where this the production they work where this was living and working and this died right into the challenges because the one takeaway that I think you should remember from this it's the same thing we heard yesterday know your data get into it with the data and that makes the machine learning work really much better so I'm going to talk a part of the presentation is going to be talking about these complex data

structures what they are the meaning is diffic difficult of them and also the dirty unstructured data that I had to deal with and then finally we're going to talk a little bit about people issues it was interesting I have a technical security architect at ubisoft i've worked at a number of other places major a world-class telco etcetera I just have a passion for everything that's technical security everything that I say is my personal opinion that has nothing to do with my company of course now the network guys world worldwide network there's about there's a maybe 11,000 members right now team members they're spread across 18 countries 26 studios in a swindle centric the interesting part about this

network is that the company encourages creativity so the developers and the individual studios have their own environment their own software and the local IT has a lot of autonomy this poses an interesting challenge to patch management because then the question becomes find the Panda software's moving us offers moving out vulnerabilities are being published every day and where's the vulnerable non-microsoft software and host today a great idea I thought at the time anyway Microsoft SSE i has reliable inventory data we already have an agent we don't need to stalk another one and this Emily T data as a update vulnerability data maybe not exactly all of the vulnerabilities but I did a good significant list of

vulnerabilities and put the two together we already have the data let's do some good patch management with it this avoids expensive licensing because we were using a free public software the vulnerability data became decent data flat file feed that was fed into back-end big data mining app application so this was part of a larger project and someone told me this is impossible to do you can't take the chaotic registry data and actually match it with the formalized structured NIST and bikini energy data I started like that when someone says that because that gives me a challenge so let's talk about the complex data structures very quickly if possible Microsoft System Center Configuration Manager is the application

that people love to hate it's indispensable for the management of enterprise scale Windows networks has a back-end Microsoft MMS sequel database that is is very very complex there's sixteen hundred tables in the thing sixty two hundred views and everything is all the is running in little distributed components running with WMI tightly integrated into it there's a quick list which you can't see about all of the components about fifty odd components there dll's mostly running either as mostly in threads there's also some services they communicate between themselves with flat files that move from this based inbox outbox or and they also use in court queues now WMI as I said is tightly integrated into s SCM

everything is architected using WWAMI client-side the client talks to the manage toast using WMI you top the server-side talk to the client with WMI the server exposes the WMI interface of the key objects are available through WMI the console application to see them as the WMI management application s SEM populates its database using six discovery methods there's four that are the target active directory one is s 2cm talking to the client the last one is s assume going out into the network the active directory is the one that really interests us most there's the four methods in looking at for us looking at groups in Active Directory users and hosts the heartbeat discovery is the only one that's mandatory and

that's the same as talking to the client although I do there are you installed as everything working well it pulls in a little bit of data from at that point this runs every seven days by default the network discovery is disabled by default and that pulls in all kinds of data wouldn't associate was something that we're pulling in from digit DHCP servers SNMP services things like that thing to understand is that this is true of anything any data data mining off thing garbage in garbage out if the active directory has data that's extraneous are not reliable if that's Fulton s SCM not good can't rely on it so the important thing that I found was

to make friends with your S SCM administrator there's six discovery methods which methods are enabled in in your specific production environment and what is the polling interval is they going and getting it every month so every day is they've got data reliable he his career depends on knowing which state it is good and which data is reliable so it's a really good person to talk to also hands-on is exploring I used I spent many wonderful hours with Microsoft sequel studio looking inside this database and looking at the different views active director can be used also to augment the inventory data we do do this in the tool that we're going to be really seeing at the end of

the presentation also Google is your friend and Safari technical library has a number of good books on ssame and there are things that one book and especially has good SSD internals to get at the SSM data the best approach is to query the sequel database directly not don't use it not using WMI this is simpler more direct and probably performance is better also I learned that you have to go after the views and not worry about the tables this is the views are more stable the documented better documentation the community works with the views and the permissions are already in place also if you go off to the table and you can maybe lock the production database and

you're not going to make friends with the ops people if you do that microsoft also has done some heavy lifting they have some storage stored procedures that they distribute and then that populates other data into the views so the views are the way to go we can see the WMI influence in the s SCM views the WMI class name becomes the SCM view name it's truncated at 30 characters but it's still you can still see the the actual abuse the property names are the column names in the sequel views the column names have a zero appended afterwards to avoid conflicts was reserved words the inventory data that associate gets from a cinema Tori discovery is has V

underscore GS and the discovery data this V underscore are for the scalar properties and the underscore re for the arrays there are also views that have metadata about the other views for instance schema abuse lists all the views and which ones how much how many they are one of the important views is the the views that give us the software and the one that turned out to be the the useful one out of all of the 6000 was there's actually two there's the GS every mu programs that have removed program 64 be careful because there's another set of data that WI might pulls in these views this view is populated from the uninstall keys that go into the

registry there's another set of data that comes from programs that are installed with the windows installer so that's a subset of all of the installed software we're looking to do patch management the information so this gives us a much complete more complete view of all the software there's also collections which you do we use to find the host the underscore VR system is the one to use probably in most situations it populates from all the different discovering methods and there's about sixty columns in that in that view very very useful the vgs system as I was hitting these different things I was wanting would I use the one or the other they both have software information the system

information it only runs when the hardware inventory runs so that means the agent has to be there has to be installed has to be active and it's less accurate for those reasons and also the it only will pull the discovery only pulls in about ten fields so basically it's a no-brainer VR systems I want to use in this data that the preceding section this was really hard to find all that information I have some references at the end I'm going to put this these slides on internet for you for your reference then this beam is fairly well-known its formalized structured format the stable versus XML they also have a beta G some version now there are

two main this data sets we want to look at the first is the CPE which is a list of all the vendors and products it's a 1-1 file that lists all of those and also the CDE which is there's one CD file for each here and has the list of all the vulnerabilities so here let's take a quick look at the CTE file it starts up with a header with the version end date and then here's a typical item the first thing that you see on the top is a title it's a human readable description of what the product and the vendor and then the CP item is the one that's interesting because it has all

the structured formalized data we'll go through it very quickly the first is a dictionary version and then the the part with in this coastal park number the ease is for application always for OS and H is for hard work we only are interested in the ace then the vendor excuse MOT the product and the versioning and if the vendor has this kind of versioning of the updates the service packs of minor versions this particular item was for a wordpress plugin i specifically chose this one so that you can see that the installed software the target software is also mentioned in the in the description now the NVD is a list of vulnerabilities there are three separate a typical and

VD entry there are three separate parts there's the CDE which is a basic vulnerability information the cbss which has the impact information and the cwe which has the Augmented vulnerability description let's take a look at the first this is a typical CBE entry NBD entry first pieces the CVE and we see the CDE ID which specifically identifies this vulnerability it's been all tagged and named by nist and mitre and then the CPE entry so that the vulnerability entry is pointing back to the vendor product information from the CPU file the second section is the cbss which has the vulnerability impact we see that this particular vulnerability is a network access medium complexity and if it fires

it's going to completely compromise integrity the last section is the cwe which has HTTP references to the weather ability infra findings and also a description of the vulnerability in human language so that we can understand whether this is what it does what's how it works the CV data is available as the daily feed I mentioned we pull in in the tool UML because this is a stable stable format it's available to compress gzip or zip and there's a meta file that they give to the meta file has to sharp 256 hash and also the file size so that you can take a look at pull down this little metal file and find out whether you should pull down the full feed or

not okay all of that was data so I can understand the complexity of the data now how do we get the vulnerability data out of ss CM and match it to the formalized NIST of a vendor and vulnerability software debt first of all a wise choice of tools is important and secondly a divide-and-conquer approach was the key to success the tools in AWS usual suspects that Python and a scikit-learn dr. Nance of all the basic approach was keep it native first of all I use Windows to talk to Windows as the CM in Active Directory and that simplified things and then Lunik for the more Linux friendly things like docker pandas pass and socket learn also we only looked at

third party but it really suffers in our particular environment that's what really interested did I have a whole bunch of it and the Microsoft said but her ability left Microsoft managed system the divide and conquer approach was basically matched the vendors first of all and then matched the software because the each vendor has his software list of software so the software each vendor and in the registry will have a set of installed software in the CPE data file it's the same thing there's a vendor and he has a list of software so we get the vendors right and if we can do that fairly accurately then the software becomes fairly easy to match so becomes two separate simple machine

learning classification problems essentially and the data is small it's not a big data problem at all it fits someone PC could rather want to see so the data can be manually labeled this is a match this is in the max this is a match very tedious but really useful and the features were extracted using fussy tidy matching and string lengths so here's some sample vendor data we can see the registry data from SS cm is all over the map there's there's capital different cases words that make that don't help us like LLC all kinds of different different variations of names the sips CTE data from NIST is very very formalized structured interests and the challenge

is first of all to match these two data sources together also in the registry data that the vendor can have up the maybe five or six different names for instance a Oracle has six different in the production data I pulled just before coming in from besides there is about six different ways of naming Oracle and the register data so the basic approach is a standard machine learning tokenization throw away the stop words and then pull out the features tokenization had to be careful about separators to split the thing properly into tokens and we're also dealing with different languages it's unicode and we had the watch which which exact separators we use to get the proper

tokens the stop words are words that don't add anything to the matching things like projects software limited so you basically when you find them you throw them away living Steiner edit distance this is the basis of fuzzy matching is a number of single character transformations add remove change to get from one string to another and there's a very nice package called fuzzy-wuzzy which calculates simple ratios whether it's the two strings are simply a match or whether one is a subset of another it takes a look the breaks of strings and the tokens and takes looks at sets of tokens to see how well the matching is so we use all of these rituals in the

tool and that gives a very nice feature set also use it string length the of the different names observations as we were labeling the data realized that as I was mentioning matching the Bender's accurately as is crucial we match bender a to vendor beat the suffers just not going to match also the data status set sizes smaller about 10,000 vendors altogether so it's something we can actually go into the vendor data set and start manually labeling at least the vendors that we think are important and a lot of them and this this really helps to drive the accuracy up because of course you use the manually labeled data as well as the machine class of machine learning

classified data together to do the final vendor matching which algorithm do we use well we use simple k-fold cross-validation obviously this is splitting the training data into key sets and then you use K minus-1 sets to train the algorithm and then validate the algorithms performance against the label data which is select the set this that wasn't in the training set and you rinse repeat that repeat that for different algorithms gives you an idea of misdemeanor accuracy no surprise random force classifier was the best one of the best and this is essentially a randomized force of the decision trees and the estimator is the average of all the separate trees to tune the algorithm again a randomized grid search with

cross-validation is what we did very very simple the errata my search is basically defining different parameter values or sets and then doing a search to randomize search and then using across holds validation for that their software matching key multiple 98% which is about what she was a spectrum randomized classifies forests and that's that was the machine learning piece the interesting thing that we found though is that there's also there's a ton of dirty data and that that the data wrangling was the other aspect that this surprised me is not being a data scientist at the director had about a thousand extraneous host s SCM doesn't manage everything laptops disappear flee Network the versioning and CPE the various wild wildly from

vendor to vendor Java was the worst example and Unicode also was a challenge lots of hands-on time with the data use defensive coding obviously validate all input tries missing data initialized missing dead or get rid of it you know it cause otherwise causes problems and get rid of the extraneous data as fast as we could the Microsoft data that we're not looking at that the duplicated in the D entries etcetera then we discovered also that heuristics was really useful to speed up the matching and make it more accurate for instance if a city then there was only one or two characters long we throw it away if the first were the CPU vendor string that

would have to be in the in the tokenized W a registry string and then that means that was probably a match if it wasn't in there is probably not a match for products we had released information on both sides so we had to see at least a partial match between the release information so by cheating of that that speeded up everything and also man is much more accurate when all else failed code around the obstacle the Java virtually we just had to do code to do that the people issues just talk about that briefly surprising if I wasn't expecting this took my great idea to the ops team the ops team were kind enough to meet with me we had a conference an

on-site meeting and the most important person was the ops architect was over six time zones over was the end of his day he speaks French and the meeting was in English so you can imagine what happen and he's the man in the wall of course he's on conference calls literally everybody else is having a grand old time in me so this is disastrous we talk I talk technology instead of presenting from a mouse point of view the SSM architect sort of didn't connect and everything sort of died and then the VP came to town he heard the presentation he loved it blessed wanted his dashboard for yesterday the typical VP and it had to cost nothing and take

no resources the office people rapidly became concerned concerned with all of this because they're going to have visibility a Class C level these started making noises of our SCCM op production database performance and this was totally understandable instead of doing direct production access taper they suggest that I used a secondary non production DB that they use for reporting and query this is a nice little sighting they were going to shut me off line it turned out that this data under goals underwent arbitrary ETL blackbox transformations the data from the production went into the database of made nice reports so I eventually decided commit so that that was not the best idea to get around the people

solutions have people it's not easy first of all we operate in pirate mode the branch project typical grants project so running under the radar docker using docker moving from Ubuntu to set to West to Windows running on laptop scrap PCs anything won't get a hands-on meet deals sell your grandmother the highest bidders anything we could get that freshest approach and access and then deliver quietly we're telling the BPO it's not ready yet there are you slow it down down sell it the Stata is the new technology we're not sure about the reliability fear whatever you know and then give it skin type tiger tops change so that the office people their dumptruck people we have to

help them to understand that this newfangled airplane is not a dump truck and how you do sir and give them ways to use the the new technology and have control over some motherhood lessons learned and that is about my time and also the presentation there's my contact information the the tool is on Ducker hub and also I get hub on mine you can get me on peerless Lord where 77 and I'll be happy to chat with anyone as long as people want to chat with me after the presentation so I guess that's it thank you [Applause]

Mining Software Vulnerabilities in SCCM with NIST's NVD: Data Challenges and Machine Learning Solutions

Related talks