Security Data Science Teams: A Guide to Prestige Classes

Name: Security Data Science Teams: A Guide to Prestige Classes
Uploaded: 2023-10-25
Duration: 51 min
Description: Erick Galinkin explores the growing landscape of data-driven security roles—Data Scientist, ML Engineer, Data Engineer, AI Researcher—and the blurred lines between them. The talk maps skill overlaps, outlines pathways for security practitioners entering data careers, and argues for clearer role defi

BSides Las Vegas · 202351:0062 viewsPublished 2023-10Watch on YouTube ↗

Speakers

Erick Galinkin

Tags

CategoryCareer

TopicAI Security Career & Soft Skills Threat Intel

StyleTalk

Mentioned in this talk

Tools used

pandas

Frameworks

Cobalt Strike scikit-learn

About this talk

Erick Galinkin explores the growing landscape of data-driven security roles—Data Scientist, ML Engineer, Data Engineer, AI Researcher—and the blurred lines between them. The talk maps skill overlaps, outlines pathways for security practitioners entering data careers, and argues for clearer role definitions. Through case studies and audience Q&A, Galinkin covers theoretical foundations (probability, statistics, information theory), practical tools (pandas, scikit-learn), and real-world challenges like LLM governance for SOC workflows.

Show original YouTube description

Ground Truth, 17:00 Tuesday As more of security becomes driven by data, a menagerie of job titles have cropped up across the industry. Data Scientist, ML Engineer, Data Engineer, AI Researcher, and more have become de rigeur job titles – but the lines between each role remain blurry, especially for early career and non-data folks. In this talk, we talk about where the skills of these roles overlap, how to pursue a security data career, and crucially, offer some hot takes on why maybe we need some clearer lines. Erick Galinkin

Show transcript [en]

good afternoon everybody and welcome to bide Las Vegas ground truth this talk security data science teams a guide to prestige classes given by Eric Eric is a hacker and computer scientist working as principal researcher in in Rapid 7's Office of the CTO present Eric leads R&D supporting rapid 7 manage detection and response service an alumnus of John Hopkins University he has published a number of academic papers and given talks on security decision Theory and artificial intelligence applications for security at conferences from aaaii and gamc to Devon's AI Village he has spent his entire life in different parts of information security ranging from threat Intelligence and malwe Analysis to Cloud security and security architecture before we begin I have few announcement

to make we would like to thank our sponsors especially our Diamond sponsors ad Adobe and our gold sponsors Prisma Cloud Sam grap blue cat Plex track Toyota and conductor one it's their support along with our other sponsors donors and volunteers that make this event possible we have few policies that we want everybody to be paying attention these talks are being streamed live except in on the ground and as a courtesy to our speakers and audience we ask that you check your phone and make sure it is in silent mode we also have few photo policies here so the bide Las Vegas photo policies prohibits taking pictures without the explicit permission of everyone in the frame so if you want to

have a picture or a photo make sure you have explicit permission of that person in the frame that being said we would like to welcome Mr Eric on the [Applause] stage thank you so much for that beautiful introduction it is a pleasure for all of you to be here I am surprised at how many people turned out given that there was a you know nice little break between the two talks so thank you all for being here um I wouldn't be excited to speak to an empty room but I am excited to speak to a room that has at least seven or eight people in it so with that uh my name is Eric Lincoln uh

you know as as was mentioned I lead AI research at rapid 7 and I'm going to talk a little bit about security data science teams uh and kind of what that means so just to begin what is security data science right which I think the the clear definition is the study of security data to extract meaningful insights and if you disagree that's fine I have a microphone and you don't so a little bit of about what security data means right because that that feels like it can mean a lot of things so you know this usually means the analysis of things like logs whether that's system firewall load balancer logs I have spent so much time on load balancer logs God

please I don't ever want to look at load balanc or logs again uh files right so this can be executables documents Scripts uh read malware uh or you know other artifacts right so packet captures which don't quite fall into logs or files right um but you know I'm sure that some of you are coming up with things I haven't mentioned yet and you know there are lots of things use your imagination right if it relates to security and you can extract data from it you can probably do security data science on it so security data science is of course done by security data scientists what does it mean mean to be a security data scientist well it means

that you're someone who does security data science you're welcome uh most security data scientists come from two backgrounds right that's either data scientists who are interested in security so typically this is somebody who started a PhD in physics and decided they wanted to make actual money uh or security analysts who are interested in data which are you know that that's my background so I I have a little bit of a bias here and I acknowledge that up front now when we think about security data scientists and especially these data scientists who are interested in security one of the points that I like to make to aspiring to Young uh new hire security data scientists is that it's

kind of a Prestige class right and so for those of you who somehow are not nerds but are listening to this talk Prestige classes are a concept from role playing games right and that is to say there are prerequisites to reach a Prestige class so if you want to be right you want a Prestige class you want to acquire it you have to be a certain level you have to have certain attributes you have to have certain traits you have to be an existing class and then you kind of prestige into the prestige class right there's a certain level cap before you can get to your prestige class it's not it's not an entry level thing and uh when I say that

I get a lot of rea where is this gatekeeping and yeah sorry yes it is right and I think that I have a a fun anecdote that that will help you understand so I'll tell you a little bit about uh a malware classifier that was built by data scientists so they started with this big Corpus of malware um literally millions and millions of malware samples and they did all their analysis and picked it apart and you know identified the features and how they were going to feature I it and how they were going to build the classifier and then they trained a whole classifier on this um and this is a true story from a

a former employer so how did it do well it got uh above 90% accuracy on the test set it did incredibly well uh excellent F1 score excellent Au if I remember correctly it was like a 096 Au for those of you who don't know what Au is that's the area under the curve one is like literally perfect the AU basically measures the trade-off between false positives and false negatives right um the higher it is the better so that's incredible that's it's unreal classifier and so what were the two most important features for the classifier uh number one most important feature for determining whether or not an executable was malware was the system language uh number two was the

compiler for those of you who have ever thought about malware a moment in your life you may realize that these are not features that are particularly important in determining whether or not an executable is malicious so these data scientists went off on their own built a classifier and said here you go it's awesome it's so good and we were like hell yeah what what does it do explain it to us and uh they were like yeah it just checks the system language if it's Chinese or Russian it's pretty much always malicious if it was compiled with Borland Del it's pretty much always malicious and it's like nope absolutely no wrong wrong right which is to say

security data science requires security skill and data skill right and if you're a low-level character that is you've just graduated college you know um you may not have the right balance of skills to be a good security data scientist to start right that's not to say that you can't get there um and of course you can get there right as you start off in your data science journey in your security journey and you aspire to become a security data scientist you'll acquire more experience you'll acquire ability points and you can put those ability points in different parts of your skill tree right so in role playing games skill trees are a way that as you build up your levels

you will unlock new skills some skills are prerequisites to other skills sometimes you need to have both skill in the line to get to that third skill you need to you know have your spheres or your ability points whatever analogy makes sense for you but it's tough to move directly to say assessing the security of large language models if you've never trained a logistic regression classifier you need to grasp what's happening under the hood before you can really get to the point where you're making well reasoned valid assertions about what is happening where right and there's a lot of skills that can go into being a security data scientist I've put a bunch up here I'm not going to read

them but one of the things is you know especially if you're thinking about security data it's tough for people to reason about well I built something that tells me whether or not a a an HTTP stream contains malicious Network traffic if you've never analyzed malicious Network traffic right you can build that classifier but when you get a false positive when you get a false negative it's going to be really difficult for you to understand why that happened explain it and fix it uh a lot of times data scientists data people in general they get stuck on this notion that well all we need is more data we just get more data and then we train it

some more and then works and that's not always the case because you have these weird ambiguous Corner cases especially in network traffic which is a nightmare to do analysis on you see things like um we were training a classifier for anomalous data transfer and one fun thing is that printers sometimes get a lot of data you send a lot of data to a printer some printers depending on the make and the model and the protocol don't actually receive that much data so does it look like xfill or does it not look like xfill well I guess that depends on whether it's a Lexar Mark or a Xerox and of course if you don't know how to look

at that pcap and say oh okay yeah this is weird it's using this printer you know protocol that wasn't in our training set you're not going to get that uh it can be really tough right and so as we're looking at the skills and thinking about the different skills whether that's you know good oldfashioned AI deep learning data visualization containerization and deployment mlops Etc that brings us into job titles and job titles are something that drive me uniquely insane um because well we'll get into it right but some some common titles you see machine learning engineer data scientist data engineer data analyst mlops engineer uh Etc right and so you can kind of break up the responsibilities of the role uh

I'm not I don't need to read this list to you uh you don't have to read it you can take a picture of it it's fine uh or a screen capture if you're watching remotely what's up um but essentially you know there is some overlap in the roles and there are some you know really defined things right uh mlops is almost completely disjoint from a data scientist there's overlap between an ml engineer mlops overlap between the ml engineer and the data scientist my job title is AI researcher and um that's not on here because it is silly so the problem is that this is my idealized version because most orgs end up structur like this where everybody has the job

title data scientist and we don't distinguish um we don't distinguish at all between whether you are doing the deployment whether you're doing the maintenance whether you are just doing data visualization you work with data you science the data and therefore you are a data scientist uh and so my hot take is like maybe we should just stop using that title no more data scientists uh I think that by putting that restriction on ourselves we kind of force ourselves to think about how those titles might matter and how we can delineate those roles and responsibilities uh and I'll talk a little bit more about that shortly but when we're thinking about the roles and responsibilities of

security data organizations right that's presenting security findings to leadership in digestible ways that usually means uh hopefully something other than a pie chart but sometimes they want a pie chart even though it doesn't actually tell you anything meaningful please stop using pie charts uh right presenting security data in stakeholder relevant ways so this can be if your stakeholder is like a sock analyst well a chart is not going to be nearly as helpful to a sock analyst as something they can read and take action on a lot of times all a sock analyst wants is red or green is it bad or do I not care about it right and that really matters how you present those findings

does matter uh and it's where that data visualization skill comes in a skill that I am sorely lacking right developing um task specific data models and machine learning models if you don't have a data model that makes sense it is going to be very difficult for you to train machine learning models uh using the wrong data structure can be a total nightmare uh especially if you're dealing with Text data and you've turned it into J and then you need to return that Json as a string and then your model chokes and dies on it and you can't figure it out for 3 weeks not that that happened to me like a month ago um and then of course

enhancing the ability of analysts to deal with ATS scale data which I really do mean is taking the Deluge of data that sock analysts are faced with and turning it into something that they as people who don't find using a Jupiter notebook exciting people who don't want to train models just people who want to find evil and get rid of it um turning that data into something that they can cope with right so a key line that was missing from the earlier chart is that understanding of security processes right and it's really important for analysts data scientists if you're going to use that term for ML Engineers to understand those security processes that way you don't write a classifier that

depends on the system language and the compiler right so how do we how do we think about understanding security processes for data scientists right for people who are coming from you know a physics PhD into working in a security organization um I don't want to imply that you need to be an expert right you don't need to be a super competent reverse engineer to know how to write a malware classifier right it helps but you don't have to be what's important is that you can work with those subject matter experts and you have enough of a background to understand what matters to them and how they do their jobs right if you spend a day with a malware analyst

you're going to very quickly learn what matters they're going to say oh it's making you know this API call it's importing these libraries we've got you know uh packing in here right all of these things are hints that something might be malicious and you learn how to deal with them and reason about them together and so when you get a classifier that does weird things you can say that's not right and you don't have to wait until like two days before it goes to production and customers freak out you can catch it early on in the process um and you know I've I've mentioned this a couple of times at you know various get togethers at at you know meetups and uh

even to my own organization and one point that I always get is but security data scient data scientists they're all so busy and like so what like excuses um I I think that that's an excuse right we are busy but this matters it's important it's important that you have the appropriate skills and that you invest in the right parts of your skill tree to do the job that you're assigned so what is the job that you're assigned and how does it matter how do you structure your team right and it's important when you're building your party when you're building your security data science team that you collect the different skills you collect the different strengths and

weaknesses so that you can support whatever your organizational mission is right so I'll give a lightly fictionalized real world example of my party right my my team and so we have me right uh I'm I'm kind of a a Mini Max Rogue I've I've invested a lot in my decks and Charisma right I've got very high security skill High machine learning skill uh but I am I cannot write terraform gun to my head I could not write terraform it I love everybody who does my brain doesn't process terraform it doesn't make sense I can't do it I've tried go langang and terraform those are the two I can't do um if you love go langang I actually

don't apologize um data visualization is just not a place I've spent a lot of time I can build like some basic charts like if I can do it with like pt. plot I'm a filthy python user um I'm sure there's somebody in here who loves R I'm sure Gabe is listening somewhere and to him and to Bob rutus I apologize um I I do know that GG plot is better I'm just never going to learn how to use it um so I I'm very very poorly skilled in data visualization and so when I'm trying to build out my party I want to bring on Jamie who's who's our tank right and by the way I do have the permission of

these people to show their faces it's not just these are real people um so Jamie comes from like a a real Dev background she's wonderful she's brilliant you know kind of familiar with security but but newish to it uh but she's you know competent at dat processing ml data visualization competent more competent than me but she brings up all of the infrastructure and Ops stuff that I can't do if if it involves a TF file if it involves aars file I go Jamie can you please help me like I need you for this I can't do it somebody said ECR to me that's gibberish that's nonsense eks never heard of her don't know her we're not friends right

AWS doesn't make sense to me uh but but it makes sense to Jamie and so I am happy to you know do my my backstabs and whatever and she is happy to tank the uh the AWS damage for me and then you know we've got another member of our party Robbie and Robbie is just a a wonderful like Druid kind of mid-range like 15s on every stat you know he's like he's fine he's good at everything um not not minmaxed he's got no dump stats he's really built a balanced character and he's wonderful he's really really good um and so we have this party with these these complimentary skills right we've got me working on the Deep security stuff and

being able to Mentor them on the security side of things I have a lot of background and and deep knowledge in the machine learning side of things and the large language model side of things uh so I can you know cover for them there Jamie is happy to bring up all of our infrastructure and manage it for us and then Robbie's kind of just an allrounder whatever you need but there's something missing uh we don't have any casters right so my party uh even though I've have tried to build it very carefully doesn't have anybody who's really really really good at data visualization and for what we're doing now that's okay because we're mostly supporting these

internal operations uh sock people right but if somebody asks us hey can you write an executive report and and put it out to the world I say no no I can't I don't know how to do that I'm going to build you really ugly charts and I'm going to have to go to our bi team and and say like hey can you help because you all build beautiful charts all day and I don't know how to do that so it's really important to build that balanced you know security data science org and it's really important to have a deep well of security knowledge to pull from especially when you have data scientists who are coming in from this

non-security background right and so with that I kept this incredibly short um I I am all set and happy to release you all to ask me questions uh and then to go eat dinner so thank you for your time uh thank you Eric wonderful talk uh really interesting if you have any questions you can use this mic uh and ask Eric your question thank

you I I've asked a question every everything that's going on in this room um I'm curious to know how you deal with llms because they seem to be so just uh who knows what's going on under the hood you know when I push you know regenerate regenerate I get all this stuff back there's all these ways you can sneak in the previous uh presenter talked about how you can reposition something because I deal in a governance and I'm trying to get my own head around that and I just wondering if you have any thoughts on that I'm so glad you asked that question uh I I love no I I really do this is something I've spent a lot of time on

and so I'll tell you a little bit about how I've dealt with it as we're prototyp some some stuff right uh which is that when I'm building it it really depends on the audience right who's my consumer so I've trained some language models to work with our sock analysts to support them right and one of the things that sock analysts do other than try to break my system no thank you John uh is ask it questions about things like indicators of compromise right so these things where accuracy is incredibly important right it really really really matters because an IP address that was benign yesterday might be malicious today and the trouble is that with a

large language model even if you can guarantee that it memorizes all of its training data which is its own governance Problem by the time you're done training a language model 7 billion 35 billion 70 billion plus parameters that data is going to be stale if it's about an indicator of compromise a don't M an IP you know uh it may not have seen a hash before right are you going to put every possible hash in there no of course not so one of the things that I've done is built guard rails that ask it what kind of question you're asking and in my case for our sock analysts if it's about an indicator of compromise or if it's about a

vulnerability I don't even have it talk to the language model I have a guard rail uh and Nvidia has built some wonderful guard rails on Nemo there's you know a lot of ways to do these guard rails but if you're asking it about an indicator of compromise or a vulnerability a particular cve ID what I do is I say don't talk to the language model short circuit go query a structured data source right go query all of our Telemetry pull it back right there's a separate system for doing that and then return that structured data that tells you this IP address is a Cobalt strike command and control domain Co Cobalt strike command control IP

right it's a known malicious IP then you take that return that and have the language model return something that's readable to an analyst right and it says like IP you know 8.8.8.8 probably not that one is malicious it's a Cobalt strike command and control IP it was last seen on such and such a date uh and then the analyst goes oh no um we have to do something about that and then they can ask a follow-up question be like okay well how do I remediate a Cobalt strike you know infection that is not going to get sent to a structured data source that is going to go to the language model that's been fine-tuned on all of this security

data all of these reports and whatnot and then it's going to say oh well you you know reset credentials quarantine the machine re you know restore from a own good image you know whatever right uh all of that advice that it kind of gets trained on and so we pair the language model with trusted structured data sources to retrieve that relevant information and that kind of helps us ensure that in cases where accuracy is really really really important we circumvent issues around hallucination uh which man what good branding from from language model providers right it's it's it's making up it's making things up um yeah so that that's how I have dealt with it um there are certainly other

ways to do it having like a uh chaperone language model is another idea that I've seen where you have a language model read the conversation between the user and the language model and say like is this going okay does it look like the user trying to do bad things to this language model does it look like this language model is saying things probably shouldn't say and then what that chaperone can do is you know Short Circuit the conversation and kind of push the language model back in to be like oh I actually can't answer that question I don't know the answer to that I am a helpful harmless language model I cannot tell you how to build a bomb

right um so those are the two models for want of a better term that I have seen work um I'm sure there are others it is an emerging field but I think that those are some really strong ways to to do that right and again with the the security data science point that I want to drive home is if you've never worked in security you may not realize that shoving a bunch of indicators of compromise into your model is not actually helping anybody uh and may actually be confusing you need those guard rails in place because whatever the W to cry domain is uh not not malicious anymore probably it's probably s cold right that's s cold I don't know

somebody ask Marcus he'll no absolutely question go for it hey uh could you go back to the slide where you have the roles and the check marks cool so uh I have a couple questions about this um I guess the first one is is um you have these kind of broken out as separate things and you could read this chart as sort of a progression of skill from left to right but I'm not sure that's quite accurate uh could you just maybe give some thoughts on that yeah so I think it's it's organized from left to right just so that I could fit it all on one chart um but it's it's definitely not a progression right mlops

is I mean I would be lost without mlops and I would be lost without data engineering right like I cannot build ETL pipelines on the far left side of that and I cannot deploy and manage my own infrastructure on the far right side of that I'm very comfy in the middle part and you could reshuffle these roles you know these responsibilities however you want this was just more aesthetically pleasing to me um I do think that the one place where there is kind of a progression and again it's one of the the issues I take with the term data scientist is I do think that data analyst is sort of a precursor to the data scientist where

it's as you develop these more sophisticated modeling skills you become you know a data scientist that said as I mentioned like the the business intelligence team I am not very competent at data visualization they are data analysts who are amazing at it right so I kind of see the data scientist as somebody who's dealing with the at scale Progressive overall uh you know large scale data and pulling insights out of it in a programmatic way whereas a data analyst is more like run this one SQL query dig in super deep on that and pull out the individual AC insights um I think that a lot of people would take umbrage with that I think I

might even take Umbridge with it but that's kind of how I'm envisioning it for the purposes of this chart and literally nothing Beyond it so that's kind of how I see it is is this isn't really a progression right like an ml engineer is not necessarily above and beyond a data scientist it's just that they are more focused in on the machine learning care and feeding and deployment and and all of that where a data scientist may need a broader set of skills to be able to clean the data collect the data explore the data and understand it right they need some of those analytic skills where an ml engineer can get away with not

necessarily understanding the particulars of the data uh and how it is stored structured and cleaned they need to understand what the implications of it are they need to understand like what is this data what does it mean how should this look if a person does this right like if you were doing it manually as opposed to automating it what does a relevant input look like and a relevant output look like uh but they don't necessarily need to be concerned with how do I extract the individual features from this data right I think that this maybe is more of a progression of the data from data ingestion to model life cycle management I think that that may

be the progression that that is captured here right is that the data engineer needs to bring in the data store it ETL it uh extract transform load for those who don't know what ETL means um right make it live in the database make the database happy I'm sure there are other things data Engineers do I am not a data engineer so apologies to every data engineer listening I'm sure you do more than that right the analyst kind of pokes ands at it the data scientist figures out how we want to model it the ml engineer helps build test and train that model and then the mlops engineer will go ahead and make sure that that

model is deployed scalable and [Music] productionedit leads into my second question actually uh which is just sort of uh looking at this in terms of uh you know sort of the development life cycle from exploration to model development to model deployment to ongoing maintenance and I was wondering if you could just talk a little bit about how each of these roles fits into that cycle definitely so the first pass through right because it is a cycle that initial model you know data collection training the model uh testing the model deploying the model really gives us this from like left to right top to bottom um but of course models are not train once and then it's perfect

forever you have things like concept drift and model drift where maybe the data that you trained it on you know if you trained a a model for malware detection uh on data that was collected from like 1999 to 2008 it would probably not do very well on Modern samples right uh these things evolve similarly with network traffic right I mean the network traffic that I remember from the hon days of 2014 when I was tracking exploit kits to what modern networks look like now oh my God um there is so there are so many outbound connections now to things that I didn't but why you know so all of that is to say at some point you're going to

need to say hey this model is no longer as good as it was and that's where mlops right in that automating the pipelines and deploying the models sees okay the model tests are not hitting on the the current data set and kick it back to the data scientist to say do we need to re-evaluate the features that we're using do we need to re-evaluate you know the data sources that we're using if you have to re-evaluate the data sources that you're using if you aren't collecting the right features you may even have to kick it all the way back to the data engineer and say hey we need to pull more stuff in right we need to pull in something

different we need to pull in contextual information we need to add to that and then that goes back through that same cycle again from top to bottom left to right which is okay now we have the data that we believe we need we pull out the features we create the model we test the model we evaluate the model we say this one's good enough and then you kick it into deployment you build the tests and then you you know uh have that you know continuous integration continuous deployment and make sure that it runs and scales appropriately and then inevitably 3 months from now now somebody's going to go Eric your thing is not working anymore I'm going to say

I know uh and then they're going to come back to me and I'm going to have to revisit that life cycle right so it is definitely uh I think a a key and maybe underappreciated part of ml life cycle management is the Run model tests uh testing your models is very important uh and and I don't think that we do enough of it and I don't think that we think enough about it as a community uh but you know testing those models on known good data uh or data that you know what the results should be and then also the newest crop of data and saying okay does this perform in a comparable way um and then monitoring

that over time to say is it just that like last week was just a rough week for the model it got a lot like a lot of bad news and it's just wasn't happy uh or is it that it is degrading because there is some change in the way that whatever we're monitoring uh whatever we're evaluating for is constructed and is working right um yeah does that yeah awesome hi um so as a experienced security practitioner um who has aspirations towards data science I'm finding that a lot of the materials that I can find are mostly about how to use specific tools about how to you know use specific programs I'm wondering if you have any recommendations or advice about

how to get a more fundamental theoretical grounding in the topic yeah I thank you for that question I actually love that question so if you're a security practitioner looking to get into the data side of things I think that there is some value to certain specific tools right knowing how again this is very python biased apologies to anybody who uses r or some other language that that they prefer um right knowing how to use pandas like I still have to Google the documentation all the time and I've been doing this for a minute uh you know knowing how those data frames work how you load data what the data structure should look like all of that is incredibly important but

when it comes to the theoretical foundations I find that revisiting good oldfashioned statistics and probability uh knowing probability Theory you don't necessarily have to go to like measure theoretic probability uh although you can if you're like hardcore um or if you're a huge loser nerd who loves math uh either way uh getting into that like deep probability theory is super helpful because a lot of these models are fundamentally uh probabilistic right there are good oldfashioned AI models like logic programming inductive logic programming one of my favorite things in the entire world uh love it so much that no probability other stuff all probability so being able to understand you know uh birth death uh models and uh

marov chains and those sorts of things right become really important I think that if you can work your way through like a cassella and burger uh or you know thinking about like gosh oh there's a really good book on probability theory that I can't remember the name of that I want to recommend and I I'm blanking but things like stochastic processes are are incredibly important to understand right if you understand Markov processes and bruli processes and Pon processes you can usually get to the point where as you're looking at this data and contextualizing the data and thinking about it from a security perspective if you're armed with that knowledge you can go oh this is a beri process like

yeah it's going to emit an event at some random interval and I'm either going to get a zero or a one at any given time step and I can just model it that way and you say okay well how do I turn that model into something that's usable that lets me be predictive and then you can start thinking about your regressions or decision trees or whatever um the other thing that I think is really important to think about and shout out to my cryptography friends on this one is also information Theory right so when you look at your data and try to figure out well is my data actually telling me anything is this just noise or does it

contain information understanding how information Theory quantifies information will give you a good underpinning because the way that a lot of these machine learning models work is by reducing entropy right you're trying to reduce the cross entropy the entropy between your predictions and the True Values and so if you grasp what the entropy is and how it's doing that you can start to say okay well the reason that my model is just like crazy and doesn't make sense is that the data I'm feeding into it doesn't actually contain enough information for it to develop good predictions which is something I found out when I was writing my Master's thesis on so Network traffic and it was

terrible um but you know such such is life so I think yeah those are my recommendations is is statistics just good oldfashioned like cassella and burger um stochastic processes and and information Theory give you a good theoretical foundation for understanding it and then really getting some hands-on experience with the tools even if it's like just building toys um knowing how to manipulate data in something like pandas uh or whatever data frame thing R has uh and knowing how to just do like basic scikit learn models that'll get you pretty far um as exciting as large language models are and as exciting as neural networks are uh one fun Secret in security that makes me feel certain ways about people who I

won't name uh is that most of our data is tabular data and decision trees actually work way better on tabular data than neural networks do and so when you start talking about large language models with security people uh some of them are like this is the best thing ever and some of them like our eyes roll back in our heads and we're like yeah but like what is it actually doing we're not dealing with language we're dealing with you know logs and and time stamps and network traffic that's that's tabular data that's not text Data uh and so just understanding how we as security analysts think about these things uh is a huge Boon yeah hi hi uh we always talk uh a lot

about ml models but but uh what do you think could be other outputs of a security data science team like you are talking how you feel your team lack a little bit of that visualization skills so maybe if you had this this skill you could maybe be producing dashboards or maybe products that could uh facilitate investigation or risk management for other team so 100% uh what you think about these kind of other outputs yeah no these that's I love that question thank you for it um these other outputs are actually really really important and I would like to shout out my colleagues at grey noise in particular um Bob rudus if you're watching hello I love you um

where one of the things that they do incredibly well is they have these really good data visualizations that show you what the trends are how things are scaling what kind of you know information and inference they're doing and these outputs are actually incredibly valuable right so I'm biased toward the modeling side of things because that's where my expertise is but you look at some of the reports that are put out by you know Wonderful organizations rapid 7 included um and then the dashboards and and you know things put out by organizations like grey noise Labs that do a phenomenal job of really capturing well what are the vulnerabilities that are being exploited in the wild right now right you can

watch the trend lines go up and down and that's incredibly valuable information for practitioners to say okay well I'm not really concerned about this vulnerability because nobody really seems to be exploiting it but this one is on an upward trajectory even though it's been available for a while right now you can dig into that from a research standpoint and say okay well what is precipitating more exploitation of this vulnerability maybe somebody just dropped a metlo module and so everyone and their mother can now exploit it for for little to no effort great but also seeing okay well this thing had no exploitation and now it's starting to go up you can say all right well it's a priority for me now to see

if we're exposed to that right so that data visualization component and these other outputs these reports these you know dashboards and those sorts of things are actually incredibly important for risk management and risk reduction um and I I don't want to undersell that so yeah definitely incredibly

valuable hi uh love your right WR because you have both like this kind of data science and security but I'm sure that most of us is not the case in my uh in my case inside my company is like are different teams separate teams one team takes care of all the data science things all the processing visualization and other my team just take care of the security so I have like this kind of uh start collaboration starting going on with them so I'm not sure of from your point of view when you're starting to to work with other data science teams what kind of security considerations we should have to other data processing teams that

doesn't have this security uh backgrounds yeah definitely I think what's important is if you are the so if you have these disjoint security data science teams right being able to as the security person communicate what matters and what you need to see right the way that data scientists in general work um is what is what is my input and what is my output right and they'll happily fill in the middle it's not too dissimilar from a machine learning model right is you tell me what inputs you think about and you care about you tell me what the output you need is and most of the time they can fill in the blanks uh and I I think that that's what's

really important is being very very clear about you know if I give you an executable I want red or green it's good or bad right if I give you you know uh a bunch of logs I want you to Output the logs that might be worth looking at right what are the interesting what are the anomalous what are the weird logs because a lot of times uh data science folks can understand that problem from a data science lens where you say I have all these logs I want to know which ones matter you know a good data science team will then ask you well how do you figure out what matters because one of the

things they can do is they can cluster the logs and they can say okay well you know if it doesn't fit neatly into a cluster if it's not close enough to a centroid or whatever then show it to somebody because it's it's out of the normal it's weird um but if you say well these are the attributes that we typically care about then they can say okay well maybe we don't cluster all the logs maybe we parse them out and then only worry about you know what are the uh command line arguments right like I don't care if you're invoking power shell I care if you're invoking poers shell yeah I I do but like maybe I I

pretend I don't care I don't care what you're doing on Powershell but I do care about those arguments those are the things I really care about okay great well now we can feature eyes on those arguments so being really clear about what your inputs are and what outputs you're looking for I think that Fosters really good collaboration and you can start to give them a sense of you know what matters to you as a security practitioner and as long as you have a a competent and engaged data science team you know learning from them and having them tell you well this is how we thought about it this is how we approached it can give

you a good iterative cycle to start with one project and then on the next one say okay well let's try something a little different and then you all will become more familiar with their jargon and vernacular right because sometimes we're just not speaking the same language and they can become more familiar with your you know security jargon and then that'll just give you better communication overall I think that's that's really uh you know communication is the foundation of all good relationships you're welcome all right wonderful thank you everybody for your time this was awesome I really appreciated those questions [Applause] yes

Security Data Science Teams: A Guide to Prestige Classes

Related talks