← All talks

GT - Advancing Internet Security Research with Big Data and Graph Databases - Andrew Hess

BSides Las Vegas14:2014 viewsPublished 2016-12Watch on YouTube ↗
Mentioned in this talk
About this talk
GT - Advancing Internet Security Research with Big Data and Graph Databases - Andrew Hess Ground Truth BSidesLV 2015 - Tuscany Hotel - August 05, 2015
Show transcript [en]

cloud delivered network security solution and we also have investigate which is a search engine tool that allows someone like a security analyst to look in and get really powerful insight that is computed from the data that we see from our like 70 billion daily queries on our DNS network so for today we're going to talk about Who I am the team I work on and what our role isn't it open dns will go into the project that I've been working on which is called the Intel you be and how that sits inside OpenDNS as a company and how it affects our product and both our research as well then how we use that and elaborate researchers are able to

leverage that to make more powerful research than how that ultimately affects the entire process of how research is being conducted at Open DNS so about me I'm a software engineer opendns have been there for about a year and a half and I'm on the research systems team and our goal is to provide researchers with very powerful tools and then ultimately serve as the connecting point from getting data from research or third-party sources and into our production environment so that it's actually enforced so I mentioned the until they be i'll talk about it that a little bit more now and so for the engineers here it's ultimately a distributed graph database we built this primarily with Titan DV which is an

abstraction layer that hooks into HBase we use elastic search to index it and we use Kafka to do really fast loading but this is a security conference so we'll focus more on that and what it means from a research perspective so ultimately it's a big sandbox of a lot of data so we can take in third-party feeds from source like kaspersky speed or mute virustotal they all provide really powerful lists of domains that we want to block but we also can compute tons of data internally so these are taken from our DNS log data and we can say hey these are the most popular domains based on percentage of people visiting compared with IP requests we

have all sorts of jobs like that dj's can compute and do it dns records research prototypes who is data and what I real like is support tickets so maybe someone emails opendns and says hey this looks like it should have been blocked or it's blocked and it shouldn't be we have humans actually touching that data and so we can take that from zendesk put it into our graph database and it actually becomes really powerful data points and so the key takeaway here is we want to see the relationships between all the data and so the database itself is very it contains all these meta meta data and then when we want to know what to do we

ask the graph and it takes in all these data points and it will make a decision so the obvious question be should this be blocked or this should this not be but you can do all sorts of questions against it and so when we built this we really wanted to make it work at scale and so just one quick up for observation it's really easy to spin up a malware attack it's really easy to spend up a lot of them and so in order for us to work against that sort of environment we wanted to make sure that we could work in bulk with it and so that means a lot of things actually need to be automatic

and that goes back to the point I just said where instead of something being determined by human we have our graph make that decision for us and that's ultimately done from like the decision engine that we have which is taking in all of that metadata and making a decision on the fly and we want this system it takes a long time to build something like this and we want it to last for a long time and so in order for that work to work we want to be extremely easy for new research new models new data sources to be able to integrate play well with everything that's already in there and ultimately provide better insights to what we're

doing so let's just take a quick look at what the schema looks like in a graph and then we'll talk about how that means for internally in our product so let's say I have an internal source and that could be someone like me or it could be a DGA it could be really be anything and it adds this domain so we make a note for that there will be an ode to the TLD for it so you could look up calm and it would be a huge mega node but this internal sources already also categorized it as malware and so that's cool let's also look at another third-party feed and so that third party feat added a URL it's Matt

you

and so I'm not a researcher so these are a little bit generalized questions but it is hard being a researcher because ultimately you you have a lot of clues of where to look but sometimes it's more art than data-driven and so saying okay what exactly should I research what's going to be the most useful thing for our customers to start start adding to our product and so it also might say hey I have a data set and I've written some stuff to detect what I think I'm looking for but but what else am I getting and how problematic will that be when I push this out to production will my data set be very different from what is live and

I'm not really sure what how this will actually perform in the wild so with and tell you what we're pushing right now is making data lead the way so just for one example we have like something like that support ticket I was saying and we can pair that up with DNS log data and our internal stats so for this example we might have a bunch of zendesk type tickets indicating an anomaly with a type of botnet attack that we don't have good coverage of and we could look at our DNS log data and say hey we actually have a handful of customers that are actually seeing this so this is something that would be more urgent than

something that none of our customers are actually visiting and so that could drive the decision to spend some time there and develop a prototype to start to detect and enforce that and so let's say you create a prototype and you have v1 or let's say 0.1 and you just want to test it out and see how it works you can integrate it right away with Intel DV and it would safely test against the environment so the data from your model would be going into our database but not affecting production but at the same time you're going to compare how it would you could see how it would perform and get that relationship between the other

notes that exist there so that gives very clear visibility what's happening and it like when we actually send stuff to production we're using that same database so it's very clear of what actually is going on and so when you actually would push that out you have a lot more confidence about what you actually are sending out to start enforcing and so a couple things that you could look for would be like white list is this white listed or not is this something that something else in our security offering is already covering am i blocking would I be blocking new stuff or stuff that would just add more mass and also like how many of our customers

are actually hitting what is trying to be detected or blocked and so what we have is what I call like a report card and so bear with me it's kind of a big table right now but the first two rows are just examples of something that would be in our database for a long time and let's say the second two rows are our new prototypes that were added and it helps you just at a glance get an understanding of what it's doing compared to everything else that we have and then we start to indicate some anomalies that we see so this prototype number one I highlighted in orange percentage with no traffic it looks like

a lot of hits that you're producing that no one is actually visiting right now and it means they doesn't mean they won't what we try to provide our our breadcrumbs so that our researchers can say hey I think this is a pretty good start what should I do next where should I look next and how can I make this better and so we try and give stats so that you can see an anomaly and then they can focus their continuing time advancing their model in the right spot and so now that we've talked about how researchers can leverage our Intel DV to create more powerful security offerings as well as how it connects with our

enforcement system it just seems natural that it should just flow with this research development pipeline as I call it so let's just work with us on a timeline and so we took our existing data that's coming into our database and providing clues of what we should do and ultimately there's going to be a decision for one thing or another say hey we should focus on effort on this so a researcher will start building some sort of prototype that will detect this and it can be very anything I'm deliberately leaving this abstract and after some point of time they'll be ready to plug it into the Intel vivi they'll see exactly how it would do with

the report card that they give and with the anomalies that we help indicate that will give them clues up to how can I refine what I've built so far how can I make this better and ultimately there's going to be a point in time or cysts yeah that data looks really good and we're ready to put this into the wild and what's really cool is this whole time it has not had any effect on what we actually enforce in our product but all we have to do is flip a switch to make it do that and so that handoff procedure is super super easy for us to do and we've tested this the whole time

and we see what's actually going to be happening so it's a lot more confidence while we go through it I like to think of it as a puppet no op procedure if anyone here is familiar with that and so the advantage of that brings is confidence as I said just now what we push out and want to research or pushes out they have a lot better understanding of what they're actually doing and because we provide these clues to help them produce better research they're actually able to produce these models slightly faster and so that increases their velocity but because we also give them these breadcrumbs they know where to look and our offering is a lot more

accurate and so I always be advancing is kind of a slogan that we use internally but what's really cool is our research is being pushed into our product but then all of the data from our product is going right back into how we do our research and so it's a continuous cycle but we because we take all of the DNS logs that we see which are enforced by the security that we provide and we push those back into our graph and so we see all of that as it happens relatively in real time and so in summary I talked about what my team does at Open DNS and I give a brief overview of how the

intelligent LED works and the design decisions made within that and then how it sits inside opendns as a data resource in between research and enforcement how we leverage that data that we store in that database to give researchers a better way to produce what they do and finally the workflow that creates and how it improves our security offering at the end I just want to give a huge thanks to not only be sides for hosting me and having this be an awesome experience but also to the team that helps build this product with me because it's not just a one-person effort and it's been about a year and a half in progress so far and it will be kind of

an ongoing project continuing to add new datasets refine things and always making sure our researchers are working fast and effectively so that's all I'll be happy to answer any questions or afterwards I'll be out in the hallway so thank you very much

go ahead