← All talks

Making Big Datasets Searchable by Calum Boal

BSides London14:40119 viewsPublished 2022-01Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
Tools used
Platforms
Protocols
About this talk
Calum Boal presents Project Robot, a high-performance API for searching Rapid7's Project Sonar dataset of 1.8 billion DNS records. The talk covers the shift from a MongoDB-based approach to a sorted-file architecture with Redis indexing, enabling reverse DNS lookups, TLD enumeration, and subdomain discovery at scale while handling thousands of requests per second.
Show transcript [en]

all right oh [ __ ] sorry hi everyone um i'm going to be giving a presentation on a project robot which is a um very fast api for searching rapid 7's project sonar data set um so yes let's get started um so am i a senior pen tester head of secure engineering engineering it's my first time uh speaking a conference in person um obviously i made this project and that's my twitter in case you want to take a look um right so let's start with what this is if people aren't familiar with rapid7's project sonar data set it is a massive data set of dns records that are gathered through various means for example scanning the

internet and parsing common names out of certificates looking at stuff like certificate uh stream like ct transparency logs stuff like that so it's huge it's about um 183 gigabytes of json data like 1.8 billion records in it um now it's really useful data because normally if you want to a bear to do with dns you have to either fetch it and parse it yourself or you have to brute force it and stuff like that which is fine for forward dns lookups when you want to start doing reverse dns lookups that becomes a bit more tricky and allows you to if you can access this data very quickly it allows you to do some very rapid like research stuff

like you can say okay yeah just give me every single s3 bucket on aws they won't get you all them but i'll get you a lot of them or you can say give me like here's our aws ip address let's do reverse dns on that and we can see the things that are pointing to it um so it's an extremely fast dns api and the old version which i'll talk about in a bit was so like 100 milliseconds to do a lookup in [ __ ] the new one takes a few nanoseconds to do it um if you try it now it won't take a few nanoseconds because it gets about 8 000 requests a second now but

um yeah so we can do reverse dns queries it supports both single address reverse dns queries or you can give it a cider range so you can pull reverse dns records for an entire slash eight asn or entire 16 and it'll just give it to you no problem um you can find tld so it's another thing you can do if you've got a domain and you want to see what other tlds are registered to it usually you'd have to brute force that but with this you can just say give me that data back and it'll give you all the different tlds um which can be useful for finding other domains in my and obviously you can find sub-domains

it's not gonna be the most comprehensive way of doing it but you can get it back instantly so it's helpful to get a quick look at what things are like does this company have a vpn let's have a look um see i guess thousands of requests per second in prod now as you can see here this is our cloudflare so i don't know if you can actually see that well but um cloudflare stats for 24 hours it gets about in the past 24 hours i had like 500 million requests so it has a lot of load um so now we're talking about how this is done so the old approach which i had running in fraud for quite a

while until eventually it just started to degrade really badly under the load it was getting um basic approach was as you can see there you would take a domain name like that and you would split it into these components here and what that would allow you to do is put it in [ __ ] mongodb or any kind of database and do a full string like a full string index on the data as opposed to having to do a full text search so you could do a composite index on like the domain and the tld and that would allow for very fast querying because it's not having to do like substring searches or just matching an exact value

and the difficult part was making the solution was getting the data like that in the first place because if you've got your you've got loads of different tlds you can't just say okay we'll chop the two first bits off if we separated by dots because some of them have three some of them have four some of them are like faux tlds that don't really exist but they're like set by companies so there's the main parser up at the top which is able to do like 400 million plus 400 million dns records a second uses suffix arrays to do that as opposed to like regex or anything else um some github you can use it you see

the path and so the old approach is fast but not fast enough crumbled under load already optimized and it assumed that databases were the best solution for doing this because well at least to me anyway i thought well people who write databases know what they're doing probably more than i do so we'll just trust that and we'll do that and uh that'll give us the best results right we've got fast database fast indexes great you know they're smarter than i am we'll move on um but the difference is data uh databases have to work with very different constraints on their data for example you have data that changes so you need indexing strategies that support

changing data whereas this data it doesn't change we pull it from project sonar and we index it or import it and we index it which means that we don't have to worry about data changing which means we have a lot of other um ways in which we can search this data so our data set is static sorting data as magical properties in computer science for doing things quickly and then we're going to check out the database right so here's kind of the way it works we have a tool called scenaric robot which takes project sonar converts it into our project robot format we sort that data and then we use crowbar to index to calculate an index over that data

so we can instantly jump to the place in the file that we need to retrieve the data we want and then crowbar server is just a grpc http server that serves that data to you from requests so um we're gonna look at how this kind of approach works now so as you can see it starts kind of the same we do we take the uh domain name and we split it up into its parts but then we rearrange it to be the domain then the tld then the sub domain in a csv format now what this means is if you were to index this and you're just looking for all the different tlds of on security

you can search and jump by the first bit or you're going to get the tld as well and just the stuff limited to just on security.i o you do the first two bits and then you just have to reconstruct every line from that data into its original format [Music] with reverse dns what we've done is convert the ip address to a decimal format because it's easily sortable and searchable and comparable um and then we just yeah do a little csv like this so then what we do is we sort the data so everything's in a line which means if you know where the first occurrence of something is in this list all you have to do to get

the rest of them is just read forward through the entire data set until you get to it and this is especially helpful with reverse dns because now it's in uh because now if you're doing slider lookups you just take the min and the max ip in that slider convert it to decimal jump to the start and read four until you hit the end so you're not having to do well with the previous version it would um you'd have to do a different query for every ip in the you know in the um the cider range to calculate them all and request all of them and that became there's a lot of overhead there because you're doing lots

of lookups um so yeah are we finished is that it just sort it and we do it and it's done it's like no well because we can do better than that because we can get one complexity on retrieving this data from the database you know okay well how do we do that the answer is well use hashmaps right so with a hashmap if you no matter how big your hashmap is if you're trying to get a value back from it it's always going to take the same time pretty much regardless of how big it is because you put a value in it gets hashed and that tells you exactly where it's going to be in the data set

so i tried using a hash map initially for this but the indexes that you make are quite large so instead you're using redis which is basically the same thing but just a bit fancier um so here's chromato index this is the way it works so as you can see i grip this out of the file so some of the stuff on the reverse dns is a bit it's not accurate it would all be in line but now it's but not so we saw everything you know descending order we saw everything and then we scan through the file and we look for in the domains one the we look for the domains of the first part in the comma

separated lines and we log the byte offset of that in the file into redis and then we keep reading forward and then when we hit a different one we write that to redis etc you just keep going through so now we know the exact position of every domain in the file where the data starts so as you can see here you ready cli get 18 t tells you the exact bite offset in the file the same thing we do it for the reverse dns but if you were to do that directly for every single ip address you can end up with a huge index so instead what it does is it rounds every uh every ipa just down to

like the nearest thousand or a hundred thousand whatever you want to bucket it has and it puts it in so when you go in query it'll convert your ip address to decimal round it down retrieve the index and then it'll scan forward until it finds the data you're looking for and it'll keep scanning until it finishes finding the data you're looking for so yeah i can do a quick demo i think um [Music] um

[Music]

does that work yes there you go so let's get back the subdomains for att.com [Music] see it finishes in like just under a second if you do see the lines there seventeen thousand we could do reverse dns so we won the one to one [Music] you get a lot of stuff back for a long time if you if you're to do aws or something you get about four gigabytes of data back from it um so yeah uh that's it basically i think is there any questions it's a 180 gigabytes sorry yeah the question was how big is the data set it's 180 gig um about an hour or less i think so yeah yeah

so rapid7 gathered the data and they release it so the question was um is the data stack what do you mean by that so the rapid seven what they do is they do this these scans every like week or so and then they release their data so we are basically taking that new data ingesting it and throwing out the old stuff so anything else it is yeah you can you can go and download it yourself which you'd think some of the people doing thousands of requests a second to this would do because it seems like they're just scraping the whole thing but uh apparently not so

so it's because it's too fast [Music]

oh yeah sorry yeah it's completely the whole thing is open source it's on github github.com sonar search additionally obviously this is just probably mentioned that actually this is available uh here so you can go here to on the synthetic and you can test it out like this and it'll give you back all your domains same with the reverse dns lookups and the tlds

sorry [Music] this is a github for it so there's a command line tool you can install written go which will stream the data from the server and grpc so you don't have to do any pagination for it it'll just stream it until it's done so

[Music] um reverse dns stuff is very useful so if you are looking at a host for example and you want to see maybe what other v hosts are on it it's very easy to reverse that if you want to if you look at a company's asn for example and you want to find out what domains are on that for v hosts you can run the whole asm through reverse dns and therefore it's you know very easy to find assets that they have similarly if you're looking at one um if you're looking at one range and you know this client's got one domain you can do reverse dns through that and often you will find other like top level

domains i'm not talking about you know i mean like full domains which they use and they own which otherwise you wouldn't know about because they're nothing to do with the company so in terms of naming and stuff [Music] all right