← All talks

I for one welcome our new Cyber Overlords! An intro to ML in cybersecurity

BSides Lisbon · 201658:26711 viewsPublished 2016-11Watch on YouTube ↗
Tags
Mentioned in this talk
About this talk
In this talk we will present some techniques that we use on a day to day basis in our research, where we combine our internet-wide data scanning and acquisition platform with ML/Data science techniques which allows us to find things faster or extract results in a more automated way. We will focus on practical cases and examples that even our audience at home will be able to use if they want. A couple of examples we will look at is how to classify images such as VNC screenshots, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used. We will start by giving a very brief entry to the data science world and talk about: Technologies Techniques How these relate to infosec Algorithms and how they can be used How people can come into the world of data and machine learning Data visualization techniques and what are the best choices for different types of data A couple of examples we will look at is how to classify images such as VNC or x11 screenshots, OCR, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will look at scoring and classification algorithms and how they can be used on ip addresses and we will talk about the use of learning and how we are applying it in real life. We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used. Some specific examples of our research that should give you an idea of some things we will talk about can be seen here: https://blog.binaryedge.io/2015/11/10/ssh/ https://blog.binaryedge.io/2015/09/30/vnc-image-analysis-and-data-science/ https://blog.binaryedge.io/2015/08/10/data-technologies-and-security-part-1/ About the Speaker: Tiago, Filipa, Ana and Florentino swim in data every single day. From looking at what people are downloading to how they are exposing themselves, we LOVE DATA! Tiago (@Balgan) is the CEO and Data necromancer at BinaryEdge however he gets to meddle in the intersection of data science and cybersecurity by providing his team with lovely problems that they solve on a daily basis. Filipa (@filipacsr) is the Data Diva at BinaryEdge, she dances the macarena with numbers to get them to tell her all their dirty secret. Florentino (@fbexiga) is the Data MacGyver at BinaryEdge, on a daily basis he needs to deploy infrastructure used to analyse big and realtime data.When not doing that he can be found creating models to analyse data,give me an orange, i’ll give you a skynet. Why an orange you ask? I’m hungry and like oranges, there! Ana (@ana_barbosa90) is the Data Ferret at BinaryEdge. She is small and hides between the 110th and 111th characters of the ascii code to see and show data in that unique perspective of someone who can’t reach the box of cookies stored on top of the capitol ‘I’
Show transcript [en]

alright thanks everyone for coming to our talk I hope you enjoy our talk he is a bit weird sometimes because you know besides this purely security conference but our talk is actually mix between a bit of business we don't date science in a bit of security I think you guys will find it interesting but hey give us some feedback idea so our agenda for today is this one we're going to swing to you guys who we are and why we're talking about this where we use machine learning and data science in cyber security we're going to talk about something called the image workflow which my colleagues will explain after what it is we'll talk a

little bit more about what it takes to analyze an image in an automated form and also talk a little bit about data visualization when I said you know just now that we're gonna talk business did science in security it's exactly those topics so security in general because the data were playing with it security related data science because image analysis in the business part because you guys can have the best dates in the world but if you don't have a proper way to show you to your clients it's not really worth much so my name is genetic I'm CEO of binary edge and the team that's presenting here today all belongs to binary edge yes Florentino is our

date engineer he takes care of all the plumbing that involves between us getting the data and sending it from 15 to the other two one place to the other he takes care of all that philippa she does a proper part of this size in machine learning it at spider eh she works with models of machine learning she does data quality and elijah said no we're not losing some results or that we can optimize algorithms in scanning things like that and anna and receives all of these data from us and then transforms it in a way that is useful for our clients and people who have already stated and they're all going to talk about their different areas as a

whole it might seem like four different talks but it hasn't understand that this is a pipeline there is me telling them i want this data I'm the business requirement they're sorrentinos and filippo working making sure we get that data correctly we transport and data correctly then there's Anna transforming my requirements into something I can indeed to the people that ask that of me if I had to describe binary edge this is the perfect image to describe it we play with machine learning we have some hacking skills security these are domain expertise lots of statistics because you know you can put it in fancy way but machine learning the device is just you know diversity so happy exactly did we

get here to mix data science and security right now binary edge scans 200 ports of the entire ipv4 space per month that's about 1.4 billion of events a month and then we also do torrent monitoring right now that number is way way bigger were monitoring way more than 700 46,000 and this keeps increasing every month and it also generates a lot of effects so as you guys can imagine we've got is huge using a buzzword later like and we did it away not only to you know make it useful for our clients but also for us do some exploratory data analysis which brings us to interesting things that I'll show you guys in a little bit uh when I say we scan the

entire ipv4 space he leads to things like this this is one way of representing our scans each of those dots if you guys actually see the map doesn't actually have any lies it's all based on geolocation and these are all each dot is an IP address where we found some service running another way you're also showing the same image it's like this if you see some color on the different asm's its services that have been found on IP addresses associated with those a essence and essentially it's looking at different types of coverage that we have inside the geolocation coverage or coverage on differently SS associated with this type of data and this quantity of data as I

said many questions appear how many IP addresses did we find in job x versus job why what's the most common service on port 80 how many ssl certificates are expired in the entire world you know and all of these questions led to a combination of two things that was the answer the exercise machine learning and that's how we decide to create the team that now is the day exciting so the sites is actually pretty cool there are different areas they're separate things sometimes sometimes people confuse them a little bit so for the desire to use it a lot for doing the initial analysis and cleanup of the data exploratory data analysis so you know actually extracting

some knowledge from this data understanding you know some correlations things like that did the visualization that I will show you in a little bit and knowledge discovery as well which I'll talk a little bit further machine learning we've got things like classification closing so joining things together in smaller groups identification similarity matching which as well you guys see this being used on our image workflow which still show in a little bit and then I've got a note on regression regression is usually used for things like forecasting and prediction it is my personal opinion and do not make it minor axis of inner this is my personal opinion garrix that regression is not doesn't work that well

in security and the reason for that or some of the reasons is these things right now machine learning scenarios are not prepared for cyber security in my opinion you've got lots of other style scenarios so not even the classifiers are working with people that you know are amicable and want them to be good sometimes for prediction the scenarios and data are too volatile and you don't have proper sources of the chief data that you can use to train your models and of course the same thing for lack of depth in quantity and quality two triangles waters so there's still a lot of work to be done to actually properly use machine learning in security what

I'm saying with this is it that machine learning is not useful in security no I'm just saying that it's not a silver bullet you have to grab your problems that you have insecurities break them into smaller problems and about ten percent of those small problems can be solved with some technique in its size and machine learning but it's definitely not a silver bullet and if anyone is trying to sell the product s as it is they're lying t um there are some good cases of machine learning being news things like antivirus 10 detection ids is IPS just for source code analysis there are lots of people that have done some really good progress in that area and also for sentiment

analysis of like emails tweets social network of employees so you guys have got a company and you know the twitter handles of their employees you guys can collect some of the twin stick with about the employee do some sentiment analysis understand how they feel about the employee and it's possible that you can detect some outliers that can possibly be some internal attackers but again it's all a bit of oak Oh spoken still but parts of it also work um I could stay here all day and talk to you guys about the link of it collection this image you know shows you what we currently acquired but I'm not going to do that today I am showing you this

image however because I just want to run to some use cases of why having all this data is important and interesting anyone knows what this image is representing or anyone can guess let's do it that anyone ok so the tree big balls that you see that the three big circles the white the purple and the blue where the three biggest data leaks that is actin this year and that was shared via torrent and essentially what you do is we grab these torrents we had our system monetary which means you can see which IPS are sharing that data and which ones are sitting issues that are downloading it all of that and the fun fact is then

when we cross the data we got from all those three turrets as you guys can see there are some that downloaded you know two of them we then cross this with trying to identify who those I he belongs to and fun fact all those IPS and I don't belong to China or from military institutions and you know this is what it's interesting because all of these data leaks were of private governmental data from Turkey Philippines and us if I remember correctly so you know states are also monitoring these type of stuff they're also not loading these like they are using it for something another case this one anyone wants to try and guess what it is again the ball

an appointment so the red circle down here was the Turkish torrent that was leaked and we were quite lucky because when this started was announced we were awake and then I just start monitoring from the beginning and there was what I few address this one right here that was sharing this data for a really long time and straight from the beginning so it's possible again this is all interesting correlations right that there was the initial seeds of that data from that we triggered a query on our other platform of port scanning and then we found that that IP address belonged to a Turkish government institution and add rdp opening hard EP screenshot had an email

on it that email had a facebook account associated with it which was from a system administrator that belonged to the same government institution that the data leaked belonged to so again it's an interesting correlation my mommy anything maybe it does but there is the possibility that the initial liquor of this was an internal person so as I mentioned you know we've got lots of types of data but today on this talk we're just going to focus on that tiny tiny tiny piece over there why that piece is interesting because we usually come to situations like these where we get millions and millions of screenshots from VNC x11 rdp of the entire internet and we needed a fast way to search

through them because I don't want to waste my time looking at with those lock screens I don't want to waste my time looking at linda's consoles that are locked I want to see useful stuff and so did our clients so we work on them building this and I let's hope the internet works let's see does it work cool so images as you guys can see these are all screenshots that we take of the entire internet and this is quite boring who agrees cool so what if we wanted to look at some really cool stuff like scallop anyone doesn't know who Scott wats Holly's critical systems right stuff that runs the water pipes the stuff that runs your electricity to your

home things like that we want to search for example for alarm in reality watch this happens here I did a search on the text that's inside these images I am not searching vendors I'm searching the text that's inside these images let's see i typed alarm and as you can see right here alarm people might think that this is a fluke so shall I you know try and challenge face gives me a keyword don't go too complicated right given something common Bank nuclear nuclear let's see then oh this is gonna be interesting okay so can anyone help me out does anyone see nuclear somewhere please tell me you do yes yeah in the middle we're all set you good physics nuclear physics

so the man deserves a clap at least come on what what other cool stuff do we do um with a images we found something really interesting we've got a mobile provider in India by the way I hope all of you are older than 19 because we're got bout to get NSFW on this we've got a provider in India interesting enough they sell this model of android phones all of them have x 11 open to the internet with absolutely no authentication so you find stuff like this these are all mobile phones in india that are completely open uh actually no point um so yeah as if i can see you know this is some of the

stuff that we find um what else uh let's just go back up here so if i go back to windows right and if i type something like ja sorry let's just go John can use it the audion we can but we want just a second i'll let you guys play around with using the workshop right but i just wanted to show you guys something real quick here so you guys see this john donnelly whatever one other thing that we do oh I'm in them reigns just in a second sorry all of you for that mitch's John can we get John here oh come on just give me sing right so a couple things that we do as well as you guys

can see here every time I mean which comes with our pipeline we calculate similar images to that original image what this allows is that if we find something that we find interesting like Honeywell we can just click on it and you'll get access to all the other IPS that have that same image without having to manually look for it but I just wanted to quickly show you guys something here but its meaning it's complicated let's just see and I promise it's my let them come on let's see i see if i'm lucky alright so the other thing we extract the cute guy yeah we also detect that there is a face present on the images and we do this for

millions and millions of images every image that we acquire we do this for it ah one last thing that you guys might I'm interesting as well those of you that you know windows well you'll see going back to our John Donnelly guy um he's got an email right here we extract all these emails from these things and we throw it against an API we have internally that collects all data leaves we find in forums if you guys have heard of have I been bumped we've got a copy of her by me pond internally so we automatically know there's this set of IP addresses a local company they have our vp open they have this user this

usually after facebook profiles linkedin profiles there's they've got this email that we extract from our DB and there are present in these data leaks so it's a lot a lot of information and i'm now going to cross over to Florentino oh just before we went to pixel scam a conference about two or three weeks ago and everyone had a public IP on the Wi-Fi yes sure the privacy of what not for us because you yeah so one of the first things that we did when we started a company we hired a company of EU and sweetzer suites title lawyers and there is a set of guidelines that we need to fall to be within the loft so and that's

where my privacy is a sunset essential so as you know then I just go lawyers take care of this and doesn't become my problem ah a pixel scam everyone has a public IP address and automatically we caught our teepees and things like that open because people still get protected by not you know everyone always criticizes math security by obscurity all that security by obscurity annette has been I think you see it's a really long time and now I'll pass over to Florentino who will actually talk to you guys about some details on things that we do so now that you have scared of love you I'm going to get really technical so you just saw a portal which is maybe the

best thing we have and it's really interesting to take a look at what is behind that portal so we gathered a lot of data about these images and the way we provide that later and other sources of data to the portal is through my crap microservices right click yes we found that this was a really simple way of creating an architecture that can deal with several types of data and it's really lightweight so how do you get this thing when we scan the internet any target there are types of scans those for the ncrb pix11 that can or cannot generate a spin shot and if we get screenshots we create a few of work that

we then sent to what we internally call our image or foe now what is this with the site that we wanted a robust way of processing all these images and a view as you might guess finding law was the Technic faces running OCR this is kind of heavy so you needed to create something that scale so there are options on the market to create things that are distributed and schedule but we did if we found that reading like any of them so we build our own and we did this by creating simple Python modules each one responsible for a certain task and blinking all of them and we have on the cloud let's call it the cube and we

could work there and the modules just go there and pick work what this allows us is that if you need more power for certain task you just want to know process equal to the others and it just needs to go get more that you and then start the data do it collects in the database this way you can easily distribute and scale there are some concerns you have to do a lot of error handling and stuff like that and when you work with the cloud they are lost lots of errors that you don't know that you're going to get so it's kind of an iterative process and this is not perfect yet but it works pretty good

if you can afford using cloud services do it because it takes a lot of work to maintain a lot of aces and all of that there are errors everywhere so if you can avoid that overhead please do so what exactly do we do when one of them I'm going to talk about the workflow as a whole and not each module but essentially we generate a notification each time we get moving a screenshot and we take that image and we start by extracting the target metadata and then we create this signature in this signature you can think of it as including the whole image extreme as you might have guessed you lose detail but this allows for fast comparison so how

this works this is really a hash now for pretty much most of you here when we talk about ash even idiot we think about cryptographic hashes and this is different cryptographic hashes and there are more about the properties of the file itself and this type of ash is called a perceptual hash really takes the properties of the image itself so this allows for an image that is slightly rotate the different color to be classified as the same so when we receive an image we just generate that small string the signature which is very quick and then we generate small partial substring we then in the light of ice we index all of these small streams and

then we want to search for similar images we go to that index reducing gases and we get partial results partial matches and then we only have to compare a small subset of images this way we can obtained as you have seen the same similar images on the fly and very quickly it wouldn't be scalable to compare every imagery every other images over afterwards we perform a little test on the images as you might guess I don't think people see before a a lot of the images that we get are you really black screens and that's knowledge we're not going to run run OCR against that it's not going to be s nothin so we perform a

label test that my colleague flip is going to talk more in detail afterwards to check if the image has any relevant information or not and if it does we then perform some more interesting stuff on we have several tasks logo detection text detection OCR you can pour in detection if you want and we enhance the image for each one of these tasks and when we say enhance we say applying certain filters like for instance race game then we perform these tasks and then with the results we can do well if we have anything else we lose this is when we do it instance when we get the results from ASEAN you might want to run

the emails against our little weak API and so on and so forth ah now I'm going to pass over the few people that's my talk more in detail about the algorithms that we use of those steps so is one of you said the first tip of horror even work for is filtering we tried to just remove all the images they do not contain relevant information for our case and like completely blank screen shots so one very obvious way of doing this type of filtering is just to check how many colors and image contains and if you content on what color we can you start it but if I have something like a completely blank screen shot with

the white pixel somewhere well this type of limit also doesn't give me any useful information and if i apply this rule of also discarding images that contains only few cold colors i'm going to lose very interesting information like comment lines so a more flexible way of doing this type of filtering is using the concepts of entropy more specifically channels enter key there is given by this equation and for images it give us like a measure of the complexity so I entropy values correspond to a great information content or two more complex images for example so if i have a completely blank screen shot all the pixels are like Sonia it will be 0 because I really don't have any type of

information in this image but if I have like a common line they can feel your record and 0 because I have a little bit of information and in the last case I have a highly complex structure so get hope you will be much better to okay nothing ah so with this in mind can then defy the threshold very cute only the images containing a certain degree of information so now I'm going to show you a note imitating prepare for this presentation where i'm going to show you all so i'm going to show you all can use very very simple libraries in python up you can use to detect the faces to detect the logos in to extract all the

texts so the first thing I'm doing in the first line is just importing the opencv in few libraries there are you much tools and here is an example of the type of you much we find a lot so of course this is a image created by us for this presentation but the truth is to find a lot of this type of Windows 10 screenshots Mary have the users photo The Green Mile the name so our goal is to detect that this type of image contains a face to extract the email and your name and also classified a screenshot as a Windows 10 let's go so let me start by the face detection Mon the most user process attacked by cities

of course to build a model with a lot of images containing places and image without faces and you can train your own model but there are plenty of Muslim we trained thousands and thousands of villages and one of these models is viola jeffery martin is very popular although it was published in two thousand in one so a lot of time ago and nowadays we have more sophisticated things like big learning but for this presentation presentation Alice I'll talk about the biology collagen from work because it's really simple to use it's fast it works really well most places and it's available in opencv so the details behind of these really clever framework are out of scope for

this presentation but you can't find all the details in their original pressure is entitled rapid object detection using a busca scale of simple pictures so Fergus presentation I'm going to show you all can use this framework in Python so the first thing to do here is to load your image in grayscale because it simplifies the process a lot and then you have this the volatile framework available in opencv in the class call to skype classifier and you have to learn that XML file that contains the pre classifiers in this case for frontal faces you also have available other classifiers for a small profit upper body of our body so just have to load xml file of the things that you

want to fight in your image and then just have to apply the pacifier to your image and as you can see we have a set of parameters that must be too depending on the type of image you have and if faces are detected it will return the positions here and then you can grow like a box or a square around the face to see where the faces were detected so if I just apply this in our army p screenshot you can see that both faces are detected but if i change just a little bit this val your ear these parameters very related with the scale of delay image you can see now that the smaller face is not detected and if i

increase in just a little bit more what ways are not detected right so this tells us that these parameters are really sensitive and you should know what they need because the success of this algorithm will depend only on the type of the image you have and so you have to to optimize these dead parameters and you can find all the details again in the biology framework and also in the opencv documentation so now let's move to the longer detection the really most simple approach to the tech blog which is to use this concept of template matching so i have a template and i have an image and i want to know if their template exists

somewhere in my image so i what i can do is just slide the template over my image and compare the net part of the image or in 715 pledge and in in Python in opencv you have class polls your match tonight that compares two images and return the metric for example the correlation coefficient and you know if the correlation is very strong most images are really similar and then you can create a function like this one to slide the template over the image so for this example my template is a Windows 10 logo and you can see that it can file in your right place so this is a really really simple approach and there are two

major problems do that the first thing is if the North template is much smaller or much bigger than the template that existing your image you won't be able to find it and second if we have a lot of interactions in this first cycle of course it will be a very slow process so depending on the type of human to help I depending on your goal you can use more sophisticated things for example you can train a model that learns to recognize certainly logos in your image but for there you have to have reasonable number of image for training it sometimes it's really difficult to to get such a little set and finally try to extract all the

texts so in Python you have this OCR engine that is called pipette iraq and you may think that applying OCR tool is a very straightforward process but it can be a real challenge because the ideal scenario final starting forward is to have an image with white background and the letters in black so if I just apply this to our original image you can see the results and its really bad right I don't have anything here so now if i convert you see much to grayscale and apply again the OCR as you can see it's a little bit better but it's not perfect because my goal is to extract all the texts are at least two email and the net

so let me show you what our results that we are obtaining run your age so as you can see we are able to extract all the text but it's not perfect because i have some lies there but at least we are meeting our primary goal then is to detect all the text so the trick here to apply an OCR to list just to apply filters through our image before you apply the OCR cool because your goal is to just erase the background and keep the leopard anyone to recognize so summarizing from our rdp screenshots we are we detected the face both faces we expect the email and the lamb and also we found the windows 10 so we can pacify

the signal to the window stainless screenshot and now i'm going to pass over to Tiago yes alright so back to me for closing this up as i mentioned to you guys we do you know we've got his whole scan evening and before I handover to one of the reason why I'm doing small very you know that second devil is that she's going to show you lots of different object types of date and how to visualize him so I want to talk to show you guys enough that dated before she talks about how people so let's try and do something here ah here's a stream let's see if we can pull this off tell me now so what you guys are going to see

here on the left it's going to be my user stream and my user stream is essentially the stream where the results I ask for in a job get delivered and now we are going to try and request a scan what okay so can you guys tell me pick a country please Russia a nice country I don't do you know friends sure let's go as friends anyone knows the country code for us to eat at harvest yeah alright so does anyone not understand what I'm writing on the left you can see you can't see it oh this one anyone knows how to create the false s so I would just go essentially what I wrote here I

know I'm not really increase font size ah yeah yeah confuses my text bigger as you like that Memphis to know not with colum lures a calm common much yes the problem is to see the mac and I don't even have a plus because it's just

good that was good okay let me know when you can actually see nothing I'll just point it out you'll be fine so we're drinking a job we're essentially telling the platform please scantrons on the port 5900 using our VNC module right an RV NC module is the one that takes screenshots on VNC and what you guys are seeing here is a real-time scan of all the ip's in France it's detecting which ones have port 5900 open and hopefully in a second we'll start seeing some screenshots mokena i'll link that running for a second and in the meantime ah public yeah as goes this so we've got modules for all sorts of stuff we've got modules for SS age with both modules for

amputees EU but loves for rdp and i'm going to show you Frank up press sh so that means if you ask scan France with the ssh module this is all the data will return to you guys so yeah read the port all the ciphers all the algorithms all the public keys that the eyes the server returns we return all of this information and we return all these sort of information for all the different modules for all different protocols which means it's different types of information therefore they need different visualizations and this is what Anna's like to talk about let me just see if we start well i'll leave this running and maybe we'll start to

get some screenshots because france actually has quite a large IP range so we also have you original plan ah because we have the GOI p5 database thing that essentially tells us these ranges are assigned to these countries and they get updated every 24 hours next time max my name is some others there are some other database so exhausted formation across two uep devices yes and also motivated or not not right now um so I'll just pass over to you for a second and you can do your part in the new maybe we'll look at this so I'm going to talk about later visualization and just I want to show you the pipeline or the workflow we use

for data be so legit visualization for all four types of data so first when I receive the the data like in the formative saw in the black screen I need to see the variables and what's interesting and think about the audience and one representative and then I'm going to think about the presentation I want to show the details the tools to use in a fishy just finishing up but i'll show you this in detail so first before showing all the types of three types of this relation i want to tell you that for me it's really important to experiment and spend some time some good time testing different visualizations try it with the team or friends

colleagues and see what works this year it's really bad example we wanted to show the two how many open parks as an IP half and we ended up not using it because it was really over complicating the idea was to to trust me so next um the the results we have here I I had to predict a little bit because it was really a huge amount of latent you wouldn't be able to see but a simple visualization is this Venn diagram not all visualizations need to be complex just need to be understandable next we have the top 10 web service for the web and here i want to mention that it's really important to pay attention to the

details and be precise when creating a visualization for example if i didn't have if I hadn't truncated the x-axis on 0 it will distort the way we look at the data and consider the the calculations that you need to do to to be precise in the lengths or areas or whatever you you are showing to to give the right information not give the wrong perception here we have relational that you have on the top the six email protocols and then they are grouped into that they are grouped according to its encryption and their color a total tool to make it easier for you to it then here we analyze the Big Data technologies MongoDB my passion readies

each square represents two terabytes of data and we analyzed over a period of a year and you can show the differences of the amounts of data exposed during that period and then special weight of course the 10 most vulnerable countries to to heart bleed and going back to the images again with the OCR results here's a simple were clouds top 20 words most commonly found on VNC stretches then um i'm going to show you this this data in in two ways this one I added a legend title subtitle send up a phone that's easy to read and now i'm going to show you the different business it's the same visualization but doing done differently that's going to highlight the importance

of spending some time in the details here you can see with the different forms for example considered sloppy from two times in one or comics and see doesn't have doesn't show the Titan such a beautiful ways as you seen before and it makes a difference when you're presenting these declines and just to to finish up these parts colorful it's a great way to to give emphasis to some data you can see that the first two boxes are the ones that have most the high accounts of IP addresses with those feelings and then one thing that we always discuss at the binary it is of course comparing automation meet originality because those visualizations were created in Illustrator or could be

any other tool but it took a while so it would be much faster to do it with a programming language heightened or are and just do the fine tuning in Illustrator but of course it doesn't have the same effect ins loses a little bit of the original so it's important to find a good balance between two and then just finish up it's really important to document every step of the process like the calculations for areas every choice of literally the visualization according to the type of that you have review everything in for the future consider what could have been would have done differently what could be better of course and take constructive feedback even if it means to let go of it for

visualization like a showing the first slide no geographies so to finish up our top ah just two more slides first one we published the study of the entire ipv4 space publicly available I see the binary at i/o we looked at a couple protocols we publish you know a couple of some of the slides you seem today are extracted from these reports so feel free to have a look the other thing as well we've got to lose something for you guys we've created actually a small comic that explains how we came together as data science team anyone that wants one feel free to come and get one and any questions that's it from us yeah ask away oh sorry prior go

on to questions if you go and more a database to use or section 3 results depends so right now we've got about Alice a time different technologies just for databases because depending on the type of query you're doing depending on the type of lakeview storage started you want different types of databases and what happens if the internet was highly few secs we're already scanning at 36 we're definitely not going to do it you know the way we'd watch it before because it's impossible but essentially what we do right now is with torrent stuff and we've got a couple of service on the IT people we're starting to collect these lists of ipv6 addresses that already allocated and instead of

brute forcing it the way we ended with type II before we're just going to start scanning those until you guess the same level of scanning we have ID before mattress which question was about a 86 but I saw other one is a lot special working machine are you doing it actively and comparing with other with other type of of tools like OC tools to correlation between that all the social media plus the information you are gathering about two devices so as Lady of that I mentioned before that's kind of the line of where we stop these whole relations have to do with the torrents was five private for us and you would never provide our finest the

ability to let on our platform we're data gathering and eggs sound company yes happy to say that the data correlations have to do for our studies its private its internal and we wouldn't do that if you find the source stated but you and you do whatever you want to get the start state by the way it's not really whatever you do because we do that our clients we don't just sell to anyone you know you guys you know maybe some of you work for the Russian mafia if you come up to binary edge and we're not able to vet your background we will not sell to you so it's not good do you sell for governments stop okay nevermind

it about two devices are you extracted meta data like this the screen size and comparing different devices so you can grab the banner so that is that's why I want to flash it you know if you reconnect with your ex parte again Oh Chinese it let me just see because if i can get this running up again this will answer your question so essentially we extract as much with metadata if you try to identify the service will extract a banner we you know extract which version of the service is running in there and that correlates with an internal cv database that we have so yeah we extract as much line after the expiry the screen

yes if the list comes back up again and hopefully is taking screenshots let's see come on give me just one come on let's just see what if anyone sees a link to pop up in there please let me know uh and the meantime more questions yes I love you hold on day six months maximum that were legally allowed to do it next question we take the quality we used to in the beginning and that's why we created the data science team um one of the things that you know we've got this reply of it cool so we've got these really big client I'm not a lot of saying their name for NDA reasons they really really picky about their data

client they're awesome and it's great and mixes work really really really hard but our data quality and this is by the way one of the differentiating factors that we have versus our competition if you go to sense each show them you know whatever bullsh as our scanning the internet children for example they've got 2.5 million IP addresses that have an ftp server running on it we've got 16 million reason for that data quality improving our algorithms on scanning and all that a lot of the work that we do is I'm laser quality what else yeah you can do oh yeah so uh where was the screenshot I see there we go so okay let's just go

down here for a second first can you see that that's a part of metadata that we extract so we have essentially the title of the computer and the screenshot hopefully I won't get in trouble let's see what we got here well thanks thanks always just a lock doing this gospel but this was a real I'm screenshot do please do not go to our history database you know it was really taken right now for Portugal we get lots of things like Weiner ass b.o.s.s things like that I was checking you know another one will pop up we'll see any more questions for me yes we're gonna do protection I have to say explicitly if i can give you that

but you are fetching over that recession where do you stand so if i remember correctly you're talking about the new fee use data protection all right yeah worse this company we're not detected by right now at least from what result we're not really affected by the youth protection I have a question rivers yes is there any way for an ISP say to you or other scanning out there I don't want to be scanned five percent we one of the controversy for lawyers we need to have a blacklist send us an email 25 mine area with your ranges come in a way of proving that you are the owner or responsible for those ranges will put

them on the blacklist you'll never hear from us again yes

so uh huh so I'm gonna risk playing a little bit with the devil here again right let's see what happens um you're CTO are you back there how they help i do a free text search you writes text really ok so just that are you sure and now how I see white associated so essentially you got free text on every single field of data that we collect the smile so no correlation but you're able to do text search across all of these things as well again the correlations that we do is more on level ok we've got this IP address which seemeth here we sit there you sit there and explore a little bit about you know what services

are running on acting as a test um more oh I would like to go to the same subject because 3 mar switch the switching price let you sell matter for you can can't describe sorry you shall data for european countries ok if you sell not to put your event continue to follow the so you are much ok yeah sure now in regards to the to the european law it's actually something that got out quite recent we've got a lot of thing that essentially enough they'll look at it they'll try to understand how it works because again this is probably like the one important thing is that we don't try and even default user names and passwords we do not try them this is

things that are completely open on the internet without any type of authentication ok yes what's your standard text scanner stack so you've got different phases as i explained first thing we do is check with ports are open and then you've followed this pipeline as well on words are open we've got the custom scan their that we've rolled for tcp and for you to beat their separate scanners just something custom we wrote you know it's not very hard doing scanning on the entire internet software hard mask and got zima what's hard is actually no game with the buds both well known scanners have and being able to know receive all those messages at the same time so our scanners are really

common you know written in c tcp UDP scanners we then have a custom greeting service identification scanner and that's a hundred percent compatible as well the lights of animal and a couple of other scanners and identify services that were able to know how compatible probes because we do have clients that like to send us the probes that they've written and make it compatible on our platform so to custom written scattered by us to be able to identify services really fast oh you're scanning notes I assume probably get blocked and shut down every once in awhile how have you dealt with that so that's essentially I would say fifty sixty percent of follicular job she has these really cool

dashboard yes she's not really cool dashboard where she can compare all the different jobs that have the same targets over time and we essentially try to monitor what's the quality of the IP is looking like are we losing something because sometimes it's not even about being blocked sometimes you know one of our provide data providers verminion as we call them provide this goes down and we're losing data from there and the system needs to be able to think you know the tasks i sent there they're going to be lost trying to send them somewhere else and a lot of the work that you know TR back there has done he's focusing on building a really good

job manager that can handle all that stuff yeah i'm sure this original story as well if you like prefer over color we try playing around with different types of orchestrations framers and we essentially just ended up building our own because it's it's really crazy you know it's something that i would have thought ensure that involved a little bit better but you go to these guys that do task management job distribution inside yeah you guys can have this task and then you can code it that the next all right but i want to dynamic task I want that each this gets detected automatically generates a new task yeah we don't do that so we ended up building

a row oh yes are you doing some kind of inspection too fast like necessary that continued inspection so you cannot go the lock file such as extract metadata the content itself we've got what we call a recon platform which I'm not showing here nope and essentially that allows us to do things like the other way around we put in a domain and it will go and get all of those files extract names from those files because in the opposite direction instead of a starry night for marital pochemy look just pulling the files extracting me today top of battling extracting like creature yeah okay yes like friends or things that you come across through your date analysis that weren't so like ever

in your imagination across yes ah absolutely so VNC it's scary stuff right you find your factory open to the internet we found and all sorts heaters things like that but then we went in scan utt and that's when it became real fun so we found presents online like the jail controls of the doors of the jail on an empty TV broker open with authentication we found nuclear research facilities that have the sensors that are measuring the nuclear reactors or something like that sharing that data in there we found have a look at the reporter published we've got them PTT stuff in there and for me in my opinion personally scariest chapter of the entire thing another funny thing that we

found we found that there was actually under showed it in the side we found that there was a break here I just dug in I still have time two minutes cope um this thing right here ready if you check here that was a huge drop from august to generates the yellow stuff and from our side we saw that drop but we didn't think much of it fun fact the guys from do all apps came in topless i know if you guys heard of them really cool people do some really good products they can't talk to us and actually they found out the same thing but they found out why there was some guy that went on

children extracted all the readies list and was logging into the readies encrypting all the data leaving the URL and locking the red is actually making it secure because the litter was not all we're exposed and that's why we see that drop over days when the guy decided on these rapid changes lock all the red is on the internets well I wear this yes I promise wise watch Mean Time discovery to attractions classification of all the no car discover okay so yeah I don't really know the into that until when i'll just give you a metric we can do entire country of Portugal which is 5.2 million IP addresses in about five minutes for one more so that you know the thing is

the way we have our place on prepared so customer is willing to pay we just raise our scanners from 100 sensor to 200 and the time gets all so it depends I can't give you an exact match it because there isn't an exact match there's just the one who used regularly which is usually we have I think 50 to 60 sensors scary what else okay oh come on just to follow that was largest give it a client to use oh yeah usually if it's really really large scans they just consume more fire hose so we've got two things that we sell we either sell we've got to fire hose is one of torrent data one of

scanning data which is a 242 per month and if it's really large cans he compensates in trying to just consume their spirals unless they want some really custom else they just do on demand scanning and then it's usually smaller scans because it's for their lines or four smaller infrastructure things like I like to tell so for example that just want to scan the socks the day either soft is monitoring things like that I actually like sugar but no not book a report oh okay yes do you do as in eternal either so really cool stuff we've got a met an app on the app store apple TV iOS that essential you guys can scan your internal network and

it acts has to say as a minion that you would on this normal infrastructure reason for this two reasons we want to start working out you know what we can get internally and how to do we internal data second thing I want these guys have axes yeah you're able to categorize the different devices on your an effort I want these guys in a couple of months to write a not making model to categorize devices live underneath good data for that we're gonna do some of the data from DIA said it sorry this is the information from the data is sent to you or in an anonymous letter by the way we don't send the IP source that it came

from the public eye I saw images are set what images known only here there are no images the data that the absence is just I had this amount of my P addresses with these devices that you know yes okay I Paris I'm done thank you guys very much