"OSINT'ing at Scale", Ben Menzies, CSides July 2020

BSides Canberra · 202043:46419 viewsPublished 2020-07Watch on YouTube ↗

Speakers

Ben Menzies

Tags

StyleTalk

Mentioned in this talk

Tools used

Elasticsearch Splunk

Vendors

Google Twitter

Show transcript [en]

it's been really exciting because we can extend uh the presentations to a much wider audience i suppose than if we had done it face to face uh but like i said you know this isn't about you know the us talking this is about our two great speakers tonight and first up we have benzie's talking about ascent or open source intelligence and he'll be talking a little bit about how to uh how to scale that and how to use that for uh for what he does uh benzies has a a great uh bio that i'll sort of paraphrase really sort of terribly um he's been in uh it uh in it since he was 17 and

really is a veteran of of it now and works in i.t security a long ex a long varied career from a variety of things um to where he is now and so let's uh all welcome our benzies to the the sage ben menzies and thank you very much over to you hey thanks very much guys um all right let me share my screen first all right here we go how are we going can we all say that looks great perfect okay cool so um yeah thanks for having me first of all um here at seaside's uh it's it's a great opportunity to talk about the things i do outside of work which very much relate to what i do at work as

well it feels it feels good um even in this virtual audience as well so um you know it's really good to be able to do this sort of stuff outside of work in a virtual environment and present like you just said uh to a much wider audience of people um so today we want to talk about osint uh and how we do this at a much larger scale uh so first of all an introduction thank you silvio for the introduction um i am ben but a lot of people refer to me as benzies and i also have my username mostly mostly reserved online so you can see that i have my twitter account my twitch streaming my linkedin and my

github mostly reserved for benzies i am a husband i'm a father i'm a mountain biker and i also stream those those mountain biking events live as well so outside of this you're welcome very much welcome to join me on my mountain biking adventures to start you with um i have a little quote here which i decided to coin myself uh and that is the famous word which is a picture is worth a thousand words but i think that the data is worth a lot more so today's uh purpose of this talk is to go through through at least three key points here um and we'll be looking at capturing uh user-generated data related to coronavirus or covert 19. uh we want to

baseline that data and find out the limitations around it and then what do we do once we've got it as well one of my friends sort of said to me you know what why do you want to do this and and a lot of the time it is is a good question you know why why do i do this and i'm i'm really i'm interested in finding out what uh what we can get online uh and what what is available to capture but mostly what what are people saying about coronavirus at the moment and who are these people and are there any trends around this sort of data as well

so to start with we want to start looking at capturing the data and i look at this as sort of a three-stage approach to getting this data so the very first point here in focus is to to ask some questions about what we want to do instead of the who what when where and why i want to focus on the what where why who and when and so to go in a bit more detail we we look at those and we're saying what are we searching for as a really good starting point where is it coming from why do we know it's right instead of why are we doing this who shared it and when was this

information shared the second stage we'll focus on is some of the tools that we want to use and the most com the most common sort of tool that a lot of osm people use use a web browser because the tools that are provided for osinting are provided through a web browser so simple things like google searching or twitter or yandex we can look briefly at using it apis but we won't look into that too deeply because this is oh since we want to focus on freely available data and then we're going to be looking at batch batches of of capture so scripts and apps will will help us along this journey the last part is i'll just minimize that

one the last part is automating these as well so once we we know what we're looking for and once we know how to get it how do we automate that so that we can have a continuous feed of data into an application of our choice and we can focus on cron jobs or any sort of data streaming application [Music] so the first stage we'll focus on is is asking the questions for the data to capture so of course the what where why who and when so to make it obvious um what we are looking for is coronavirus and we've got a few key words here to begin with uh which is corona space virus coronavirus

covert 19 covert dash 19 and covert underscore 19. and those are sort of some keywords that people will use on on social media to track coronavirus the where is always good um mostly social media at this point because we are looking for trends with in people uh and we've got at least six options here there are obviously more but six here would be twitter linkedin reddit facebook instagram and tick tock for anyone who still has tick tock installed but today's focus is mostly on twitter as ambitious as i wanted to be i realized the the enormity of my work if i started focusing on different social media why do we know it's right well we can go to any of these social

media applications and type in the keywords specifically to search for and we can see there where i've highlighted the search field and sorted by latest and we can see the keywords highlighted in this few tweets that i have there and we can also see this in the linkedin data as well if we do the same search terms we can sort by latest and we can see the keywords pop up in there as well so who is it coming from this is probably more of an obvious answer um that we see the usernames highlighted in these events and then the when as well is uh also quite obvious is that we get these events within a short window of time uh typically

within a few minutes even a few seconds so we know this is actually quite active in this so the next part is thinking about the tools that we want to use so web browser is your most obvious tool to use and specifically firefox is my preferred web browser for these types of things because we can separate these tabs into containers and the containers are separations of cookies and sharing of information between them so in this example i've i've created many different containers and specifically to separate facebook and linkedin from each other so that the targeted advertising is less but we can also do the same for if we're looking for uh using different accounts in twitter

or facebook and whatnot but the other benefit to using more advanced web browsers like this is that we can access the console to look at the data generated the next one is apis so twitter has great api features it provides a lot of data when you search for it there are some limitations to that and that is is that you will need to pay beyond the point uh but for free use it's quite good though it steps it a little bit outside the ocean boundaries that you need to actually register an account and capture that data through the account but luckily um it's not too bad and with this tool insomnia which is the tool uh we can

actually easily generate these these queries and then create a script just here all generated through the application itself to easily create new jobs the next part is is to look at what scripts and apps are already available to leave the hard work out of it and specifically today like i previously mentioned we'll be focusing on twitter data only that the twint application is the preferred application in the circumstance so the automation or getting the data and streamlining this a little bit more um two two really easy easy ways to approach this is through cron jobs um cron jobs can be timed and batched uh on on linux systems mostly or to use data streaming application uh

now the the thing with cron is it's pretty straightforward pretty simple um you just need a script and you need to time it you can make a lot of scripts as well and you can run them in parallel and here specifically i have highlighted parallels as a linux application to run multiple applications at the same time in parallel and capture data quickly but as i found out later on there's some limitations to twint and running in parallel so i had a little quote there after i figured that out that it's mostly for people who like to live laugh and love the next uh thing to streamline all of this is to use a data streaming

application uh and so in these two this example we have two applications we can use which is elk and splunk so now that we've figured out a few things on what we want to get um how we're going to get it we can look at starting to get that data and as i mentioned before we have the application called twint and it is fantastic it's it's uh it's pretty amazing tool actually um and it's very very easy to to use as well sorry that opened up a new window for me uh it explains on the github profile there how to install it and you can install it through various various methods and methods and the most easiest being

pip as well so you can just literally type in pip install twint and you can get going of course right down into capturing the source and installing it via source on the github page they provide some examples on how to capture that data so in this example we have here we're looking for the keyword cover 19. but we can also specify that these can be user names that we also specify instead of just keywords in these examples we also have an option to capture tweets from a certain period of time to export them as csv or json files and to even export these into an elastic search engine as well once we have figured out the data and

how we want to capture it we can automate that into chrome jobs again and output those files quite easily into json files but it doesn't have it even has an option to export this directly into elasticsearch and easily enough you can run a container that starts up elasticsearch and once you've started directing all that data into elasticsearch you should have a data feed

but i decided that was too easy so i wanted to set a little bit of a challenge myself and put this into splunk and i guess the question is is why why did i want to do this and well the first point is is that it is it is a challenge for me um i i prefer using splunk it's my tool of choice for data streaming um i thought it looked pretty easy and again i like splunk it's it's very easy to use but i want to make a point here about when i started going through this journey is that it looked easy um you're probably thinking at this point i could have done a manual search

and imported what i needed into splunk or created the cron job and manually automatically ingested it through spunk through their inputs but again i thought i could make this a little bit better and generate an app for splunk because i thought it looked easy so building a compatible app um splunk has a few options to to make this life a bit more this journey a bit easier um there is the splunk add-on builder you can do python scripts um batch scripts and i think there's another input type there it's coming off the top of my head but they automatically generate code for you and various variables for you to to create as well so that code's quite easy to

to work with um and twint also has some modules which give examples on how to to code your own custom inputs for this sort of stuff too and in this example they're looking for donald trump's tweets uh and to look at a period of time as well and export as json and this is all python code as well but as i found out as i started coding uh all of this in twins module examples i started writing into a bit of a dependency hell uh and it's the comic explains here from xkcd this is exactly what this felt like as i started to generate this code but pip makes this life a little bit easier because you can specify

all the dependency requirements for twins into a specific directory instead of into the python directory but even then there was still some trouble behind all of that and the real reason that there was trouble is because splunk has its own python sdk that's separate to the operating system and the biggest reason behind that is so that you can run this on any operating system so after a little bit of struggling with generating code for splunk myself i decided to see if i could put this into a container and luckily enough i have a bit of experience with docker containers and they are absolutely fantastic to work with a lot of time because they are super easy to build all the dependencies

are met within the container they run separately to splunk input processes so that's a real big advantage there as well as that i can restart splunk at any point and the the the ingestion or the data capture still continues to run in the background and the other advantage here is that i can actually export those events from the container itself and put them directly onto the host volume storage [Music] so without say saving you the time of all the problems i went with in generating this code as well i mostly highlight the important bits of um this all working together uh and there's it's fantastic python is fantastic it makes it so easy that i can just import

um the the docker module and import the twin module to help me out here um actually i don't think i need the input import the twin module in this example i think i just left that in there by accident uh but anyway um i specify a few variables here which is the keyword for splunk there's another option in there to translate the tweets as they come in and all of that there is sent to the docker module to run on the docker daemon on the host and then of course once it's finished running it actually removes itself which cleans up the directory and the processes really nicely as well and this is just when just to visualize

how this code works so we can see if we start on the left here we specify the spark input which causes splunk python sdk to execute the python script and start a volume for the container the docker container then starts up the twitted application which pushes off the tweets into the google translate api which feeds it back down into the storage where splunk will read that json file and then ingest that into the indexes once the full process is completed it will remove the docker container i think it's pretty cool and i've been wanting to work on a project like this for a while so this has been a pretty good excuse to work on

so what does this look like from splunk's point of view well it's actually quite nice once once you get all this sort of stuff working um you can avoid the cli and just use the gui from spunk instead and it's pretty straight forward at this point once you once you've done all the hard yards you can put a you can specify a name for this type of of input the keyword is really important again this is a little feature of mine that i made which is the synths option and that's how uh how many minutes previous to the current times you want to search for and in this example i use 60 minutes the option to translate the tweet or not

um is really important too because there are limitations to to tweeting this sorry to running this uh and then the interval in which this runs at uh which is also really important we'll get to that soon the rest of that actually doesn't matter so where we specify source type and index it doesn't actually matter because it because i couldn't manage to figure out how to send this directly in splunk rather than to a file and then splunk pick up the file and this is how splunk picks up the file uh so you have to have a inputs monitoring your directory uh and in this example we use the op splunk output directory and we're looking for anything any file

name that ends in twin.json and then the code there whether you sort it or not uh it outputs to uh is it date time uh the the keyword used twin or json so that we can easily verify the time that the events were created and what keyword was used the props conf there is to specify how that the data is generated and turn it to a little bit more so it's a lot easier to work with in smug so you probably think at this point it's like a cron job but with extra steps you're right but it works and i'm happy and i can i can leave it there for now so now that we have got this all working

how do we baseline this and what can we do with all this so we need to capture an event at this point and we need to go back through our verification of the what where why who and when and in this particular event i've highlighted all of those events so if we start with a what we can find the keyword that we use in there which is coronavirus or covered where it's coming from at this point we mentioned twitter so we're still focusing on twitter why is it right is we can verify this through the actual tweet itself that it picks up the keywords in there who is it from we specify that field in

here from username and when in this particular event this is formatted to my current time but the events come in utc so it's really nice because it comes in adjacent it's really well formatted and it's very easy to work with and it's another debate i'll get into another time but all data should be in my opinion json or xml formatted so now that we've we've figured out what we want we want to baseline this data uh and we want to be able to just sort of focus on one thing at a time so we set the the new inputs to focus on the keyword covert 19 and i set my my time period to this time

last year so any point i run this this particular input it will focus on this point a year ago and uh and like i said i'm going wild at this point i want to get everything related to cover 19 um a year ago and i want to translate all of those events too because i noticed that a lot of these events are not english but uh i soon learned that there's some limitations around all of this uh and we still need to work within the ocean boundaries and we needed to find a better way to do this [Music] so now that we're starting to get data um what limitations do we hit and google is kind enough to tell me the

limitation that i do and at this point i'm hitting too many requests for this url at this point i'm hammering twint i'm hammering twitter and i'm hammering the google api translation and thankfully through automation or unthankfully uh it keeps trying until it until it fails entirely and then it tries again uh so here we have a graph that specifies that exactly and we can see a fairly inconsistent data capture where we have the peaks and troughs of data capture so i needed to find a way to smooth that out a little bit and that that was pretty simple thankfully enough uh all i needed to do was was set a smaller time window for capture

i needed to set a delay between each of my keywords that i used uh an approximate time between each keyword as well so two to three minutes and high set those cron jobs in splunk so that it ran more smoothly and once once i figured out the sort of best time to capture data i started to see more consistent data leary through splunk as well and that's represented there in that graph so now that we've got some data what do we do with it and there's some really simple things to do with this at this point um and we've got a feedback loop to oops nevermind i was just reading the zoom chat window okay so we've got once we've got this

data we we can start to focus on what we want to do with it and what we can do what we can focus on from this point uh and in in these dot points here i want to focus on what hashtags are coming up as the most popular who are the noises users what kind of tweets are like the most and who has the most mentioned names in a tweet as well indicating noisy users and once i had a baseline i i focused mostly on hashtags at this point because i wanted to use those as keywords and in the keywords example here i have uh that's all good sylvia um in the keywords here i have

quite a few things like china flew coronavirus covered all the variations of covert 19. one of my favorites which is the cove idiots uh lockdown pandemic rona second wave which is which has come up fairly recently uh stay home and trump virus another personal favorite of mine and once we have decided on the keywords we go through the feedback loop that each time we've captured a new one we collect we analyze it we verify the data is correct we course correct which is just adding a new keyword and then we go through the same process again but once i started capturing this data i noticed something's a bit wrong with it and that is is that there's so many

users it's so noisy um there's bots there's news articles being tweeted um and there's just a lot of unpopular tweets as well so i needed to find a way to start um actually have a little term here that a friend of mine used which is cleaning the garbage so i had some very simple logic checks in here which are represented here in the in the splunk query language using email segments and logic checks where we can specify a greater than or less than value of a of a field so the logic check i used in these examples are is where a tweet length is is greater than a certain number of characters to eliminate short tweets uh where

people only use the hashtag and then i want to find tweets that actually have some sort of popularity to them so that we know that conversation's happening and then i wanted to remove anything that had bitly links because i found that most of the time those were news articles once i had that data i could go back through my verification steps again i could actually look at the the particular tweet itself by copying the url inside the json data visiting the the tweet and verifying that all of my keywords and the usernames are met and now i know that my data is correct [Music] so once we're here we can we can start thinking about the use cases we want to

we want to generate out of this so that we can actually make some value out of this data and here i've created some some really simple ones where we focus on graphing the data tabling the data and getting some more advanced things like sentiment analysis so to start with we wanted to i wanted to focus on the uh the most popular tweets per day and here is my graph to visualize that so i'll step through this this search a little bit first um so first of all we want to specify where we're looking for the starter which is in an index we wanted to group together the data per day we wanted to put everything into

lowercase as well specifically the username because when i didn't have uh the user case and lower name the username in lowercase sorry i was generating duplicate events then i looked at charting the data so that we could start visualizing this in a per day basis and then we stream that data into into this graph so that we could view those on a daily basis and then we have this really nice view of who has the most popular tweets per day and that isn't necessarily one tweet it could be multiple tweets but we look at the popularity of that person on a given day the next one is the highest frequency tweeter per day so instead of being popularity we're

looking at who's making the most noise and it's a pretty simple thing where we're just looking for um the the amount of tweets from a particular user in a day and we count those and then we chart we do the same thing we chart them over a daily basis and it gives this really nice visualization of who is making the most noise about coronavirus at the moment then we go on to the highest users of hashtags so again it's just about finding out who is making the most noise and then and then the last one here uh was to find out who on average is using the most characters in the tweet um i still have a little bit of work to

do on that one i was finding it was going outside the boundaries of twitter where it has a hundred and 280 characters now um sometimes it's getting 340 characters which i found odd um but i averaged out most of those so it should be a bit more accurate and again it's about the noise of these users to save you the pain of going through each one individually i focus more on on on these last bits which is the table and the more advanced uh analytics out of this um so once we've once we've verified how we get this data and we can visualize it we can start to look at and correlate the the time of the event the user who

created it and the popularity of that particular tweet may be a little bit hard to see in this in this example but it's really hard to get these pictures into such a small screen but we have here at the very top here the the time of the event the user who created that that tweet the tweet itself and then how many like counts are on that particular tweet on that day in this table the the most popular one is here right about the bottom uh where there's 94 likes for that particular tweet on that day and then into the more advanced stuff um where we can actually start looking at the sentiment analysis and the

engrams of these tweets and find out what the top terms or the most top or the most used word is in all of these tweets and surprisingly enough or unsurprisingly a covid is the most used word out of all of the tweets that i've captured uh which should be sitting around 400 000 events now um and then i guess unsurprisingly as well uh coronavirus pandemic people lockdown case social get trump virus and go the bottom graph shows the sentiment analysis behind these tweets it could be a little bit accurate i probably need to train these data models a little bit better we are showing more positivity than negativity but then again this could be related to

people having a positive view on on the current situation current environment we're in rather the negative view whereas the negative views could be focused on trump and all of the political arguments behind that at the moment and the neutral views are uh are based on again on the data model set that we have here so now what we've got tweets we've got keywords we can identify users through the use cases that we have and we want to go back through the feedback loop once we've identified noisy users we can start focusing more on on the people that are making those kind of noises uh the people who are popular around coronavirus and then we can generate more analytics

behind those um and i think that's pretty interesting so at this point i think i've been talking about for for 30 minutes um i actually have the option here to have a live demo if people have questions so i open this back up to sylvia and kylie and the slack channel if you have a question of the data i'll be glad to see if i can answer it for you over to the over to the slack are there any questions for uh ben uh that we have that he can do a live uh demo on um i can see a few people uh are talking um there was discussion about uh lunk versus elastic earlier i think we have we have a few

questions i think the um the stream is probably slightly delayed so it is delayed oh yes for them to start asking questions live the live demo but while we're waiting um there's some other questions that were asked during your talk um sylvia do you want to turn your video back on oh yeah of course this is classic classical streaming um so did you want to ask eleanor's question first sylvia yeah i've got a couple of questions on my my own actually as well so i'm not i'm not an o center by any means especially the imagination but i do know that they're like if you think of like carly for pen testing there are linux distros for osinting one yep um

what do you think of those distros are they useful are they are they not useful and two are you thinking about like getting this tool that you've done and i love the idea that you've done like tech for for human stuff i mean i love that you know that that human stuff isn't just about the analysis of of humans talking it's about designing the tools that let the analysts do the work as well yeah i love that um but are you going to introduce this into an ocean distro and what do you think about sync displays yeah so so my personal opinion on us in distro is they're quite limited unfortunately um because a lot of them are based around

us uh osinting once we start looking at australian data it's actually a lot more difficult to find people phone numbers uh and and various things like that so i find they're a lot more useful for for us searches rather than australian searches so this particular tool is probably why it's interesting because we can actually focus on people who are in australia and start focusing on the data they generate in terms of releasing this tool for a an operating system it would be difficult because this is quite spunk focused but the twin application is designed to work on on elastic and any linux operating system so most of those osyn operating systems would would actually run this quite well

but you'd have to think about how you want to baseline your data and continue capturing from a certain period of time so i can i absolutely can release this to my github profile i just need to clean up the code for a bit first um but i will provide it and send out the links so so i have seen some of those those unlike the on on githubs and stuff they have like the like the all-encompassing like this is the full list of like the universe of austin like links that you can go for like yeah are there any australian related ones that are actually useful for us and while you're still on um you're still on yeah

i'm just getting into an application it's okay yeah um so i i have searched for useful australian stuff but again a lot of it's very us focused and the australian stuff that we we kind of need to capture on are just sort of government websites or any any google searching that we can focus on at the moment there are things like username searches and email searches that are quite handy because a lot of this this data that we use to find for usernames and and email addresses are us focus again because they usually things like gmail facebook twitter etc it's all us us-focused are there any websites that actually just literally like have a like a business australian

business search or like asic or something like that and then yeah any stuff that's out there that does that stuff i haven't actually found anything related to that yet but that's a very good point um but i have used asic and uh various tools like that which is just government websites to to do ocean work on yeah okay old seems to be you know we're not we're not that small a country surely there should be something out there i don't know i don't know uh what's up so there are a few other questions not just silvio's questions here a few other ones so louise asked um and i don't even know how to pronounce his question have you done some

levenstein levenstein like like string edit distance and stuff like that on the tweets uh [Music] hang on well i'm trying to find i'm trying to say that question actually um i'd probably not but because i don't think i understand what he's asking i think i think he's just saying like fuzzy matching of tweets as opposed to oh right yeah okay uh no i haven't um but i imagine with some of this sentiment sentiment analysis and the the more advanced tools we could probably focus on stuff like that yeah um but but we're focusing on capturing keywords rather than looking at modifying or or looking at what the tweets have in particular uh but the tweets hold the

most interesting information so and we can use that to to go through the feedback loop again um so once we find something interesting we can put that into the keyword search and start downloading that data and there was a question by eleanor talking about like um the the approach that you're using now the feedback loop of finding keywords and associated keywords is there a way to automate like the keyword association or something like that that you can like automatically find new keywords and extensions yes such absolutely um but do you want to automate that kind of thing it could become a nightmare you'd have to have some pretty powerful searches yeah as servers to run that kind of

stuff right like i said with the gnu parallels i i used up all 32 gig of memory on my my server within an instant trying to do several hundred searches at a time um so yeah you could and you'd have to time it very carefully as well um but you could automate things like that so you could find top keywords per week and then just start focusing on those keywords uh splunk is very flexible in terms of application development like that yeah well uh another question by by louie um about uh what is the funniest like online tweet or something related to this topic that you found oh that's a great question um it does have

unicode search and compatibility so i suppose we could find uh the unicode for the laughing faces and find the tweets related to that why don't i try that quickly um unicode for oh what's the laughing face called laughing face let's see if we can have a look for that one i barely thought so let's let's copy that and just dump this into here we'll just do a we won't even focus on uh let's let's let's put it to this which is the the field extraction i have here let's look at the last 24 hours and see if we can actually generate whoops anything out of that looks great my field extraction correctly here we go so

oh no nothing's worked okay oh at the the wild cards in between that as well that probably helped me out a bit there we are and if we search for that okay great that's good so then we can ask the data to be a bit more clearer so we can uh count by username and the translated field [Music] did i not spell something correctly i think translate oh yes the second transition yeah there we are i'll just copy paste it makes my life a bit easier there we are okay great so i guess is there anything funny in there i have i have seen some funny things very tropical i like that very tropical first one there we are but i i cannot

remember the tweet it was uh and i i should have saved the idea of the tweet because there is so much information in these these tweets itself and the api provides even more data out of this so um you know if we look at this particular example here where we have the laughing face in here um we can see the the translated and the tweet event that they're both in english but we also get the same data to show us whether it was in english originally or another language but then there's other things like the translate id the link which i showed in the presentation as well any photos related to this this event and replies to as well so this is this

is really quite rich data from twitter yeah well it's great and following on from that um uh ian actually said were there any silvio related tweets in there yeah i'll have to add that as a keyword yeah and uh get back to you on that one okay i did see a tweet before it said that macaroni wasn't salad in the last page for me that's the funny that's pretty funny that's a good joke so i'll tell you what why don't we why don't we ask the question of the data okay this is let's see if we can get silvio at all nope lucky yes we haven't talked about coronavirus at all damn it ian also asked one other question um he

said uh you mentioned you did it the heart the hard way during your talk uh why do you think it's important to not do things the easy way yeah great question um if you do it the easy way um you're really you're focusing more on just the data at this point i wanted to focus on the on the whole uh package i suppose and i wanted to put this in splunk again because i i prefer using splunk as a data streaming tool um it was hard because they provided me the answer the whole time i just had to use elasticsearch and i only needed to create cron jobs to streamline this data process um but if i did that i wouldn't have

learned much about the the module of twint at all and that sort of dependencies required to run this application um so i think i learned a lot more trying to do this the hard way uh than just sort of following the easy methods which they provide and put it straight into elastic i've got a silly question i don't know if this is valid but like i mean i'm curious myself like are most oh sensors like um you know are they focused on the engineering or the tech or they just focused on the analysis i mean is like are there many people that combine both engineering and the other side as well uh it's a bit of both i suppose um i i

guess i don't see a lot of people going to the extent i do for capturing ocean data a lot of it's just uh focusing on where a picture was taken where a video was taken and those sorts of things um but when you start sort of getting into the more detail and technical things people are building tools exactly like this which is the twin application uh to download more data and look at this on a bigger scale because once you capture this data see if you use the application to look for a keyword at a particular time you're not focusing on a baseline of this data you're focusing on everything that's available at this point so if i used twint uh to

look for you silvio i could find every tweet you've ever tweeted on twitter i've got some embarrassing ones out there i mean we could even if we've got time i can fire up my um please don't i don't oh is that is that the time it's such a it could be it could be interesting sylvia um but you know it's it's it's data right we're not going to see any embarrassing photos or anything like that um we would just see the data behind it so yeah it's it's very it is very interesting to see like the i love combining the engineering with you know actually applied uses of the engineering as well i mean engineering isn't just

about you know building towers without reason it's about you know having actually applied things as well and having an effect on the real world yeah absolutely yeah yeah all right any other questions do you guys have any more questions i think there are a few more on the slack but if you want to just jump onto the slack and and yeah no problem i'll start answering them in there definitely a few questions about elastic versus splunk yes that's what i was really trying to be careful about i don't want to make this a talk about splunk versus elastic that's right but splunk is a tool that i used and we'll just leave it there look thank you very much for for talking

uh welcome it's a great talk love to see engineering comply and combined with the actual application of engineering as well so thank you very much oh yeah thanks thanks to you guys as well um it's a great opportunity to talk and again you know you can find me on twitter linkedin twitch or github and he does some great streams as well by the way there's mountain biking i have some he's very good and he actually is very talkative actually while he's cycling up these like insane

"OSINT'ing at Scale", Ben Menzies, CSides July 2020

Related talks