← All talks

Call the Plumber: Your Documents Are Leaking

BSides Charm · 202232:4379 viewsPublished 2022-07Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
DifficultyIntro
StyleTalk
Mentioned in this talk
About this talk
Documents posted online for marketing or collaboration—PDFs, Word files, spreadsheets—often leak sensitive metadata and confidential information through search engines. This talk examines how threat actors exploit document metadata to map internal infrastructure, extract usernames and software versions, and discover accidentally exposed confidential files via Google dorking. Practical open-source defenses and detection methods help organizations protect documents through proper sharing controls and metadata stripping.
Show original YouTube description
For most organizations, posting brochures, contract templates, whitepapers, and various forms of marketing collateral online is a standard practice. And for most threat actors, this can surreptitiously provide a wealth of information about the organization they are targeting. In this talk, we will examine why cyber criminals benefit from the public sharing of organizational documents, how they make use of the metadata contained in the documents, how misconfigurations and lack of user awareness can lead to data leaks, and propose practical / open source methodologies organizations can employ to protect themselves. Nick Ascoli (@kcin418) Nick Ascoli is the founder and CEO of Foretrace, an External Attack Surface Management (EASM) solution. Prior to starting Foretrace, Nick was a Cyber Research Scientist and Consultant with Security Risk Advisors and has published several open-source tools including pdblaster and TALR. Nick has been a speaker at Blackhat Arsenal, SANS, and B-Sides conferences on SIEM and UEBA topics.
Show transcript [en]

good morning everyone thank you for thank you for coming to my talk how is that is this distance okay can you guys hear me all right awesome so as you can see by the title uh this talk is gonna be about documents uh very osint heavy talking a lot about data leakage document leakage and uh document metadata now if you were here for the opening remarks um that was my cell phone that went off really loud so as punishment if anyone wants to play a ringtone really loud at any point during my talk that is well deserved and welcome so a little bit about who i am i'm nick scully i'm the founder and ceo of a

company called fortrace we're an external attack surface management product uh formerly i was a senior cyber research scientist with a company called security risk advisors um i've made some open source tools all very ocinty and i've spoken at a bunch of conferences and if you like to talk or want to chat about this kind of stuff that's my contact information i'm pretty new to twitter so uh don't expect much over there now equally as important to this topic is who i am not first and foremost i am not a plumber um though i have been known to find some leaky buckets haha um i cannot be trusted with anything blue collar beyond uh putting together ikea

furniture um i'm also not a lawyer while this is a strange um disclaimer to have to give for a talk like this uh the legal the legalities of accessing someone else's documents from the public internet um is messy at best there's not a lot of litigation uh beyond a couple small examples i'll talk about but uh we're going to be talking a lot about google dorking finding documents on the internet um from you know the comfort of your own home and the legality around viewing those files uh you know in short i do not endorse viewing someone else's files without their permission however it's very easy to do um now this slide you might find

offensive uh due to its simplicity but when i'm talking about documents in this talk specifically these are the kinds i'm talking about the file types that spin the cogs of the world of business these are the files that we interact with these are typically the file types we're trying to protect um as enterprise defenders from getting out these are the files that store everything that a business uh is doing now in terms of publication i break publication down into two categories the least common one is intentional this is you know releasing marketing collateral releasing product documentation as a company releasing press releases anything where you're posting a pdf a word doc a spreadsheet a publicly accessible sharepoint url to

the internet um which is typically linked to the corporate domain um what i'm more interested in and find a lot more common is unintentional so while these can also be linked directly to the corporate domain via like a misconfigured iis server uh or collaboration software that's been shared via url these are documents which have been published to the internet and what i'm considering publication is it's available via a public url and it's been indexed by a search engine so google or bing has picked up the url to this file um whether it's downloadable whether it's accessible in browser and made it so that someone from the outside with the link can view it there's a lot of

reasons for this that i see regularly and this is you know anecdotal there's there's a lot of other reasons but uh collaboration softwares particularly sharepoint onedrive and g suite making files shareable via url this is the biggest culprit for very very internal files accidentally getting indexed by google because they are hosted when you visit the files from the public on live.officeapps.com or something those links are being indexed by search engines all the time so accidentally making a file shareable via url or intentionally doing it to share it with consultants is going to make it public-facing in many cases um you know internal document storage misconfiguration so if you're working in like a lab or a manufacturing

environment there's all kinds of you know very good uses for document metadata and document repository storage uh that you know is required for compliance with a bunch of stuff um and those servers uh often we find misconfigured uh accidentally exposed to the internet you know and then the obvious stuff misconfigured git repos leaky buckets you know all that good stuff um and before we get into the technical side um a little story time this is why i gave the legal disclaimer so there was a gentleman uh robert hutchinson a climate activist in london who was doing some research on a big evil company big developer who was going to tear down some playgrounds um in his

community this is a true story uh now robert was googling the company found their internal note-taking software online um took a screenshot of the notes he found within it which had some interesting info about them like i don't know bribing or you know trying to influence people in the community to vote for what they were doing um and he shared those notes via twitter the next morning you know his door gets kicked in the met police cyber crime unit uh has taken him into custody you'd think they'd do some due diligence to see if there was actually a crime here because they were the cyber crime task force but uh no they did not so

after a brief investigation like six hours um robert is released uh because the metropolitan police cyber crime unit determined that because those files which they were internal like these are internal corporate notes they are proprietary um they are confidential but because a search engine index them they were considered a part of the public domain after that so how does that apply to classified information how does that apply to you know all kinds of internal marked uh documents who the hell knows no one knows like this is the only this is one of the only examples of actual litigation we have uh which is why i call um intent to publicize whatever whatever intentional or unintentional it doesn't

really matter once it's out there it's a part of the public domain and enforcing uh you know some kind of legal action against someone who's got their hands on it is going to be pretty difficult now those logos at the bottom are all examples of companies who've had similar issues whether it's from third parties accidentally leaking buckets third parties with note-taking software that became publicly exposed the company themselves accidentally exposing um an entire sharepoint like directory which exposes all the files within which then become indexed by google and very easy to access there's really no shortage of examples in recent history of this being very problematic so how do we find corporate documents i'm sure most of you are familiar with

google dorks they're super super easy to do basically what a google door is is using google's uh sort of you know pseudo-advanced search syntax to narrow down your search results to very specific things so in the examples i'm showing here we're looking for specific file types associated with a domain um in that first one you can use wildcards in these searches too so in the first one i'm saying um in in the url that i'm looking for in the search result i want the word admin to be there the file type that the url is pointing to i want it to be a spreadsheet and the site is gov dot anything which the first is gov.ng

there's a bunch of other interesting ones all those files by the way were absolutely not supposed to be exposed to the internet super fun search um and uh every example i have uh going down is other examples pointing at different corporate entities but specifying what i want to see in the text so i'm saying in in the second search site blank.com show me docx so show me document files with the word proprietary in the file because google has indexed the contents of the files as well it's not just indexing html pages it's indexing the text within the files that it finds also and the last search um of course it would be very crass to

call out uh you know the the wonderful venue that's hosted us but the name of that uh the name of that entity rhymes with samaritan schmotels if you do if you plug in their site with pdf and intex proprietary it comes up with all kinds of things that are tagged confidential proprietary not for public release all that kind of stuff and exploitdb is the best these are really simple dorks uh which find super juicy stuff now if you want to find even juicier stuff exploit db has a google hacking database filled with dorks the most fun category um for me at least is files containing juicy info so this is how to search specifically for file

types that have been exposed by certain types of software that are sitting on the internet and again uh i do not condone viewing these files um without permission of the owner so this is something we're gonna come back to a few times but the adversarial value of a document we're going to talk about metadata and contents now talking about metadata specifically what i mean when i say metadata are the the properties attached to a file that are really only used in an advanced way by the sort of lab software i was talking about so like in a manufacturing environment you've got heavy fim on these files you've got people editing them that have super important pharmaceutical contents in

them it's very important to track that metadata that's a good reason for file metadata to exist beyond that releasing files with metadata attached especially to the public is not super necessary but when i'm talking about metadata i'm talking about the things in the file that show how this file was made who made it when it was made and we'll get into the specific properties in a second contents i don't even need to explain that's just what's in the file so talking about metadata extraction of metadata from a file um is easiest done using exif tool there's a couple different iterations of this in different languages but exif tool is by far the gold standard in metadata

extraction you can extract if they've got a command line and a gui version for windows but the command line version is much more popular and you can extract metadata from roughly 200 file types so in this example i'm pulling the metadata out of a jpeg that's literally the command you install the utility exif tool point it to the file it will pull down everything um and there's so much metadata associated with file i just cut it out in the second example i'm grepping just for just show me the key value pairs that have gps um in the field so this is showing me uh as an example for a jpeg file you know with most smartphones most digital cameras

are recording gps information about the photo this is certainly not news this is like in the wild this is probably what you'll see metadata used for most an exif tool used for most is reporting on you know ocean tracking via geo data stored in photos um so we've got like latitudinal and longitudinal locations of the photo taken it's location above sea level very very specific pieces of metadata um can be contained within a file but what we're talking about is uh you know office productivity file types so using exif tool you can do the same thing you can run it against any kind of file what we're looking at here and these are files that i pulled this week i

presented them in a little more of an easy to digest way but powerpoint [Music] a word document exported to a pdf and then a regular old pdf so there's a lot of interesting things that exif tool is pulling out of the file metadata here that that i'd like to point out uh first of all when pulling these files what i'm most often finding in the author fields is um either actual active directory credentials so internal credentials which is very valuable to an adversary a lot of them have first and last name combos not super you know that's not really that interesting um it gives me the name of someone on the inside but um like an ad credential or any sort of idp

credential is a lot more valuable because i can take that you know brute force the subdomains and start poking into things that don't accept emails but only accept actual usernames so in that document in the middle we've got an active directory username in the metadata of a word document we've got the timestamp that the document was created and the only reason that's interesting is because we've got a software version below of the tool that created it and it's an outdated version it's adobe acrobat pro from 2017 which we can see in uh at the tail end of 2021 is still running on this host so really the value the value in that and the value that we

piece together from a lot of documents is we can start to put together an inventory of software the more often we see a piece of software reoccurring in a company's public documents intentional otherwise the better chance we have at putting together an internal inventory of like what is probably on the gold image of most of these laptops which makes building a payload you know you've got a much higher degree of confidence when building a payload some some of these uh some documents you'll pull exit from will have down to the dll used to produce that document so you've got a dll that you can you know potentially hook or do something uh interesting to now some other fun uh

point outs here are in both of these we've got confidential markings and these are documents i pulled from the internet with very basic google doors these are probably um casby tools or dlp tools tagging these documents as confidential as they're perusing internal resources as they're picked up you know by the sensors as they're flying around but they're still making it onto the internet despite a cosby tool or a dlp tool tagging it and probably trying to prevent its publication online due to some misconfiguration um both of these documents uh contain confidential information and we're specifically tagged as confidential but we're sitting with a publicly available google indexed url um online and they've got they've all got interesting tags which you can

also use the tags to put together what security suites like what what security suite uses this kind of tagging infrastructure so you can put together again a good idea of software versions and types running on the inside in addition to uh user names and a bunch of other stuff a file i looked at yesterday had like someone an employee's name and their manager in the pdf's metadata i don't know why these uh i don't know why products do it that way but a couple open source scripts to get started and looking at the metadata of your own organization or of whoever the hell you want is pi meta and power meta the only difference between the two

pymet is a little more maintained um and is written in python powermet is in powershell so you know depending on your environment you can get up and running with one of these really quickly uh what both of them do is dork google and bing for files pull them all down um extract the metadata with exif tool and put all the files in a directory for you so i encourage you to check those out so to conclude on metadata alone uh what we've just looked at what is the adversarial value of metadata a few things one is an internal inventory of username uh usernames and conventions so sometimes host naming conventions sometimes uh os versions an inventory of internal

software so we can start to put together depending on how many documents a pretty confident gold image of what this enterprise might be running on the inside like it's super easy from the outside to poke at the vanity domain and see what software is running on the outside so like maybe we can tell they're using sales force but now we can tell what's going on on the actual hosts running inside the company um and then obviously the identification of tagged confidential content which is really only something i've seen in the last couple years with the adoption of casby and the slow death of dlp traditional dlp is tagged confidential documents making their way onto the internet despite protection so

talking about document contents i'm going to show you some cool dork examples again stupid simple dorks that yield incredibly interesting results and i'll get a little more specific with each one on how you can narrow down what you're looking for but the interest the most interesting takeaway like back in 2007 when most metadata document metadata research was done um a google door is probably going to show you things linked directly to the parent domain of the company in this example i am dorking a dot gov domain looking for file type xlsx what i'm given is a blank state police spreadsheet i click on it and it takes me to sharepoint so the reason that's happening if despite

the fact that you would think i would have to do siteview.officeapps.live.com show me spreadsheets um in the i've blurted out so it's kind of hard to see but that second blue box in the office apps url um so this is a document shared via onedriver sharepoint is the tenant name for this particular file so google is smart enough to know this file is associated with this domain um because it's in the tenant location in the url so now exposed sharepoint files are accidentally um despite not being on your domain actually are associated with your domain in the public and dorkable so that's super bad news this example is you know it gave me some defense logistics agency report of

transfers from uh the defense logistics agency to some law enforcement offices that i've blurred out and it's pretty boring stuff air conditioners bench presses but there was some cool stuff in there like rucksacks and body armor and i'm sure the more i scrolled through results the more i would have found um more interesting logistical information uh but another example is getting a little more specific with terms and just how easy it is this was the first search i ran while making this presentation and it immediately came up uh with good results so this is uh this was an internal um this was not like something hosted on their vanity this is something hosted on the vanity domain kind of but through a

back-end application that they use to share presentations internally so running this dork against this particular company i'm just saying site companyname.com um file type pdf and in the text of the pdf i want to see the word confidential so the first link immediately business confidential i click on it super internal powerpoint that is absolutely tagged on every slide business confidential and you can see in all the other files i'm getting the same kind of internal markers so if i knew more about this company if they had a specific marking they used on everything to tag it as internal i could just plug that in and get even more filtered results but you know you can just read in the results

proprietary and confidential this document contains confidential information no disclosure duplication whatever so um was were these intentionally disclosed i don't know it sure doesn't feel like it um and it especially didn't feel like it with this example so this was a different.gov domain i pointed to where i got a little more specific because i know a little bit more about how this particular organization runs so i said in-text show me case numbers in spreadsheets and again i'm taken to a sharepoint site so this this sharepoint issue presents itself um once again and clicking into uh despite the fact that it says pdfs in the link this is a spreadsheet it's tagged as xls by google when i click

into it this particular spreadsheet had on every single you can see i'm pretty deep in there i'm on um column a0 on every single row it had a unique email first name last name combo and address for an individual associated with a particular case type um and there were nine just shy of ten thousand um of these rows about nine thousand so that's 30 000 unique um first name last name address and email combos now was this intentionally released it sure didn't feel like it but maybe and it's probably not a good look to have something like this i'm sitting out in the public my best guess usually the the case for things like this is they're sharing this

data with a sister agency they're sharing this data with a consultant and it's because they're not linked to their identity infrastructure in any way they share via url and even when you try to narrow that down it doesn't always work uh quite how you think it should so this sharepoint uh file and i didn't even look at the other ones who the hell knows what was in them um was incredibly juicy right out of the gate so in terms of the adversarial value of a public document the contents are obvious right potentially passive collection of internal data and compromising data exposed via collaboration software and in every example i showed you collaboration software or internal misconfigurations

now the reason i say potentially passive is because when you used to run a google door and download a bunch of documents from a company's website you'd be cutting logs on a web server now these files are hosted somewhere else with the rapid adoption of sas you're cutting logs somewhere that they you know are probably not going to this enterprise's sim so while it's not you know it's not actually uh passive in that you are cutting logs somewhere you're probably not cutting logs somewhere that the organization has access to in the case of sharepoint you know maybe and maybe the logs can be requested but um what you're doing you know dorking already on a corporate

domain it looks like normal activity it's normal web server activity it's not going to be hitting any like advanced ueba alerts um or triggering much on the waff if you're just downloading files right it looks pretty normal um and now you're even one step removed from the complexity there uh and the last uh content um value i have there's wireless ransom now this is not common it's something that i think could be more common but without in all the documents i showed you these are potentially things that would have been interesting finds during an internal red team um depending on the scope of what i'm looking for uh i am finding you know we're finding

files through these dorks that previously before the adoption of these cloud services and the exposure via public urls would have taken um phishing initial access uh privilege escalation phishing a location of the sharepoint server pulling down the files seeing what's interesting i have now done through like two clicks so uh what a lot of ransomware actors do these days you see in their campaigns is they'll not only say you know we've got everything pay us and we'll decrypt all the hosts but also uh if you don't do it by x time you know we've exfiltrated a lot too we'll start releasing it now without having to compromise an organization in a way that again cuts

any logs you can say i found this uh super confidential spreadsheet about your upcoming merger pay me or i release it so without actual compromise um you can take content for ransom via its exposure which is something you could always do but it's become significantly easier with the rapid adoption of cloud services hosting files just this is just another example of that value in action so the bottom example is the contents i've cut no logs and i'm looking at a spreadsheet of 30 000 unique contacts uh unique emails and they were not emails with for that entity there were people's emails um 30 000 unique records the content you know the value of that is obvious now

with the metadata i've got in this pdf above that i pulled from a similar entity i've got pscript5.dll so i've got a possible dll to hijack i've got the specific name of a dll here i've got software versions known to be running on the host i've got an uh some kind of idp associated uh username both in the creator field and in the author field um this is a pdf but i see that it was created with microsoft word from the title of the file so i'm getting a pretty good idea again and i've also got an os version in the producer field i've got os x you know 10 4 6 quartz and then

the utility used to create the pdf from word which was pdf context so again i'm putting together the more documents i have the more confidently i can put together this inventory of internal um you know runnings and usernames for uh stuffing portals that don't accept emails you can easily pull emails with like a hunter.io but where are you gonna pull internal creds from and it's these files in many cases uh so how do you prevent and detect this so prevention is uh stupidly easy in the most common um applications but it's not it's not as simple to implement for the business right like the folks in infrastructure and the folks on the business side might be pretty resistant

to eliminating this because this is a really common way to share files in a totally safe way with other people it's just the um the capability itself can be misused you know in a way that's uh super dramatic and super unfortunate um on sharepoint and onedrive it is as easy as uh sliding a little slider down to least permissive and on g suite excuse me it's as easy as clicking uh the off button on uh link sharing in the these are both in like the admin tenants that's where you do it now you can also get a little more specific maybe you don't want to disable link sharing you can disable link sharing to anyone who's

not in our identity infrastructure so you can get a little more granular and still allow link sharing but in a in a safer way but documents in many many cases when shared via url from g suite sharepoint or onedrive once that url exists um it will be indexed especially if you're like in chrome i'm trying to figure out all the scenarios um i'll put together a blog on all the specific scenarios that make the url indexed by google but there's a hell of a lot of them so when you make these uh when you make a file shareable via url it will be indexed by a search engine in most cases and now that is searchable by the right

person with the right torque now the prevention of metadata if you've already got a ton of files out there and you put stuff out that's super useful and a part of normal business practice for you um you can strip the properties using like uh you know the right click in windows go to properties clear metadata that does not clear most of the interesting stuff um so this is the only method that i advocate for which is using the exif tool utility minus all equals file name uh with that argument uh we see on the left i have my file as i originally pulled it we've got that dll we've got all that you know the user

names in it once i run the exif tool minus all it pulls all of that interesting stuff out now i've heard of ways to uh you know still retrieve that information but most people aren't doing that so this is a safe um way that i endorse to strip metadata out of a file before sharing it if you do have to share files publicly which now you know you should just put what the information on a web page but if you still need to share a file you can do it safely in a super simple detection pipeline uh that anyone can do you can do this on a box you can throw this into a lambda um if you run a tool

like pi meta uh to download all your files you know on a cron job and then have a super simple script in bash or python or whatever to uh look at the metadata in all the files your org has out there you know are we accidentally exposing the author field the creator field are we are we exposing internal software packages um and then also grep the files contents for keywords that you're interested in like our specific tagging for confidential or our merger coming up um and especially if you've got something like your merger coming up uh run this search with those same keywords against like this is public domain stuff run it against everyone involved in the merger

the two consultancies the law firm the mergee and see if they're not accidentally exposing files you know about the merger coming up these are things in the public domain you are uh legally allowed to be searching that especially if it's if it's a merger you're involved in but it's super easy to run this against yourself and run it against whoever that whoever else you want who might be uh might have this information and might be accidentally leaking it and then you know through an api send that alert wherever you know wherever it suits you now why does this matter a lot of us know about metadata already know about google dorks the reason is the landscape for dorks

has changed quite a bit with the mass adoption of collaboration software and the you know switch to sas hosted infrastructure not a lot of file sharing things are on-prem anymore um which is you know which is the way of the future it's something i advocate for of course uh the amount of exposed files has increased uh dramatically over the last like two to four years um there's several reasons for that that i link here most of the the vendors who are doing your like uh the scraping that you would hope would be uh pulling these are not um so the the changes in the public landscape this is that's the reason why these things are still sitting out there

in such huge numbers um and this is uh this is the final note i'll end on but this is 2007 beloved child's character bob the builder um seeing a document so this is a document profile these are all files i pulled from a domain um a couple days ago up at the top are files linked directly to the corporate domain right these are files that were intentionally released by the company totally manageable job uh for bob the builder we can handle this this is like six files um now everything else you see uh below that are this is 2022 bob the builder it's no longer 2007 we're using collaboration software um those are all files hosted by third-party services

now exposed and associated with this same corporate domain um so obviously now we are swimming in a river of uh bob the builder's tears um and that that is the uh that is the ostensible end of this presentation but i hate to end on such a somber note uh and i'd like to enforce that this is a solvable problem right like link sharing uh finding these files it is solvable it is a little bit chaotic right now and a bit endemic in the uh sheer number of uh sheer volume of exposed files but it is most definitely a solvable problem uh so that's it that's my presentation i appreciate you guys coming and i think we have a little

bit of time for questions i've been doing this

mm-hmm uh

you open that up um

well that's unfortunate but a great example of uh the actual consequences of metadata leakage and yeah file history is stored in a lot of file types the user names of the people involved the time stamps all that stuff so very cool and sorry that happened any other questions in the back uh sure yeah i can share the slides somehow you got it yes do you think organizations are even aware of this um in my in my like notifications uh you know almost never i think only sort of now are companies starting to kind of be like hey what's on the outside like what can but and kind of rightfully so like the the priorities have been you know

let's roll out edr let's make sure all of our uh internal detections are looking good only now are people sort of having the time and the space to be like okay well wait if we're looking specifically just from what the average adversary is looking from like credentialed from the outs uncredentialed from the outside what can they see so in most cases uh not aware at all like if i'm showing them something that i found here it's the first time they've seen it um but i do think that's changing um i'm hopeful that that's changing at least there was another question yes

yeah that's a great uh that's a great idea i haven't i'm usually looking for corporate owned files but in the juicy i mean exploit db has tons of searches for file types or contents within file types so you could totally look for with cool dorks look for malware samples hosted on domains like look for uh open directories that are hosting exes or something facing the internet so that's a good idea and i have not done it but totally something to do all right i think i have to make room for the next speaker so thank you all so much much obliged [Applause]

i