Fighting Secrets In Source Code With TruffleHog

Name: Fighting Secrets In Source Code With TruffleHog
Uploaded: 2018-04-25
Duration: 30 min 48 s
Description: Secrets accidentally committed to source code enable lateral movement, privilege escalation, and breaches. Dylan Ayrey presents TruffleHog, a tool that identifies hardcoded credentials in git repositories and package managers like npm and PyPI. The talk covers entropy-based and regex-based detection

BSidesSF · 201830:481.7K viewsPublished 2018-04Watch on YouTube ↗

Speakers

Dylan Ayrey

Tags

CategoryTechnical Tooling

TopicDevSecOps Supply Chain Security Vulnerability Research

StyleTalk

Mentioned in this talk

Tools used

GitLeaks Santa Hog TruffleHog

About this talk

Secrets accidentally committed to source code enable lateral movement, privilege escalation, and breaches. Dylan Ayrey presents TruffleHog, a tool that identifies hardcoded credentials in git repositories and package managers like npm and PyPI. The talk covers entropy-based and regex-based detection methods, DevOps pipeline integration, and a novel finding: secrets in old package versions that never appeared in git history.

Show original YouTube description

Dylan Ayrey - Fighting Secrets In Source Code With TruffleHog Secrets in source code have lead to breaches in the past. They make it really easy to move laterally and escalate privileges once inside an environment, and it's a problem the entire industry faces. I'm going to talk about the tool I wrote to help identify secrets: TruffleHog. I'll be talking about different ways to use the tool, how it can be used in devops pipelines, and the future of the tool going forward. I'll also talk about a new type of problem I don't think anyone has looked at before: Secrets in old packages. I've tweaked truffleHog to scan package managers like npm and pypi, and found tons of secrets accidentally uploaded to the package manager, that weren't ever even in the git history. I'll be releasing the tweaked version of truffleHog and walk through how to use it, and why we need to pay more attention to this problem.

Show transcript [en]

[Music]

so my name is Dylan and I'm gonna be talking about a tool that I published about a year and a half ago that I've been slowly iterating on for the past year and a half called truffle hog so the objective of truffle hog is to identify secrets in source code so Before we jump into how that works I just want to cover the basics of why secrets and source code of Brad I mean first and foremost they can lead to breaches and I think we're all familiar with a couple notable examples but I'll go over a few they can also greatly aid in lateral movement so you can imagine one particular computer in the environment being compromised that

has access to source code and if you're a bad guy access to that source code then they can gain credentials and then move to the hosts that your source code has access to and along that same line it can help elevate privilege as well they can jump from one environment to another one of the things that's tricky with this is exploitation of this is sometimes hard to detect and the reason is because you're often using the secrets in the way that they were intended to be used that's not to say you can't write detection but it's a lot more difficult than say and exploit that you can fingerprint another example is workstations they get lost they may have

secrets in them and in general source code is just very leaky you can imagine scenarios where the doc get directly is accidentally exposed or scenarios where developers may be copy/paste things over to paste bin so source code can often end up leaking out and unexpected ways so I have a couple of tragic examples here one of which is one where read one of the b-sides organisers had an AWS token in his personal github account that a researcher found and reported to hacker one and read was kind enough to make this opens that everybody can see the vulnerability any paid to K for it so this is one of the lesser impactful vulnerabilities but here ramping up the impact a little

bit there was a researcher who crawled yet hub looking for slack access tokens and he found thousands of them some of which went to high-profile companies so you can imagine a bad guy taking advantage of this and squatting inside of your slack and then this is another example where a different bad guy went to github and crawled a bunch of AWS keys in 2015 and found them and an individual occurred a bill of over two thousand dollars because the bad guy used it for Bitcoin mining and then last the example that is probably fresh in mind for everybody here the recent uber breach where 57 million user accounts were exposed to a researcher because again a credential

was put in github and the researcher was able to off with the credential and gain access to the 57 million user accounts so secrets and source code are that now that we're all on the same page with regards to that this is not a talk telling you how to manage your secrets or which secrets management solution to use there's tons of them out there a couple examples up on the board you should do a lot of research and figure out the one that makes the most sense for your environment truffle hog is a tool that is intended to get the secrets from your source code into your secret solution so there's a border collie up

there that's truffle hog and then the secrets are being herded into the secret solution so where does source code live it sounds like an intuitive question you may immediately say it lives and yet but the reality is source code actually lives in a bunch of places and the more you think about it the more places you end up finding so version control is on the top of the list but also source code some package managers and source code lives in mobile applications when you download a mobile app you download the source code that runs on your mobile app and that source code can have secrets in it there's actually a talk on that earlier today slack another common place

that source code gets pasted into websites you render a website you're rendering a bunch of source code and secrets can show up in that HTML but I'm gonna spend a lot of time actually talking about this last bullet here and that's revision history so not necessarily the current version of your source code but a version that is still accessible through your version control so I have an example here of a sample repository that I'm sure most people have heard of react from Facebook and you can see the green on the top there is a code that was added over time and the red on the bottom there was code that was taken away so this repository isn't special

it's pretty normal in this regard but the point I'm trying to make here is that as much if not more of the source code was actually removed from the project but still lives in the old version control so there's more source code or as much source code that lives in the past as there does in the current provision of the code and this is a problem when you imagine all of that pass code potentially still having keys and secrets they're life so why is it that sometimes these things end up in the old version of the code well simply put I don't know if anybody in this room falls into this category but at least at

one point in time my career when I was a developer I may have pushed a secret and then thought the best way to handle that mistake was to commit over the top of it so here there's a list of commits that you can search on github for the string removed password and you can find thousands of examples of developers that committed passwords and then committed over the top of them rather than we're moving to commit another example here is maybe an entire feature is removed so you can imagine if you're working in AWS maybe first you think that s3 is a good solution for temporarily storing some data and then later you change to SQS

and so you end up removing a big section of the code you put a new credential in for the new use case and the old credential may shall I stay in the old buried version of the code that's still in your version control and then the last example here typically in a large company when you submit a project to get open source there's a security review process that it undergoes and prior to that security review process there's a good chance that your developers are going to want to try to clean the code up a little bit because they know it's not in perfect state and they don't want a lot of back-and-forth from the security team

and so if they have secrets in their source code there's a good chance that they're going to try to clean that up and one of the ways to do that is to commit overtop of it and I'll be willing to bet in a lot of cases when the pen tester or the security engineer gets a chance to look through that source code they're only looking at the latest revision I personally would never go through all the old revision history manually if I were doing a code review for security audit for an open source review so I have an example repository here I cleared this with someone at Netflix so it's okay for me to put this

up here but it's an example of an old AWS key that was committed to a public github that was committed over the top so you can see here we from the - in the red that this was removed from the project in an early incarnation of it but the secret still remains no the secret is no longer live if you're wondering but this could just as easily be a live secret there were other cases where identified secrets committed over in a similar fashion that were still life so we needed we need a way to scan these old commits no one like I mentioned before is going through the negative code contributions and grep doesn't find these the way git stores

its blobs a grep doesn't identify them they're stored in some binary content that I haven't really explored fully but I know if you just grep for some anti patterns you won't be able to find them in the document so this was really the the thinking and the reasoning behind why I made trouble hog and the intent of truffle hog is specifically to go through the old revisions of the source code all the branches all the old commits help identify secrets that were potentially buried but it also looks through the latest incarnation as well so it's open source you can find it on my github it specifically pinned to get version control so if you're thinking of

using this for SVN you're gonna have to convert it to get first it digs all the way back in time and it finds secrets so when I originally wrote this tool I was set with a problem of how to identify these secrets and the path of least resistance for me at the time was to just look for high sources of entropy so what that means is if I saw a bunch of characters all together that were random and there wasn't any consecutive order to it I would flag that and identify it as a likely secret so here you can see the same commit that we saw before and the AWS secret key is flagged as as a

problem and if you'll notice the AWS access key also looks pretty random but that wasn't flagged and the reason is because this method isn't perfect it has a lot of limitations and one of them is the character set is something that I had to predefined and so I said look for a random sources of entropy in base64 character set and AWS access IDs don't fit that so this method has some limitations on plus side it's pretty good for pen testers if your pen testing application you can reasonably go through all the false positives good for an open source review for the same reason you've got time to go through all the results and find some interesting

things and it's pretty good for bug bounty as well because he's got all the time in the world and a whole bunch of researchers there's with enough time they'll be able to triage all the false positives but the big downside of this method is it doesn't do well with the dev sec ops model and it doesn't scale well so in the same repository that I showed before and this slide there were a bunch of false positives and this is an example of one the developer had basically committed a URL that contained a bunch of entropy in it and trouble hog flagged on it so to combat this problem to fit the dev suck-ups model more to put this into the

DevOps pipeline and deliver results more directly to the developers I decided to spike on developing a bunch of high signal regular expressions and you can see there's a list here but this is a list that's grown over time and it's still a lot smaller than I wanted to ultimately be but I've been very strict in to only allowing regular expressions that only will flag for pretty high confidence reasons the pros are pretty much what I described before it reduces the noise a lot it's also customizable so I've got a flag that allows you to give it an additional set of regular expressions for new rules and you can remove the old rules and replace them

with ones that make more sense for your environment so if you're running an azure remove the AWS rule and replace it with an azure rule they can also be used to identify low entropy secrets so if you're pretty confident about a certain string that'll always match with a password then you can include that regular expression and well Trevor uggs entropy detection won't fall short this will be able to identify it so it scales a lot better the cons are that this will miss out on the types of Secrets that you don't know about so when I run truffle hog with the entropy mode on it always identified secrets - weird services that I've never heard of and I

haven't been able to again or enumerate every single regular expression for every single secret out there so you miss out on some of that and another downside is it still requires a little bit of manual triage because even though it's accurately flagged a secret you don't know whether or not it's live you don't know whether or not it's a maybe a test private key which is more often than that's it's a pretty common thing is what I've found so there's still some manual triage that's less than ideal so this is kind of the model for how I've integrated us into a devops pipeline in the past you have some commit hook that fires a truffle hog scan Java hog will

run on that committee with your customized regular expressions that'll get then gets sent to a triage system we all have to go through and identify whether or not the key is live or not and that triage step will ultimately go to our mediation step and the remediation step isn't always straightforward because you not only have to remove the secret but you have to rotate it you have to do it without taking down a service in production which can be a pain because sometimes you don't know where these keys belong to so these are kind of the different steps and that second step there truffle hog is the one that I've open-source so far but if you remember back in the

beginning I talked about a bunch of different places that source code lives it doesn't just live and get and so what about all those other places there was a talk earlier today about finding secrets and Android apps and they exist in every single place you can imagine source code existing I promise and so recently I despite on package managers so the two that I've been looking at are n TM and pi PI and similar to get they also have a revision history so you can push a package to note and NPM and give it a version and then you can push a new version and then you can push a new version and so if you

accidentally push a secret and one of the old versions you can bury over the top of it and if you looked at the latest version you never know in addition to that when you push these packages to NPM at pi PI they don't pull from git they pull from the file system so if you're tracking you're working get and you've got some sort of test script that you haven't staged you're committed there's a good chance that it'll end up accidentally packaged into NPM or pi PI even though it never made it to get so if you remember back to what I said about open source software reviews again there's a good chance that your security

engineer is only going to look at the version control and get they're not going to hold down the package and read the source code that was published to the package manager so that source code by and large goes unreviewed even though there can be differences so I went ahead and scanned a whole bunch of packages and to see if these problems were systemic and what I found was if your package on NPM work yet had the string AWS anywhere in the description there was about a two percent chance of that package having a live AWS token in it and what's worse is in most cases that AWS token didn't show up and get so it was in the package and

either a testing script or in an environment variable script that was designed to be sourced from and that just got packaged up and sent to NPM or pi PI and it never made it to get and then the last bullet point here is maybe experimental code with a developer knew that they were just temporarily in lining a secret or a password but they would never actually commit that to get and their intention was to at some point pull that out and make an environment variable but when they packaged NPM and pi PI they weren't paying attention to the fact it was just pulling all the files in the file system and that experimental code accidentally got

so I made a new tool and i'm calling it santa hog which is a really dumb name but i'm excited about it because i really want to see the sketch artist draw a Santa hog my original thinking original thinking is it's a it's like a package that you open and you get secrets so it's Santa and then hog is just for consistency but other than that is it's the best I could come up with so basically currently Santa hog scans NPM and pi PI packages it goes all the way back through all the old revisions of the package and it runs the same exact engine the truffle hog does so it runs the same reg X's and it does the

same entropy detection that you could turn on and off it's also open source that's available in my github I made a public last night so you can visit my github and use it but this is the tool that I use to enumerate all those AWS tokens which I reported but my suspicion is if you look for different keywords you will find lots of more tokens live in NPM and pipehype and in addition to that if you run internal mirrors of NPM and pi PI there's a good chance maybe a higher likelihood chance that those will also have live secrets so this is an example of the output it's not quite as trouble it's not quite as

pretty as trouble hog currently but here I'm scanning a package called T channel which is one of Hoover's open source packages and you'll see it flagged on a bunch of stuff flagged on an AWS key it flagged on some private keys and if we look at it a little bit closer we see something interesting these keys didn't show up in the code that uber wrote it actually showed up in a directory that uber didn't mean to publish the node modules directory and if you're familiar with node basically the node modules directory is where all the dependencies for the package that you're using live and so what they accidentally did here was they accidentally published their

package with all the dependencies and the dependencies dependencies and all the tree sub dependencies of the project that they were working on so you see several layers of node modules there and we end up with AWS keys from packages that were never written by over that uber didn't know about in any capacity but you were accidentally published because they published all these other packages with their code and this is interesting and we'll come back to why it's interesting later than why I can cause some problems so this is the revised DevOps pipeline basically you have hooks for when a new package is deployed and for when you have git commit hooks and they feed into the same

triage ER and the remediation stage is the same so I could use some community help on this project it's been just me and random contributors that I don't know committing to it for the last year and a half but there's a lot of features that I could use for starters I need a lot more regular expressions there's a lot of Secrets that I'm not looking for and they need to be hi signal so that I don't make it really noisy it would be really nice if you could specify a range of commits currently you can specify a commit where you say to chop a hog scan from this commit onward but their limitations are it will number one clone the entire

package it would be nice if it only cloned the relevant sections and number two you can't specify a range so you can only specify from this commit onward another thing that would be really nice is multi-threading currently it's single threaded so when I run it I usually just spin up a bunch of instances of triple hog but it would be nice if I could just specify the command-line argument how many threads that I want and there's a lot more features as well that I'm sure you all can come up with and offer to me and I'm glad to accept pull requests I am a little bit slow because I'm managing a bunch of different pull requests but if the

feature is small and well contained and really easy to understand I'm happy to merge it in and then the last thing that I did in this past week because Santa hog was relying on truffle hogs regular expressions I pulled the regular expressions out of the main truffle hog project and I put them in their own repository and I pulled them out of a Python file and I made them a JSON file so they can be used for other projects as well so there is another project called get leaks that was basically a feature parity of truffle hogs someone made written and go and when I looked at his source code he had copy pasted all my regular

expressions over my hope is that by pulling these out and making them their own JSON files you all can pull these JSON files down and point to this reference so that way we can have one community repository where we all update and keep these rules in one place so that was kind of my thinking for pulling that out there's one more thing that I want to touch on here that triage step on a graph I mentioned that in a lot of cases it has to be manual but in some cases it can be automated specifically when you're offering to public api's if you've got a live key you can use that key on the public API and very easily

tell whether or not it's live automatically without a human having to do the verification and I have one simple example of that here this is a simple Python function I wrote that takes an AWS key and returns true if the key is live and false if the key is false so you can imagine taking the output of truffle hog sending it directly to these verifiers and then based on the result of the verifier automatically knowing whether or not you need to take remediation steps there are some drawbacks to doing this and this kind of brings me back to the Chi Channel example let me start by saying I'm not a lawyer and this is not

legal or financial advice but you should be careful with the Computer Fraud and Abuse Act if you are scanning everything more likely than not you will end up testing keys that accidentally got packaged in your source code that don't belong to you because you're pulling in dependencies you're pulling in code that was copy pasted and if you authorised to assist them with those keys you're probably violating the CFAA you're authenticating to a system that you didn't get permission to or in a way that you didn't have permission and you probably should not use these Auto verifiers for bug bounties either for the same reason for actually more so at the same reason because a lot of times

bug bounties will not allow you to authorize with credentials that you've extracted so if you find a credential usually a bug bounties say stop testing and don't authorize don't actually off with them so if you do write these Auto verifiers be very careful with them and the way you use them because you may end up accidentally winding up in some legal trouble these are all the resources so truffle hog Santa hog and the Reg X's are all available on my github and feel free to check them out and to use them for your own personal projects that's sort of the end of my slide deck but I do want to open it up for questions and

community engagement in case anyone has any ideas for the future of the project [Applause] yep so the question was what does the triage process usually look like in your DevOps pipeline and for me ideally in a perfect world you'd be able to give these directly to the developers give the results directly to developers now realistically even with my high signal regular expressions sometimes they're also going to have false positives and sometimes you're going to have keys that aren't live anymore so for those cases if you build out automatic triage errs you can use that to identify which keys are live and a hundred percent of the time you can deliver that result directly to developer because you know

there's a problem for the cases where you can't build out automatic triage rooms it's a little different if you try to give that to developers you may end up with too many too much noise and if that's the case you'll have to manually go through and refine the truffle hog rules a little bit but obviously the other side of that is if you're a small enough company or a large enough security team you may be able to have some of your dev ops or your security resources to help with triage yep yeah so you do have to be really careful with the secrets for obvious reasons you don't want to propagate them in more places the way I've handled that in the

past is if you're using github just construct a URL that has the line number and the commit hash that just directly links to the problem but don't actually take the secret and replicate the secret and don't store the secret in a database because then you're just creating a centralized repository where all the secrets live and in a sense you still have that with the URLs but at least you're not actually copying the password and putting it in more places

yeah so there is a really good tool for cleaning up the history the problem is you don't want to accidentally take down production services so that one can be really tricky and usually it's going to require a lot of human intervention to figure out what that key is where it's being used and if you try to just automatically pull it out of source code or auto-rotate the key you may end up in trouble but I think your question was more towards after you've already identified the problem and rotated the key how you remove it from source code there is a good tool for that I don't remember the name but it's a it's a good

point I'll try to provide a link after or something like that I think Ben had a question then oh that's pretty good I like I like sin dog yep yeah so the question was are there ever any situations where getting the regex just right is really difficult and you maybe need to design some kind of custom code to maybe increase confidence I have thought about that a lot I know the tool that lifts open sourced has a confidence rating of if it meets this then at one point if it meets this an add one point and if it exceeds that threshold and report on it but I really wanted to design travelhob in such a way

that you can confidently take on arbitrary rules from arbitrary sources and not too much worry about the threat of arbitrary code being executed in such a sensitive secret capacity and so for that reason I've kind of strayed away from allowing arbitrary code to help identify these secrets there's a question back there

yes I think that's that's the one I was thinking of so BFG repo cleaner is the one that makes it really easy to scrub the secret out of your version control so it would be interesting to chain the two projects together and have them sort of automatically work together in some way what does it give a point yes

totally or like apt-get or like rpm yeah it's a problem I can guarantee it I haven't written the code for it but I promise you if you can think of a package manager it's got secrets in it I accept all requests yep

yeah so this is gonna be the last question because I was given the timeout basically that is a problem URLs often have hashes in them and that can be a problem I've tried to make the regular expressions in such a way that either the secret has to be wrapped in quotes which is pretty uncommon for URLs or you expect the name of the service and then the secret lengths to try to cut down on noise but sometimes even with all that you sometimes still get that problem and I think it can be solved with a little bit smarter regular expressions but I've done my best to try to cut down on that so much again it's a

good point though all right I think I'm at a time thanks everyone [Applause]

Fighting Secrets In Source Code With TruffleHog

Related talks