
with no option to create your own. Answer you have to answer these 15 security questions. What? And with and with the Adobe leak of of of the password hints. Yeah. uh you know at Passwords Con that year we were doing you know when we had lunch we were doing uh crossword puzzle with those yeah and you know ever since that episode you know security questions shouldn't be used for online security that's my opinion and and security are different I mean security questions could be different but some of them what we're they ask you for favorite things favorite things change favorite things last year was different than favorite Jim has a perfect blog of you know
collection of stupid security questions. So that's one of them. So yeah, if you have any, do send them to J. And there are all these variations in the answers. If you went to Central High School capital I mean, okay, it's it's 10:00, so we're going to stop. Uh before I introduce uh the two first speakers, I also just want to tell you that the next talk making password meters great again from Alan Cordwell uh is cancelled and it will be replaced by Michael Space and he will be talking about how websites are storing and disclosing how they store your password and he's basically crowdsourcing a list of companies that will actually tell you how they store your password. So either
you can be on the good list or you can be on the [ __ ] list. That's going to be the 11th episode and then I can do the official introduction. Uh crafting tailor word lists with wordsmith Sanjiv and Tom from payment software company. Please go ahead. Thank you. Yeah, thanks guys for coming out. Um there's a lot of cool talks happening in this 10 a.m. time slot. So yeah, we're thankful you guys came to this one and uh we're sure that some people had a late night last night. So thanks for uh thanks for making it to this one. Um, some some quick formalities. Tom's a guy with a beard. I'm the Canadian. Don't
hold it against me. Um, sorry. We're both pentesters with PSC. Uh, PSC specializes in PCI assessments um as well. And we also do pen testing in in in nonPCI contexts. Um, our day-to-day kind of just involves um going through large enterprise organizations and going through various network segments trying to find card holder data. Um, we're also looking for pen testers. So, um, if you know any or if you're interested in pentesting, either, um, come and see Tom and I after or Joe over here in the front and we'll be happy to talk to you. So, before we jump into a quick primer, um, what's Wordsmith? Uh, well, it's just basically a tool which can generate dictionaries.
Uh, the only thing that we're doing differently is that we are generating dictionaries based on US states and specifically geoloccation data. Um, geoloccation data can just basically kind of be boiled down to cities, landmarks, um, zip codes, area codes, towns, uh, and that sort of thing. Um, we'll get into exact sort of data sets that we're collecting. And we'll also go into some statistics a bit later, but we're taking these word lists and we're just going against and and cracking against large hashets or just hashets in general to identify um, what sort of passwords people have and if they're introducing geo based passwords into their phrases. Um so yeah, we're going to go through a quick primer here uh of
just basic authentication process uh as well as the difference between passwords and hashes and dictionary attacks. Um we've timed it. It should take about 3 minutes. It's about eight slides in total. Uh for those of you already who already know about password attacks and dictionary attacks and and um hashes and that sort of thing, um there's going to be an image on the next slide here and uh if you can tweet us the hashtype that's in this image, uh we have some swag that we're giving away. So we've got like a case logic backpacks and I think uh phone speaker amplifiers and a single selfie stick for someone who really wants a selfie stick, I guess. Uh
so that's our Twitter handles. I think it's also in your brochures. Um, or go and check out Wordsmith. I just made the repo public. Um, so you can find it there. Um, and I guess just as a quick show of hands, how many pentesters are in this room? Uh, does anyone do pen testing? Uh, Joe, couple of guys over there. Great. So, if you've ever done any sort of man-in-the-middle attack on your on your network, and I guess if you're a blue teamer who's ever done a man-in-the-middle attack on your network, this might look this hashtype should look pretty familiar to you. Um, so yeah, uh, here it is. You got kind of a couple seconds here to take a look at
that and uh and send us a tweet here at our Twitter handles and um we're happy to give away some swag after. Uh but for now, we'll head back to the primer and Tom is going to walk you guys through uh the primer. Thanks, Angelie. So, let's talk about a simple authentication process. Um this is Bob in the extent of my Microsoft paint skills. Um, despite what Bob's prohibition style hat might suggest, Bob is a user in a Windows environment. And this is Bob logging into a Windows host with a username of Bob. Um, and this might be locally on a workstation in order to unlock it. This might be remotely via something like RDP. And
we've taken the liberty here to unmask the password field. So you can see that Bob's password is password two3. Unsubmit. And Bob clicks enter. um that input of password 1 2 3 is put into a one-way hash function. Um and the output of that is a fixed length character string um which you see represented below. Um that's what we call a hash. And what's important to note here is that this is a one-way function. Um we can't reverse this process by putting a hash into the hashing function and retrieving the original clear text password. So after we've hashed the password, um we uh put together Bob's username and password hash and we send it to the
authentication server. This might be locally in a SAM database. Um if you're joined to a domain, this might be um Active Directory domain controllers NTDS. Um this backend database holds, and this is a little bit of oversimplification, but this backward backend database um holds a listing of all the users um and their password hashes. Um, so we're not storing passwords in clear text here. Um, and from there we basically just do a lookup of Bob's supplied credentials. Um, we find the record for Bob and we match uh the supplied password hash with the stored one. Um, if it's correct, we allow the login. Um, if it's incorrect, we'll bump up the failed login count and
deny the login. So, we can't reverse this hashing process. So, how do we convert a hash back to its original string? Um, the answer is there's no direct way u, but what we do have are a very particular set of words. Uh, words that make passer cracking a nightmare for hashes like these. Um, and we particular we use a dictionary attack. So, what are dictionaries? Uh, they're simply just large list of words usually grouped together by some type of thing. Um, so they might come from password breaches like LinkedIn or Yahoo or Adobe. Um, there's also great word lists out on the internet that you can find for free like Rockq 10K. Um, there's even some paid
ones like unique. Um, despite its price tag, uh, unique is a password list that any pentester or auditor should have in their toolkit. So, in order to carry out dictionary attack, uh, first we need a a few prerexs. Uh, first is a solid dictionary or a good word list. Second thing we need to know is the hashtype. So in this case with Bob, we're using NT hashes or NLM. Um if you're authenticating against a Unix or a Linux type server, uh it's going to be some variation of MD5 or Shaw one t usually with a salt. The third thing that we need um are a list of password hashes and these are usually exfiltrated from compromised systems uh
maybe like a a local Windows workstation or an active directory domain controller. Um and these are what we're doing the lookups against. So the steps for actually carrying out a dictionary attack boil down to this three-step process we call the guess encrypt compare cycle. Um our guesses are words that we're plucking from the word list and we iterate through them one by one. Um from there we take the input word and we put it into um our encryption algorithm in this case an NT hashing algorithm and it gets outputed this fixed uh length string. Um then we take that hash and we do a lookup against our list of obtained password hashes. Um if we if they match then we
know we can map that back to the original word that we guess and we have our clear text password. And Sanji we move on to word smmith. Yeah. So uh as we mentioned briefly at the beginning of the presentation wordsmith is just a a word list generation tool for um US uh states and u I guess geocation based data. So what kind of geod data is in a word list? Well, we've got things like cities and towns. Um, we also have landmarks. So, in in Nevada, you're going to have Area 51, things like the Hoover Dam, that sort of thing. Uh, we got streets and roads. Uh, we have zip codes, sport teams, colleges, common names, and area
codes. Uh, now, why geoloccation data? Well, it's really interesting. I guess it's kind of a marriage between curiosity and password analytics and just general human behavior. Um I remember I was testing uh I was an internal penetration test for a client in a very small state and as part of my post exploitation process I tend to go from system to system and scrape credentials out of memory just using kiwi or mimiccast or something like that and that usually enables me to collect a large amount of passwords to then enable me to move into another network segment or access access applications which unlock greater depth into that environment. Now, collecting all these passwords, I I realized a
common trend um for several of these users, and that's I couldn't crack these because these are specific geoloccation based passwords that weren't going to exist in any sort of password list that's currently out there. Um things like sport team names or colleges or um other things that might be in in related to geoloccation. So, I thought to myself, well, it would be pretty neat if someone put together um a word list generation tool. And um then that kind of transformed to well we'll just put together a word list generation tool for geoloccation data. And um as we'll get into some statistics a bit later we kind of found out that we've limited some guess encrypt compare cycles um and and
been able to actually turn this into something quite useful. Um so I should probably mention where's all this data coming from. Um well Wikipedia and the US census uh have a ton of this data and it's readily available to the public. All we've done is we pulled it, scraped it, and put it into nice little phrases and words which appear in word lists. Um, Open Street Map is another good source as well. Um, we've also had to put together um a collection of data sets for area codes uh because that was a little harder harder to parse using our parsing engine. So, uh, and required a bit heavier parsing. So, we actually have some custom data sets which we made
as well. Uh, so Tom is going to talk to you about how Wordsmith works and then we're going to jump into a demo. Um, so yeah, take it away. Cool. So the GitHub repo is live now. Um, when you do your initial uh git pull, you'll see these files listed there. Um, on the right, you see the actual WordSmith Ruby file. Um, there's also next to it the sources.yamel file. It's basically just a simple configuration file for all the internet sources where we're pulling down uh data. Um, and we broke it out like that to hopefully make it a little easier um in a modular design to be open to extension um and for easier
management of our internet sources. Next to that, you see a data.tar.gz um which is basically just compressed data archive where we've already pre-scraped all of the data that we're using from Ford Smith um and compressed it there. Um, you'll also find a a gem file just to make installation a little bit simpler. And there's a readme there which you'll see in the repo. Uh, that'll walk you through some of the dependencies and installation. So, when you run Wordsmith for the first time, uh, it's going to do a couple checks for some of its files that are needed. Um, if it doesn't seem, it'll unpack that data.tar.gz file, um, and expose it into the current working
directory in a subdirectory called data. That data directory um is mostly categorized by state with the exception of some of the custom data that we've had to massage into place. Um the top level you'll see some the directory for the area codes. Um names which we've pulled from US census is first names, last names, baby names. Um sports which are mostly big for sports in each state at this point. Um, and then the states themselves. Um, and below that you see an example for what kind of files you would find in the California directory. Uh, so you see a citieshtml there, colleges.html landmarks roads zipss. These are words very specific to that state. Um, and if you notice the HTML
extension, um, these are actual HTML source files that we've pulled down from our internet sources. Um, and the reason we've done it like this is because we've added a update option within Wordsmith. It's a - flag. Um so sometime down the road if you'd like to update your data uh manually you specify the update flag and actually go out to all the sources and update your local data repository for you. Um to parse this data uh we're using gems like no giri and spider and we do all these lookups offline uh so locally just for speed performance. So a a word list that's been generated by wordsmith kind of looks like this. And I'm using an example that Sanjieve
went through earlier with a roads from Nevada particular Fremont Street. Um so the word as it comes out of Wordsmith looks like that. It's a capital F. There's a space in there. There's a period at the end. Um so we add in some just very basic mangling uh for words. Um so we can split on spaces break out Fremont and street into two separate words. Uh we can remove special characters. Um, we can remove spaces. Uh, there's also options to uh convert all the words to lowercase if that's your preference. Um, you can also specify a minimum character length. So, let's say you've compromised a domain um where you know the password policy has a
minimum character length of eight. Well, you can specify the dash or the minimum length here and it will truncate all words that are not at least eight characters in length. And now, Sanjie will take you through the demo. I've seen some really bad things happen with live demos in the past. So, let's hope this all goes well. All right. So, is that text? How's that? Can everyone see that? Maybe go a bit bigger. Cool. Okay. So, yeah, as Tom mentioned, these are kind of the the initial git pull files. And if we just uh go ahead and run WordSmith for the first time, you're going to see that all these um uh files get unpacked. Uh there's also a
warning message that you see here where it says cool is not found in path. Now that's cuz I'm I'm running um Wordsmith on my OSX system. I don't have cool installed. Um but if you're running this on Cali or something, it's going to pick that up in your path variable and you'll be able to use cool. Now the purpose of cool uh is because we've integrated support for things like uh domains and infiles. Um so for those of you who are unfamiliar with cool uh basically if you specify a domain name let's say client.com or facebook.com it'll go out to that domain and it'll look for unique words and scrape it from that application. And um the default cool
settings I think stays within the scope of that domain name and it only goes uh a certain recursive depth and follows any hyperlinks that take you to any other links on that scope and it'll pull the client's name. It'll pull um other unique words and strip out some of the common words like of the any connection verbs things like this. Um and you can also um have an infile where you specify various client domain names. And all this is just to better populate this word list. So um we've integrated support for cool however uh you don't need to use it because we also have some other options. Um now as as as we kind of mentioned from a
top down approach it all starts from a state. Um and typically when I generate a word smmith word list I'll use a dash all option and all is going to give me cities colleges landmarks phone numbers, roads, teams, zip codes uh and also names. Uh names. So names would be like uh common last names, uh baby names. You have no idea how many times I see a first name or a baby name as part of a password. Um, yeah. So, I guess let's run through an example. So, we'll set set the state for California and we'll look at some common sport teams there. So, as you can see, um, these are all of the sport teams in California. And
that's just doing a basic lookup on the HTML files that we've already pre-pulled. Now, these are great, but we can also mangle these and get every single permutation of these words. Uh, so as you can see here, we've got Sacramento Kings, Sacramento Kings, Sacramento, and somewhere up here. There'll probably be be kings as well. And there's only one instance uh because we've there's probably some other teams here which have kings in their name, but we also do a sort and unique so you don't get duplicate words and things like this. Uh what's also um pretty neat is that uh we do things like um zip codes. So we've got every single zip code in Nevada or
uh landmarks. So if we set the state to DC um we can look at landmarks and we probably tend to see things like White House or or whatever. Um like uh the Lafayette building, things like this. Um yeah, and if I guess we set the state to maybe Massachusetts, we can look at some of the uh colleges that exist there. So these are probably see Harvard or MIT at some point in here. Um yeah so as I mentioned before, these are just some of the options, the singular options that you can set for per state. We can also do multi-state. So, for example, CAN Nevada and grab me all the uh area codes for those two states. Now,
this is pretty verbose output. Um, so, uh, I guess the inverse of that would be quiet output. So, let's set the state for California. We're going to grab everything we possibly can. And, uh, we're not going to have it as verbose as this. And so, you can start seeing some, um, things here of how many landmarks there are, how many zip codes there are. But, this isn't really useful because we're not getting any words. So, we can go ahead and output this to a file like California.txt and it'll collect all this data and stick it into a word list for you. But, as we kind of mentioned before, these are just the actual words themselves.
These aren't the mangled versions. So, we can specify the M flag. Change that to California mangled. And now you can see that there's a little churning time here for especially the roads because there's 250,000 roads that we're now going to have to mangle, which now help out puts almost double the amount. Um, so like Tom had mentioned earlier, we've split on space, we've concatenated, we've stripped symbols and and and things like this. Any other options that anyone wanted to see from this help area or um see if a particular college that someone went to in a state shows up? Can you compare the the number of of roads in California versus nationally? I'm wondering how much overlap there is and
and how much benefit there is to restricting it by state. Sure. So you we also have an option built in here for all uh so you can churn through every single state and create a mega word list for everything as well. Um but so that's yeah that's totally an option. Um as you can see you'll just start spitting out um everything is just an array. Mhm. Well yeah so but that's the with the quiet option. If I remove the quiet we're going to grab basically it's just going to keep on churning like that. Yeah. Yeah. So um sorry is there another question? Can I say how do you account for like local regional businesses? Do
you go ahead and specify the minus you option and put in URL? That's a great option. Yeah, sorry. That's a great um Yeah. So, repeat question. Sorry. Yeah, he said, "How do you account for local businesses?" Um, so the - flag for the cool integration. Uh, so let's say you're testing a particular client. Um, the client there's probably going to be some password variation in there that's a client name 123. And so that that uh cool integration will scrape that client's web application, pick out that client's name, and put into this word list for you as well. I don't actually have internet connectivity here because we're at Bides and I don't want to connect to Wi-Fi. So I can't show you
the cool integration aspects of it. Um but yeah, that's that's basically uh what our our Wordsmith demo is. And what's really interesting are some statistics that we're going to kind of show you guys. Um and I think you've heard me talk enough, so I'll let Tom kick that off. Cool.
Go. So, uh we wanted to measure how effective geoloccation bless uh based word lists were. Uh to do this, we did a couple of tests. Uh some of the prerexs for this test, we first built a hash cracking rig uh locally in our shop. Um and we got our hands on some real NT hashes. Uh so we grabbed some hashes from an actual real world internal pen uh p penetration test from clients in Massachusetts over 400 hashes. Uh Wisconsin about 2,000 and New York which is about 500 hashes. The hash cracking rig itself um our weapon of choice for cracking NT hashes is hashcap. Um the hardware is fairly modest. It was just an Nvidia grid K520.
Um but even with that we could get about uh three billion guest encrypt compare cycles per second. Uh so returning through passwords fairly quickly. Um and just last week Sanjieve put a post on his blog um for those interested that want to build their own hashracking rig and he goes through the process of doing so um in Amazon was for doing some advanced work. It's got about 57 kind of can be translated into 2011 active directory user accounts which can kind of further be translated into 2011 employees although that might not necessarily be true because there might be some accounts being shared. So we'll just say 2011 active directory user accounts. Now as Tom had previously
mentioned the top 10k word list is just a collection of the top 10,000 passwords. Things like password, love, god. Anyone who's seen hackers will get that reference. Um but yeah, we're taking those 10,000 word lists and we're um injecting those individual words into the dead hobo rule set which is 58,000 rules. So these rules can prepend symbols, append numbers, lowercase words, camelc case words. So for every single word, you're doing 58,000 different permutations of that words word based on this rule set. So yeah, the the 10,000 top passwords uh took about 2 seconds to run against these rules and uncovered 237 of these organizations passwords, which means that 237 active directory user accounts had a password that was in the top 10
word 10,000 word list as a root password uh in their in their I guess password string. Um yeah, Rocky uncovered another 1,094 uh passwords, which is now at 66% of a cracking success in total of all these hashes. And now our Wisconsin generated word list took 12 seconds to run and uncovered another 11%. So that's 229 passwords. Now what should be uh key here is that uh this Wisconsin word list is solely uh it solely consists of geo based passwords. So, at this point, we've cracked 66% of the passwords, but um the additional uh 229 passwords are all things like cities, sport teams, landmarks, uh zip codes, and things to that effect. So, there's a question over
there. Are they discreet or would there be duplicates between Rock? There would be duplicates. So, this is a collective cracking session. It's called basically your hash cracking pot would be populated with uh 66% of these passwords at this point and so it's a collective edition. So yeah, we uncovered another 12 uh or sorry 11% of passwords um uh that were all geocentric. Um now yeah so if I remember correctly Wisconsin some of the really common passwords like Green Bay Packers um first names and baby names as well. uh Massachusetts had a a smaller hash set about 400 uh active directory user accounts. Uh that makes top 10K with the 58,000 rules run about a second and
recover 52 uh passwords which is about an eighth. Uh and that's really surprising uh because it just goes to show that some organizations have really weak password complexity um uh rules and enforcements. Um, Rocky uh uh uh recovered a staggering 65% um in 24 minutes and uh with the dead hoba rule set and our words smmith generated word list another 56 um and again this is all geobased word lists uh or sorry geoloccation based words uh which which are found after um in about 12 seconds and Massachusetts um I mean people always use sport teams names like Red Sox and things like this but what's really interesting is you'll see a lot of city names as well like
Boston, Boston Marathon, um Cambridge and Harvard and Fenway for landmarks as well. Uh New York, uh 552 hashes and 552 active directory user accounts. Um what's really surprising with New York is that zero were recovered with top 10k which can we can allude to several things here. Either the active directory um domain controller has a third party plugin. Uh so this New York organization has imported a list of known uh compromised passwords or bad passwords into active directory through some sort of thirdparty module which technically refrains uh users from uh creating bad passwords or uh as a non-technical control they have um just great uh security awareness programs or long and complex uh password uh requirements um
or or things like this. That being said, uh the Rocky word list, which takes about 26 minutes to run, uncovered about 220 passwords. And our New York uh Wordsmith generated word list recovered additional 59. And as you can imagine, um some of the popular passwords would be landmarks like Empire or uh I think I have some examples here. Yeah, Empire, Broadway. Uh there's also one user in particular who had the state NY abbreviation and then um five numbers which are the zip code and then a symbol as part of his password. So yeah, um Tom's gonna kind of summarize that last sort of segment there. Cool. So some of the conclusions of this Oh, yeah. Go ahead. Uh I was just wondering
did you uh did you try uh the New York data, the New York hashes against the Wisconsin list? No, we didn't. No. because that would be very useful to see whether you're actually getting value from your from paying attention to geography as opposed to just what's on these lists. Absolutely. So, so you need to do that crosswise. Yeah. So, another to to further extrapolate that we can also just do all states in general against that New York word list as well. But the real takeaway here was just kind of for that particular state and that guest encrypt compare cycle. So for one state it took 22 seconds but for all states it might take an hour. Who knows? We
haven't actually tried to do that yet. That's yeah great question. So some of the conclusions from this testing um we got a little bit of confirmation bias here in that we'll get into the psychology of how users choose passwords but we know that users like to choose passwords that are near and dear to them. They choose passwords off things that they know. Um you know the street they grew up on the name of their child. And that's what we're seeing reflected here in our results. Um, and with that, there's a little bit of a time CPU cycle trade-off in that instead of using a blanket word list and looking for lowhanging fruit, we're spending a
little extra time up front to craft a more tailored list, um, and we're spending less time or less CPU cycles on crunching those less pertinent words. Um, and it's a small sample size here, but with these cases, uh, we had, uh, we cracked at least an additional 11% um, of passwords in a reasonable amount of time. Um, and I think this speaks a little bit to the relevance of the generated passwords. So, next steps for wordsmith, where we see this going, um, we're always thinking about data. Um, we have ideas for more. It's, uh, difficult to have an to actually marry an idea for data and actually find good sources for it. Um, but we like to expand on say the sports.
Uh, we've seen users who love to have their favorite athlete or their favorite player as a base word of a password. Um, maybe team mascots or names of stadiums. Um, you could include famous people like politicians or actresses or actors. Um, state symbols, things that are very relevant to a state such as the motto or state song or state flower. Um, and we've gotten recommendations from the community too. Uh Larry Peshi recommended looking at regional food or um cuisine or agriculture. From the design the codeesign perspective um it is modular. Uh we like to we like it to be even more so. Uh just to have it open to extension. Um we think that this framework could be
extended to not only include states but also include provinces or territories or even other countries. Um and we could even change the granularity of how we're looking for this. When Sanji and I were concepting Wordsmith, we're thinking about uh what scale do we want to start with? Do we go, you know, as macro as continent or do we drill down to country state city road address actual geo coordinates? Um so maybe a future version could instead of a user inputting a state, they input an address or a pair of coordinates, say 50 miles, and say, "Give me all the words that Wordsmith can generate in a 50- mile radius. So Sanjiev and I are are both believers
in free and open source software. Uh we believe that everyone should have access to all the source code at all time. Um we also believe that we're not the smartest people in this room. So if you guys have any ideas for data for features, if you're um have experience with uh looking at querying um APIs for you know geocentric type data, we would love to talk to you. Um, please send us a pull request. Uh, submit issues. Um, hit us up on Twitter. Um, the repo is listed there. Uh, we'd love to to to share this with you. So, with that, this is contact information. Um, also, if you uh replied to Sanjieves's hash challenge earlier
via Twitter, uh, feel free to hook up with us. We'll either be at the back room or in the passwords. Uh, and just bring some verification that uh, it's actually you on the other side of the tweet. Um, but with that, we thank you for coming out and open the floor to questions. Yeah, thanks guys. [Applause] Well, I'm pretty sure there are questions for this and I have questions myself, tons of them actually. uh but as an example I just wanted to tell you that in the UK there is one government organization that keeps on tweeting again and again and again that a safe and you know a good password that is also easy to remember is made up of
three easy words that's what they keep saying all the time three easy words now on December 4 2015 hashcat put out a tweet saying important announcement and there was a hash value and very shortly afterwards solo replied to that tweet saying the hash if you can crack it says hashcat opens source so that's the way hashcat announced that hashcat was going open source and solo the signup cracked hashcat opensource that's a threeword passphrase with spaces in between words so Jeremy uh Gossney he responded by asking Solar because Solar said that he cracked this by doing a 10line focused word list. So 10 words. Mhm. he put into his word list and then he cracked a three-word passphrase
and the words that he put in was hashcat is open source will be will be uh without a space in between sourced GPL under license and he cracked it. So that's pretty you know that was like dude yes that's pretty good. Now I'm going to kick off one of the my first questions first for this. Um first of all have you been looking into the simple fact that there can be as an example uh physical geographical locations that consists of one more than one name like you have a space in there. So uh the sea of something sea of oceans or whatever you want to call it. uh are does wordsmith today actually take that into account or
will you just say that anything with a space is two different words? Uh so no we take that into account. So if it if it does have a space we keep the original string we split on that space. We concatenate that string as well. So we get the common permutations of that particular word and it also has that word in the word list as well on its singular level. Now, we didn't want to do too many permutations because we thought that the hashcat rules that people would use afterwards would uh do that for them. So, we didn't want to make the word word list too inflated or too big because hashcat rules would take care of that in the cracking process.
Okay. And to keep um the keys unique. So, uh, for instance, when you're inputting a state, like for instance, uh, North Carolina, which has a space in it, or District of Columbia with two spaces, um, our keys, we sub out spaces for just we basically URL encoded, um, just to make sure we keep the keys unique. Yeah. Okay. So, in in Europe when doing passwords there, I have had Sebastian Rav do two talks at different times about generating word lists based on different wikis, including Wikipedia. and his his he his his talks are essentially about you know how he created the Wikip Wikipedia word list and the issues of identifying what's a password what's a pass phrase what's
just random gobble in there and and I highly recommend his talk from Cambridge in last year because he actually uh also uh could prove that Han Solo is mentioned in the Bible which is kind of cool. It just depends on how you actually uh break up all those spaces and so on. But Han Solo is actually mentioned in in the Bible. So questions. I'll go first. So first thanks for really good talk. Thanks. Um I just as much a comment as a question which is I think you have really a very powerful tool in having these three sets and I hope you can do some more studies on it and in particular the new there was a
discussion earlier uh about the question of in one of the earlier talks about whether um blacklist dictionaries and password generation are dangerous because people when you reject their password they just add one or one two three to that and you have a a place to actually do an empirical test there with the New York database of saying okay go back to that uh top 10 10k word list and do apply some software that does you know various monging to that and see and see how many in fact users having been rejected just did a simple transformation so I encourage you to do more with it great sure if you want to send us your organization's hashes that
would be great well you actually asked that question to the guy who invited Diceear. So he he does have a word list at least to provide so to speak. So to ask kind of a different question um when evaluating an individual word within the context of a dictionary or a hash um is there any metric that can be generated for understanding how likely that is to appear in word lists globally? This is for example uh password god I love you 1 2 3 etc. Um we we commonly are told that yes this appears in many many many many dictionaries but we don't have a metric or score to tell us how likely it is to
appear uh and how dangerous it is within the context of a password. Is there any way we can generate that? Is there any, you know, basian analysis, probabistic analysis that has been done on corpuses of words that could tell us given a word how likely it is that is going to be cracked by one of these tools? I don't know. to my knowledge uh in in terms of the community there hasn't been any sort of um collective analysis on every single breached word list that's out there as well as a collection of words uh to identify um I guess the the singularity or the commonality of a single phrase across all these lists. Um however that
being said uh in our penetration testing reports we use a tool called piple uh p i p a l and uh that shows the I guess the um in terms of percentile how many users are using this word or this root word um in in context of all the user counts or hashes that we have or have recovered. So the only thing I can think of is marrying Pipple and uh all those common word lists out there. But that that would take a lot of a lot of um computing power to do that and it's almost a project in its own right. But that's a great idea. Sounds like a great talk. Yeah, you should you should do
that. Yeah. Questions. Yeah, I was wondering uh on the on the hashet that you were working on, did you do any combination of the word list? Did you combine the word lists that were generated from WordSmith with the top 10,000 or with rocku? Yes, I know that there's like a function in hashcat for doing that. Yeah, it's separate from the mangling rules, right? Yeah. No, we didn't combine uh word list. So, we were we were kind of more interested in the chunking aspect. So, this is what the top 10 recovered. This is what the Rocky recovered. This is what we recovered after that entire pot had been populated. That's a great Yeah. So if we
can combine password and white house together that'd be great. So that's maybe something else that we could have done. when you started out on doing this I mean obviously you do uh you know there's data available massive amounts of data available for for doing the geoloccation part of this but when you started out doing this was it you know did you start making uh wordsmith because the data was available and you saw you could easily use them or did you actually have uh you know u a theory or eventually did you prove that lots of people are actually using geoloccations as part of the password so we need this input you know what was the reasoning
behind starting it. So part of it was um as we compromise uh active directory domains, we dump password hashes, we start cracking. Um one of the things we do after we were turning through the rock used in the top 10 Ks is we would uh kind of generate our own custom word list based off root words that are for instance like the company name. Um and we do some of the common translations for that and converting O's to zeros and aat signs etc. M um and also just some of the names of local, you know, like the street that the the company's on or the address where the building resides and we notice we started getting hits
after hits after hits using that pattern. Um so then Wordsmith came about as a way to kind of automate and weaponize that process. Okay. And as a as a second part to that as well, we just thought it'd be kind of cool to do this um because no one else has really done this before. And uh going back to like the inception of Wordsmith, the story of that, I just as I was scraping credentials out of memory, I would see that people have been using these phrases uh in their passwords. So that was kind of like another um catalyst for this tool. Okay. Question for you guys. You couldn't have asked us in an email later on. Yeah, I
could have asked you in an email later on. Yeah. But I'm I'm more interested in getting your your feedback for everybody. This will come up in your review. Don't worry about it. This is Joe. He's our boss. Yeah, for us to be out here. During the course of of your testing and analysis, other than the one hashcat rule set, did you find any others that were more efficient that that extended uh what you were doing on top of of the the targeted list and what would those rules be for the rest of us? Sure. So we used three different uh rules but we found that those two other rule sets uh existed in this hobo 64 rule set. So we
used rule onerule which contains 5,000 rules and we used hobo 64 which contains the top 64 uh in the dead hobo rule set. So we do also have metrics on that which I can post out to Twitter or to the GitHub repo. But this hobo rule set just was a a collective 58,000 rules of uh that just sort of encompassed all those. Uh I'd just like to say that um it looks incredibly useful and I'd like to encourage you to make it work in other countries. Um particularly the UK. We would love for you to help us do that. Yeah. Yeah. So if Well, what have you people got against the letter U? Collodial. um
what's some advice on how to tweak it for the UK or for another country? Yeah, that's a that's a really good point. I was in the I was in the Netherlands and and also in the UK recently doing a pent test and um I kind of showed this to some of the um clients I was working with and just told them to check out the talk if they had time. They mentioned the same thing and that kind of um I Tom and I were speaking about that and that's why we kind of extrapolated and made it modular now. So, um, we hope to build on that and hopefully there could be a UK, um, based implementation in
Wordsmith too. I didn't know if this was, uh, covered, but, uh, so, uh, mass processing. So, if you want to say like, show me all passwords to have capital, their length 12, and all that stuff. Was that covered in in this version of of the Worsmith? So, there is a feature to uh, you pull up, if you want. Yeah. Um there is a feature to uh specify minimum character lengths. Um there is a feature to you know convert to all lowercase. By default it's going to usually come out as capital case. That's how we pull it out of the HTML sources. But you can massage the data to a degree. So yeah, we set the state here for California. Um
Z is or or zed is for the zip codes and K is for the character length. So going back to password policies, if your organization has a minimum of seven, um we could specify seven here. We all know that zip codes have five characters in their string. So if we specify a minimum length of three, we're still going to get all the zip codes. Four, we're going to get all the zip codes. Five, we're going to get all the zip codes. But if we set it for six, we're not going to get any. And we're just drilling down on zip codes here. If we went for all, which is uh every single option available in wordsmith and set for six,
you'll see that we'll churn through everything. Every single road, every single city, landmark, zip code, area code, whatever that is under that six character uh specification will be removed from the word list. But if you uh I'm sorry, you guys want Sorry. Um I was just curious since you mentioned sports teams and I've seen a lot of users definitely use sports teams. Um, if you use cruel um to do like the FIFA website, would that be able to scrub those team names or would we have to like manually enter them um because soccer is huge at least in the Bay Area. So, um, absolutely. Um, curious, give me 10 minutes after the talk and I'll
answer that question for you because we'll do it together. We'll type in FIFA.com and we'll see if it gets some sport names. Yeah. Yeah. One of the things I've been telling many many times over over again I pass Con as well is that if you are a student looking for an assignment something to do or if you are a security researcher and you for some magical reason actually have spare time please feel free to contact me and I will make your life living hell for the next 10 years with work to do you know I have lots of things that I would like to see being researched and one of the things I'm looking still looking for
is uh and that's also So the reason for why I asked you know why did you initially start doing this? One of the things that I've been interested in doing and I'm kind of asking around if this could actually be done at all. I would like to see somebody make a tool or some kind of you know uh thing to pretty much uh put uh words the base word of passwords that we can find from leaks into different categories. So as an example, house or car is a physical object while dream or anger is not something physical. So my question is could we analyze password leaks or password from pentests and categorize them and because I'm interested in
looking into you know what kind of categories or words are people actually using when they create passwords. One of the things I have done is I have analyzed passwords based on gender and facial hair and hair color. I have done that. I have statistics on that. Women with red hair, red hair have the best passwords and guys that looks like Unix gurus have the absolute worst passwords. I have evidence of that. But I'm very curious about that. So if somebody's interested in in work to do, you know, give me a call. And with that, I will say that we are going to do a 10-minut break before we'll go move on to the next speaker. So again, Kiv and Tom,
please. Thank you. Thanks. Yes. [Applause]
So next talk is about
Okay, it's time to get started. Again, for those that haven't been into the room or been looking at shed for the last five minutes, we have replaced the original talk that we're going to have now with Adam Cordell. Uh, and we have replaced that with Mikall Space from Czech Republic. Uh, I'm absolutely sure this is going to be just as good as a talk as the one Adam was planning. Uh, this is about disclosing password hashing policies. Uh Mich have been with us before at Pastor ConCon up to several times. Um and uh yeah, take it away.
Hi everyone. My name is Michael or you can just call me Mikall in Czech if anybody here speaks Czech. Anybody here speaks Czech? Oh yeah, my girlfriend. Yeah, that's enough. Oh, she's here. Yeah. Uh so uh my talk is about uh disclosing passwording policies and by that I mean you know uh companies who are storing um user passwords they probably should do something to them something really nice not something nasty and they should tell us um or they should tell their users what they do to the password. So I'm going to talk about um who's doing who already who has already done that um and how they do that and uh stuff like this. So the duct tape here is um just you
know if you don't want to disclose any anything just put it over your mouth. So that's why it's on the first slide and yeah let's go. So um please raise your hands who have uh seen sometimes a message like this when you are trying to register your password must be six to 20 characters. Please raise your hand. Oh my god it's everyone almost. Cool. Now uh one more raising hands. Uh who has been already wondering why? Please raise your hands. Who has one who was wondering why they do this? Okay, cool. Me as well of course. So you're not alone. There are people also wondering why actually companies are you know um putting password policies like
that that the password must be maximum 20 characters. So uh sometimes there are just tweeting asking the companies why they do that. like this guy um who I know he asked Tripet like why do they um require him to have the password uh exactly 9 to 64 characters long so he was wondering why do they do that and he asked and tweeted at them um so people sometimes do that most of the times they just uh don't get any answer companies just don't reply at all well um sometimes the reason is that just the companies don't know why they do do that because um what happened to me once when I asked uh that the guy who actually set
up the rules, he already left the company. So the company didn't know what why do they uh limit the length of the password and they tried to fetch the information for me but they failed. So most of the times you just don't get any any answer if you ask which is kind of uh bad. Sometimes people are wondering or actually they are afraid that when the company is limiting the length of the passwords that they are storing it in in a plain text in the database because if you limit the passwords to say 64 characters um they are afraid the users are afraid that the company is limiting uh the length of the password because they are storing the passwords
in a database in a column which is 64 characters wide and they just cannot fit anymore. So uh sometimes users are afraid that the company is storing passwords in plain text if they limit the length of the passwords. It might be true and it might not. It actually doesn't reveal anything because sometimes the rules are there for just whatever reason but it's not the the password storage. So but it might be true, it might not. So um yeah uh this is this is especially true when there is a password breach or database breach or database leak. Um in that very moment the people are uh more wondering actually and they are asking how the company is storing their password. So if
they store it in a really bad way or if they store it in a secure way. Uh this is a I have a story from uh Shotbo servers. Shotbow is a is a Minecraft servers or yeah is a company running Minecraft servers and they got breached um I think few months ago. So uh they just announced it on on their forum and people started wondering for example like this guy uh froze J he asked um what hashing algorithm was used for storing the passwords because uh the original announcement said only that the attackers got hold of um you know one way encrypted passwords nothing else. So he was wondering what hashing algorithm was used for storing the passwords. He
asked on a public forum and yeah, so he got really interesting answers like this one from another member of the of the forum. Yeah. And uh thanks Bruce from passwordresarch.com for sending me this so I can use it in my talk. Um yeah, if they told you that there would be no point in the encryption and there is a headbank uh emoji. So yeah, um let's just not go into details here because you know hashing and encryption something different and Gigopov's principle and yeah let's just let's move on. Well uh the official answer was this one. It wasn't much better. They said that uh don't worry the passwords were hashed and solid and managed professionally.
uh no idea what does it mean. They didn't specify the algorithm but they just say it was managed uh sorted and hashed professionally. Um I'll publish my slides then there is a link if you want to verify it and check that and comment on that maybe I don't recommend that but yeah so luckily there are companies who are actually not afraid to completely disclose their passwording policies. Yeah, I'll wait the next slide for you per. Yeah, luckily there are companies who are really not afraid to disclose complete details what they are doing to uh use a passwords like for example Facebook. This is a uh this is a screen from uh Alec Mafett's talk from
passwords 14 from Norway which was the talk was about Facebook password hashing policies and um you know authentication and everything. So this is a this is a slide from the from his talk and this is what Facebook does to their passwords. Um they do a lot of things to the passwords but they have the reason for that. Uh seriously just I recommend to watch the talk because he's talking about that like for I don't know 40 minutes or something. So um yeah they use several layers of that but uh at the core of it there's a script here and some of some HMAX here. They have reason for that. So I will not uh I will not go
into details. Uh seriously just watch the talk and but this is completely what they do to the to the user passwords and they are not afraid to disclose it and they are not afraid to tell us. Um there are other companies um who are not afraid to tell us what they do to the to their passwords for example like uh like Last Pass they just uh they have it also u yeah Facebook did that uh in a talk so it's not somewhere on their site it's not on Facebook.com it's just in a in a talk by a security guy from Facebook uh Last Pass they they publish it on their side and they say that Last Pass utilizes the
PBK K day something um with show something else uh to turn your master password into the encryption key. They got more details there as well. This is just a this is just a short uh short text copied from from uh from the site. So they are also not afraid to disclose how they uh how they store user passwords. The same thing goes for for one password. They are also not afraid to tell us what they do to the user passwords. They have released um 60 pages long PDF which completely describes the security design of the um one password for teams and one password for families which is really nice. And one password is also doing a really nice
thing. They are sending Jeff to Las Vegas every time every single year. I don't know why but thanks of me. Yeah. Okay. So there are also some other nice companies and and smaller services which are doing the same thing. they are disclosing how exactly do they store user passwords and um one nice example is Scott Helms report URI report.io IO which is a service which is really nice service for aggregating um content security policy reports and HTTP public keeping reports really nice service for doing that and he's got this information this important information which says want to know our passwording policies sure check out our frequently asked questions so he's got this on a login page and on a signup
page as well right next to the um to the to the field where you enter your password right there it's just there which is really nice And these companies, you know, Facebook's got a lot of private data uh last pass. Well, probably as well. Um the same thing goes for the for the service. Uh they got a lot of um a lot of reports from content security policies. So uh they are not afraid to disclose um what the p how do they uh store user passwords? There is more companies like this and I have actually started collecting them. You know some people collect you know empty beer cans and stuff like this. So I collect sides.
Um this is my site. Um I call it, you know, mahalpachi.z and um you can find there a link to to to um subdomain called pools. I have several sites of that. I will just show this show the site in a few seconds. Um uh it's it's supposed to be a part of the bigger survey of the internet. That's why I call it PS because I got heavily inspired by a work from 18F which is a US government something and they scan uh US government websites and they just publish the score how uh you know how good the encryption is there and everything. So I want to do something similar but um I'm just sing
you know uh one man show this is oneman show only so it takes more time but let's just move on. Um so this is why it's called pools because they call it pools as well. So I got really really heavily inspired by that. Um the site looks like this. Um so I think I right now I have only 20 companies because it's not that easy to you know get official information how companies store passwords. But so if you look at the site um here I have the company which is called data dog and a site ape data.hq.com and they have disclosed that they use brypt for storing the passwords. So there's more sides like this. Some of
them are check because um I'm from Czech Republic. So um I'm asking the companies directly and they know that they should tell me because otherwise I'll just you know um make a public PR for them really bad. So uh yeah here are the companies and the sites and the um algorithms they use. I also came up with the rating system of how good they are. So data do is rated B. uh this company is rated F. Well, we will learn um why. So um wherever I have more details about the company or about the um uh the password hashing policy they use. So I also try to put it on the side. So here I have a check company
which I was working for in in 2014. So we made a talk about uh what we do to our user passwords and um you can find it on also on my side that um we use or the company use brypt and cost is 10 and they also do some encryption on the hashes and uh they have disclosed it on on a twitter and in a talk um every time I put something into this side into my side uh it must be already a public information I don't put anything um like uh you know because sometimes s you learn how the companies are storing passwords just by doing let's say penetration tests um so I don't put it there it must
already be a public information somebody already must have disclosed it somewhere in a talk on on Twitter on Facebook or in in a docs or somewhere so every time I just put their disclosure and link to disclosures I also uh make the u snapshot of the of the of the disclosure so that uh if they later think that it was not a good thing to to disclose. It's still on the internet and it will stay on the internet. Um yeah. Uh so a bit about the rating system I've come up with. Um yeah. Uh the rating system works like this. Um if you want to score or if the company wants to score a really nice grade in in
my uh in my rating system, uh it needs to use slow hashes. that means brypt, srypt, pbkit, the that thing and or argon 2. Uh I call it slow hashes right now just for the lack of uh better naming. And if you want to score really perfect um really perfect grade like a you also have to disclose that in your docs because uh if you disclose it somewhere in a talk or or on on in a blog post or on a Facebook or on Twitter, it's it's hidden. Nobody will look at um nobody will look for it there because you know um the blog post they just you know disappear in time also that's true for Twitter and Facebook
post. So if you want to score, if the company wants to score a perfect uh perfect grade, they just need to tell us uh you know write in the docs because uh that's probably where everyone if you are looking for the information that's probably where you want to look uh where you want to look at um you will probably not go through the block or or um Facebook or Twitter. So that's why some of the companies even if they use B-crypt they have they they have B uh they have great B because they just tell us in a in a talk and not not that officially um then there are other hashes um like you know show one two
show three and and other hashes MD5 and if the company uses something like that MD5 or show one show two uh and they at least salt it and stretch it and it means they do several iteration of that they score C. Um or if they just solve the hash they score D and if they just use plain MD5 or plain show one or something like that uh they score E. Or if they encrypt the passwords they just don't hash it or if they encrypt it they just score E. Um it could be worse and yeah it could be worse. F is for fail and that's plain text. There are some companies who are storing plain text as
well unfortunately. So um A and B are somehow safe. Um CDE C could be safe as well but uh we are not sure. So these ones are not really nice and this is um not nice at all. So sharing is caring but some don't care so they don't share. Uh so is it okay to to share or disclose um the the the passwording policy for the company? Well, I think yeah, I think it is okay especially if the company uses um you know brypt or script or or fun or hashing functions designed to store user passwords. if they don't use functions designed to store user passwords like MD5 or I don't know show one or something um it's better for them
to fix that to use something better and then they can disclose it because um yeah there's no point in not disclosing that if it's if it's Facebook and and Last Pass and and one password disclosing what they do to the passwords you know there's no point in in hiding that so um some companies are afraid that if they disclose what they do to the user passwords that they will get hacked, that they will get that they will become a target. Well, um I have be I have bad news for them. They already are a target and I'm not talking about target the company. Well, so uh this is a data dog. I have data do on my screenshot
somewhere here, I think. Yeah, they called B because they use uh they use B-crypt and um they have they use B-crypt and they have been using B-crypt even uh even before the data breach they have suffered like I think a month ago. So they were using B-crypt and they they got hacked as well. So um it doesn't really matter how the company is storing the passwords because um companies get hacked and they will get hacked even if they just use whatever they use. Um it's it's worse if they use plain text or MD5 or plane show one or something like that but uh they will still get hacked. There is another company who got hacked um
even if they were using B-crypt and that's called Ashley medicine but I think that the uh the motivation for hacking medicine was completely different than uh than user passwords but still even if they disclose what they use u they get hacked and even if they don't I think that data dog didn't disclose before they get hacked and they still get hacked. So yeah uh there are some tricks um for the users how they can actually um investigate how the company is storing the user passwords. Um one of the tricks is is here it's uh it's exploiting the PHP's feature of uh comparing two strings. So it works like this. Uh if you are able to sign up to a
site with a passwords 2 46 something like blah blah blah and then you are able to log into the site with this password q and key something then uh you can be pretty sure that the company is using plain MD5 to store the user passwords even without uh the company telling us well so this tricks works like this um the hash of 240 something something uh is uh starts with zero E and then there are some more letters. Uh the same thing goes for ND5 from Q and blah blah blah. The hash is also zero E and something. For PHP, if you take uh two strings which start with zero E and then something then PHP compares them as
zeros because it's it thinks that it's zero exponent something like that. So so they just compare PHP compares it as zeros. Yeah exactly. So, it's possible to detect the password the hashing policy if you are able to sign up with this password and then login with this one. Uh there are more example like this. Um it works also for show one and for plain text as well. Uh I got them on my GitHub and you can try that. Uh if it's uh if it doesn't work it doesn't mean that the company is not using MD5. they can be doing something else like you know they can just uh they can be comparing the strings with uh
three equal signs not just two but um if it works then it's MD5 definitely and I have found one side um who one side storing passwords in plain text just using this uh or similar trick so even users can do detection themselves they just don't need the the company to disclose that Yeah. Um, yeah. So, does that mean that uh I'm afraid? Yeah. It So, does that mean that one in 256 passwords is is subject? Because it I mean one in 256 hashes would begin with zero e if if they're if they're comparing this way. This is really bad. Yeah. But it depends on how they are comparing that. if they will be comparing that with the three
equal signs then this doesn't work but um I think that anyone who uses MD5 is already comparing just with two equal signs um or they can be just you know uh fetching the data from the database in a different way but if they are using uh two equal signs here uh then yeah exactly it's less likely than that because all the other digits need to be well the other characters need yeah exactly yes they need to be Yeah, it needs to be hexodimal string. Yeah thanks.
So yeah, sometimes you can just use this uh the slides will be made available online. Yeah. Uh anyone else? Can I go to next slide right now? Just asking. So um even if the um you know uh people are able to tell what the hash is just by looking at the hash. So if the database leaks then um you just look at the hashes and you are pretty sure that you know this is not a brypt. Who thinks this is a brypt hash? Oh great. So this is MD5. So just by looking at the hash you will know that this is MD5 or show one or or or a brypt hash or something like that. So uh even if if
the company gets hacked there's no point in not telling us what exactly they were using for storing the um the user passwords. There is a nice example um from Anthony Ferrara who has done this um he wrote an application hgsterspot.com and uh he tried to prove that you know security through obscurity doesn't work he has done that um yeah he gave the users uh two passwords one is password the other one is apple he gave them two salts per the password and he gave them resulting hash what he what the side was missing were the exact algorithms how the hash was calculated. So he just gave the users passwords and the salts and the resulting hash
and uh he gave the users I think it was 15 different algorithms like this and um the goal was to come up with the hash of a password fu and salt bar without actually knowing the algorithm. So you had to reverse the algorithm the the hashing algorithm you had to reverse that and calculate a new hash for for a new password just by looking at the hash and knowing uh passwords and salts. Well that's quite um interesting. Well the results were really uh nice. So there were people I think it was 15 algorithms in total and there were 14 people uh sorry uh there were three people who have found 14 of the algorithms just by
looking at the hashes just by looking at the hashes and trying to calculate the passwords and and everything. There was one guy Matias Globe who has actually hacked the app and made the made the app and made made the server leak the algorithms somehow. So he found a misconfiguration but he was able to to generate all the 15 15 algorithms. So just by looking at the hashes people are able to reverse that even if the algorithm is more u uh more even if the algorithm different than just plain MD5 or something. Uh Antony was doing some really weird things to the to the passwords like you know making them reverse and and stuff like this. I I will just not disclose
any details. If you want to try that just go ahead. Uh so just by looking at the hashes even if they are not plain MD5 or plane show one just people are looking uh people are able to reverse that quite easily. Um if any site is using any open source software they are actually disclosing by by design they don't need to even tell us because you know open source software it's just open source. So uh so people can look at that and just um tell how exactly the site is storing user passwords. Uh also open source software makes it quite easy to fix actually security bugs in in password storage. Uh I have one example here. Uh this is um Presa. They
were using they were using plain MD5 to store user password. Uh actually they they were using salt as well but it was a static salt. It was a salt in a configuration. So the salt was for so the salt was the same for uh all the user passwords and somebody told them that hey you you guys should switch to brypt from MD5. So they did that uh they did it like this. uh they were calling password hash which is a php function to calculate the bryptid hash and they prepended uh salt uh to the user password. The thing is that the salt was uh 56 characters long and you know brypt is trimming the passwords at 72 characters. So 72 minus
56 uh I think it's something around 16 maybe. So they were actually truncating user passwords at 16 characters without you know even telling the users. Uh this was a security issue. Um I was able to fix that just by looking at the code and you know making a pull request just fixing it in in in a few minutes. So that's what I like on on uh open source that you can actually fix simple things or not that simple quite important things like really really easily. Um they got more issues here like you know they called it brypt show 256 for whatever reason and they just called it encrypt not hash but I have fixed that
in in uh in next revisions so that's fixed as well. Um so yeah uh so I think that it's okay to disclose uh how the company is actually storing passwords because uh the company doesn't need to be afraid. There are other companies disclosing passwords like Facebook and Twitter as well. And just look at my sites. There are some nice companies. And it's okay to disclose how the company is storing the user passwords especially if the site uses um so-called password hashes brypt script PB something and argon 2. And if this if the site is not using any of these uh special password hashes, they should just fix that and then disclose and they should definitely let me know so that I
can put them on my side because I think that uh if they appear on my site with nice grade that uh the users will love it more because they can feel more you know confident that the company um knows what they are doing and stuff like this. uh even if the company is using you know um hashes like MD5 and they switched to brypt um the users will quite love it because they were like oh yeah you were doing something wrong before but right now you are doing something nice and you are not afraid to tell us that you screwed up before so yeah I've been there and done that as well um so I
think it's okay to disclose uh especially if company is using slow password hashes and yeah that's it from me I think yeah There are some questions. Okay. Thank you. Questions. Arnold. Um, one quick question. In your A and B read uh ratings, it wasn't so clear from the slide. Are you requiring salt for the A and B ratings? Uh, sorry. Are you are you requiring salt uh for the A and B ratings? Yeah, because all the all the algorithms like BCrypt, script, and the other ones, they all they already require the salt. So, yeah. Yeah, I am. Yeah. Oh, Jeff. Okay. Just simple one. Um, are you familiar with plain text offenders? Yeah, I am. Okay. I'm just wondering
whether you kind of I want to take it to a slightly different direction because plent of offenders is more like just a shaming but I want also want to you know thank the companies for doing great job. So yeah I know them. Yeah. Um uh you know you said it's it's it's possible to tell the difference between different hashes but can you tell the difference between just looking at the hash of something without the password of what's an S-cript, what's a BCrypt and what's using PBKDF2 or three? Is it possible to sell those apart? Uh most of the times, yeah, because BCrypt usually starts with dollar sign to something dollar sign. So that's a B-crypt. The
script is slightly longer and yeah, sometimes it is possible. I think it's most of the times. Yeah. Yeah. Okay. It probably if it would be encrypted or encrypted, I'm sorry, encoded to B 64, it could take some more time, but I think that it's possible. Yeah. More questions. your uh pulse project. Is it possible to contribute or is it closed? Yeah, definitely it's required to contribute. Okay, great. It it's it's mandatory to contribute. And how do we do that? Uh just tweet me or send me an email uh to a link to public disclosure because I bills in his shorts. Uh Bruce from Password Research already done that. They he sent me like three links to to sites who are actually
disclosing there is a password. Yeah, it's mandatory. I cannot do this alone. Thanks. You know, this this has been an ongoing discussion for many years at Passwords Con. We are still fighting to to make companies to disclose how they are storing your passwords. And my own personal opinion on this is basically that, you know, if they don't want to disclose, you should just, you know, as expect the worst. Yeah. Uh it's like either you can be on on the good list or or I will put you on the [ __ ] list. It's it's that simple. I I will probably rename my project. Yeah. Yeah. But I mean over and over again for every new
leak we see we see unsalted MD5 still going on. We still see unsalted SHA one and we see all kinds of bad implementations. It's like somebody I'm not going to say who if if it is board of directors or or if if it is the developers but somebody is just not watching the news for the past 10 15 years about all the leaks. It's like oh there's a leak. Well the obvious question anybody on the survey on the board of directors should be asking their own organization is how do we store our customers passwords? They don't they don't ask those questions. They just assume well we're not stupid so this is not never going to happen to
us and we need to change that. So again thank you. Maybe one remark I was talking about actually medicine that they are storing passwords in decrypt but they have um done something really bad to the passwords as well. They were storing them also in MD5 like besides the brypt passwords. So uh if the company scores uh A or B because they tell us that they store passwords in BCrypt, they can they still have a lot of opportunities to screw up. But um this is not possible to verify. Um we just need to trust them. If they tell us that they use B-crypt, we just need to trust them that they use the B-rypt. So um it's not really possible to verify it
without hacking the company and I'm not going there. So yeah. And if you're interested in the Ashley Madison case as well, uh, Sunosha Prime, which is one of the really good groups doing password cracking for crack me if you can and and and other password hashing uh cracking competitions, they did a talk at passwords con at the University of Cambridge in the UK last year where they talk about the well the how Ashley Madison had done pretty much everything wrong in their password implementation. So initially brypt oh look looks good and then you know kaboom. Okay so we're going to take a break until 12. Uh and the next speaker up is Bruce Marshall who will also do a really
uh uh interesting talk about how you should be proactively handling password breaches from other sites to your benefit pretty much. So back again at 12. Thank you. Okay. Thank you. [Applause]
Come on.
on your slides yet?
I don't know. [Music]
Okay. So, this upcoming talk now, proactive password leak processing with Bruce Marshall. Uh, I know the schedule says uh 25 minutes. It's going to be uh a bit more than that up to maybe 50 minutes. I don't know. We'll see. Uh, and as I said before, you can very easily go without food for several days. So even though there's lunch afterwards, I will highly recommend you to stay here and listen to the entire talk. Uh because this is going to be a good one as well. Uh Bruce is one of those that have been with us ever since we started doing a passwordcon in Las Vegas four years ago. So I will just leave the
stage for him. Go ahead Bruce. All right. Thanks be. So yeah, as the uh I run the passwordresarch.com website which I started 101 15 years ago and now not quite 15 years ago uh to try to gather some of the research uh specifically starting with the academic community. Um but then I've since added more from events like this which I would consider non-academic for the most part and trying to share that information make it more accessible and essentially provide an index to people like us who work in the private you know industrial government type fields so we can benefit from some of that knowledge and my last couple presentations as per mentioned I've I've presented several different
times on security questions on uh passphrases like diceware xkd pass style passphrase phrases um and password expiration. And some of those have been driven by data that I've found like the security questions and password expiration was based on password data that I had and some some uh internet dumps that had security questions and answers. Um and this is one that's been a little bit different because I I started out just hearing about these companies looking at leaks like Microsoft going out. They announced here a couple months ago they were going to start, you know, or maybe they've already been doing it for a while, but they they announced at least that they were going to be looking at password
leaks and looking at things like that to try to protect their accounts um of their of their users, their customers. And so, as I was collecting this information, I started hearing more and more about it. And I decided that I wanted to kind of try to gather what we know right now that companies are doing and talk about the techniques that they're using and talk about the different alternatives you have if you're considering this or um if you're debating whether it's even worth your time. Um and one of the reasons I wanted to do that now was because uh in some ways in some things in the industry we kind of get broadsided by stuff by
either auditors or standards you know OASP or SANS or somebody comes out with a new guideline on something then we have to figure out how to do it. Um so this is kind of my way of helping start that conversation along before we're too far in where you know 99% of the industry isn't doing anything related to this. Um so let's get right into it. Uh, and like Per said, I'm there was some confusion. I I had thought that I had a 55 minute slot, so I have about that much material prepared, but I'm going to try to pair it down to uh probably closer to 30 35 minutes. Try to be respectful, but we'll see. I I no
guarantees. Have you people have you had breakfast? So, you can wait. Go on. Yeah, just just raise your hand and P will bring you a little cup of juice and I've been locking up doors. There's no way of getting out that way. Right. So, password reuse, if you're not sure what that term means, basically is a person using a password on multiple sites or multiple applications. Typically, in this case, we're primarily interested in the internet, but you could say in a corporate environment or something like that. Um, and so to try to measure the extent of it, I'm going to give you some some stats. And this is where I'll kind of not spend a lot of time, but there's
several different ways we can kind of measure that. And and there's additional stats that I don't have just because there's more than we really need, but just to give you a rough sense of it when when actually asked what they do, um, which can be somewhat problematic because people either idealize what they do or they're they underestimate what they do. Um roughly, you know, anywhere from 46 to 60% of people say that they at least use passwords on several sites or they reuse passwords um um for different places they go on the internet. We can actually see that too um in some research that's looked specifically at password leaks which password leaks are essentially if a company gets hacked
like if you were in for Michael's last presentation talking about some of the data that gets dumped out um they can compare passwords between the different users that are in both of those dumps both of those uh that were hacked from both of those organizations and see if they matched and kind of get a better feel for that. And so one of the research papers um academic research papers was looking at how many had exact matches versus slight modifications. You know maybe the password policy is different on one site than the other. So they had to add a capital letter where normally they wouldn't have one or add a number or maybe they some people like to
do little prefixes like for Facebook the first three letters are FAC things like that where they felt like they were reasonably predictive. Um, and I guess I skipped it on the first slide, but these numbers and brackets here. I'm big on uh references since that was the point of me starting my website was to point people back to the original sources of data. Um, and I'll have the references section at the end. So, if you see something you want to dig in further, just write down the number and um, you can see actually where it comes from and do some reading yourself. Um, Troy Hun also did some comparisons between Yahoo Voices, Sony Pictures, and similar type
of thing. He saw that around a little bit under 60% used the exact same password within that sample and 2% had slight capitalization differences between those passwords. And finally, probably even the most accurate or at least more insightful is monitoring what people do uh within their web browsers. Um and two different one study and one kind of industry type study have looked at that. trustier. It's a little they is getting a little bit old now, but I don't imagine things have changed too terribly much. Uh had a basically had a browser extension that would monitor where people were using their passwords for those th those customers. And I want to say that they had like four million different people
that had that that extension installed. And they were able to see that just specifically focusing on financial sites that 73% of the people used their bank password or or you know their credit union whatever to log into at least one other site which you would think that's very bad and you would be right. Um and then also the fact that they may have a different ID but they but they um on other sites too but they use the same password. And then um in a smaller review of just more more recent but smaller review of university students, they did the similar thing, installed a browser extension, looked at how many different sites they had versus how many
passwords and and there's a lot more details in the studies like I'd encourage you to dig into if you're really interested, but 85% had fewer passwords than websites than websites they went to. So why is this a problem? um because of the atto or account takeover threats kind of what that that type of uh attack has been labeled where we as site owners start getting attacked because our users have made choices to reuse their passwords. Now account takeover is not just a result of password reuse. It could be re it be a result of poor password choice. It could be a re um someone's computer having a Trojan installed in it. You know there's there's different reasons for it.
Password reuse is one of the threats uh as one of the the causes of of account takeover and credential stuffing has kind of been the name that I think shape security kind of um aim monitors a lot of the sites out on the internet and they're able to see some more insights on their customers beyond just a single customer type of situation and they released this year in one of their reports um like nearly you know a a million IPs being used in in one single attack against a financial customer um throughout that period 427 million accounts checked and a different customer which I think they said was in the entertainment industry was the the
second one here you know 817,000 different IPs and looking at those IPs of course because they've got insight into both of those able to see that there's about 70 70% overlap between them so they either they're using the same botn nets or it's the same gang of people running these different types of attacks or there's you know there there's definitely um some correlation between those different attacks. So they're out there, they're trying whether it's your site yet or not. I mean only you probably you know or hopefully your logs will tell you. Uh one of the interesting things that also Akami said in that same report was that when new password leaks come out like LinkedIn and MySpace and some of
the others we've seen this year, they see spikes in account takeover credential reuse type activity. They're specifically monitoring that types of stuff. Um, and one of the biggest examples of that happened last year was Tao, the Chinese kind of like an eBay reseller type site. Um, they were hit in the middle of October with uh what authorities later said was a collection of 99 million credentials that successfully got them into about 20 20 that matched at least 20.5 million active users on the Tao site. Now, Tao said that they didn't actually get in. they got blocked by maybe, you know, some um contextual authentication type stuff. You know, they came from the wrong IPs or they had suspicious browser
strings or something like that. But um regardless of how many they actually blocked, um the resulting crime resulting from getting into those accounts was around $1 million worth of fraud uh that they detected and then had to deal with on their site. uh in less widespread nature. There's been a lot Techrunch here just in the the last couple weeks um got hacked into due to a shared password in their content management system and briefly someone p posted a fake article. Uh GitHub had problems the same type of thing after LinkedIn came out. Uh the most recent LinkedIn uh leak that uh their users were being attacked with reused credentials and they had to respond to that. And chatba um was
another one that said that someone got in because of a third party breach and the credentials being the same as on their one of their administrators.
So it's hard to quantify how many people have suffered account takeover due to password reuse because we often don't know why their account was taken over unless they say specifically, oh, I had the same password between eBay and my Facebook account. But we know as far as just account compromises in general um roughly you know 25% of the population has experienced that in in the past year and had to deal with the the outcomes of that um based on this survey. So why does that happen? I don't really want to spend too much time on this. There's lots of different reasons for people to want to get into accounts. Often you think like why does someone
get want to get into my Starbucks account? I can understand eBay or my bank but why my Starbucks account? Typically there's some way they can either get money out of it. they can get, you know, social proofing, they can do different things that are going to add value to them or that they would can sell to someone else that is interested in doing those things. So, I guess one of the questions that we kind of have to answer is, is it our responsibility to care? Someone's chosen to reuse a password and there's been research that says that people reuse passwords in part to deal as a coping mechanism to deal with the overload of passwords that they have. um you know
they don't try to choose something super complex so they can remember it. They try to reuse it in certain situations because they want to relieve that memory burden that they have of having you know if you had to memorize a password for every single service uh or account that you have um that could be overwhelming. So they made the decision to reuse that password. Maybe they weren't as informed as they should they could have been but they did make a decision about that. Um so we do have the option and most of us are in that default option right now which is we're not doing anything. we wait until a an account gets hacked and
we respond to it and you know we can continue to do that if we want to. So from a perspective of back to what do users know there's an excellent paper um I don't think Dr. Craner is in here, but her team at Carnegie Melon has done great research over the last few years. But one of the papers that I recommend pretty much anybody dealing with password policy or authentication decisions is this one called I added a exclamation mark at the end to make it secure, which is where they sat down in a lab and they asked people to create passwords for like a newspaper site, a bank site, and an email account. And they looked at what they did and kind of
had them talk through the process. Okay, I'm this is my bank account, so I want a stronger password or this is my email, so I want it to be something I can type in quickly. They they they got that feedback from the people that they talked to. And then they also talked to them after the fact. Um, you know, why did you choose to make your bank password the same as your email password? Um, and so they specifically got feedback on password reuse. And a lot of people say, you know, kind of like, well, you know, it's a bad thing to do and um, I probably shouldn't be doing it. I'm not as concerned because
it seems to not have any consequences for me. They're part of that, you know, maybe 75% of the people that haven't had their account taken over that they can't trace back to um password reuse being a problem, which, you know, that's I you can't argue with experience. In some cases, they somebody can reuse a password and never have problems with it. It depends on some of the other factors of if that password is going to get disclosed um through an attack or a breach somewhere. So, but part of the problem is that they don't have the same education that some of us have as to what constitutes good and bad password decisions. What constitutes um risky situations that may
expose their password to compromise. Um three of the different people that they talk to, you know, say, "Hey, if it's if I've got a good password, I reuse it. I don't see a problem with that. Um my reused password is not easily guessed. No one can guess my reused password." Well, so the researchers then took the passwords that were generated as part of this lab experiment, used hashcat blindly without, you know, with someone that didn't know what the values were to crack those resulting passwords or attempt to crack the passwords. And two of these three people had their passwords cracked. So maybe not the best, you know, judges of of someone being able to guess their
password. So that's my perspective. Um, I guess one more here. This was a guy that uh recently just came out here. Uh he was contacted by a news agency because his password had been breached as part of a O2 compromise over in I guess England. He talked about well you know I reused that for O2 and eBay and Gumree and up to that point he'd considered himself secure secure online and internet savvy. So um you know from his perspective maybe he had thought his password was good enough but the point being um sometimes we have to provide that guidance back to users um you know we establish a minimums or we establish standards and users say well if you're
saying I only have to you know do six character passwords then that must be sufficient. Um, so you know the office office space uh flare scene here came to mind as far as that argument of you know well you why did you set the standard this if this wasn't sufficient and password reuse it's a little bit harder for us to set standards but um our actions do kind of speak to that same question of is this acceptable behavior is it not acceptable behavior. I mean users do also kind of have mixed feelings about who's responsible for that. Um, in one survey, 56% said the sites that they visit had ultimate responsibility for their for their
account protection. Uh, and another 39% said that websites are to blame if they have account compromises because they didn't offer the right security features, whether it's multiffactor or stronger passwords or whatever it may be. Um, they're placing some of that blame on it. So, I would say, you know, given this, we kind of have a shared responsibility. Um, Alex Damos who was who's now the Facebook CE cso uh spoke at a appset conference here last year where he said he was asked in the Q&A session what's the biggest challenge for Yahoo and he said user security and then broke further down and talking about how they deal with password leaks and password compromises um and saying you know in theory there's
nothing we can do about that right the user's choosing to share their password they're making a choice we you know we we we don't really have any role in that choice choice, but he said in practice it means they need to kind of readress how they're dealing with passwords, how they're dealing with their users, and how to limit the the risk of those compromises because a user is not capable or not willing to do that for themselves. So, I I thought that was a a very pertinent quote related to this discussion. So, there are there are different things you can do. We're going to focus on password leak processing, but I did want to kind of talk through these and I'm
not going to spend as much time. I was going to go into blacklisting a little bit more and contextual riskbased authentication a little bit more. Um, but one of the bad I would say bad things you can do is to enforce regular password expiration. Um, and that's been talked a couple other sessions uh in this conference where if people are changing their passwords every 60 to 90 days, they at least can't have an exact they probably won't have an exact match to their other accounts because they'll be incrementing the number on the password that they've been forced to change. You know, maybe a slight difference. And as we talked about earlier um in some of the leak
processing people are able to guess those transformations fairly easy. So not a great way. Um incident driven would be like Citrix go to my PC and uh Pandora. Was it Pandora see Oh Carbonite? Carbonite was the other one. Whoops. There we go. So Pandora and Carbonite or uh go to my PC and and Carbonite detected password guessing attacks on their sites and just reset everybody's password, forced everybody to change their passwords. That's a fairly, you know, scorched earth type of policy. Um it can work, but how often are you going to be able to carry that out and not upset your users? I mean, there was a lot of people that said, "I had good passwords
on your site." you know, especially with Carbonite where you're doing backups with it and now you got to change service passwords and stuff to make sure that actually continues to back up. Um, that can cause a lot of disruption that your users may not want to deal with. Um, you can design unusual password policy requirements. you know, if you start making sure that people have to start it with no space, you know, no symbols in their password or they have to have a uh, you know, two symbols in the middle of their password or uh, Mark Bernett has a great site um, PW or Twitter feed of uh, PW2 strong which he retweets all the different terrible
policies and different sites have. There's there's ton and I don't think that typically they're designed to do that but that is one way that someone could approach this this problem saying we'll come up with some crazy requirement that way they won't you know possibly have the same password on our site as others. You can assign random passwords to users um or they don't have to be random. I mean they could be semi- random or whatever but you could assign passwords to users. Uh Linux Mint was hacked earlier this year. their their forums where their forums and their partner sites or uh community sites were their their database for those were hacked. And as a result of that, one of
the choices they made was to just randomly assign all their users passwords um for the for the new passwords. That lasted about a week. Um after that, they realized that people weren't really happy with that choice. They wanted to be able to choose their own passwords. For some of us, it's not as big of a deal. We just plug it into our password manager and go on our way. But for others, either they're trying to have to memorize that or writing it down or they're just not pleased with having to deal with that uh possibility. Very highse secure sites, you might get away with that. Maybe there's um you can justify that and your users aren't going
to um react too strongly. Um eliminate passwords altogether. Um, and this is what kind of Yahoo with their Yahoo account key and and some other sites like Medium have adopted where they just basically say if you want to log into the site, plug in your username, we'll send you an email, you click the link in the email that has basically a session key that logs you into the site. You don't need a password anymore. Everything will be done through your I mean your your email you need a password for. But that's not our responsibility, you know, their responsibility anymore. Um, that's, you know, for some sites I think again lower security that may be an option you're willing to go with. Um,
just because the passwords then aren't you have to worry, you know, they have to worry about their email password, but from your perspective, there's nothing else you really have to worry too much about. Uh, two-factor multifactor authentication, two-step verification. I mean, we've been encouraging that for years. So, regardless of password use, uh, because of all the other password and account takeover threats, it's a good idea. Um, probably goes without saying, uh, but it is nice if your password isn't the only line of defense. So, if someone guesses your password, they still can't get into your account. Uh, that's ideal. Um, blacklist from leaked passwords. And this is I'll talk briefly about this, but um, and Jim
mentioned yesterday in talk about the new NIS standard that there's some pushes towards instead of just, uh, reacting to people's bad choices after the fact. When you see it in a leak, you tell them, "Yes, that was a bad password." I mean, not all leaked passwords are bad, but um presumably some of the passwords are and then you would be telling them not to use them. You could just create that list from scratch or create it from leaked passwords. Um there's services out there um password RBL is one I talked to. Um he's got millions of passwords essentially that he's compiled from leaks um from attack tools that have password username, you know, combo lists
uh from different places like that that you can subscribe to. You can generate your own blacklist. Um, but you can essentially try to prevent those from the start for all users rather than just saying this one user can't use this one leaked password that we we found that they were using somewhere else. Uh, contextual riskbased authentication. There's lots of different names for it, but it's essentially looking at other factors more that you're monitoring more passively um that are associated with the user's login experience. So, you're looking at IP and geollocating that IP to their normal location. So if I log in from the United States and suddenly I'm coming from Europe, they may flag that
for a you know they can essentially flag that as a higher risk type of authentication transaction. Uh LinkedIn did is doing this and did a great talk at Enigma uh conference this year. um David Freeman there I think it was called serverside second factors and he talked about the different essentially the formula that they use to to determine risk you know browser agent time of day you know all the different IP factors and he also talked about their success at combating things like account takeover fraud the fact that just by looking at country they could eliminate 90 some percent of automated you know those bot those like the data the the automated large-scale attacks because they were coming from other
countries um and and some of the other things like that. So, that's certainly something that I think regardless of whether you're doing this just for to combat password reuse, this is a good thing that you should be looking at if you're not already having it implemented. All right. So, finally, the in the approach we're going to talk in more detail about is looking for password leaks in the internet from other sites, uh possibly from people claiming it to be your site, and then compare that to your own users. So here's my obligatory kitten picture. But so your goals are doing this is to reduce account takeover. You probably want to eliminate, but to reduce account
takeover based on risks that you know to be there. If this password leak is out there and other people have access to it, those accounts by by logic have a higher chance of being attacked. So you're trying to get ahead of those attacks um for you know presumably the riskier accounts at least from this perspective. uh I haven't seen or heard feedback from the companies are doing this. I would assume that there is some money savings from having to uh for dealing more proactively with eliminating account takeover threats before the account takeover has happened when you're not engaging customer service and um you know admins or whoever else has to deal with compromised accounts and possibly
the loss of business that goes along with customers being frustrated that you didn't protect them even though maybe it was their bad choice that led to the the account takeover. Um, and it also, as we talked about, demonstrate some security commitment to your users, to your investors, uh, your auditors, your, you know, management team, whatever, whoever is kind of in that field of needing some reassurance. So, if you've never seen leaks before, if you're not quite sure, I'll give it a brief uh, explanation. Typically, it's going to be just, you know, data that's posted on the internet. um my experience and I'll talk a little bit about about this the data that I collected a couple years ago was
most of the at least the small time compromises come from things like SQL injection uh where sites private predominantly running PHP would either not program properly to resist SQL injection so the attacker can then dump their entire user database get the passwords you know usernames and things like that sometimes in bigger cases it comes from just outright server compromises where they they compromise ize an application server and then have connections back into the database server that they can pull the data out of. Um, but there's also cases where Trojans and malware are collecting it. Uh, I think Trustwave did an analysis about the Pony botnet and they had several hundred thousand passwords that have been collected by the the Pony
botnet um over I think a six-month period or something like that. Um fishing of course is tends to be somewhat lower scale man-in-the-mill attacks and then there's compilations where people may just collect data over time and then will it down and say okay I've grabbed Gmail addresses from seven different sources and now I'm going to put them all in one file and call it G you know a Gmail dump password dump um even though it didn't technically come from a hack of Gmail um some will be duplicates people kind views this as a bragging right. In some cases, they put their name on the top and say, "Hey, we hacked the FBI or we hacked, you know, Gmail."
Um, and some leaks, of course, as as you probably have heard, don't just contain password data. You know, the Ashley Madison uh compromise contained all sorts of personal data. The healthcare breach that happened that was or I guess was uh leaked that was kind of given publicity this week has all sorts of healthcare information. So, they're not just limited to, you know, username, password, email. um there may be lots of other data within them. So you kind of if you're deciding to do this, you kind of have to make some decisions and you can be somewhat flexible on this. It's not a binary yes, we're doing everything or no, we're not doing anything type decision. Um
probably the most important one is for you to decide if if you're going to be looking for data that comes from your own site, looking for signs of compromise or looking for signs that someone is sharing data that supposedly comes from your users or your employees. Um after that probably the you're going to be looking at easy leaks to process. So the larger leaks that have plain text passwords that you don't have to worry about cracking anything. You don't have to worry about um going through much effort to parse the data or try to go out there and find the data um is another I guess easy hurdle to overcome. Like I said the larger leaks and all the
leaks you can find. And this is kind of in talking with um Michael Coats who's the Twitter uh privacy and security officer. he talked about. That's kind of their approach is they're trying to find everything that's out there um that they can deal with and try to process to protect their users. But as you might guess, that also requires greater commitment to time and resources to try to address that. So, if you haven't seen password leaks before, they come in all sorts of shapes and sizes. These are some from like paste bin type sites where they're just, you know, text file type formats. Um, you know, usernames, passwords, emails. Some are going to have them in different
order. Some will be tab delimited, some will be space delimited, some will be, you know, comma delimited. Uh, some will have hash passwords, some will have plain text passwords, some will have passwords in a different table than they had usernames. So, you'll have to try to correlate them. It just my point being that you there's not a one-sizefits-all type approach to sucking in that data and having to process it. And in this case, also, you can see these these their passwords over here in parentheses that they've already cracked before they publish the dump out there. This one's more like a SQL type statement um with all the different fields within there. So, I looked at this back, it's been,
you know, three, four years now, um to try to get a an idea of the scope of password leaks. I wrote a blog post about it, which you can see here in the the reference. But uh essentially over a two-month period, I looked at how many dumps I could find that were specifically had I think my criteria was more than 10 usernames and passwords in them because some might just have like two or three and uh some might be some like one single person had a Trojan on a system and they dumped all his information kind of like a doxing type attack. But these were the different results. So December 154 different password dumps 125 that had specifically
or organizations or a site specifically named as the source of that dump. Whether that was accurate or not, I didn't verify it. Um, plain text passwords, there were 66 dumps and 40 dumps, you know, resulting in 221,000 and 61,000 respectively, uh, passwords. And then similar with hash passwords. So, you can kind of see there's some variation is also in variation in size. Uh, the dumps with less than a thousand passwords, there was a pretty good number of those. And a lot of these are just smaller, less secure sites where they throw up a, you know, a university in India throws up a CMS for their students to log in and get courseware or a small, you know, retail
or something like that throws up a site. Um, and I didn't count up the emails. There were emails included in the December dumps, but the January dumps, I actually did say which which percent of the dumps had emails in them to give you an idea. The rest either had just usernames or some of them may have just been passwords by themselves to give you an idea of are you going to be able to have access to those emails to know if they're your users or not. Um so you know roughly that's you know this was and this was a decent amount of work for me to parse through this because there was dozens and dozens more
dumps that I saw that didn't have password information in them. They were just config files or other data that was not passwords that I didn't want to have to deal with. Um, and this was even auto, like I said, automating just the I had to manually review them, but I could at least automate the um, scanning for them. Uh, when you were looking for for for emails in there, I mean, emails are being used as usernames. Did you at any point see any leaks or did you ever think about also looking for usernames that are something else than just an email address? Like, you know, Twitter as an example, you can login using your phone number if you have given them
that. Uh you can loging in on Twitter using your handle or you can loging in using your yeah uh email address. No, I didn't I didn't do any counting of of usernames specifically. Um you can see like in some of these dumps like uh this first one here on the left side, it's got a screen I guess a s name screen name maybe. That's their username and it also has their email. I couldn't tell you what that site allowed or used as a login. Um, we know that a lot of sites on the internet of course use emails, but yeah, there may be cases where they could do either or. And we'll talk more about whether you want to parse
usernames yourselves or against your own users or not, but um, that can be problematic. So, that's, you know, that's kind of like the kitchen sink. You're you're looking for as much as you can. You're sucking that in. Um, and I'll talk about some tools here in a minute that may make that easier for you. But, you could also just look at the larger scale. And this is kind of a sampling of what's come out and what what I would say has been generally available in the last few months of this year. Um some of these didn't come from this year like LinkedIn we know came from I think 2013 2012. Okay. And then MySpace is around
that same time. And the Twitter one wasn't really Twitter. It was some other site. And they had 400 million entries but only 32 million of 32 million of them were unique entries. And so there's, you know, but you may still um find that that that data is still pertinent. And we'll talk about LinkedIn um here in a little bit, but uh that data may still be relative even if it's older. But as far as general availability, um these are kind of what you're looking at as far as large numbers. Clearly much larger than the, you know, a few 10,000 password dumps that we see with just crawling paste ben. And most of these you can't find on
sites like pace. you have to look for either people tweeting about them or there's torrance or you know maybe some file other file sharing sources in the underground. The nice thing is I guess once more people get them they tend to share them more like LinkedIn at first was kind of more harder to get a copy of and now it's it's fairly easy if you know where to look for it. See? All right. So, some tools. Netflix is one of the organizations that does process for password leaks. Um, and they came up with a tool, a Ruby on Rails application a few years ago, which they open sourced called Scumbler. And that's not really a
password leak processing tool so much as an intelligence gathering tool that looks for password leaks. So, it's a tool that they have for uh data sources like the pay spins, but also Facebook link uh Twitter. They look for they can scan for Twitter. Um um I'm trying to think probably Google searches and and some other stuff like that. But they essentially make it so if you they they want to put in Netflix password as a term in there. It gives them a workflow and kind of a checklist for them to go through of all the different sources that that tools found since since the last time it's been run. You know, you could schedule it to run
daily or whatever and retrieve that information, decide whether it's a threat you need to deal with, whether it's leaked actual leaked passwords or not, and then um and process it. Yes. You also be really careful with stubbler. By default, it will attempt to use email and its original email source was actually destination was not registered. So, it was actually sending some of the reports to an unregistered email address. Okay, that was patched. So, he he he mentioned um Scumbler's output as far as alerting you that you need to process data from it was originally email and there were some quirks if you're using an older version of it. So, make sure you're using a current version. But that that's kind of
their solution and it works in with some other their tools as far as like I said the workflow behind how they process those leaks and decide because they're looking for other stuff too. If someone says, "Hey, I hacked hacked Netflix, but they're not dumping passwords." They're also interested in that type of data. So, it's more of a general intelligence gathering, but it can be used specifically for password leaks. Dump is is a Twitter account, but it's also an open source project where they are crawling past for you. And I believe that was one of the primary sources I used when I was doing my research back in 2012 and 13. Um, but you could customize that since you've got the
source code to instead of posting on Twitter, do the same type of thing. Write it into one of your ticketing systems or send yourself alerts when you find specific strings that seem to match password dumps you want to deal with. Um, there are some sites that there's not really hashes.org or is pretty much the best site that I'm aware of as far as current um hash dumps. Um they seem to have a good and they kind of disguise the name like LinkedIn has a one instead of an I because I guess they're trying to make it harder to search for. But um they don't only have the raw hashes, but they also have cracked hashes. So that
can save you a lot of time if you want to just um take advantage of their work that they've already done. Um, Inside Pro is a like I think a Russian password cracking site that they have forums where uh people share hashes and stuff like that. There's all sorts of other um occasional sources that may have either older data or they may leak something every once in a while, but it's it's kind of hard to find one good source on the internet that has everything you need in one place. Yeah. Uh, Pro is a little bit more nefarious than that. It's basically frequented by a large number of people with excess GPU capacity. You can basically post large
dumps of uncracked hashes. Yeah. And get them cracked. Um, I've been watching specific actors responsible for some of the large cloud service provider breaches and when they pick a a new target, uh, they'll they'll pick like the hashes for maybe 50 or 60 employees at a particular cloud provider and and pump them into inside pro like a anywhere from a $5 to $100 bounty per hash and they'll usually get around 20 20 to 25% of the accounts um, reversed within about three to four hours. Yeah. So that's that's how effective. Yep. So he mentioned basically that Insight Pro does kind of like a crack for pay um in their forms where you can say, "Hey, I
want these cracked." And you can offer a bounty for it. And um so it's not always just like good naturatured password sharing. Sometimes it's it's more of the criminal element involved in that, too. Uh, of course, a lot of the dumps we've seen this year have been offered for sale initially, like the LinkedIn and MySpace data on, you know, underground sites where they sell everything from drugs to, you know, whatever else that you wouldn't be able to sell on a normal site. Um, those can be a little bit harder to track down just because you need to get access to them. Sometimes they're a little bit more careful about that, sometimes they're not. Depends on
which sites you're talking about. I'm not aware of any just like a underground like a carding pro type site that's specifically focused on credentials. You know, they they're only interest credentials. So, it tends to be mixed in with the other either data sales or things like that that people may have um credit card sales, other other types of data dumps that are out there. Um law enforcement may also I've I've heard cases I think with Time Warner Cable where the FBI like said, "Hey, we found this data we think may be associated with you and you might want to take a look at it." Um, I wouldn't count on them doing that for you. If you, if they
do, great. But it's probably more of a, you know, try to know who your local cyber security person is in the Secret Service and FBI or your national law enforcement agency of choice. But, um, that's like it's nice if you get it, but I wouldn't I wouldn't count on it. Um, there are a couple companies that are specifically focused on providing you with either password leaks or allowing you to check your users against the password leak data that they have uh collected. Um, Hold Security um, is run by Alex Holden, which will you've probably heard them in the news in the last few years, but um, they've got several, you know, they have a pretty, I
guess, mature uh, program for going out there getting large data leaks from their underground connections. They do some they do crack some of the passwords themselves. They put in, you know, it said like about a week's worth of effort. They do find a lot of plain text passwords. Um, and their idea is that they sell you essentially an API service to query users and find out if your users, whether they're employees or customers, have passwords in their their database and if you need to crack them, you can crack them, you know, do further cracking on your own. Leak Source is a little bit newer. Um they've also kind of focused more on the end user um as
far as you can go out there plug in your email and they'll show you which sites you've been comp you know your credentials have been leaked through. Um but they are off also offering a business option. Uh again they do some password cracking themselves. They've got the API and you would have to subscribe to that. uh thread intelligence service providers I'm not as familiar with and I would imagine that if you they found a leak that had your name on it they'd probably tell you about but they may not be willing to share like you know a Netflix dump with you if they found something like that. I don't have any like I said if you got
feedback on that certainly feel free to to to pitch in but um I know that they'll at least alert if you know hey LinkedIn had this big credential dump and it may be something you're going to be aware of but they may not provide you with that data directly. Leak source though doesn't just allow you to look at your end users data. It allows you to search anybody's data. Right. And he mentioned that leak source doesn't restrict you to just searching for your own users data. You could essentially search for any matching emails within the dump. Um and I would imagine hold security is probably the same way, but uh I don't know that for sure.
So you've got these leaks whether you're pulling them from large sources or single sources. You've got to decide uh make some choices on how to process that. The first one, like I mentioned, there's duplicates. So, you probably be good to have like an indexing type of a history where you're saying, "Okay, we've already done leaked, you know, linked in from 2012. We've already done blah blah blah." So, you don't waste time going back over the same data. Um, especially if you've got uh you're parsing a lot of the small dumps where you're probably getting duplicates every month if you're if you're doing that. Um, the cleanup and conversion of data. Some of this can be automated. Just
things like I said, I showed you kind of the different formats. there's headers and footers and different columns and you can sometimes use regular expressions to pull out hashes and email addresses but getting the other data may be problematic. Um, so there's a and then of course if you're presumably you want to filter that and focus on your users, you just care about the email addresses that match your user your user accounts. There's another decision about which users you're worried about. Um, Pandora was the one that got the leaked data and any user with an email in that in that dump had their password reset or had, you know, got flagged for password reset. They didn't
try to crack the passwords. They didn't try to make sure that they were the same. Um, that, as you might guess, angered some people uh that didn't use the same passwords. Probably frustrated some others that weren't sure why that was going on. Um, and again, it's kind of like that nuclear option of we're not really willing to put in much effort into this, but we want you to be secure. Uh, so they're kind of pushing that back in the the users's responsibility realm. Um, any user with a username appearing in the list and kind of back to your question about if you should look at usernames, that can be problematic just because usernames aren't necessarily
unique, you know, globally unique. Email addresses are a little bit better about that, although they can be reassigned over time. Um, but you know, J Smith is a username. There's probably a lot of sites out there with a J. Smith. It isn't your J. Smith. Um uh Alex Damos also mentioned in that same presentation from uh the appsite California that they do strip off the desk the um recipient of email addresses and check that against matching usernames within Yahoo at least they did at the time um for matches. So for them I mean but they were specifically also looking for password matches. if the coincidence that J. Smith one and J Smith 2 had the
same password, J. Smith was still getting a password reset, even if he wasn't necessarily the same guy. And of course, the the more precise option is to say if the username and email specifically match, that's who we're going to worry about putting through the reset process. So, deal depending on what type of data you're dealing with. Um, like I mentioned, sometimes you're getting hashes, sometimes you're getting plain text. Plain text, you're normally going to take that, put it through your normal hashing process. uh you may have to pull down the user salt if you're salt hopefully please hopefully salulting your password so you would have to retrieve some of that data um if you have an existing API that this works
well with it may be fine otherwise you may have to decide to design something specifically for password leak processing either you know because it's you're dealing with different uh performance issues or you do things in your normal login process that don't don't make that um you know the right approach to take for this um if they are hashed you're going to identify the hash, you know, whether it's MD5 or Shawan or something else, and then have to decide how much effort you're going to put into cracking that. Um, my recommendation would be to say we want to prevent people from getting their passwords cracked if they are having, you know, similar to what an attack the amount of
effort average is going to be. So maybe that's a week, maybe that's two, three days. Um, you're going to have to try to make that decision yourself. Um, as far as the different approaches, I mean, we don't really need to get too much too much into that. you're going to try different approaches to crack the passwords. Um, you can certainly customize that your policy is eight characters and higher. You don't have to try to crack passwords lower than seven characters. That can be problematic as far as the different hashes. Um, this is a presentation data from a presentation that Rick Redmond did uh from Core Logic a few years ago where he over a six-month period they gathered data on
what types of hashes they saw on the dumps that they were capturing. MD5 was like 40 46% of all the dumps they saw. Shaw one was I don't know 4% or it's it was much lower but you can kind of see you're dealing with a lot of different options and this is just the top ones they had I don't know a list of 30 or something that that they gathered. Um and you can look at that source for more data but the idea just being it's not going to be an easy everything's MD5 or SHA one you may have to deal with more options than that. Um, one alternative to hashing, um, or to to going through the cracking process
is something that we learned about with the Facebook presentation that Alec Muffet did a couple years ago in Passwords crack passwords con Oslo or I'm not Tronheim. Tronheim. Okay. He talked about how Facebook deals with passwords and one of the things he mentioned was that before you know they they use I think escript HMAC and some other so they're they're doing better security but he mentioned that they do have an option for feeding passwords through MD5 the normal user passwords through the login process through MD5 first and then into the more complicated uh sec more secure systems. So if you find an MD5 dump and your users passwords ahead of time have already been hashed with MD5 initially and then
your more secure alternatives, you can just skip that MD5 hashing process and put them through your more secure process. You know, again, maybe having to pull down um seed values and or seed salt values, but you could make that comparison even without having to crack the original password. Just like I said, assuming MD5 to MD5, Shawan to Shawan, you you've got comparable uh types of approaches. Um, if you're going to do that for more than one, like if you're going to do MD5 and Shawan and who knows what else, you may have to have additional records stored within your account database. You may not want to go through that much trouble. It's kind of depends on how
serious you want to take it and save yourself some trouble. One approach I thought of, which I'm not sure if it's good or bad because it's got some it's got some drawbacks, is instead of trying to crack those passwords, you decide to um just compare users when they log in, take their plain text password they provide to you, hash it. If you've already determined how those passwords have hashed, if they have a matching record in in different dumps, make the comparison then without having to do any cracking on your own. So that's going to compare both the secure and less secure passwords. The problem is you kind of have to keep that database around for an
indefinite amount of period of time. Maybe users don't log in except for every couple months. Um it's definitely more overhead. You may not want to have to deal with that, but it would get you out of some cracking work. Uh so what to do and tell your users? Um you could just notify users. You don't have to necessarily force them to reset their password. Um that's kind of what LinkedIn did. LinkedIn, the initial dump had like some six million passwords in it. So, I understand that they made those users reset passwords if they had a matching password. Um, but the most recent dump that had all, you know, 117 million um accounts in them um had not
been reset for the most part. In fact, LinkedIn reported of those like 117 million users, more than 100 million had not reset their password at all. And that was after all the media attention. You know, LinkedIn was in the news a lot about being breached and passwords being compromised, but they didn't tell users to change it. And some users probably said, "If I don't, you know, LinkedIn's not telling me to do it, I don't need to worry about it." So, that's kind of the give it back in the user's hands, but they may not make the decision you want them to make. So, if that's concerning to you, don't don't let them make the decision. You can lock the account. Um,
there's you can either do a custom unlock workflow where you say you have to go to this page to reset, you know, recover from this, which some sites have done. Most of the larger ones seem to try to push people through the normal forgotten password workflow. So because that's you've already got that in place, it's better to reuse that same code and functionality. The problem can be like in the Tumblr case where they had to reset some user passwords is that this is literally the only message they gave their users. It's time to reset your password once once they tried to log in. They didn't know why. They didn't have any context of what why this was
happening for them. So, if you're going to put them in through your normal password workflow, you may want to have some sort of a flag you can set where it says, you know, click here for more information and it points them to a URL where you can explain why that's happening. Um, you can do what Microsoft and and Facebook and some others do, which is to not lock the account necessarily. They can still log into the account, but once they log in, they're required to go through uh a password reset workflow. Um, but they're also subjected to that secondary authentication during that process. So they're looking at are you coming from a country I know, are you
coming from a browser I know and things like that. And of course invalidate session tokens is also important if you got uh mobile apps or persistent cookies or something like that where um even if they're not they may not be logging in. So you can't necessarily force them to your your login workflow. Got to kind of rush through this but there's some important things to tell your users why it's happening. There's going to be confusion on their part. Um, I don't recommend that. There's a question about the name of the third party. I would say typically you don't want to tell them where the leak came from. Um, and that may cause some confusion, but as we'll talk about from
a privacy standpoint, that could be problematic as well. Uh, make sure you comp you emphasize your site wasn't compromised. This is from a third party. Um, whether or not unauthorized access was detected on their specific account. That's also helpful for them to to understand that you're protecting them. Education, media, some stuff like that. nuisance leaks. Um I'm going to skip through this. Basically just you may have to process stuff that didn't come from you like uh there was a release um a few months ago where large email providers and here's the actual headline that came with it. You know breach it these big major email providers. Well it wasn't a breach. That's just the way
that routers routters decided to spin it. But um they had to process this data and they found that a very you know some as far as a percent of the accounts that were in there very minimal amount were actually valid data. The mass vast majority was either old data, fake data. Um regardless they had to they had to go through and process it. So you're calling 2% a minimal? Well 2% of what was in there if they I mean what they were claiming clearly 2% of meaning 476,000 accounts. That's pretty significant. But as far as the percent of one% would be incredibly significant at any normal reach scale. Yeah. My my point is like I said, not that you
wouldn't care to reset this data, just that the data itself was for the most part bad data. It wasn't something you know that you would normally have to worry about except for those small percents that you do. Was it really bad data or is it just mischaracterized data like you were saying before where somebody accumulates all these breaches and then sucks out just the Gmail addresses and says hey Gmail's been great. Yeah. And a lot of kinds of times we don't know like uh I'll mention and this is some headlines from it but some Pandora confusion. Well, I guess I don't have it in here. Um someone basically said that Amazon Kindle was hacked. Um they had a record 80,000
records and that turned out to be entirely fake data. So sometimes it's completely made up. Sometimes it is just old data from collections that are years old that's no longer valid. Um but any regardless um so quickly going over this there is some risks involved with processing this. Like I said users confused about how do you know how do you know my password was in this leak? Are you keeping my password in plain text? Are you encrypting it? When Facebook announced their policy there was a lot of comments on there and as well as other news articles about it from users who were confused about that. Um, some people if you lock the accounts, they're not going to be able
to get back into their accounts. They've changed email addresses, they don't have access to their email, their phone numbers changed, whatever recovery options you have for them aren't going to work. So, just be aware you're going to have to deal with that in some cases. Um, privacy concerns. Uh, if you tell me, you know, we found your password in Ashley Madison or we found your password in Fur Affinity, I'm going to say like, I'm not sure I want you to know that I was using that site. Um, I haven't heard of people specifically being concerned about this, but I'm not I would not be surprised if I heard um concerns from users being voiced and notification
fatigue. You know, if this is happening every month, I could certainly see that being a problem. Um, legal risks, the short answer is talk to your legal team because, um, you're dealing with stolen data. Um, there from a US law perspective, federal law perspective, uh, if you're not actively using that to compromise sites, you might be okay. Um, from a trade secret, intellectual property standpoint, it kind of depends on what data you have. Um, and of course, avoid actually testing those credentials. So, successes, um, you know, WordPress went out there, looked at Gmail data that was compromised, found 100,000 users that use the same credentials, and they were able to reset those people's passwords before they got compromised. you know,
Yahoo, um, Alex Stamos again mentioned that they in in some of the bigger password dumps they deal with, 10 to 20% of the entries match their users and they're able to get those passwords reset before it causes much trouble. And Twitter also felt like it was successful. So, I like to feature my niece in every year's uh, slides and this is she's plugging the password leak. Of course, this isn't going to solve all your problems um, when it comes to account takeover, but it's it may help you and uh, demonstrate your commitment. So, I'll open up for question and answer. I'll just kind of slowly tab through the references so we can get it recorded in
the video, but I also have a link to my slides at the end so you don't have to worry worry about writing these down. But any questions?
Uh, this is really more of a comment maybe to help. Um I recently did a uh analysis of the LinkedIn data uh against a large uh e-commerce site and uh just against the subset of uh commercial customers there were a about a 16% overlap uh between the LinkedIn data and their active customers. And of those approximately 300,000 users uh only 10% of them had changed the password since the LinkedIn breach. So that kind of gives you some sense of what the potential exposure is from these breaches and giving you the incentive to go, you know, to go down this road. Yeah.
Yeah. The customer angle um I guess I I agree only 10%. We tried resetting passwords, sent password reset email to about 4,000 people and about 200 people changed their passwords. So it's it's it's it's a tricky issue. Uh yeah, you can it's you can easily uh offput customers um from your website uh easily by doing that. Also an interesting observation um so all the um all the all the leaks that recently have been for sale for $1 and all that stuff. Uh it it seems like the information uh that's been sold now is probably the same one that uh maybe Alex Holden has uh has had for a number of years only now got into
somebody else's hands. Yeah. Yeah. A lot of them seem like it's people that are they've maybe not weren't even directly in involved with the hacking when it originally happened, but they were one of the, you know, inner circle of people that happen to have that private data and they've decided now, well, I might as well sell it and try to get some money out of it or share it or whatever the case may be. Um, so you mention briefly touched on um the possibility of storing user passwords in a weaker hash algorithm in order to compare against leaks or leak data. Um, do you have any advice for making your system robust such that um,
if you decide to follow that strategy, you don't make your own system more risky or more riskprone to a dump? Sure. And and let me be clear on that. You're the the my advice is not that you store an MD5 record, password hash record for your users. It's that you hash with MD5 and then you use your S-crip or BCry or PBKDF2 on that h on the resulting hash of that. So the idea being that you've you're starting from a known known place, the MD5 hash. So if you get more MD5 hashes in from a leak, you can then put them through that same, you know, starting on that step, put them through your stronger password hashing process and
then compare the results without having to crack those original MD5 passwords. But yeah, don't don't store like an extra record with MD5 or Shaan or anything weaker than that. Okay. So, I'm going to stop it there and uh I will give you actually give you the opportunity to have lunch. Uh when we get back again, the next two talks are really taking us into the psychology and linguistics area of passwords and I'm really looking forward to those talks as well. So, be back here at 2:00. Okay. Thank you.
Is the password meter happening today or maybe
Hey. So, uh, hey, no brown M&M's. You should ask
the uh the adro lock pattern uh analysis on how people create lock patterns on Android phones. So now I'm I'm I'm uh messing around with this uh Chinese student uh for her uh PG project. Uh we are also looking into some password related stuff obviously. Uh so we'll see next year if there will be any talks from that research as well. So yeah um h do I have do I have any Jeff? Do you have any funny password stories to tell? I'm I'm kind of like you know this is the 10th time so I think I've told the majority of my password jokes so far. We should go just go back to that password thread on Twitter. Oh yeah, that one.
Yeah, a password walks into a bar and I forgot. You canot Yeah, I forgot it. Yeah, I did that on Twitter. I I just I just wrote, you know, a password explain that. Well, a password walks into a bar. And as an example, there was one um uh one allow any funny characters. Yeah, we don't allow any funny characters. That's one of them. Yeah. And you know, this actually sets the bar pretty low. Um you know, jokes like that. Yeah, we could also do for for biometric or for pin codes for that matter. A passphrase walks into a bar. So, uh actually you could do we allow special characters. We don't We do allow special
characters or we don't walk. Yeah. You have a big smile. You look You do look like an emoji. What are you doing here?
A passphrase walks into a bar and and and and the barman says, "We don't have space for you. Yeah, I'll be here all week. Okay, I I Okay, I get it. You know, you know, I'm going to refresh that thread on Twitter again, you know, because that was actually I mean there was It was really good. Yeah, there was lots of, you know, really crazy examples there. So, I shall rerun that one. Passes, passwords, pins, biometrics maybe. Uh I'm not sure. Biometrics. Global entry is just the worst. Global entry. Global entry. [Laughter] One of one of the things that I understand, you know, every time I go come here to the US, I have to give away
all my fingerprints at at at border control. Um but you know this could be related to the fact that I have been military police so I know very well how to do a pat down but you know the the you know these TSI people no you know they the the few times they I have been you know the subject of a patound you know they don't they don't seem to really know how to do it I mean like you know are they afraid of touching something or what is it it's like you know do it properly that's That's me. That Well, maybe it's just me. Smoke afterwards. Well, I haven't any, you know, I haven't
actually seen any of the agents qualify for that one yet. Uh, maybe me, just me. Hey, it's 2 o'clock. So, this is Matt. He's been with us before at Pastor Con here in Las Vegas. And like 20 minutes ago, he told us that he only pulls the doctor card when he's mad at somebody. Uh, but Matt actually has a PhD in password cracking and optimizing how he can crack your passwords. So, uh, I I'm I'm uh and and he told me he put actually lot on Twitter that, you know, he was try he he was going to try to make this a little less academic and I can only say thank you for that. Uh, and
I look forward to talk back. Go ahead. Thank you very much, Pere. Um so I got my start uh as uh he mentioned um basically when I decided to go back to college um and I was very lucky uh to be able to um get involved with the e- crimes investigative technologies lab at Florida State University. And the whole idea behind that lab is to uh do research and develop better forensics tools for law enforcement. And as you can imagine, you know, cracking passwords definitely kind of falls under that category there. Uh so that was a tremendous opportunity. Uh I've since uh graduated uh found myself a real job and uh so since I like that real job and I
want to keep it uh I really need to mention that all the password cracking and password security research that I'm doing is on my own time and so all these opinions are my own and not my employers. So I was spending a lot of time trying to think about what I should submit here to Password Con. What would be fun? And uh I just wanted to mention a couple other little hobbies that I'm doing right now uh when it comes to uh password security. Um one thing is that uh IBM right now is offering a real life quantum computer free for any researcher that wants to use it. So you can you can log right onto this here. It's a 5 cubit
computer. So it's it's a real one. It's not like those D-Wave ones where somewhat kind of like you know quantum computers. Um and you can it's basically the old mainframe days where you basically you write up your program. you submit it to them, it gets cued eventually they run it and then they send you the results back for it and it's just a a tremendous wonderful opportunity. Um now quers have you know uh the potential to really impact uh the password security landscape. Um I won't really talk about that because there's been other uh conference talks that people have mentioned about it. Um but what I really wanted to highlight once again though was that um this is free
and this is something that anybody can do. So I highly recommend that everyone here in this audience go out sign up and then just you know play around with this here a little bit. Uh even if you don't care about quantum computers just do it for the bragging rights you know have it open on your screen on your work when your coworker comes by you can be like and they ask you like what are you doing like oh I'm just programming a quantum computer you know what are you doing I mean are you still using a classical computer that's you know really lame. Uh so I mean that's just kind of highlight about that. Yeah, but that's when your
coworker then reminds you that they're a PhD. Um, so another uh you know um research topic that I'm very interested in right now is trying to figure out uh what a hashing algorithm for biometric uh measurements should look like. Uh because this is a really big problem right now. Um and I'm not talking just about you know your iPhone logging into that or potentially logging into a website with your fingerprint. I'm talking about these large databases of biometric information that's currently out there right now. I mean, for example, uh who here had their fingerprint stolen from an OPM hack at all? I knew I'd catch a couple feds that way. Um but it sucks, doesn't it? I mean, and
the problem with biometric information is once it's stolen, it's gone. Okay? There's no getting that back. You can't change it. Uh it's just it's just gone. Uh so this is an area that we really need to have better research on in order to try to limit the value that these biometric databases provide to an attacker. So that's all cool stuff. U but I'm going to talk to you about modeling instead. Um I know this is generally not something that people really you know get super excited about like oh hey we get to use you know different types of you know regular grammarss and go ahead and uh try to figure out how to go ahead
and uh model how people you know behave. Uh but it's something that I feel is very useful for probably just about everyone in here in this room. Uh because there's definitely offensive um you know uses for this here when it comes to you know developing better password cracking algorithms because if you can model someone and how they create passwords you can definitely you know weaponize that model in order to you know crack their passwords better. But there's also lots of defensive applications as well. So you can start talking about how do you make better you know password security policies um and how you know there's a couple other applications as well. So regardless of whether you're on the offensive side or
the defensive side or you just want to do some more research there's a lot of uh stuff here um that I hope can be useful for you. So it could be very easily said though that we don't need better modeling techniques. I mean current attack techniques are amazingly good right now as it is. Uh this was uh based upon the more recent MySpace leak and uh Sinoure Prime posted basically a write up of how they were attacking it and they managed to achieve a 99% success rate against all these MySpace users. I mean that's really good. I mean whenever you can say that you cracked 355 million passwords uh you don't need to up your game. Your
game's already up there. Okay. Uh you don't need to really advance your techniques that much. Um but there still is a lot of room for improvement in areas where that improvement is really kind of needed. So I mean for example um when you talk about computer forensics uh hashes are getting stronger. You can only um especially when you're talking about like you know full hard drive encryption or file other types of file encryption uh you're really kind of limited in number of guesses that you can make or you start you know people actually start deploying argon 2 eventually maybe. Um, on the defensive sides, we really want to go ahead and have these better models
so we can tell what's a strong password versus what's a weak password. Uh, so we can make more sane password policies. And then also I'll talk about Honeywords, which is a way to be able to make basically fake passwords in order to mess with people. Uh, but what it really comes down to is that we have all these different data sets that are getting released and that we're, you know, cracking and we're learning about. And a lot of times, you know, we'll go ahead, we'll run some statistics against them. You know, we'll make some great, you know, graphs. uh Per and Jeremy made this one here about the the LinkedIn list and there's a lot of little cool
tidbits in there and a lot of kind of useful information but kind of the question comes back to how are we going to actually make use of this information? Uh how can we go ahead and use this to improve our password cracking techniques or improve our defensive techniques in order to make uh people more secure. Uh so that's what I was really kind of curious about is how can we go ahead and kind of automate you know the the learning process of you know here we have this leak and how do we you know incorporate that into our knowledge of how people create passwords. So when I first really kind of started working on this research um my approach
was I wanted to make an automated rule generator. So I want to go ahead uh parse a list of you know cracked passwords and then just create mangling rules based upon that list in order to u feed into you know John Ripper a hashcat wasn't really around that time but you know that's kind of general approach uh that I wanted to take. Um and I ran into a lot of different problems which I'm going to talk about uh really shortly here. Um but it still is it's a very useful technique. Uh, so I mean if you're looking for tools that do this type of thing, I would probably highly recommend either Barvel's rulefinder or uh password pass password analysis and
cracking kit pack. Um, the problem is both of those tools haven't really been updated in the last couple years. So if you're looking for a research project and purr hasn't already tasked you with something, um, I would say that this is probably an area where there definitely can be a lot more work being done. Probably though the the one area that was the the biggest problem when I was running into when trying to go ahead and create an automatic rule generating uh kit was uh that um the words that we uh people use in passwords have very different probabilities associated with them. you know, a whole lot of people use password. You know, some people use p
monkey and then, you know, uh uh or baseball and then you have people down there you're using like, you know, piano or zebra. Um and you really want to be able to represent that and it's it's hard to do in a lot of your password cracking um you know um sessions that you run. So for example, if you want to use like the entire core logic rule set that they released there, you have to run a really small dictionary against it if you ever wanted to have it you know complete. Um, so a lot of times in order u to crack passwords, you know, I'll have a very targeted input dictionary that I'll run really advanced rules
against and I'll have a much larger input dictionary that I'll run, you know, much more targeted rules against in order to try to do that. Uh, but that's a really it's a big pain in the butt. I would like to be able to just go ahead and run my cracker and have it be able to take into account the different probabilities of these different words. So that way I do a lot of advanced mangling rules automatically with these really very common passwords and then you know less mangling rules with the less common passwords. Also combining different mangling rules is kind of tough especially when you're talking about learning from a data set. So doing a very basic idea of like okay
you know people put digits at the end or they you know capitalize the first letter uh is pretty straightforward but when you start combining it so like okay they capitalize a second letter in the word and they replace the a with an at symbol and now they add digits to the end and then they have a special character. Um that's really hard to go ahead and figure out how to combine those different rules uh together. And when you start to do that too, the number of rules that you have just absolutely explodes where all a sudden you're generating millions, if not billions of rules. And I'm not exaggerating here. That's actually what's going to happen when we start
trying to go ahead and uh do this types of modeling. Also, when you do a lot of this training, it's often not very predictive. So you might not see something exactly in your training set, but you'd want to try that in your cracking session. Uh so you want to be able to support that. Um now with some of these bigger data sets, that's not as much of a problem. I mean, when you have 300, you know, 60 million passwords, you're probably going to see just about everything. Uh, or if you don't see it, then you generally don't care about it that much. U, but when you're trying to make a more targeted rule set against,
let's say, you know, uh, a corporate entity, you might only have, you know, several thousand passwords and you want to be able to learn based upon that there. And so, that's what really kind of drew me to probabistic contextfree grammarss. Um because when we talk about how people create passwords, we generally always talk about probabilities. So we say, okay, you know, 90% of people, you know, capitalize the first letter of the their password or, you know, 80% of people put digits at the end or 5% of people use their, you know, dog's name in their password. And so these are all probabilities. So it' be really nice if we could take all these probabilities that we're talking about that
probabilities that we learn from these different, you know, data dumps and and directly incorporate them into a model and then generate password guesses based upon those probabilities. So that way we don't actually have to manually create all these different rules. We kind of let you know that the probabilities kind of sort it out for themselves. And oh, sorry. Yep. Kitten. Oh, yep. That was def definitely a good kitten there. That's pretty much my my cats. Oh, it is kitten twice. Kitty. No, the instead of Oh, that's uh that's that's intentional there. I know. Yeah, but we don't get it. Sorry. Oh, it's a it's a kind of a LOL cat speak where they'll go intentionally make it
the cats speak incorrectly and and not use proper grammar, which I I'll admit it's probably a little ironic since I'm talking about grammar here and proper grammarss. Um, good. Yep. So I can start talking about all this and you know how probability context free grammarss are good u but really I mean the proof is that you know in when our researchers have done uh studies on this is that proistic context free grammarss when you use them um generally tend to u mimic how people create passwords better than alternate methods. Uh so this was u done by Carnegie Melon University last year when they were going ahead and testing different um uh password cracking attacks and they have
a basically a password guessability lab that you can submit passwords to and have them running these attacks against it and proy context for grammarss really kind of did better than just about everything else and that includes let's say the pros which was uh the core logic team was brought brought in and they definitely know a lot about passwords and while they eventually were able to crack more passwords if you really want to talk about just how accurately it modeled uh the precision of the grammar uh PC PCFGs pretty much did better than anything else. So that's kind of my, you know, spiel on why this probably is going to be interesting to you guys hopefully.
So I'm going have to do a little bit of academic stuff here and I apologize. Uh but what I really wanted to do was just talk a little bit about what proistic contextfree grammarss are because there's a lot of misconceptions and probably the the biggest misconception is that they're complicated and require lots of mass. Um, it's kind of great when I get up and talk about people and they're like, "Oh, he's talking about proy context grammar. He must be smart." Not true. Well, hopefully it's true, but uh it doesn't it doesn't automatically flow from that there. Um, proistic context grammarss are actually probably easier to use than Markoff models there. I mean, it's um there's
really no math required when you talk about context grammarss themselves. It's more like mad libs. So any type of grammar can be defined by four different things. Uh non- terminals which can be replaced and they're basically variables. Uh terminals which are kind of the constants. They can't be changed. A replacement mapping variables to the constants and basically your start variable that you start out with. And what you do with a context grammar is you start with your start variable and like mad libs you just start replacing it until everything that you have is now a constant. So let me give go over an example of that there. Let's start with this very basic kind of you know uh
context grammar that's kind of modeled after pass after the passwords here. You have a start variable S that goes to letters and digits and you have you know an L that goes to three different words here and then D that goes to three different digits. So with the context grammar you start with your start digit uh start variable here and then you just go ahead and replace it. So you replace that with L and D. Now since it's contextf free each of these different variables here these are two different variables can be replaced independently of each other. So if I want to I can go ahead and replace that d um that l with anything. So for
example I can replace it with baseball and I take that d and I replace that with um 82 and since everything now is a constant I can't replace it anymore. Now I have my final terminal. So as I said it's it really is just kind of like mad lip. You just keep on doing this. And while you can get more complicated with your grammar itself, essentially all you're doing is just this right here. And that's contextfree grammarss. Now, when we talk about probabilistic context grammarss, unfortunately, there's a little bit of math involved. But really, that's the only math that you do is multiplication. Uh nothing more complicated than that whatsoever. There's no um calculus or anything else.
You're not doing matrix multiplications or anything. It's just standard multiplication. And what you do is that every single replacement that you have has an associated probability with it. So that the final probability of your terminal is the product of every single replacement it took in order to get to that terminal there. So just kind of going through that previous grammar that we had there. Uh you start with your start symbol and probability of that's 100% because that's where you always start. And this grammar here there's only one replacement. So the probability of the next replacement there is still 100%. We want to go ahead and replace the L with baseball. The probability that's 10%. So 10% times 100% is, you know, 10%
for this current running probability that we have. We replace the D with 82, which is the probability of 25%. And 25% times 10% is now 2.5%. And what this means is that, you know, the probability of this grammar generating the string baseball is 2.5%. And this is incredibly powerful. So for if you're cracking passwords, you can it's pretty obvious to see, you know, how you'd use this probability. You want to start with the most probable password guess, try the second most probable one, then try the third most probable one, and so on. Or if you're trying to go ahead and um do it a little bit faster and paralyze it, you can, you know, take some shortcuts and instead
say, I want to go ahead and generate every single password guess that falls above a certain probability threshold. Uh so that way you can kind of limit the key space that you're searching uh against um while still you know making use of this grammar. It can be used also for like password policies for defensive purposes where you basically go and tell someone hey I'm sorry your password's probability is just way too high uh because you actually have a probability associated with that password that they just created. Um, and what's more, you can go ahead and and suggest other alternate replacements according to your grammar that would reduce the probability of their password uh down to an acceptable
level. So that way people can go ahead and create passwords kind of the way they want or at least somewhat related to the way that they want while still meeting your security requirements. Yep. Why wouldn't it certain digits may correlate differently with the words? Okay, so that's a good question. Yeah. So I'll go ahead and do that. So the question was um why uh how does this go ahead and correlate um let's say the probability that certain digits might occur with certain words in a higher different probability. Um and the short answer is at least it was using context grammarss it doesn't. Um and the reason why I chose context free grammarss versus contextsensitive grammarss is um
mostly because the training is much easier and when it comes to passwords I found that there's very few contextsensitive letter replacements. people they might choose their dog's name but then they don't choose you know 82 related directly to their dog for the most part. Uh it might occur some occasionally with you know um um let's say you know your you know your daughter's name and then her birthday or something like that. So there might be some context sensitive but that's really hard to model unfortunately and I'll actually talk about how you can cheat a little bit and get some context information and kind of mash it into a contextf free grammar. So that's something that I've struggled with. But
as long story short is that most passwords I found anyway, you don't have those contextsensitive uh replacements there. Um so and then the other thing I'll talk about a little bit is just honeywords, which as I said are ways to be able to create fake passwords that look real. So that's kind of the end of the academic side of things. I want to talk more about the the practical things because I I promised I'd do that. Um so this has been around for a while. Um and about last year I decided I wanted to go ahead and completely recode my entire probabistic contextfree grammar. Um and part of the reasoning behind that was that um it was becoming very hard to uh
modify. I had basically baked in the um um a particular type of grammar in order to model how people create passwords and that was limited and I wanted to make changes to it. And what was actually really driving that was I was trying to get into uh passphrase uh modeling with contextfree grammarss. Uh, so I was uh scraping Per's Twitter account and his blog post and stuff like that uh in order to try to make a paraphrase generator. Uh yeah, I mean so actually some get GitHub uh results there. Um but basically I was breaking my code left and right. I was getting really annoyed by it. And what so I really wanted to do was kind of go
back and be able to make a kind of a cracker uh that uh would be able to work with any generic contextfree grammar. And what this means is I could really focus my time on trying to mess around with the modeling of how people create passwords. And I just have a tool that would just kind of work in order to be able to uh test see how effective those models were when uh cracking people's passwords. Uh so it's coded in Python so that way it'll go ahead and you know run pretty much wherever you want it to go. Um it works with both John Ripper and Hashcat. Uh it is open source so you can
go ahead and grab it and modify it to your heart's content. Uh, and there's a lot of documentation I've been trying to put up there on the wiki itself. And so, kind of going back to, you know, why was I doing this, there's a lot of improvements in the the grammarss that I'm going to be talking about. So, um, beyond from what I was doing before, you know, another group at Florida State University decided to go ahead and modify the grammar a bit. And they had absolutely huge, you know, improvements in the results of just going ahead and changing how the grammar was laid out. And I think there's a lot more room for improvement even beyond this here. So
what I want really want to be able to do is allow other people to go through and kind of make their own grammarss, make their own training programs and then have the tool for them to be able to go ahead and uh test to see how well they work. So I include one kind of default training program uh with in the GitHub page right now. I'm probably going to release a couple more uh as they you know mature a bit. Um, but basically what you do is you, you know, run it on your, you know, list of cracked passwords. Um, and it'll go ahead and generate a grammar for you automatically. Um, the only thing that I
really recommend is that you include the install the car, uh, character detect Python package when you do this. It's not required. It'll it'll give you a warning, but most people will probably ignore it. Uh, but uh, as we learned from last year's, you know, core logic crack me if you can competition, dealing with, you know, different encodings in your training set is a real pain in the butt. Uh, so I found this Python package actually handles a lot of that for you automatically, which is really nice. Uh, because we gota show some love to, you know, some non-English speakers out there too. Um, so I don't expect you all to read this. Uh, but basically when you run this
here, it goes ahead and creates a config file uh that describes the grammar that you have. So you'll start with your start symbol up here. it'll have you know different uh you know uh replacement values in there and it'll actually say okay you know you know your eight character letters are stored in this directory. Um and this kind of goes back to the fact that you can now modify your training program to completely different types of grammarss and then the cracker will automatically read this in and dynamically use this. So that way you can use the same password you know cracker uh for um or the the guest generation tool uh for let's say passphrases versus you know password
cracking. So now I'm going to talk a little bit about the the basic model uh that I'm using right now in order to model how people uh create passwords. And what I expect when I go through this is good questions like that the one over there where basically you kind of ask you like wait you know there's got to be a better way to do this. you know there that seems kind of ad hoc or you know um crazy you know there's totally a better way in order to uh represent something and that's what I kind of want uh because that way you know quite honestly there is I mean every single you know choice I made here I mean there's a
couple alternate choices I could have made um so I really want people to go out and kind of make their own training program make their own grammarss and then you know make a better you know model than what I have so right now what I uh do is I first kind of go ahead and scan the password and I break it up into what I call a base structure. And this is kind of resembles like a hashcat mask where you go ahead and you break everything up. So, okay, it's alpha characters, digits or other. Um, and I associate um a number after as well. So, this is like, you know, a threeletter, you know, word or, you
know, a fourdigit uh number. And the reason why I associate those numbers directly after it is so that way I can do things like I can represent two words um you know two four character words versus one eight character word in my base structure there. So for example cat 123 would be parsed as an A3 D3 or you know password 16 would be parsed as a 01 A9 D2. So next question though is you know what about capitalization because you know every once in a while someone capitalizes their password if they don't have to. Um so what you can do is have a replacement go to another replacement. So basically you can have a rule that
your alpha characters go to essentially a capitalization mask uh that gets applied to your password as well. And those capitalization masks have a probability associated with them um too. So that way you know your A3 can go to let's say you know uppercase the first character at 10% of the time and then that you know will go to let's say cat a capitalized cat you know at 80% of the time. Now you can get even further into that there as well. So now we can assign probabilities to individual words and this is kind of really powerful. So that way we can say exactly you know this is how probable you know the word password is and this is how often we want to use
it in our you know test set as well. And what's cool about contextfree grammar versus some other techniques is like Markoff models is that you can dynamically change this without going ahead and having to retrain your entire data set. So let's say I go ahead and I train this on like rocku. I can go back later on and say okay I'm attacking you know uh some other data set. So I'm going to go ahead and change the probabilities with these words and make these other words associated with that data set be really high. Uh so that way we can start directly using things like you know your you know your targets you know pet's name or you know the name of
their site that they go to and so on. And we can do the same thing for digits as well. So that way you know you can raise the digits you know their their birthday or their zip code. um um to be much higher than you would normally see from your training set. So, the next kind of improvement I've been making is modeling keyboard walks because those are really popular, especially in your corporate environments. Um because I mean, if you're giving these people these crazy pass requirements, I mean, that's obviously what they're going to do. I mean, it's much easier to remember a keyboard walk than have to remember, you know, some random passphrase. So I added a new non- terminal to the
base structure K just to represent you know keyboard walks that you have. So now we have alpha characters, digit characters, keyboard walks and other. So for example, one QAZ pass would be a K4 A4 or you could have multiple keyboard walks. So you have multiple, you know, K4s, you know, back to back there. Um because that kind of represents how people kind of have a tendency to kind of do a keyboard walk here, a keyboard walk here, and a keyboard walk over here sometimes. Now, one cool thing about this is in the training set, you know, it basically creates a dictionary of all the different keyboard walks it finds. So, if you want to, you can go ahead and rip
this data out of that the the grammar here and use it for whatever, you know, input dictionary that you want for your normal password cracking attacks as well. Uh, so you don't actually have to use a context grammar when you making use of this data. And there's not a whole lot of good keyboard walk com uh, you know, data sets out there. So, this makes a pretty easy way to do it. Now, one thing I want to kind of point out though is that you have to be really careful in training this uh because there's a lot of words out there uh that look like keyboard walks that aren't. And you don't really realize that until
you start running this and you're like, why the heck is Drew like one of the most popular keyboard walks, you know, well, you know, it's a name, but you know, if you look at your keyboard, it's going to be, you know, um it's a keyboard walk as well. So, you do have to kind of start adding some whitelisting on that. Uh, one other thing that I've been kind of uh dealing with that causes a lot of false positives is I was trying to include hitting the same key multiple times as part of a keyboard walk. Um, I don't know how much people actually do that. I mean, I've been looking through the data sets. It's not really that common, it
seems like. Uh, so I I have to admit in my training I'm probably going to just go ahead and exclude that so I don't cause a keyboard walk like tree. Uh, which once again is, you know, looks like a keyboard walk to the algorithm there. So, as was mentioned earlier, while rare, there occasionally are contextsensitive replacements as well. Uh, and you kind of want to, you know, kind of shoehorn those into your cracking session. So, I mean, probably the most common examples of that are emoticons. Uh, so you don't want the, you know, uh, funny, you know, smiley face to be parsed as, you know, a A5 S1 A1 where it's, you know, a word special character and then
another, uh, word. also things like you know number one I mean that's a very common you know context sensor replacement and you want to be able to kind of model that so what I did was I really cheated I mean that's what it comes down to and I created a new class that says okay you know these here are you know um context sensitive I'm going to go ahead and mainly set enter those in my training data set and I'm just going to treat those like a whole another replacement just add them together there uh so that's pretty limited right now as I said there's not a whole lot of contextsensitive data I've been seeing
in the training sets but if you want to extend that, you can start adding all sorts of emoticons there if you really want to. So, letter replacements. This is actually something I'm working on still right now. Um, it's proved to be a little bit trickier than I originally expected. U, but one way to be able to handle this is to just handle it as another mask uh for your placements, just like we do with capitalization. So that way your your your uh threeletter word goes to you know your capitalization mask then that goes to your word you know cap and then that goes to a potentially a replacement at a certain probability of like let's say
replacing the a with anat symbol. Um unfortunately this does complicate the grammar quite a bit by adding that another level into that there. Uh but what's really kind of I'm struggling with is just training is difficult for this uh because you have to have a way to be able to differentiate that you know cat one hat is not a replacement. You know it's just adding a digit right in the middle of that there. Um, and then you can have words that have multiple replacements as well, like you know, password with the at symbol and the the zero. Uh, so I mean that's still something uh to really kind of work on there. Uh, luckily not a whole lot of
people still use replacements. I mean you still it shows up if you're really trying to crack passwords, but um uh it it doesn't hurt the grammar not to have them that much, but really I just want to be able to model this because that is something that's um uh does occur. So that's the current grammar. So if you go ahead and you download it right now, you run it, that's kind of the grammar that's using in the back end. But I wanted to take just a couple of minutes to talk about, you know, potential grammarss that, you know, I've been thinking about and eventually might get off my butt and start trying to code because I think there are much better
ways that we can go about doing this. So what I'd really like to do though is kind of go back and take a step back, get rid of those base structures, and really start have a grammar that kind of models the thought process that people use when they're creating a password. Because people don't immediately create a mask and start filling it out when they create a password. Uh they basically you say you want to create a password and they immediately go to kind of a coping technique in order to create that password there. They say okay you know I want to go ahead and create you know a dictionary based you know password. So you know name my dog or I
want to go ahead and create a passphrase or I want to go ahead and use a keyboard walk or you can get things like you know I'm going to go ahead and use my email address. So that can be the very first kind of step in your contextfree grammar here is figuring out which attack method you really want to model. Uh next after that let's say they choose a dictionary word based attack. So the next uh replacement would be do you want to go ahead and choose a word from your input dictionary or do you want to kind of dynamically generate a word from let's say a marov model because I'm going to talk about this in
a little bit but probably the the biggest weakness in current uh context for grammarss is that they're really dictionary based. They're really dictionary focused and our input dictionaries uh really suck. Uh so that's when you start seeing how you know Markoff modes and you know neuronet networks start really being much more effective um over a longer cracking session than context re grammarss. The reason for that is that the your input dictionary is basically exhausted all the words uh that um it could possibly crack. So when I start looking at the the words I didn't crack um a lot of them are really basic mangling techniques. It's just that I didn't have that word in my input dictionary. So you
need to be able to have a way to go ahead and kind of dynamically generate those because trying to keep an input dictionary up to date with all the new Pokemon characters is just a losing proposition there. Um but let's say you choose a word from your input dictionary. The next step in your context grammar there might be going ahead and um apply required digits. Uh so then you can have a you know probability of appending the digits to the end, inserting the digits to the front or performing a letter replacement. And you can have other you know different types of you know um uh replacements as well. So you at the same time you'd be going ahead and applying
required special characters if need be um or you know checking for length requirements as well. Now uh for some of you um um formal grammar fans out there you might be looking at this and saying this looks a lot like a finite state machine. Um okay I know there's probably like one of you but uh and but the answer is yes. And so one nice thing about context for grammar is you can model finite state machines with it. Uh but so this way we can go ahead and start testing out you know how effective this is and then if this turns out to be a really effective way to be able to go ahead and model people's
passwords we can go ahead and then go back and make a much more simplified model using a finite state machine. Uh so that way uh we can go ahead and potentially make it faster um or make it smaller. Um actually I can't recover this part here but um so as I said you know one of the biggest problems though with context free grammarss is the fact that they are you know you are dictionary based uh probably the the best example though of a contextf free grammar that handles uh brute force really well. is actually the prince attack in uh hashanj john ripper and what this does is it uh it really it just takes a input dictionary and it
starts combining all the words in input dictionary uh together and where it performs essentially a brute force attack is that if you have you know one character words in your input dictionary it's going to start combining one character words together uh and essentially does a very basic brute force style attack while at the same time adding digits to the end of your you know longer words and um so that's one approach approach, but I think that we can probably do better and start using some of these more advanced markoff modes as well to um make just a little bit more um precise. So, that's definitely on my to-do list. So, I'd mentioned earlier passphrases. Uh so, this is kind of one of those big
problems there. And um one question I had was can I go ahead and use a context free grammar to kind of dynamically generate passphrases for uh use in password modeling? And probably the best example of a contextfree grammar generating passphrases is Syen which is OMX CS paper generator which is hilarious by the way. Uh basically uh some researcher code this up and started making fake scientific papers made them to conferences to see whether they could accept it. Uh might try that sometime with passwords con here. Uh the end map one if you've ever read that one is hilarious. I I probably won't talk too much about because it's not safe for work, but uh it's uh it's a
very good example of that there. Um so I thought there might be some potential there for that. Uh so as I said, I started going ahead and trying to model this and one problem I ran into was that you're just kind of your generic model of you know taking like a proper noun, verb, adverbs and noun. Um even when applying some probabilities to that it just creates too large of a search space. It just wasn't very effective very at all. Um so a successful grammar probably needs more context sensitive information than a pure contextf free grammar uh can get uh you can get but as I said you know uh one way that you can start getting more
information is really just target this against a particular individual so not just going ahead and saying I want to go ahead and generate passphrases for everybody but I want to go ahead and base these passphrases on one particular writers. Uh so one thing I highly recommend and I have a little bit of code there in my uh GitHub repository. It's kind of iffy state is to use what's called the natural language toolkit which is a Python library that you can run a corpus of data through and it'll automatically go ahead and parse it down as you know noun verbs or adverbs and break this up into different structures. So that way uh you can kind of use this in your
training set. One thing I really want to try though is it's much newer to toolkit is uh Google's new parsy mc parse face. Uh I'll admit 90% that's just due to the name of that uh tool. Um but basically they re released this part of their tens tensorflow uh toolkit and it does a lot of the same types of things of parsing a input dictionary and breaking it up into these different types of uh grammar structures. So kind of the next question though is okay so we have this great and wonderful grammar you know we spent all this time training it. We spent a lot of time trying to figure out how it works. uh
how do we actually make use of this grammar? Okay, how do we get from the academic side back to more of the practical? So, I'm going to start with the defensive applications first because I, you know, really want to kind of push those. Um, but one way and it's been really kind of been fun is uh was honeywords. So, honeywords are kind of a synthetic password. Uh, so you kind of algorithmically generate these passwords and you can use these for different different types of opp uh opportunities. So they originally thought of as kind of being like a canary uh type of u alert sensor uh for your data sets. So you create these fake honeywords and then if
you ever see someone try to log in with those honeywords then you know you've been pawned. Um so you can set that as kind of an alert. Um, I have a little bit of disagreements how it was kind of laid out in the original papers that talked about it, but one really cool tool I saw um out there uh was this one right here that basically goes ahead and installs a client on people's uh computers and installs basically a fake account or u password into the computer's memory. And this password is unique to that computer. So, it's it's a valid password and you could use it to log in. It shows up in memory. if you
have, you know, you logged into it and you jumped the credentials, it's going to look pretty much normal, but it's that that password is only valid on that one particular computer there. So, if you it won't help you go ahead and actually move laterally. U but if someone tries to go ahead and authenticate against it against the active directory account, um it'll send out an alert. So, that way you can kind of start detecting some of this lateral movement that's going on in your network if someone goes ahead and tries to, you know, pass the hash with the wrong hash. So, that's a really kind of cool tool, but part of that is you have to be able
to make a password that, you know, fools an attacker so that they think it's actually a real account versus something that's, you know, set up to fool them because they're going to see that client tool on the computer as well that's inserting those passwords. Um, other uh usefulness for that there, if you're setting up a huge, you know, honeypot, I mean, if you want to make create a fake Active Directory account or a server for people to break into, you don't want to have to go ahead and hand hand jam, you know, a thousand passwords into it. So this is just a really quick way to be able to go ahead and generate those passwords and have
those look, you know, fairly legitimate. Also, if you're going ahead and enter a password cracking competition, you know, if you need to create a large data set that looks somewhat real, um, this is probably one way to do it. So that way you can go ahead and you can run it and these passwords are going to look a lot like, you know, the passwords that you trained on. Um, I didn't put this in the the bullet points, but also if you want to go ahead and just mess with people, you could create fake password data dumps, too, if you wanted to. Uh so I mean there just just a lot of just kind of fun things. So this is one of those
things that where originally when I started playing with it um I was like yeah it's not really worth that much but uh it's really easy to implement so you know why not. And the more I've been playing with it the more fun I've been having with it. Um so um included in the tool set there uh is a tool to go ahead and generate honey words from whatever grammar that you create. Uh so you basically just tell it which grammar you want that you trained on tell how many passwords uh you want it to to generate and it'll go ahead and just go through that there and that's actually pretty fast to be able to do too. So you can generate as many
passwords as you want. So the other thing is with you know password policies. So as I mentioned earlier you know if we can go ahead and assign a probability to a password we can then go ahead and say okay your password is you know too probable. Um where this is really nice though is as I said you can uh recommend passwords that are very similar to the password that the person created uh but have just maybe one or two transitions that are different um from how they selected it that would lower the probability of their password. So that way they can they can create a password that's kind of like the way that they want to uh but
it's a little bit more secure. Um now how you recommend these transitions has to be randomized too because you don't want everyone who's chooses the same password automatically choose the same replacement. Um, but you can start to go ahead and do this in order to uh, you know, kind of stretch out that search space an attacker would have to do in order to be able to crack them while at the same time hopefully not infuriating users that much. And kind of going back to that keynote speech, you can start teaching people what makes a secure password and what's not by showing them examples of this here. So that when you show, you know, password, you know, 83
versus kitten 83 and saying, you know, are these passwords, you know, the same probability, people be okay, you know, kitten 83 might be a little bit lower probability password. Uh that being said um about one week before I uh presented here uh Carnegie Melon University came out with this amazing paper absolutely amazing uh talking about going ahead and modeling passwords with neural networks. And the one nice thing about them is that the resulting um uh neural network that they can send out is is really small. So that way you can go ahead and incorporate it in you know let's say a browser. Uh so I would say if you're really interested in making more secure
password policies uh to go ahead and check out their work as well. They're gonna be presenting this at I think soups in a couple of weeks. Use okay. Sorry. Yep. So um that would probably be the better way if you want to go ahead and uh create better password models uh for uh password policies. Um, that being said, after I, you know, my depression came off and, uh, you know, I had a chance to really read that, through read that paper a couple times and realize that I What are you doing here? I should Yeah, I know exactly that, that was exactly what I was thinking. Um, I had a chance to kind of look at this and there are, you know,
plus is using proistic context for grammarss and probably the the best one is that I can go ahead and I can incorporate their work directly into my grammar. I mean, there's a Star Trek convention going on right now. So, I can be like the Borg essentially. I can go ahead and take their techniques and just like with Markoff models, I can go ahead and incorporate their techniques in order to uh basically generate better strings. Um so actually this makes me very excited. So unless they can show that their entire model is better than using proistic context grammarss at all, um it's something that can basically make my tools better. So what you're probably all interested
in really though is cracking passwords with proy context or grammarss. Um, so really the first thing I have to say though, and this is gonna probably be disappointing to everybody, is that the current implementation is god-awful slow. Okay, so uh, when I run it, it's generally generate about 50,000 to 100,000 guesses a second. Now, that may seem like a lot unless you're used to using your hashcat cluster where you're making, you know, billions of guesses a second. Uh, so if you're talking about attacking a fast hash, don't use this, okay? uh you're much better off using you know even pure brute force or something like that you know uh because you're just making so many more guesses
that even you know the the increase in prob uh the precision just isn't worth it. Where this is useful though is when you're dealing with you know hashes that are really slow. So let's say you have a large list of salted hashes um or you're trying to track attack you know like full hard drive encryption or doing something where u the slowness of um the guest generation uh it doesn't really impact your your um your hashing of it because the hashing just takes so much time. So there's definitely still uses for this here. Um the performance of this is very dependent upon the grammar. Um I won't go into that too much here. uh but
basically uh depending on how many replacements you have and how many values can map to a certain probability and that those replacements uh it generates essentially very fine grain u uh word mangling rules. So the less fine grain that you make it the less overhead that you have in trying to generate probability order and then the faster the whole thing runs and if you're okay and if you're interested about that I can talk to you more about that later. So long story short though, you can totally crack passers this right now, but it really is kind of geared more as a kind of a research tool. Uh but that being said, you know, proy context free grammarss don't have to be
slow. Uh I mentioned earlier Prince is really just a proistic context for grammar. It's just a very specialized one. And I don't think anyone said that anything that Adam has ever written has been slow. Okay. Uh so I mean you definitely can take a grammar and move it back and then make it much more fast if you want to. Also uh when talking about um uh replacing you know u how how to make these grammarss and stuff like that uh you don't really have to make things really defi ingrained. So there's this really fun law called zip's law. I just love it because of the name but basically says that the probability of a word is you know proportional to where
it is. And what that means is like the most probable word is used a whole lot. The second most probable word is used a little bit less. But then you start getting down to this tail end here where all the the probabilities of the different words kind of look pretty much the same. And if we just go ahead and flip this around here, this looks like just about every single password cracking session ever. And so the question was whether Zipoff's law, you know, applies to passwords. And there actually has been a paper that does this. And it says yes, you know, passwords follow this type of distribution here, which is kind of we should all know, you know, password 123
is really common. Uh, you know, zebra 123 not so much. And what this means is that modeling the probability of these first ones here is really important, but after that you can just be like, yeah, screw it. You know, everything's about the same probability. And this is way that you can go ahead and make the grammar a little bit more effective while uh or at least a little bit more uh scale a bit better with performance. Also, you know, Cardelling University, they've been definitely doing their own work on password um uh policy context free grammarss as well. And their approach was to in order to reduce the overhead was to go ahead and spend all
the time you know generate all the these basically uh rules here um and then save them on disk and sort them and then it's already there. So that way all that overhead just goes away and uh they can go ahead and make guesses with that at much faster speed. So I mean it's not like we're using those discs for rainbow tables anymore. So we might as well be using them for something. Uh so that way we can go ahead and start using this these pre premputation attacks as well. The problem with that is when you do something like this is that you lose the ability to dynamically change your grammar um after the fact. So if you
want to go ahead and model it towards let's say a particular target uh this premputation attack is not going to work quite as well. Um so kind of to deal with this here I'm starting to look into maybe moving you know the grammar into uh the GPU. So the the current uh PCGF cracker that I have right now it tries to generate all the guesses in probability order. So it'll generate the first one, then the second one, then the third one. In the GPU, you really don't care. You just want to go ahead and generate all the guesses that fall within a certain probability threshold. So that way you can go ahead and parallelize it. So what
I can do is, you know, um I'm starting to, you know, modify Hashcat. Hashcat's open source, which I'm really psyched about. Um and version 3.0 is really cool too. And I'm basically kind of copying some of the techniques that Hashcat's already using in their Markoff mode where it uses what's kind of called, you know, it's modified beam search where you basically look at the different replacements and only do the first couple of replacements that you specify and everything else you just kind of skip. And this way you don't actually have to do any math whatsoever in the GPU, which is really big. And it also reduces the size of the grammar because any single replace any single
replacement that falls below that threshold, you just don't even need to load up into the GPU at all. So basically we're trading precision for recall. We're trading you know the ability to make guesses in probability order for the ability to make guesses massively paralyzed. Um now I'm only in the very starting stages of this. I've uh you know forked hashcat. I've been basically breaking it left and right. Um it's going to be a long time before I actually get this working. But this is definitely something that I think is going to be coming down the road here. If not by me than by somebody else who actually can actually program well. Um so just kind of as my closing
thoughts um just to show you kind of the differences in grammarss uh from last year here I ran my previous u uh um u grammar uh I was targeting I trained it on rocku and I was targeting on php bbb.com this is john rippers incremental mode which is kind of a markoff based attack this is a previous proy context re grammar and here's currently one that we have released here uh so I mean this definitely it's getting a little bit better here it's only 1 million guesses but still I mean for 1 million guesses all of a sudden you know we're cracking you know 35% of the passwords in this data set that's completely different
from the data set that we trained upon here. So when we talk about things like salted hashes or you know really where it really helps to crack a lot of them really quickly uh because that way it reduces your workload for making f future guesses. This is something that's kind of nice just to run really uh first off before you start moving into your more traditional attacks. Now when we expand this out to let's say 50 million guesses you you notice it really kind of hits this plateau. Um, and as I said, that's kind of because the limits of the dictionary itself where the words just aren't really matching up with, you know, the uncracked passwords that we we're seeing
here. So, I'm hoping once I integrate Markoff modes that we can start getting this, you know, look more like a zip uh uh law type of a distribution because right now it's just kind of hitting the kind of a wall. But that being said, you know, when you look at the difference in probability or in difference in, you know, how long it takes for the previous message to catch up to it, you know, right here, I mean, it's, you know, about, you know, 140th, you know, the time in order to crack the same number of passwords. Uh, so this is where I'm really big about making the the grammar itself be more precise because even though it's, you know, not
as fast or, you know, you're not getting those speed bonuses, you're reducing the cracking session time you have to run because you just have to make less guesses. So that's about it. Um, if you have any questions, feel free to contact me on Twitter. I have a tendency to use like IRC. I really apologize. I'll just like post like 20 things. Um, but um, or if you have questions, feel free to ask them now. Questions? Okay. Um, just a couple of quick points. I don't think that your contextsensitive non-terminal is a cheat or that it should be called contextsensitive. Oh, you know, I if someone's saying I'm not cheating, I'm I'm happy with that. Yeah.
Um uh I also think that I if uh when you're talking about Zip's law with password distribution, I assume you're talking about David Malone's research. Yep. Um uh if my understanding of that and also my look at the Adobe uh stuff is that what he was saying is that this stuff superficially looks like Ziff's law, but when you do a proper statistical analysis, it isn't. It has tendency to break down at the tail there. Uh but when you look at the the front portion of it, it actually does follow it pretty well. So Matt, I've been using your tool for about eight years now. So it's pretty great. Um, years and years and years back, there was really no memory limit.
And so the limit on time was how much memory you had because it didn't really free up memory. So you running in sort of somewhat reasonable constant memory now or or is is memory still or space still a problem? Uh, memory is still a problem. So in the the previous version I had kind of capped it out where it would uh basically free up that memory and uh one thing I'm really kind of struggling with now with the you know generic uh context free grammar is it's hard to make use of those optimizations. Uh so uh what I'm doing in order to try to free up that memory is I'm just kind of dropping some very low probability
transitions on the floor so that u when you're running this it'll it'll run in a constant memory space uh but it'll eventually terminate uh much faster than what the grammar would uh do. So you're not going to be able to run across the entire grammar. So you can't run for infinite time or constant memory. Yeah, you can't run infinite time or constant memory. Now if you want to do uh it not in probability order uh where you do something like a beam search or something like that, you can start going back through that uh the data set and run it in constant memory though. Uh but you're not going to have it in probability order. Last one. So I asked
a variation on this question during the wordsmith talk at 10 a.m. Um and that is are there when you generate a grammar and generate the probability table for each individ individual word um have there been attempts to expose that publicly which is to say given n plus1 password leaks can I check to see what the pr