
the besides DC 2017 videos are brought to you by threat quotient introducing the industry's first threat intelligence platform designed to enable threat operations and management and data tribe a new kind of startup studio co building the next generation of commercial cyber security analytics and big data product companies hey guys thanks everyone coming out and you know thanks for coming and supporting your local b-sides thanks for the community and the organizers for having us as well so yeah Tom and I are really excited to talk to you guys about a little tool that we wrote called wordsmith and geo location based password generation so can I get some formalities out of the way my name is Sanjeev Kawa my roots are in Devon IT
I've had a pretty fortunate career so far I've seen everything from QA in terms of like security based application testing and functional testing I've done some Java development so does any Java devs out there I feel your pain I've also you know been on the helpdesk I've been a system administrator and security consultant I'm Kenya I'm currently a senior pen tester at PSC and yeah PSC does a bunch of PCI based assessment that's kind of PCI assessments that's kind of where we have our that's our bread and butter and the way that I play into that is I basically spend a lot of my days kind of traversing large enterprise networks trying to overcome segmentation
boundaries and finding card data something that I've been getting into a lot recently is just binary analysis and exploitation as well but yeah we're always looking for pen testers if you're interested come find me after the talk and we can grab a beer and talk shop I'm Tom Porter fine Porter house on Twitter I maintain a blog at Porter house calm my backgrounds on the blue side doing network situational awareness flow data analytics more recently I've been doing penetration testing a little bit of red teaming I've done some extensions on the Bloodhound project I recently joined the fusion X Red Team as a security consultant there but my true passion in previous life I was
professional baseball player so if you guys have anything you want to talk to us about about our presentation day you can find us out in the lobby if you want to talk why the Braves just finished under 500 for the third season a row you find me at the bar yeah so today we're gonna be talking about wordsmith at its core as foundation all wordsmith is is a password or sorry a wordless generation tool but specifically for geolocation data this kind of it kind of got created because Tom and I wanted to work on a cool project together and kinda wanted to write some code together too it was conceived because as we would go out on
site as we would see more passwords in the wild we would find that people's passwords would now contain things which really specific to that area so things like sport team names or common baby names that particular area or even a specific as road names and street names now in addition to that so what we wanted to do is kind of grab a whole bunch of geolocation data wrap it all up into a tool so people can dynamically create geo word lists on the fly now primarily our first sort of use case was to use it to crack password hashes and we'll talk about that a little bit later as we do our demo but something else
that we've kind of wanted to do was extend wordsmith past just password spraying and password attacks so we started looking at geo specific user name generation so a scenario might be like you're out in Minnesota and you want to pull down the most common first names and last names and create user names for a particular client that you're attacking that week you can do that what we're just kind of cool yeah in addition is pretty modular and extensive and as we start progressing throughout the presentation we'll talk about similar so that in detail so I could show hands who's familiar with a dictionary attack Wow okay it's basically everyone in here that's awesome for those of you who aren't we'll just
quickly do a one slide primer it's not going to take a lot of time but essentially a dictionary attack is well I guess one of the first things you need is a dictionary and it just consists of a word list so this particular word list just is apple banana and cherry our geolocation word list might have things like street names religious texts and things that affect now when you type a password into an application or into a domain it gets stored in an encrypted format or a hashed format on disk and the way that works it goes through a hashing algorithm and you know it outputs a fixed length high entropy string in most cases various different
hashing algorithms common ones or antia and md5 sha-1 so forth but what's really important to note here is that fixed length hashing algorithm can't be reversed in the sense that there's no decrypt function or there's no reverse hashing function because that would kind of circumvent the intended use case of hashing data so what you have to end up doing is a guess encrypt compare cycle so using that word list we just talked about you stick all those words through the actual hashing algorithm for the target hash you're trying to cap a crack and you start creating these hash strings and once you compare or collide to hashes they're the same you can then trace back to the exact word that you
used to have in your dictionary yeah that's basically dictionary attack in a really simplistic form so this is our second time presenting wordsmith wordsmith person one we presented it at besides Las Vegas last year it was our first conference together and you know we've got a lot of lovely besides been really good to us to miss and really cool people what Smith version one was very specific to United States of America and it kind of work from the state down approach so think of Nevada for example you type that into word Smith and it spit out all this data for you you can see some streets and roads colleges and major sports teams although I don't think Nevada actually has a
major sports team so bad state example but you get the gist it willings listen yeah I guess so we'll soon some of the additional features rebuilt into our Smith was we integrated cool so not sure if you guys know what cool is as kind of made by digi ninja Robin wood but let's say you're testing a client be it grocery stores calm or you know huge electronic stores calm you can supply a list of URLs into cool and it will go out to that clients website and pull down everything that looks like a string in between all the HTML tags and so now you can actually really do targeted password cracking and once you marriage that with your location data
from word Smith what's with the version one else would do some basic mangling so splitting on whitespace trimming and adding special characters and you could also specify minimum character length so this is quite useful in if you know what the minimum character length is for domain for example seven would be an active directory open LDAP or if you know the minimum character length is for an application as well so you can say I only wanted you know generate words which was seven characters and above and you know just trim everything else out so yeah a couple things that we learned from wordsmith version one both after the talk and in the following months was some of the feedback that we got from
the community and the top three request was we needed to have more countries available because it was only for the US in addition to that the way that wordsmith version 1 was coded was it was quite static in the sense that you couldn't really introduce your own data if you wanted to and if you did it probably break it and yeah it was also quite limited to the English language so the way that you interact with wordsmith on the command line interface it's still going to be English but the data sets underneath were all predominantly English based so taking all those requests works with version 2 was born and we just pushed it out to github
today and there'll be a link that you can grab it from later but we now collect a geo did location datasets for over 230 countries and territories was basically every single country in territory in the entire world and we also have a multi-language support so all the popular languages like French and Spanish and Portuguese and German those are all the soul exists as dictionary files with wordsmith now because this was such a ground-up rebuild I don't think we actually reused any of the code from Burt's Mabuse version one Tom put together a really beautiful CLI command line interface so you can interact with I'm just partial to it because I think it's beautiful because we wrote
wordsmith and so you guys might hate it but we kind of like it yeah and you know as we go out on site as we can see more passwords we saw that a lot of people are introducing things like gods names and book names and verse names so we thought to introduce religions so we've got a couple of different formats of the Bible as well as a Quran and some other popular religions and wordsmith - yeah so again primarily is uses password cracking but we've now extended that beyond so you can now start doing some real cool geolocation username generation - which we'll get into a little bit later yeah speaking about a couple of the data sources here
show of hands who knows about the CIA world factbook cool okay so a lot more people than I thought I actually didn't know about it until I started looking at some data that data sources where I can pull data from for wordsmith and turns out you know shout out to the CIA that they've collected a ton of metadata on a lot of different countries in the entire world so this is kind of what a fact book looks like I'm gonna make this screen a little bit bigger here huge yeah so you get some really cool words and a lot of cool dates here as well like 1776 one nation was founded so the chance of that being append or prepend
onto passwords might be high primarily the things that we found there are quite interesting here what kind of these anchor points so population counts you'll see why that's kind of cool later on as well as different ethnic groups and languages both official and non-official languages of countries and religions and as you start digging more into the fact book for countries you'll find that they're pretty well updated and you'll see things like political party names and current senators and current presidents and yeah think of the things of things like that too right so pretty cool data source in addition to that oh so when we were thinking about the concept of introducing languages into word Smith
something that we kind of started with first was just raiding the NYX repositories and grabbing all the language files out of there but it turns out all those language files are more technically focused and have a lot of operating system lingo and technical lingo in them and you can't really build a full lexicon of the entire dictionary for the language from those sources so kind of pivoting off that idea we came up we found that spell check files specifically through Huntsville and openoffice and libreoffice had some you know fantastic community community contributions and yes we use a lot of spell check files for the languages that we generated and we found that gave a better at least at least more data for
four different languages if you guys want to know how dictionary files and affixation files and spell check files work and want to listen to me you know drone about them for about an hour we can talk later but it's not that interesting some other data sources are Wikipedia which can be a nightmare to parse just because of all the spam tags and div tags so if you guys want to learn how to do that really quickly and efficiently I can tell you guys about that a little bit later as well but we're covering most of the u.s. in terms of sport teams in colleges and universities and we've got almost entire world coverage for landmarks and
archaeological sites so if you're you know if you're looking at Italy you probably gonna have a lot of famous landmarks em in there as well Project Gutenberg is another data source that we're using and it's um it's kind of like an open source mindset community where they're kind of digitalizing cultural works of this sort and they promotes the creation and distribution of this work so that's we're pulling most of our religious texts from as well yeah some of the other data sources like Open Street Map and US census data tell us spent a lot of time looking at that so I'm gonna pass it off to him to speak about this hmm so Open Street Map for
those you're not familiar it's this community of mappers who have come together to basically build their own open-source version akin to Google Maps and we can take that data down and parse it locally and to grab interesting things and this was kind of our first foray outside of you know us being our initial purview of word Smith now branching out across the world so for now we're grabbing roads cities and counties but there's other data to that we're going to grab from there the United States Census it's our canonical source for area codes and zip codes and we can do submitting things with those with a pending and prepending and some of the attributes and the Social Security Administration
maintains this list of the most popular baby names in each state dating back to 1910 so we can take that data and do some Michigan attributes and some username generations which we'll show so to get word smooth if you're at the command line here you do a git clone of the repository URL this is included again at the end of presentation then you CD into the directory then you can actually run this bundle install we include this gem file to help with that that's only necessary if you want to use those cool integrations but it's not mandatory for route forward stuff to work it will work out of the box with all the other options when you run
wordsmith for the first time you'll see the output that looks like that it's going to unpack some data and create a data directory where all the flat files are stored for for querying so this is what that structure looks like we recommend that you check out the readme first or look at the examples with the - capital e flag they're pretty similar now give you an overview of different components of wordsmith and how you can interact with it the primary Ruby script down there at the bottom is wordsmith Darby the compressed data archive that we include all of our words in is that data rxz and so upon first run of wordsmith it's going to unpack that data
and then create that data directory that you see there near the top so when you're using word Smith there's two kind of primary components that you need to be aware of you need to select your boundaries which are your inputs and then your attributes which are the types of words that you're getting so on the boundary side those are specified with a - capital I flag or the input flag and those are all the areas of the world that you want to get data for so it might be in a particular country some of those countries are broken down and States or provinces sometimes they go down even further into cities we even have this notion of custom regions is
something that we created so you can build these custom mappings of certain boundaries and marry them together so you specify your boundaries then you specify your attributes on the right you see some of the different troops that we have so cities colleges landmarks and so on if you look at the example at the top you can see what an invocation of this would look like so it's Ruby word Smith - capital I USA is our boundary input and then - Arthur Rhodes and - L for landmarks now using those boundaries that helps to know a little bit about the structure of that data directory because that's what we're querying so if you look at that data directory at the
top level you see all of these sub directories that are three letters long and they correspond to the ISO alpha 3 country codes so the u.s. Busa Great Britain's GB are Germany's deu South Africa's CAF and so on and this is kind of that the top level when you start drilling down as are the countries if you go into one of them say the USA you'll see these broken out by two letter state abbreviations we do something similar with like Canada with their provinces BC Alaska and California out to Wyoming there's also some text files that will start residing in these directories so you see CI a dot txt there that's one of our attributes
that's all the data that we've parsed out of the CIA World Factbook for the US so if you were to use that CIA option as an attribute you would get the data that's in that text file a lot of these countries also have a llamo file it's like kind of our configuration file we're restoring some metadata about the country if you drill down further so here in the North Carolina directory you see even more attributes and this is where most of our data resides and the US at least a SUSE area codes North Carolina city is in North Carolina and so on and you can even define more granular restrictions if you wanted to so if you want a boundary of Charlotte
or the City of Charlotte and include a text file in there like Sports txt you look at you see Charlotte Hornets and Carolina Panthers so how does that translate over to the wordsmith command line basically you're just taking all the sub directories and substituting - so if it's USA it's just USA as your input if you're entered in North Carolina but USA - and see Charlotte but USA USA - NC UNC Charlotte if you want specifying multiple boundaries you can do that - in the command line it's just a CSV format so for USA and Canada beat us a comma CA and if you want the DMV area the USA - DC comment us AMD comma us say - VA with
the CIA World Factbook data one of the elements we could pull out were the populations of each of the countries so we took that and we ranked it from most populous to least populous countries and with that you can now supply a number for your input so if you give it a number of ten it's going to go out and grab the ten most populous countries 25 50 whatever you're interested in so what if you want to get data for a very large region and you don't want to type out all these different countries like may be interested in getting all the data for Europe we have this notion of a region so you see this region CSV file
space is like a configuration file we're grepping out the Europe alias so the way it's structured your alias names going to be that first value of Europe the second value is a description of that alias this is the continent of Europe and then the third value are all the boundary members of it so you see you got Germany and Italy and Poland and so on all wrapped up into Europe and then when you go back to the word Smith command line you just specify your alias of Europe and it's gonna pull in data for all those different countries now word Smith comes prepackaged with a bunch of these regions already defined you can list them out the - capital R
flag so here are some examples from the US you got New England and we broke down to South East and the Far West we've got world organizations like NAFTA EU we've got each of the continents like South America in Africa and we have this catch-all alias of all and that's just basically every single country so if you want the purview of warts must be every single region that we have data for you would use all so with boundaries define now we have attributes and getting these are the types of words that we're generating you can see a lot of them listed there and we're not going to go through all of them the ones I want to call your
attention to are the first one underneath the input options that - a flag that's gonna get capture every single attribute listed there the other special one is the - B flag just underneath it this is our miscellaneous / other option and this can grab text files that you've added yourself or that don't fit one of these already predefined attributes and we'll go next to example of that so examples of what it looks like on the command line if you're grabbing zip codes for California your input be USA - CA and then - Z for your zip codes for all the road cities and landmarks within England you have to specify your input as gb r - e and g and
then - r - CL if you want to grab all the attributes for asia andrew ages one of those regions that we have are a definition for of all the Asian countries and then use - a flag for all of the attributes so it's gonna grab every single possible word that it can for Asia now if you're curious what attributes are available - for each boundary we create this thing called child nodes so you give an input boundary in this case gbr for Great Britain you give it the - capital C flag and it's gonna show you this hierarchy of data so it starts with Great Britain at the root and it shows you that you
can get cities and counties and landmarks and then then traverse down into Scotland and Wales and England England is broken out by historic counties and then in some cases like Sussex they're broken out into administrative counties so you can even get as granular as East Sussex or West Sussex for getting cities counties and roads when you call a certain attribute so if I want to get all the cities for Great Britain it's gonna start at the top and then Traverse down to each of its tout child nodes and then combine all that data together so we touched on the CIA world factbook and some of the metadata that we pulled out this is an example of Japan's so in
that directory you'll see Jake JP and amo and the middle you see what that config file format looks like so we're grabbing the population we're grabbing the most common languages spoken we're grabbing the most popular religions and we can use these for some interesting queries so we've already talked about the most populous countries query for religions or the - G flag we've used that data from Project Gutenberg where we've parsed out several religious texts so the King James Version of the Bible the NIV the Quran and some of it in different languages and now when we look up that country in in the gamma file we can see its markers for which religion or yeah which religion is part of and we
can pull in a religious text for it so in the middle there you see one way of how we're parsing that we're doing a bio kind of Bible book and in chapter and verse so you see Genesis 1 : 1 or 1 Columbus I don't know how many of you have done hash cracking in the south but there's a lot of John 3:16 and Psalm 23 is hanging out there at the bottom you have another way that we're parsing Bibles was just grabbing every single word possible from them on the languages side with a dash of the option we found that the languages were the most frequent across all the countries we had data for and
then we grabbed the 13 most popular ones starting starting with English and working our way down and then we built those dictionaries for English and French and German and so on so we look up a marker like the US and we see that it has US or English and Spanish as its two most popular languages we'll pull in the Spanish dictionary Oh pull in English dictionary now with this rewrite of wordsmith for version 2 we talked about our modular design this is kind what it looks like in action so let's imagine that you wanted to include words for all the lakes than Minnesota so you go out you find a data source you parse it and then
you create a text file called Lakes txt and you place it in the appropriate directory so it's placed there in USA that format should be new I'm delimited it should be sorted alphabetically it just helps with some of the processing and then you can immediately query that with word Smith using that - B option I mentioned before this is a data file that doesn't necessarily match one of our predefined options we still want an easy way to grab that data now the reason that we formatted it like this is because we wanted it really simple for folks to add their own data and also contribute back to the project all you need really is just a text editor and
the ability to go out to get HUD and upload a file and you can submit pull requests to us and that Wakeman share data with everyone else so we've got our boundaries got our attributes now we want to start tweaking the output by default wordsmith is going to print all the words to console if you want to instead of writing out to a file in just you - oh if you want to quiet the output so you're not logging up your console there you can use - cue the middle section there is from it's for placing some restrictions around the words that you're generating or if you want to do some like word mangling so you can set a
minimum length you can set a maximum length for your words you can use - capital D for windows default complexity so if you only generate words because you know you're cracking hashes where the password policy was the windows default of a character minimum and then three out of four cases of upper lower numeric and special that option will filter out any words that don't meet that standard - J for lower case in case you want to pour it into like a hash cracking tool the next three options you can strip out spaces you can strip out special characters you can split on spaces if you want to do all those things of three things at once just use
a mangle option - um it's a very basic word mangling then the bottom section there is for prepending and imprint and appending certain words so like zip codes or area codes or your own custom word list so example of tweaking the output let's say you're generating roads for DC so use USA DC as your input - are for your attribute and one of the words that pops out is Pennsylvania have the space and a period at the end if you drop a - M at the end of it and start mangling that word you see if we can remove the special characters or split on spaces or remove spaces entirely you've turned one word into seven so it's just some basic
word mangling if you'll specify a minimum character length to do that with kashchei so we're - k so we're gonna keep care words that are eight characters minimum that - capital D we'll do the windows default complexity so now we're only passing worries that are gonna meet the a character minimum with three out of four cases if you want to write all of these words out to a file the syntax looks like that so in this example such as doing roads we're doing - a for every attribute of DC we're doing - Q to quiet the output and - o to write it out to a file this is kind what the output looks like where it
will numerate through every single attribute of your boundaries show you how many words it generated then at the very end they'll do this big sort and unique and write all those file or all those words to a file so for DC this generates a little over 1.2 million words I think I sold this article about a year ago in ARS technica it was about an alleged drug dealer who got rich from selling illegal drugs on Silk Road and his machine was eventually confiscated and they found his PGP key and I tried to crack it and what they found was his password was that dictionary word followed by 2:09 which happened to be the zip code or the area code of where
he lived in California and Sam's and I saw that and we would like this is too good we have to include that functionality in wordsmiths article came out right when we released wordsmith last year so you can use those options you see on the side to append or prepend zip codes or to even splice it in your own word list so an example of how you might use that if you have this file years dot txt that has seventeen twenty seventeen seventeen and twenty seventeen Bank and then you go to word Smith and generate some words so here we're generating words for DC we're grabbing the colleges with - f we're mangling the output with - iam and they were going to
append all of the years to it so you have gaya debt and then you see all of our custom words appended to it it's a guy that 2017 or 2017 bang or Georgetown's gonna pull out the mascot for it's got the Hoyas and who has 2017 bang you're generating words that look like they might actually fit a password for someone who lives in the area names like we touched on before from the Social Security Administration we're doing a couple things with them one you can just grab them as normal attributes so if you don't grab all the first names for a particular area you can do that or all the last names or if you know grab
all names you can do that we're also using this for username generation this is the only word are the only files that we keep that aren't already sorted these are arranged from most popular names down to least popular so when we get over to - username generation and we saw this as a new Avenue because we are focus of Portsmouth initially which generated words for password cracking or password sprain attacks now we've got this idea that we can generate usernames that are particular to a certain area so the format and that would be kind of some example you see at the top first initial last name first name last name down to first name dot last name there's
some options at the bottom if you want to truncate user names or if you want to control how many user names your you're setting you can use those options so to give an example if you're generating first names and last names for the US you see James Smith James Johnson James Williams and so on at the bottom its first name dot last name same output the dot in there and if you see it's grabbing the most common first names and then matching those up against the most common last names that's how it's going about the degeneration process let's say that maybe you're going up against a mainframe that likes to only use 8 character usernames you can use
this - - truncate option specify a number you know truncate all your usernames at a certain character length if you want to adjust how many words you're generating by default or our same default was a hundred so it's going to grab the hundred most common first names 100 most common last names and match them together and that gives you 10,000 usernames as generating if you want more you can specify this name depth option and it's going to use a most 250 most popular first names in 250 most popular last names or if you go up to a thousand its kind of generating a million different usernames for you and with that I'll pass the back to stands for
the demo yeah so last year we did a live demo and it went pretty well I think but this year we liked it to go with a pre-recorded demo I guess just for the context but what we're gonna be doing is we're gonna be generating a word list for the country of Ireland I was recently in Ireland doing a pen test there and I extracted the NTDs dog did and dumped City and to them hashes for a particular Irish client now we're gonna be using this Irish word list it to crack these Irish hashes and just looking at some of the cool passwords and interesting passwords that we recovered that are geolocation based let's get out of this
guys ever heard of ASCII nema or ask innama before anyone a couple of people one or two cool so whenever I'm on a pen test if I'm recording my console output before I was using a script and it's great and it just kind of records anything you put I'm into console out of console into log files if you want to you know preserve evidence or kind of track what you're doing but I found that escaping like control characters can be a bit of a pain so ask an email just Chris he's really nice JSON blob files that you can pull back into and just kind of like copy and paste text out of really nicely so you know that aside
essentially what i'm doing here is i'm creating a irish write list using the three-letter iso country code almost immediately I put the quiet flag in there because I don't want to see every bus output of all these input options up to screen but when I'm creating word lists of typically including all of the options and I might toss in some mangly in there as well so can you see that on screen okay just gonna fast for a little bit here yeah yeah so that's a generation of the Irish word list so you're seeing the CIA demographic data you seen the OpenStreetMap data come out and some of the wikipedia data as well religions probably Catholicism I mean
the Catholic as well and languages is probably going to include the English dictionary and I think there's an Irish slash Gaelic dictionary as well but yeah roughly twelve seconds to generate an 800 K word word list you know it quickly quickly touching on roads initially as I was looking at roads as a data set I was thinking more about well I thought that a lot of the data in there be basically numeric Street values right but what I came to realize was there are so many roads even in the United States that kind of have the name influential people or for example Martin Luther King Boulevard is right or New York avenues or those datasets kind of transfer
across the world so you'll have a nice set of names in there as well which might be part of people's passwords yeah so we've created this word list and I'm gonna start cracking some hashes here using hash Katz so I'm just removing the hash cap pop file shouldn't really do this because it's got nice little mapping of every single crack you've ever done but you know for the demonstration purposes of this video these are real ntlm hashes extracted from an Irish client so we're using the attack mode of well 1000 ntlm and then straight-up brute force using the dead hobb 0 or dead hobo rule set so now if you guys are familiar with that but Tom
had had a great success with it it's about 50,000 different rules these rules that add different mangling options as well as append certain things or prepend certain things and do tattled casing camel casing character substitution so fours become apps and aides become apps and that sort of thing but what you're doing is you're taking that 800,000 word word list that you've created and basically performing that all 50,000 character or word substitution and operation on every single of those words so now you're expanding your list from 800,000 to 800,000 times 50 yeah so kind of looking at this hash cat outfit here there were five hundred and thirty three unique hashes but there could have been
maybe 600 but it distills it down to unique because some of the passwords might be reused for certain service accounts for you know common domain user accounts whatever that may be and of those 533 within a minute and 20 seconds we cracked a hundred in one of those which is a pretty decent recovery rate of 20% for a geolocation generated word list this is just on my macbook pro so you know they're not gonna have the fastest GPU cracking speeds here if you can as opposed like a cracking array of some sort but yeah I guess looking at some of these passwords some of the more interesting ones are kind of on the right-hand side but you'll get the
common sort of city names you'll also get having recently been there you'll also get some things like landmark name so temple ruins or landmark and they're in temporary and Portishead interestingly you'll also get some really cool last names like walls Lee and Donegal and farriers this last one is an actual real address as well so 5-row the Abbey is a business that's located there and that's kind of just what the addresses now so you're not always going to be on site and extracting hashes for just one particular geo set or sort of organization which is based in one country right the more common approach is you're gonna have multinational organizations which span across different countries have satellite
offices everywhere and you know even headquarters in different countries too so this next statistic is for a multinational organization which I recently obtained the hashes for that has offices in the USA Australia and Canada and as you can see in the word Smith command there I'm using the - je flag and that's just reducing all of the or sorry that's just taking all the words in the words generated a board list and making them lowercase because I'm going to be applying more sophisticated cracking rules to those later in terms of dead hobo so yeah top 10k it's just the top 10k passwords it's going to grab you about 256 passwords in four seconds in conjunction with those
dead hobo rules rocky which is a really common word list grabs another 476 and then our geo worthless for Australia Canada in the US is about 7.3 million words takes about 30 minutes to run and we'll grab you another turn 41 so looking at some of those passwords as well Australia's got a really high character length password for Queenslander which is just someone who lives in the province of Queensland you've also got a couple different cities and towns there but yeah kind of all sort of geo based data Canada nothing too interesting they're just you know the the country name and popular city in Canada Matthew to-to-to-to is probably a reference to a book and a
verse and a chapter of one of the Bible's and then the USA you've got you know common tourist destinations and sport teams and then transportation lines and political candidates as well okay Sanjana is spent monumental amount of effort trying to create these parsers for all these different repos sometimes we're going out to the internet and scraping websites sometimes we are pulling down files locally and parsing them in various different formats so we wanted to preserve that so we created this other github repo at wordsmith underscore parsers you guys can go in and see the all the random assortment of bash and Ruby and Python scripts it's also an opportunity for you to contribute back so if you submitted some
data and you write some parsers to to get that data feel free to send us a pull request future work and where we see wordsmith going from here data is always gonna be something at the forefront and always will be with this project so we'll probably I have deeper into things like Open Street Map there's a lot of other interesting attributes you could pull out there things like cafes and airports and bodies of water just interesting proper nouns specific to that particular boundary popular song lyrics so this was a recommendation from Patrick Fussell's sittin right over there what if you took the Billboard top 100 for a given country at a certain time found all those songs and then
parsed out all the lyrics from those songs you can break it up by phrases or verses and just those into word list but if you've got ideas we'd love to hear them send us a pull requests on github submit a new issue reach out to us on Twitter we'd love to chat with you especially if you have skills like gif skills Sanj and I this is kind of a fun project as a hobby on the side we don't have that GIS background or experience if you do I think we can get a lot of value out of you so feel free to reach out multiple language speakers sometimes we're reaching out to these really strange foreign language websites trying
to find data sources and Google Chrome translate can only take you so far sometimes so if you got those skills that'd be great and or if you just like hunting you know in the deepest darkest parts of the Internet and scraping data from that would be very helpful in terms of design where we're seeing version three of Ward Smith going my kind of pipe dream is changing how you can specify your inputs so instead of just doing at a country level or a state level or a county level or whatever it is what if you put in like an address well if you put in a pair of geo coordinates and then say give me all the words within a
50-mile radius of this that's what we'd like to go for Version three so with that like to say thank you and I'll have thing there's some time for questions yes
and we looked into using demographics along so what kind of demographics in particular yeah that's really good point um yeah no we haven't that's a really good point though I guess I would be you know for a very specific target tax maybe for a small set of users or one particular user right no that's a great point that we didn't consider or haven't done but we'd love to talk about it as well any other questions Hey yeah yeah utf-8 it's way to go what's that yeah with so with some of the language files were using we've got anglicized versions of those so it you know they're utf-8 helps there and then particularly with the Russian character
set there's how they had to some characters have to be stripped but they can be converted to be read improperly as well so yeah that's a good question anyone else cool well you'll yeah we'll be around and will be attending other talks and it's great having you all guys ever having all you guys here and thanks so much and yeah hope you have an excellent con [Applause]