Wes Widner - The Sound of Evil

Name: Wes Widner - The Sound of Evil
Uploaded: 2019-06-25
Duration: 58 min 5 s
Description: Recorded at Knoxville's 5th annual BSides on May 3rd, 2019 Our ears are the original nexus of information security. The environments we're in are constantly streaming valuable information to us. All we have to do is listen properly. "Let he who has ears" and all that. Join me as we explore the fasc

BSides Knoxville58:0556 viewsPublished 2019-06Watch on YouTube ↗

Mentioned in this talk

Tools used

Alexa Google Assistant Siri

About this talk

Recorded at Knoxville's 5th annual BSides on May 3rd, 2019 Our ears are the original nexus of information security. The environments we're in are constantly streaming valuable information to us. All we have to do is listen properly. "Let he who has ears" and all that. Join me as we explore the fascinating world of audio security. We'll cover: - some meta information about audio - the basics of digital signal processing - the fascinatingly complex world of determining what "silence" means - modern machine learning approaches to sound event detection - attacks on audio interfaces like Alexa

Show transcript [en]

they have any Google Docs use a voice assistant regularly at all no okay so usually what people think about voices systems they think about a special purpose voice assistant device so anyway so the voice assistant devices so what's there to study and the voice assistant devices the bad news is these devices are really simplistic they're just a microphone some processing and then a speaker that's it and besides that the physical device itself is pretty solved there was a there's a vulnerability in the first version the Amazon deco that allows somebody to JTAG off of it and then you could get root access jailbreaking the last two versions haven't really had that ability in there but from the first one we were

able to tell that what they're running under the hood is an Android operating system and Alexa's run awfully good case which is fairly interesting but in general though these devices are pretty solid I mean they're the best case of Internet of Things these are actually the best case devices they're fully managed by the companies that would imagine they are updated regularly the firmwares hardened hard to get into these devices do pretty much anything so if we're chance early if we're not talking about the device because the device is pretty solid and what are we talking about oh the other thing is that it's in to end encrypted the there is several pcaps of data streams and you

can take your own but they're all Indian and group that I haven't seen any unencrypted streams from the devices so if they're if the device itself is pretty solid then where do we go from there when it comes to researching these things one thing I want to point out is that voice assistance by their nature are disembodied that means that the first thing you should think about voice assistant is not the physical packaging like an Amazon's Electrolux of dr. leg-oh or anything those are really just ways to tap into what's the wheel interface which is the voice so think about like on your phone which any and they show up like everywhere so one of the major places to

use it if not a physical device as long as fun but then they started showing up in cars from TVs even the fire sticks the Amazon fire start si or the remote for that has a trigger and they're just like all over the place a lot of public places are starting to have a voice assistant feature they are in a place to kind of serve you and help you orientations whatever else they're also part of like chatot systems which means several clever developers have decided to take like a slack bot even further and add a voice Menaul to it tell it to do that so the point is these since our getting into and synthesize

everyone so thank you so then that's where I want to focus is only virtualized and instances because there's a commonality across all the platforms fun side note here is that Siri used to be called half before they renamed it apparently because they figured that name outside about another interesting note for Siri is that the company that put it together nuance they envisioned Siri has an entire operates so they had a way to program it on the fly using some tactical language just talking to it programming it you know there's all kinds of really cool stuff and then Apple bought it and they have this point really they cram it into the phone and then basically until recently

oh I guess a little bit more about the voice of systems here that I meant to point out we've had a desire in the computing space action - The Voice assistance into pretty much everything and if you think about it a lot of our own a lot of our sci-fi has a voice assistant component to it in earlier versions of this talk I had like kid kid Fox and Knight Rider he interacted with Knight Rider via voice even on Star Trek how do they interact with data the voice computer do this thing like all the things were voice the creator of Tetris whose name I can't really pronounce one of his jobs in education B before he came this was

trying to create a voice assistant or a voice command interface from the main fighters I don't know if they were ever successful but that kind of gives you an idea of how old the desire to interact with your computer via voice tips it's something that's just his natural juice the problem though is well another random bit of trivial happened AT&T wanted to be able to detect like speaking out numbers so for various reasons

the problem is getting the model voice requires a lot of crunching the data so just like the rest of the machine learning a world it kind of lagged for a long time until we have the processing there like a bunch of other bits in place so that brings me to a Platonic ideal as in like the ideal form of what a voice assistant is you can pretty much sum it up into this structure there's three different machine learning systems that are involved in a voice assistant and I know that there's you might be learning that's really what this entire platform is so the first part is the voice to text just understanding the human speaking parts you know and then

bringing that to text and then there's the natural language processing what the person said and all that and then there's the the device or the assistant each one of these systems has its own and each one of them is an area of research all by itself and it's truly fascinating so I'm just going to look at the first two if you want to catch me later talk about the other ones feel free so the first thing is we need to pick apart the acoustic landscape there's actually a hole in the in the acoustic space that we can hear and our brains are wired such that we we like flawlessly or not always flawlessly but we automatically filter out things

that we know are less important or whatever else programmatically we have to like train to do that and that's where digital signal processing and all that so I have the ranges here for for like what we can produce as far as voice and then what we hear and notice that we can we can hear a lot a lot of wider or range and produce with our voice another random bit of trivia here male voice usually goes 200 Hertz like in the deep baritone James Earl Jones

whitepapers cellphones and basically just phone carrier operator especially cell phones are very very interested in how much right yes that's all there's some other things to the the less frequency range you you could care about the less processing you have to do there's all kinds of benefits there one thing that's tempting when you start in the space is to start throwing out frequency ranges that the voice could never give to but the problem there is we there's there's information at the higher ranges outside of what the voice can produce that's still like so a sound wave affects other ranges above and so we can detect that and that actually comes into process for example it's tempting to cut out between three

and six quarters but the problem with that is our ability to understand consonants another big thing to point out here is that pulling apart sound is inherently a lossy process so the more we pull sound apart and that has a security component that we get to intersect so back to the rainy filling the sounds that you don't that aren't actually there but your brain knows that they should be there this is a trick that was used to get around the problem of Alexis being triggered by commercials so this is the Alexa dog burger commercial which was one of the first ones that didn't trigger Alexis and what they did was they just cut out this frequency range

here that the human voice would fill that in but from a digital processing standpoint all that's missing so the model doesn't record so how did it pull apart voice is we run it through what's called a fast Fourier transformation saying that wrong pronounced it very many times but what's what we're effectively doing is we're taking the time domain and we're taking a complex signal like this make sure that we're taking a complex signal like this and we're deep in posing it into the component parts and that's useful because like I said the acoustic space has a lot going on and if you a good thought experiment is to sit and listen and think about how many different

sounds different types of sounds that you can hear going on at the same time that's the problem that a voice assistant has to work with and one of the worst most hostile environments when there's - one of the most hostile environments is in an airport where you've got announcers in the background go people around you you've got like different ranges of people all around it's pretty harsh one of the things that Bose I know others do this too but they're trying to Bose headphones have they're trying to latch onto your voice when you're speaking so that it just picks that up but the problem is when you stop speaking that starts arranging and finding another place so basically one

of my colleagues and we were when he was talking with his spine when he stopped talking they ranged and latched on to they are the airport announcers and that was as if it was like a handle one of the other there's other examples the other hospital space would be in a car we one of our first installations of Alexa was to put it in our car because I hate like screwing the sound system or whatever in a car I'd rather just be able to tell them play this music so the the problem there is that you've got like engine sounds it's in the bag like all kinds of other things going on now the amazing thing is the Alexa actually

picked up on our voices really well so the first part of a voiceless system is detecting sound events and by events we decompose that down to something that's Pulis glass breaking smoke alarms you know to even gunshots lots of cities have gunshot detectors in that detector it's a sound system that picks up on the sound of the gunshot and then what they can also do is triangulate that sound in the city because they have several different detectors and stuff like that but one thing that I thought was interesting is one of my first applications for voice assistant was to do this sort of like guard duty thing so Alexa guard it's not really widely publicized because there are still some

false positives on it but just the idea that it could detect glass break that's a sound event and one other interesting note on that is it doesn't work on all Alexa devices because like I was saying earlier the Alexa dot is very aware to make sure all of this really is a microphone very for that amount of processing and that's to to detect more than just a handful of sound events takes process so here's what sound events look like if you were to plot it out using what's called a male spectrum spectrogram there is that generally sound events don't happen in like like one frame of sound sound comes in in basically like if you're taking a sample rate of 16

thousand the second samples a second then 16,000 samples data frames and what you're doing with a metal spectra hammers you're taking that raw data and then you're chunking it up so if you're looking at a waterfall top waterfall representation of sound you're looking at a spectrogram of some sort so this is the visualization of it you can kind of see like but if you look at it and think about it you can kind of see that makes sense like a hydraulic camera you can see the rhythmic there and especially how long hits it floods all the sound of all the frequencies you can also see like wind noise the low dull Rumble there milling machine is a little bit higher than

engine sound all that talking is an interesting one there because it's not a uniform and that's something that's that's really important when it comes to a voice assistant voice is all about the phone how is perhaps is wrong the phone domains clothings yes they need both phonemes votes yes so the parts of speech that are intelligent is the phonics like Sonic like Sonic reading that's basically what this is it's decided it's detecting those parts of speech it's not single letters so our voices of speech doesn't go from voice straight to like you know it goes to these phonemes and one interesting thing about phonemes as they change over time there's actually an entire field of

looking into speech patterns so who to your head Google Voice before it was Google Voice when it's Grand Central Station may I remember that so way back in 2000 2002 or so there was Grand Central Station is one of the first voice voice voice calling systems and so you can add that as to your to yourself plan or whatever when you don't have it be yourself playing just tell yourself I know store my voicemails with this company and eventually that grew into Google Voice and then so the whole point was they were collecting samples of people speaking throughout the country and the music thing as I had a deeply southern friend of mine that it always

got wrong always so who uses a like does anybody use Google Voice or any of that today how have you found it to be when it comes to transcribing tastin speech better than Microsoft better than Microsoft so the point there is it takes it takes a lot of samples to understand different patterns of speech either the first dragon to actually speak [Music] it's so remember how long you had to read two stories yeah for like 30 minutes or an hour so because it was building a model of your phonemes of you and so someone else came along like a friend of mine had one of those systems and he said here I have said like talk to them it didn't

understand you at all now the other interesting thing here is that the oral pronunciation of word was changes over time to one of the papers that amazon has is it detect when you're sick detect that you're sick yes your breathing how do you breathe your breathing affects how you pronounce these words there's all kinds of other acoustic things that are really interesting that I've kind of gloss over but I will just put in there then your body makes all kinds of noises your gut your heart your lungs like all that makes noises and all of that actually feeds into acoustical models that we can come up with to detect all kinds of interesting things about you so

so not only can it detect your like whether you're sick especially the respiratory illness it also so the variance of your voice over time actually place and has it a very big security component to it is if I remember this story Alexa sent 17 or Amazon some employee Amazon since 1700 lecture recordings to the wrong person a lot of people saw what was we we can deduce out of that that Amazon is continually reading models based on the data

you really think about other stuff too but when several weeks we could deduce from this one it's roughly portly a person's recordings from their Alexa - more than likely their story means into one big way and that's probably what led to this incident - is that the employer was like well the data is in in this batch here and so it collected like a whole bunch of other people's recordings - now what the motherboard is wrong here the Alexa is not continually recording and uploading that would take enormous amount of and the Alexis and all other devices cheat by listening for a keyword that's really the only process and that's done on an Alexa device it's

continually sampling the other space and looking for a collection of phonemes that match to the keyword that's really [Music] is that a PK that's listening to the sound and then trying to match it that's also life we go back the Alexa garb can only run on certain devices because you have to add these other sound apps but actually add those to the device besides that yes once you trigger the device it starts recording and it starts basically pulling apart the audio strings and then trying to that piece what it pulls apart up to the 12 it's not pulling it's not sending rotated remember back to here almost doing is it's trying to identify the speaker and

there's actually a branded paper that just came out recently where Alexis of triggers it then it will only it tries to anchor itself to your voice if you've ever played with the license our families played a lot of Alexis one person will trigger it and they want to change the song where they have thought it through their mind all the way and so another kid slipped right in there with their song that's the sort of thing that that elects the developers are trying to they're not trying to fix that problem in particular but they're trying to make it more responsive to the person that's true another issue there is that Alexis

right so they set this back to the wrong person one way to to kind of guard against this is to flood out with sound checks this actually helps your devices operate better overall I had a headset recently was on a conference call and we have a an AC unit but after going through like this material my life several times she was like that I bet it's gonna operate a lot better because it's flooding out the background noise and it basically blocks this other noise from you how many of you have ever studied noise cancellation it's effectively with noise cancellation is it's flooding the sound to kind of if it's trying to create a acoustical brigid operator so the point here for

water falls and I want to throw this out as part of the audio security is we have the concept of firewalls when it comes to network I would argue the the equivalent putting a conservatory for all your security would be wonderful so and yes this is a reference to TLC these images yes to go chasing waterfalls noise yes it helps isolate sound and one interesting thing that I found one started moving into the whole audio space silence abilities basically the same it's just the intensity of the pitch is different but when your device if you if you were to like lower the game over for a cruise today they're getting static either way so sad our silence is a very

subjective concept

Microsoft has a soundproof room that's one of the most critically dead rooms ever looks like negative

anyway so what if we were to take the concept and water call and apply it to an election we basically have an Alexa parasite so this is a this is called Project alias you can find the have exported actually have a collection of exact words it's and inside of this hat we have raspberry pi a and two speakers these speakers are just continually flooded the device under it and what the Raspberry Pi is doing is it's acting as a buffer and what it's doing is I can program it to listen to a difference of acoustical triggers for events and then it will stop flooding the device underneath it play the weight key work and then go ahead and do whatever I want

to do most people though aren't going to pay an additional like fifty or so dollars after parts and labor soldering you know to do this but it'll at least I present it here to give you an idea this is how we can go about securing or what way we can go about securing these devices so along the feed the whole telling in the acoustical space trying to figure out who's speaking to the device we can easily tell just mathematically I showed earlier the pitch range between the male and female there's also a pitch range from children to so we could just tell based on pitch this is probably that's pretty easy what's more difficult is to is to it

before the identity of the person that's speaking to the device now I'm going to make an assertion here that there is not enough data in the human voice range to give us enough like certainty to guard against collisions what that means in English is any voice on mocking is inherently insecure

or other baby yeah a trade commodity your voice like very tightly trend yes but even then I'm going to go back and say that that would be smooth but we'll get to that as a faculty practice right so and this section goes into biometric security in general it all has to do with the scanning or how much input data you get and an ugly secret with biometric security is they're all inducing the same sort of signal differentiation and pattern matching they're all fuzzy they're all the lossy so there's always going to be a much higher statistical chance of a collision then if you were not if you were putting in like a public key so the most the

device could do that an Alexa can do is determined that there are different speakers in the room the Alexa can't even determine whether you are the same you today tomorrow Mike there even in the course of one conversation with us it'll even switch to different there's something changed now you're a different person altogether so to deal with that Alexis have had to and also because the company's behind these devices want to make them responsive earlier it used to be that we could talk to our licks in our car like just naturally you know we could also trigger the Alexa across the room I mean really a good promise but they're also really ways and everything else going back to

the Alexa recordings Alexis have got a lot stricter about what they accept as an input range basically we now have to yell at the Alexa in the car

so so I went back to the speaker identification signal remember they're going to be Singlish my voice is my passport that hot that ol see me to play like collect all these samples the voices of interest for 10 years they already sectioned I need to throw out 4 times but the interesting thing here is it was already demonstrated as being not very accurate and Lawson and then the bank decided and what Smith what makes this even more terrifying is I'm researching this I've come to the conclusion that voice identification is a problem it hasn't been solved visits the voice and I see my bank okay there's still my bag now they have is my voice I think

security of course it's secure like you know they worked other bets for suit HSBC and just in case you thought my bank was unique your may probably thought this was a great idea to just say so HSBC had this big thing where they're like otters can't be you know broken into whatever social security researcher over in the UK said show you how easy it is to pull this I've seen the speaker right what's even worse does it online BBC but it's not an identical twin didn t have a Wrangler so they they know that people really well we've got a guy in our office that if he sits with somebody for like a week or so he picks

up on what the mannerisms are fine to my verbal tics all this other stuff anybody has children and how they mock each other like the point there is it's not that hard and people can do it to like just you don't even need like advanced technology just like we do this art nature does this guy's mockingbirds the Lyrebird the Lyrebird is actually known for mimicking like all kinds of other acoustical sounds around them because you can't

there's something terribly exactly like me more than once

yeah that's my point there are only so good bits and so assuming that a voice could be used as an as a single factor now if you want to use it as part of a multi factor maybe but as a single factor it's not about other other applications of voice unlocking tomfoolery it has to do with this so I recently switched from iPhone to pixel and I saw that totally they actually allow voices unlocking it's one of the few funds that does but at least Google has the wherewithal the same you can turn this on that you're making your devices less secure just glad you know that in fact recently within like like in the last month

they're about voice unlock even and all that stuff nothing else no to allow that to happen so that is the that's just the first machine learning system getting their voice to text the second one how we doing on time 10:40 that I company third floor now so the second part is natural language processing in taxonomy so here's a basically a breakdown of what happens when you talk too much when we first got them just talk to like an old friend you know just the law all that like no it's a program and the people who are doing the natural length of the process they don't have like it's not it doesn't go one thread think about what you want to tell

it and then tell it in this format Alexa the wait word whatever the action verb is tell play and whatever implication name which is and then the or various that's it that's the floor you've never had trouble dealing with your likes of devices probably because you're not telling it what it needs in the time it needs to it in the format that needs but even with that constraint Alexis still have a pretty good range and also Google to and I'm pretty good brain to be able to tell what you meant your attention and that's all natural language processing so some of the security implications there is that so companies really tightened to kill the taxonomy of the voice assistance

this taxonomy is classification so the classification of this whole stream here and then the invocation names so what they also do is craft a better experience that you have sheets in the taxonomy Alexa play and start playing the song it in first four it has the cheetah in there saying holy who said play so it must be not a sound all this stuff search engines do the same thing they know it knows to searching for different search remember or switch it to a different circle if you're searching for like car parks it doesn't just entertain Rahl search for that part that goes into the college searching taxonomy [Music]

because

[Music]

this assistant is that we don't own any of that so part of it it's part of the implication there is that you don't know what happens talking to this is a huge problem but you have no verification of the tap that you've just triggered now what does that do for children we're talking to Alexis and asking me to play the set time scales whatever else or my smartass kids when they start asking

[Laughter] there's no sense of context this is a it's part of the central research from the voice' system but for now there's no session think about that we are in the free web era where there's not even a web session to do you know what the implication of the security implication there is I would give so this is a great time for me to point out the book I would give this the first person that can tell me what the security application of having no session

[Applause]

[Music] and because the session was already open because that's how the website works that's also how Google Doc works - there's no verification also they also they didn't into it later like a two-factor type if you have authorized your device to do something you can do that abusive way when I went up to Seattle awhile back there's hotels there that have a Lexus of all of October's okay that sounds pretty cool right I mean for somebody who likes you know the promise of the voice the system like to turn the lights off yeah what are they off the roads to turn off the light of my room okay that's because of that session problem that is every time you talk to

the Alexa every command is by itself now what that means is that so anybody remember this case of the Alexa that recorded the family's conversations everybody what happened was going back to the sound events Alexa thought of her dad's name and then it thought a curve call and then it just decided okay and then if I heard a content so those three pieces right this all natural language processing so then what it does after that is that both an audio channel so I send the information that puts it over to that content Alexa did what it was supposed to do it parsed it out all that so what was the response does if I know the response to

secure this Matt is on side five you guys are going to pitch a moment about the Alexa like being so responsible then we're just going to lock it down this is actually what now having the jello by Aleksey of course every time I tell my legs in the car remember this case somewhere it's because you guys you didn't have the whatever else and now I have to tell my Lex I have to use my yeah and this was like so what I'll leave you guys with is this there's still some really big challenges to solve it comes to voice audience police identity not enough bits of information for you to actually to authorize yourself or much less I did it

for there's not enough bits of information for you to identify yourself much less authorizing a particular person and that kind of fits into the privilege separation Alexis with the the long-running concession tokens effectively it's just authorized

but basically there's ten minutes

my meditation is really close because they need that and the data to know that your mail probably do you live around Kentucky like they need all that bits of information to train their models problems so no they have identifiable information on each one of the voices in fact there's two open more cases right now in discovery part of their discovery is we know that there was an Alexa present and we know that it was triggered at some point those two pieces there are two pieces to the Porsche and so far they as far as I know they've been holding the line that's we're not going to be we're not gonna be that seeking for you to get this data so

the other part the control part based on recent articles that have come out of as on employees just trading back and forth in their slack rooms

yes yes

- that's not Santa pass that that has that's the reason why so much is because they require now and other versions of this talk I go into do-it-yourself audio voice assistance I would love since we're the hacker conference especially since voice assistants are to build your own voices system that is completely secure because [Music] it will be closely tied to whatever subject so that's the trade-off

doing this I'm able to talk to a kiosk or something they can understand me because they had that rich stored data but the house that goes back to the water calling to pay attention to what you're saying Oh Raffi's assistance

yeah so the reason why that happens is the model going back to separating it's called the far-field problem the farther away from the device the born always comes in and so the looser the model has to be this music detection systems lectures they Emerson find that the way that those systems have to work is they they sample the space and they have like certain pieces that hit but then they they have to assume that if there's gonna be cuts out like you're trying to listen to a song and some guy who's talking over it or something pull out this is an example so same thing happens with a far-field activation problem it's if the reason I

have to shout out it is because it's it's looking the tolerances for the dropouts are a lot smaller and that means that it needs a clear direct signal a sports activation like signature mr. Fogg filled with yelling an Alexa where you will see a lot of improvement is when we start having more what's called a hero and applause so you have a headset it's really close to the sound source and so you can like fine-tune like it needs to have this person role in the concerts but that doesn't change the activation problem in the sense of the word I'm not sure any device and worship device that allows you to set a free open source to do that

that's the sound protection if they allow people to just name the device planted a lot of that go away then it would increase the cost of the device itself

the question was

they're surprisingly

[Music]

right and like I've got Holmes is trying to the far field Palmer I just

[Applause]

[Music]

what are you doing there man okay

Wes Widner - The Sound of Evil

Related talks