Deobfuscating JavaScript malware

Name: Deobfuscating JavaScript malware
Uploaded: 2019-05-29
Duration: 22 min 27 s
Description: BSides Canberra 2019 Adrian Herrera - "Deobfuscating JavaScript malware"

BSides Canberra · 201922:27450 viewsPublished 2019-05Watch on YouTube ↗

Speakers

Adrian Herrera

Tags

CategoryTechnical

StyleTalk

About this talk

BSides Canberra 2019 Adrian Herrera - "Deobfuscating JavaScript malware"

Show transcript [en]

right now we have a great speaker a Canberra local we have Adrian Herrera talking about diab for skating JavaScript malware so let's welcome him to the stage

boom thank you thank you for sticking around and I'll try not to get it into too much of the lunch break weave my talk alright so yes Silvie I said I'm here to talk about JavaScript and specifically deal for skating ugly JavaScript malware all right so a little bit about myself so I wear two kind of hats in my day job one hat I'm a researcher with the defense Science and Technology Group here in Canberra and my other hat is my academic hat and I've recently started a PhD at the ANU also here in Canberra and my my manga research is around reverse engineering and applying what science program analysis to reverse engineering problems and today is one example of that alright

so I'm not a JavaScript dev at all in fact I kind of just learned how to write JavaScript this week when I was preparing for this talk but I'm told there's very widely used and there are lots of different frameworks out there lots of different you know tough forms and tools and things for JavaScript and that's kind of what's happened let's bring that back up

yes all right we're back that's not really the focus of today's talk what I'm more interested into talking about is this kind of JavaScript which is a malware sample that does some malicious stuff and has been heavily obfuscated by the author to make the reverse engineering process by the malware and it'll analyst really hard so I've kind of cut the sample off a bit to display it on the slide but if we look at the full thing here believe make this a bit bigger this one this one

right so this is their original sample here and yeah it's pretty hard to make sense of it keeps going on and on and on and our aim is to apply static analysis to this kind of malware - you know strip it back to its bare minimum and figure out what it's doing so yeah so the ultimate aim is to make this kind of thing more readable to learn our analyst so why do we want to do this well if you're at mirror analyst you already kind of know that obfuscation you know hinders your analysis and makes it harder makes it more time-consuming it makes it just life a lot tougher and this is not just true of like JavaScript

and scripting languages but you know you can you can obfuscate binaries and you can of the scape scripting languages other than Java Script so this is I believe vbscript it's pretty tricky to read it's not really clear what it's doing I guess other than the name start exploit maybe gives you some indication but other than that it's kind of tricky to understand what it's doing and then of course there's JavaScript as well so you know if I'm malware or thought and I want to make it hard for the analyst I would apply all these different kinds of escaping techniques and we want to kind of undo all of that work all right so the approach that I took when

building a tool for Java skating JavaScript was to borrow a bunch of ideas from compiler theory and and apply them to the deification process and so a bit different to compiler theory you know instead with compilers we're trying to generate executable code at the end of it here we're trying to essentially get more readable source code instead and so what else are our goals so I I've got here you know focus on semantic transformations and what I mean by this is that we can you know there are a bunch of tools already out there that allow us to preprint source code so one of them for javascript is uglify yes which is the one I kind of accidentally

did before and so you know by pretty printing I mean inserting white space indentation and new lines to make it you know more easier to read but it's still kind of hard to figure out what it doing there's still a lot of curried there to try and to try and get an understanding of so we're not really interested in that what we're interested in is doing some more I guess interesting transformations to to make this more very readable and we want to reuse a bunch of existing tools all right so for those who haven't studied compilers before or for those who have maybe you've seen a similar kind of diagram to this in a compilers course

but this is a typical compiler workflow it starts with the lexer or the tokenizer which takes your source code breaks it up into tokens and assign some products and kind of category to those tokens so maybe you've got you know tokens for white space you've got tokens for keywords you've got tokens for variable names etc and then that stream of tokens goes into the parser and the parser there's some more kind of analysis on that and generates a par 3 based on some kind of grammar and then from the pass 3 we can strip away the uninteresting things like whitespace for example and we get an AST or an abstract syntax tree which I mentioned in his

first talk of today and then on that ast we can apply various optimizations and for a compiler the optimizations the name of those optimizations is to produce either smaller code or faster code and so there a bunch of different tricks the compilers used to do this and then after we've done that we want to generate executable machine code so that's kind of a typical compiler workflow now how can we apply this to the office Kishin well it's pretty similar so here's another diagram we've got three of the same four elements that we saw in the previous slide so we still got a lexer we still got a parser instead of a code generator at the end

we just want to generate source code back out again and now instead of optimizing for speed or size we kind of want to optimize in a sense for readability so this is essentially what deification is and the nice thing about you reusing this kind of workflow is there are already a bunch of existing tools that give us parsers and lectures and generators for various kind of languages including Java scripts all right so I kind of touched on this before but the lecture turns the characters into words and then the pazzo is kind of the interesting thing so we take that extreme of words we generate a par 3 according to some grammar and then we

discard a bunch of uninteresting things to get the ast and then what we want to do is we want to apply some sort of analysis and some sort of transformations on the ast to do a deal for skating all right so the tool that I based my work on is this one it's called safe it's out of a research group in South Korea it's essentially a JavaScript analysis framework it was not really made for the purposes of security but just for a generic analysis of JavaScript code and it already gives us the things that we kind of need so it gives us the lexer gives us the POSIX generator it also gives us a bunch of

its intermediate representations so we can go do different kinds of analyses on top of those ir's but in the end I didn't end up using that kind of stuff all right so this is a snippet of the sample we saw at the beginning again chosen so I can fit it on the single slide and this has been pretty printed with a graph ideas and so what we can do is we can run safe over this to generate an ast and kind of do some analysis on that ast all right so this is what the ast would look like if we were to visualize it as a tree because that's all it is and I've got it here as well

interactively and so we can kind of look at the source code here this is the same source code on the left and the ast is on the right and we can kind of match up the different elements in the source code to the nodes in the tree so you know at the top level we have our entire program our program is made up of a number of source elements these are two variables which are these two here plus this if statement here and we can you know continue expanding out different elements of the tree so this variable is this one here and it's been assigned the undefined value and so on all right so let's look at actually I

guess the meat of this work and that is how we build a deal for scalar to to do things to this ast so if you're familiar with compilers again many of these kind of terms may look familiar so these are typical compiler optimization techniques and it turns out that we can reuse a bunch of them to make code easier to read as well so I'm not going to talk about all of them today I'll just talk about a subset of them because of time which is the three highlighted here and what we essentially do in our D obfuscator is we keep running and applying these different transformations over the top of our ast until the tree

stops changing or we've reached a fixed point and at that point we can kind of serialize the tree back to source code and hopefully at the end of it we've got more readable code all right so let's start with the simplest optimization which is constant folding and so a constant folding does is it looks for constant expressions in code and then just evaluates them statically rather than waiting for the code to run and to evaluate at runtime so if we go back to our example here and we change this one so if we look at our code here we can hopefully you can see a bunch of places where we can kind of pre evaluate what

the result will be without actually having to run the code so for example in this if expression here we've got these three strings concatenated together and but what we can do is we can just pre compute this rather than doing this at runtime and just get a single stream which is easier to read and the other location is this one here so we have these two integers added together and it once again we can just combine them and spit out new code with the complaint or the added value of those two numbers so if we look at the tree and figure out how we're going to do this so if we look at this if statement here

what that looks like in a s2 form we can see the tree in the if statement is made up of these two sub trees I guess which consists of two operations which are the addition or the concatenation I guess because they're strings and then we can see on either side of that operation we have two string literals and so if we if we recognize that pattern while we're walking the tree then we can essentially just perform that operation and replace this subtree here consisting of these three nodes with a single node which is the concatenation of those two strings all right so the slides kind of just repeat what I just said so here's the tree again and so you know

we're running our D office Gator it's walking the tree it sees this kind of pattern and it says well they're just two string literals I can pre-compute the final value of that and then we get undefined what happens is we get a new subtree and once again we can pre-compute the result of that and we get a simplified string so very simple technique but it makes things slightly more readable so it's not just strings that we can do this to we can do this for integers so again in the code we had a hundred plus eighty eight so we can pre-compute that and get 188 and because java scripts a terrible language and lets us do very terrible

things we can have all sorts of different operations on different types of data so for the top one here we have a string and then int and so what happens if you're a JavaScript person you probably know that the the int becomes a string and then we get a string at the end of it what about this one here if we have ten - true so true becomes an integer value 1 and then we have 10 minus 1 so we can replace that with 9 not 1 is 0 times that by float get 0 and this one here is the same as the top one just the types of reverse but it still we end up

with a string and so we can encode all these rules in the obfuscator when we walk the tree and we just look for those that contain these types and you know we look up our rules and we say oh okay we can just replace this node with a simpler node and to do this we kind of have to understand what these operations mean so you know and this in this one this operator here means plus but for Strings that means concatenation and so this kind of requires I guess my point here is that it requires understanding of the semantics or what it actually means when you see this operator in your code all right so going back to our example so we

run that and we get these collapsed strings here all right so how do we implement it I guess I've already described this we start with we write a bunch of rules we start with our root node in the tree we walk the tree and we just apply the rules to get simplified nodes and then we just keep recursing until again we reach a fixed point all right so that's constant folding a slightly different technique is constant propagation and so a constant folding looks for two literals with some operation applied to them constant propagation is that we know that a variable can only ever hold one value and therefore we can just replace every use of that variable with that value so

here's our code again and if we look closely we can find some instances of where we can do this propagation which are here so in this function at the top we have this variable here which is set to null and then the only time that variable is ever used is this just return from that function and so we are we can kind of wreck it again recognize this pattern in the tree and just remove this variable and have the function just return null similarly here this variable here is only used is only written two up here it's set to undefined and it's not written to anywhere else in the code and so once again we can just replace that

variable or to get all right so this is the substitution step so we substitute those variables values where they're used and then we can just delete those unused variables so once again we've removed a bunch of code with the aim of you know getting easier code to read and the the other benefit of this is that we can now start to see where we can reapply some of our the previous technique constant folding because we've now got a simpler AST so for example here we can make a rule in our constant folder to say that the type of undefined is the string undefined and so we can replace that and then we can have another rule that says if we have

two strings on top of equality then we can just collapse that to true and then if we have you know if true then we know that that will always be the case so we can just remove the if condition altogether so you know by building this simple primitive we can you know keep applying them and stripping away the code to the bare minimum all right and then we delete darn use variables all right so constant propagation is a pretty easy idea to keep your head around the implementation compared to constant folding is a bit more trickier I have to pass over the ast a number of times one to propagate the constants and then go back and look for redundant

assignments and remove them and because I'm a program analysis nerd this is implemented as an abstract interpretation which I won't go into the details of that today all right and so the third trick that I wanted to talk about is function inlining so this is another pretty easy one it's kind of obvious when you look at it so now we have this function here it doesn't do anything except for return null and so we can just replace any calls to this function with the value that it returns which we know is always going to be no and then we can just delete that functional together all right so it's called here in a switch statement so and

we know it's always going to be null so we can expand it to that value that it returns and then we can just remove this function we it's no longer called anywhere in the program and so once again we've got even less code to look at and again we can apply other techniques so you know if we're switching on null or on a constant value than any of the other case cases in our switch statement they can never execute and so we can just delete them all together because they're essentially meaningless they're all dead code all right so function inlining again kind of easy to implement you just find what you know you define what it means to be inlinable so for me

I guess I took the simplest path which is if a function I need contains a single return statement and that return statement only returns a literal expression then we consider that to be inlinable and then so once we've collected all those functions again we can just walk the tree and find where those functions are called and just we'll replace it with the value that it gets with that it returns so if we go back to our example where's my tree gone all right doesn't matter and then we keep it closing until again until we reach the fix point all right so I've talked about three kind of easy compiler optimizations I've yet there are a bunch

more we can also talk about but essentially that aim is just to remove as much dead code as possible and by applying different ones they enable us to remove dead code that appears with another kind of optimization alright so let's go back to our malware that we were looking at before which is this one and so yeah so we can see that there's a lot of code here there's a lot of functions that just return you know single values so we can apply our function inlining to them they're a bunch of variables here that I only get assigned a single value so we can propagate those values throughout the code and we keep doing different things

until we can't go any further so let's run the tool on top of that code so yeah so safe is the tool and I've just added a deal for skate sub command to it and then we give it you know the javascript file that we want and the output file that we wanted to go to and then we run it and then we get you know it terminates hopefully and then we can look at it and see what we get so now we had 460 odd lines of code before and now we've reduced it down to 11 so we've done a bunch of extra things here so I've renamed variables to make them more memorable so I just picked animal names

so that's why we've got these funky animals all stare at the code but now hopefully we've got JavaScript that is much easier to understand what it's doing because we've stripped away or the craft that really had no effect on the on the mail at all all right so this is the same kind of slides here so we started with 680 lines of code once it was when it was pretty printed and then once we jump to skate it we reduce it down to 11 lines of code and congratulations we feel that our analysts now you have off the skated POW shell inside it that you have to deal for skate but again you know these these

techniques are kind of generic and you could in theory you know build a static analysis for the powershell and do the exact same thing so this sample just happened to have PowerShell in it but you know I've tried this on a bunch of other samples and it has a similar effect it deletes lots of code and you kind of makes it much easier to read all right cool so by looking at IDs from compilers we can build some pretty powerful tools such as adding obfuscator this idea is not really that now but it's kind of effective I guess from malware analyst and you know it's applicable across different kind of languages not just JavaScript so as long

as you can generate as long as you have a parser for a language that gives you AST that you can walk and transform then you can essentially apply these techniques to that ast to reduce it down to something more meaningful and yeah I guess that's a summary that's the end of my talk thank you very much you

Deobfuscating JavaScript malware

Related talks