← All talks

A Glance at Interpreted Language Bytecode Trickery

BSides Tampa · 202141:2147 viewsPublished 2021-04Watch on YouTube ↗
Speakers
Tags
Mentioned in this talk
About this talk
Chris Lyne: A Glance at Interpreted Language Bytecode Trickery When performing 0day vulnerability research, targeting unfamiliar products, an initial goal is to gain an understanding of the product functionality as best as possible. This helps the researcher to map out where the weak points in the product might be and quickly identify where to prioritize research. Ideally, the researcher would like to get his or her eyes on the underlying source code. One specific technique I encountered was in a product consisting entirely of compiled Python 2.7. My attempts to decompile the Python were unsuccessful, and I didn’t know why. This turned into an opportunity for me to dig into the inner workings of Python to uncover why de-compilation was not working. Prior to exploring these protection mechanisms, I was completely unfamiliar with what was going on under the hood of both Python and PHP. In this talk, I would like to share some of my key learnings. We will start from the ground up, discussing how interpreted languages work, what bytecode is, and finally, we will look at the protection mechanisms in more detail and how I was able to bypass them. I hope that this talk can give other researchers a leg up when they are faced with a similar protection mechanism down the road. If you enjoy reverse engineering, security research, CTFs, and/or programming in general, this talk may be for you! ----------- WEB: https://www.bsidestampa.net DISCORD: https://discord.gg/FhdkSNa24P TWITTER: https://twitter.com/bsidestampa MERCH: https://bsides-tampa.launchcart.store/ About BSides Tampa: B-Sides Tampa is an Information Technology Security Conference hosted by the Tampa Bay Chapter of (ISC)², a registered 501(c)3 non-profit organization. The purpose of the B-Sides Tampa is to provide an open platform for Information Security industry professionals to collaborate, exchange ideas and develop long standing relationships with others in the community. The B-Sides Tampa IT Security Conference took place Virtually on March 27th, 2021.
Show transcript [en]

okay is my audio coming right now chris hear me

just verify here i can hear you can you hear me okay perfect thanks mike we will go ahead and get started um so thank you all for coming out to my talk as the title says today we'll be talking about we'll be taking a glance at interpreted language by code trickery my name is chris line i'm a senior researcher at tenable on the zero day research team and the team's goal is to find unknown vulnerabilities in third-party products we also coordinate disclosure with product vendors when we find vulnerabilities and just like we're doing today we also share our research with the community some background for this talk there were two research projects that kind of sparked the content for

this i was looking at druva in sync which is an endpoint backup software and i was looking at nagios xi which is an enterprise server network monitoring software and like you see druva was written in python nagios is written in php so the idea initially is to get your eyes on the source code and try to make sense of how the the product works you know under the hood hopefully you know we can understand it well enough to find some vulnerabilities in it the problem was both of these products had implemented some form of code protection so i wasn't able to read the source code so some topics we'll cover today are interpreted languages php and python we'll look at bytecode

we'll also look at some bytecode protections and also how i went about bypassing them to understand application logic so first we'll talk about compiled versus interpreted languages so as you see on the left here some examples of compiled languages our c or c plus or go um basically a compiled language requires the programmer to to explicitly compile that source code into an executable format which is you know it's machine code it runs on the operating system which runs on the processor architecture right today we'll be talking about interpreted languages which are a little different the programmer doesn't have to explicitly compile the source code however there is an interpreter program um that will interpret the source code convert it in

by code format um which can then be executed by a virtual machine so for instance like the python virtual machine will execute python bytecode so first we'll talk about the python opcode remapping which i saw on druva nsync when i first started researching druba i realized that it was built with pi to exe so what led me to believe this is if you look at the procmon output at the top here you can see that when the nsync exe process launches it loads up a python 2.7 dll and it also loads up a library.zip if you were to look in the library.zip you would see a bunch of pyc files and these are all python modules this

behavior is indicative of pi to exe and what that what pi to exe does is it allows the python application developer to distribute the final product has a windows executable so there's a python interpreter that's delivered with it along with all of the required modules so you might be asking what what is a pyc file well it's a byte compiled python file and a lot of times modules are converted into that by compiled format in order to help speed up the load time so when you go to import a bunch of modules into your code it loads faster it's already compiled however if you were to try to open up that pyc in a text editor

it wouldn't you wouldn't be reading source code like you would in a regular python source file it's a binary format so in order to get that python source code back you would have to use a decompiler such as uncompile 6. so in the screenshot here below you can see i took the the python struct module and ran it through uncompiled 6 and decompiled that that pyc file back into source code and it's pretty simple there are just three imports so the protection i encountered was that i wasn't able to decompile the the druva nsync modules that were delivered with it and when i tried to to run uncompile 6 on the struct module that was packaged with druva

i got an interesting error it said unknown magic number 62216 in the file so i figured there was something funky about the file and so that led me to to dig deeper into that pyc file format in order to dig into it i started with a hello world the python script was just you know just one print statement i used the pi compile module to create a pyc from that source file and below you'll see a hex dump of the resulting pyc file first four bytes are what's called the magic string the second four bytes are the time stamp of when the file was created and then the remainder of the bytes are a code object

we'll talk a little bit more about magic string and code object so as you just saw the magic string the magic string contains a number in it called the magic number and that represents which python version compiled it so we saw that the 62216 was an unknown magic number if you were to look in this list which is a documented list of magic numbers um you don't see six two two one six so that kind of explains why that decompiler didn't know what to do with the file but if you notice there's a six two two one one sixty two oh one six two two one nine one these are all magic numbers that line up with python 2.7 a zero

and each magic number is a new revision so as you can see with this magic number it introduced setup with it introduces build set map ad set add these are all new instruction types being added so these instructions all map up to a specific op code for instance on the highlighted line here you can see that the call function instruction maps up to opcode131 um and then so on there there are plenty of more instructions make function 132 build slice 133. so there's a module in in python called the opcode module and if you were to dump the opcode map using that module this is the output you would see for a normal python 2.7

installation i want you to take note of where the arrow is pointing so the call function instruction has an op code of one three one and also keep in mind here dupe top is a four so we'll just look at those two for now but if we were to look at the druva opcode map you would see that call function has a value of 1 11 and then duke top has 64. before it was 131 and four so it's interesting that the druva op codes have a different instruction to up code mapping um so i told you we talked about code objects a little bit more so again here's that the graphic the code object follows the magic string and

the timestamp and this piece is very important because the code object contains all of the instructions it contains the op codes and the operands for each instruction that code object can also be executed so inside that pyc if we were to read starting at in byte index 8 like we saw we could read the code object using the marshall module we can load those code bytes into a code object and if we were to execute the code object we get the expected output hello world so inside that code object like i said that's where the instructions are and there is a field called co code which has all of the raw bytecode so like d

for instance actually represents a an integer value that is an opcode and then there are operands in here as well which we'll see so like i said the the idea was that i wanted to decompile the druva nsync modules i wanted to be able to see their source code but before you can so an intermediate step prior to being able to decompile something is disassembling it and the disk module allows you to disassemble a code object so with our example the hello world we could see that just printing out hello world results in quite a few instructions first we load up a the hello world string constant which is stored in the constants tuple at

index 0. that item is printed then there's a new line printed and finally we return a value of none

okay so that was a simple hello world example in the real world you know applications are way more complex than that you're not going to run into many hello worlds you're going to run into object oriented types of programs with classes and and methods and whatnot um so here's another example this is my hello class you can see that there is a constructor it doesn't do anything and there's a method that just says oh hey so again this is a very simple example however it it makes the code objects much more complex so if we were to disassemble this code object the one instruction i want to point out to you is this load const

because it loads a code object out of that constants tuple and if we were to look at that constants tuple we would see at index 1 that there is indeed that code object now if we disassemble the the code object that's in the constants tuple so next one here check out the the lines i've pointed to there are even more code objects being loaded so as you might imagine um these code objects get pretty complex when you have you know a very complex classes and stuff um so my next step was knowing all this how do i fix the op codes and the approach i took was to remap the opcodes so since we had the the opcode map for

druva and we have an opcode map for a normal python 2.7 we could disassemble the code objects found in the jeruva pycs and you would go through the raw byte code look for opcodes when you find an opcode check the map so 111 maps up to call function and then we also know that the the correct python 2.7 op code is 131 so we will replace 111 with through one and then so on for each instruction if we find a 64 we know it's a dupe top we'll replace it with four so that that's how you go about fixing the up codes now it's a little bit more complicated than that because of those nested code

objects so we're going to work through this diagram real quick so we want to decompile a protected pyc file so that's our input a pyc file we know it has a magic string it has a timestamp and then it has that code object that we're interested in so we pull the code object out of the pyc file and then it gets sent as a an argument to a routine that fixes code objects this is where all the magic happens so when that code object comes in the co code field is fixed so that that gets scanned for op codes and those op codes get remapped and we know that const the tuple can contain more code objects so

the next step is to loop over that tuple look for code objects if there are any more then we would send them through this routine again and that's where the recursive nature comes in um so the output of this function ends up being a new code object co code would be remapped and the constants tuple would contain more code objects that get remapped as well this results in a brand new pyc file and in order to make it work with uncompiled six you need to set a documented magic number um it doesn't matter what the timestamp is but like if we did it today we would set it as march 27 2021 and then we pushed that brand new

code object in here and unfortunately i can't show you the druva source code because it's you know it's intellectual property um so we've seen how a pyc file can be remapped um and that that's a static file right so this next this next protection we'll look at is source guardian we'll be looking at our runtime solution so when i opened up nagios xi it had a lot of php code and a large chunk of it looked very similar to this um so if you look at the top here you'll notice first a function is checked for existence and that's the sg load function and if you kind of creep down here to the bottom you'll see that

sg load gets called and the argument to it is this big blob of characters data now this represents an encoded file for any type for any php file that gets encoded by source guardian it'll end up looking like this except this argument will be different um what i like to call this is the source guardian wrapper when you run an encoded file and execute it the file knows how to decode itself and execute itself and all of the the decoding magic takes place in sg load so i'll take a quote off of the source guardian website they say that our php encoder protects your php code by compiling the php source code into a binary bytecode format which is

then supplemented with an encryption layer so they compile your source code and then encrypt it so sg load decrypts and then executes the bytecode okay so we'll take a little quick glance at some php internals here so php by code it's similar to python by code however i don't know of a concept of a pyc file that's static you know compiled python file however there are php extensions that pre-compile scripts and one that you might recognize is op cache app cache improves php performance by storing pre-compiled script by code in shared memory thereby removing the need for php to load and parse scripts on each request now it serves the same function as a pyc

file it's there to to make load time much faster now in general even if it's not cached like this um php code will be compiled into bytecode which is executed by the zen vm runtime so php and the zen engine also provide hooks for extensions that can allow developers to control the php runtime in ways that are not available from php phpuserland some hooks that we'll see moving forward here are zen compile file that is the hook when php code is compiled into bytecode zend execute is there when the byte code gets executed by the vm and also we'll see some op code handlers um being overwritten and an op code handler is it's just a c

function that is designed to handle a specific instruction so for instance if an echo instruction were encountered it would know to print a standard output also various you know php functions can be overwritten and hooked like you know variable dump so hooking is a very useful process for debuggers i'm sure if if you've ever debugged code you will set breakpoints on various functions you want to stop execution when that function gets hit that is an example of hooking a specific function and overriding it so next we'll look at the vulcan logic dumper extension and this is an extension i made heavy use of in order to decode this source guardian at run time but the way it works

is so we'll start with this example but we have a hello world if you were to run vulkan logic dumper on an unencoded php file it would show you the instructions that are compiled so this compiles into bytecode and this is a representation of the bytecode we can see that executed properly but again you have two instructions an echo and a return so vld hooks at compile time so at the point when source code is compiled into bicone that's where vld out of the box will do its hooking and i've showed you a little snippet here of what the extension looks like inside of the compilation routine so first the original zen compile file is called

it converts it into bytecode which is represented in an operator and then there's a function called to dump the operae so this is the output of that vld dump op array now if you were to run vld against a source guardian protected file you wouldn't get the output that you're interested in the reason for that is because we're looking at compile time now a source guardian protected file contains all of the code necessary to decode so really to decrypt and then execute the compiled the protected bytecode the problem here is that if we're hooking with at compile time the only instructions that we're going to dump are the wrapper instructions so as you can see

um we check for existence of sg load and further down what you can't see is where sg load gets called um but but we're interested in the argument to sg load not this wrapper code that tells it to decode the protected bicone so just to give you an idea of kind of what happens here the hooks when a php on a source guardian protected file is launched by the php interpreter um all of that wrapper code is compiled into bytecode so that wrapper code gets turned into an opera that includes the call to sg load and then all of that compiled wrapper code gets executed so this in turn calls sg load as she load fires

and it decrypts to x2 it decrypts the encrypted byte code and call zen to execute to execute that decrypted by code so you can see that there are two calls to zend execute here we're executing two sets of bytecode the we're interested in the second hook of zen execute because that is when the decrypted byte code is executed okay so from here i i modified the vulcan logic dumper extension to instead try to dump op codes um when zen execute fires versus zen compile file like we saw before and i set it to to dump on the second invocation of zen xq now here is my my function that fires when the hook is caught

first i print out execute and then i call vld dump op array and then i'll actually execute that operate so with the hello world example you can see that execute was called a clearly executed fine because we got the output we expected but no instructions were dumped so i had to figure out why and the way i went about doing that was by debugging the php process so i debugged running that hello world source guardian protected file and i set breakpoints on zend execute like we talked about we are interested in the second invocation of zen execute so here we go we hit the break point and here i print out the op array notice

that the line start is zero and the line end is zero and interestingly enough if you were to look at all of the instructions they all have a line number of zero which is kind of strange and each of these each of this these little outputs here represents a zen-op structure as endop has a handler it has operands a line number op code so this is this represents an instruction um so that vldump operate function ends up calling a vld dump op instruct function that will dump out an individual instruction and when i looked at that source code because i'm trying to figure out why are no instructions being dumped i saw this little if block down here

if the line number is zero then an instruction would not get dumped so i ended up commenting that out and sure enough i got some output so the top up here represents the hello world that is i source guardian protected it and below is sort of the expected output that we're aiming for this is um this was dumped with an unprotected hello world using vld so if you look at the differences here the protected file has an additional operation and it's this jump that's prepended at the beginning here also what's weird is if you were to jump and follow the jump it takes you to instruction two so if you were to execute all these instructions

it would just jump and return there would be no echo which clearly isn't the case so i needed to dive in a little bit deeper here i started with this sample we generate a random number either zero or one depending on the output or depending on the random number um we'll either output a one or a zero so it's pretty simple when i encoded that and ran it through the my vld um i got some similar kind of output that we did before so again this is the expected this is an unencoded sample this is an encoded sample so again we got additional instructions 12 versus 10. we got that prepended jump and similarly to the hello world if you

were to follow that jump it takes you straight to operation four which is ascend val one and then beneath that you have a do function call random so if you were to follow this logic it takes you to the random call but only one is sent as an argument zero is never added as an argument so this this doesn't match up to this down here we have a send val zero zen val one do function call rand additionally um if you notice there's a jump z and z down here that instruction doesn't even exist there's just a jump z so clearly we have some some weird stuff going on if we go back to the debugger and this

is the hello world um remember that there were three instructions there was a jump an echo and a return and i talked about that head handler the zendot has an opcode handler if you were to look at the handlers here the jump handler function does not have a symbol associated with it like the other two so echo has a symbol associated with it it clearly looks like it's some sort of zen engine function and return also has a zen symbol associated with it so that's kind of strange right now i checked out the loaded libraries and the addresses and that jump handler address actually fits in the range of the addresses for the source guardian

loader extension and that's what this funky file name is that's the source guardian extension i set a break point on that and when the jump handler fired um it confirmed what i just saw that that jump handler exists inside the source guardian loader extension now if we were to enter that source guardian jump handler function something that stood out to me was a specific call instruction and it calls a function pointer when i stepped into that that function there it ended up calling the zen jump handler so the source guardian jump handler does a whole bunch of stuff and then it ends up calling the zen jump handler which is kind of strange so if you were to take a look at the

jump instruction prior to entering the source guardian jump handler and then compare the instruction to right before the zen jump handler is called inside the source guardian jump handler you would see that operand one actually changes and since it's a jump popper m1 is a jump address basically what that means is before the zen jump handler handled the jump instruction the address changed which means that the jump would have gone to a different instruction than what we saw in that output and to kind of recap on that here's the source guardian jump handler so when that function is fired the current operation which in this case was a jump is referenced and then the operands

which is the jump address are de-obfuscated then the zen vms jump handler is called with the valid operand the valid jump address and then prior to exiting that source guardian jump handler the the operands were obfuscated yet again now we'll start to get into the solution so the solution is that you have to fix each operation that's obfuscated um and what i did was like for the for example for the jump i created a function and i based it off of that source guardian jump handler function so basically i copied all of that that logic and i modified it a little bit and what i did was i allowed the deop station to occur but instead of allowing the jump handler

to fire i set the instruction handler that jump instruction handler to just point to the zen jump handler and then i didn't allow the operands to re-obfuscate so in essence i'm just i'm fixing that jump instruction now that's just fixing the jump instruction but there's an entire operae that needs to be fixed so before if you remember when we dumped an opera we would basically we'd be dumping obfuscated values and then when those ops when those instructions are executed by zen execute the operands are de-obfuscated the zen handler executes it so then instruction fires and then the operands are obfuscated again so my solution was to fix the operae dump the operate and then execute

so when i fixed it i like i said i modified the the source guardian handlers i allowed them the instructions to de-obfuscate the operands and then i set on the instruction the handler to the zen handler when i dumped the operae it was dumping the fixed instructions and when zen xq fired the operands were fixed and the handler was the correct one so in addition to the jump handler there are several other instructions that were being obfuscated and as you can see like a lot of them have control flow implications so we've got jumps we've got go-to's we've got conditional jumps try catch so there were five different source guardian op code handlers for these groups of

instruction types so i ended up having to create five different modified opcode handlers and they follow the same logic you know let the operands the obfuscate set the set the handler to the proper handler and then don't allow them to re-obfuscate and here's a snippet of my code in the vld extension so right here i would loop through the op array look at each individual instruction if the op code ended up being a jump or go to then there's a specific fixer function that i would fire and again in this switch statement if it's 46 47 152 or 158 then i run another routine and so on

so like we saw before with python classes introduce another layer of complexity so i'll start with an example real quick we've got class one and class two each one has a funk one or funk two which echoes one or two each of them the classes also has a function that will not be called and that they're called not used one or not used to and return value one or two so in the main function or the main logic of this php script i generate a random number one or two that will determine which class is instantiated and then in turn which function is called now when i ran vld against that um here's output of kind of that main

logic we can see that a random number is is generated depending on that either class one or class two is instantiated and funk one or funk two would be called now in this case funk one was called and class one was instantiated so you can see the disassembly of func one and the output shows that you know one was the random number generated now what i found was that in the output it didn't dump the unused functions and it didn't dump the unused class which was class 2. so there was one more step i had to take in order to fully dump out a source guardian protected file and all of its opcodes so what we've

been doing is dumping kind of the the main however there's another piece that i had to tap into which is the class table and also the function table and these define entries so in this case there was a class 1 in the class table and a class 2 along with their function entries however we didn't have a just a static function defined in the the file so i didn't have to tap into that for this particular example but basically we we fixed that operating we fix the operators in the class table fix the operators in the function table and then we can dump it so after i tapped into the class table function table my modified vld would then dump

unused functions in unused classes so as you can see the not used one was dumped here along with funk1 and also you can see that class 2 in its entirety was dumped we could see funk 2 not used to

so thank you all very much for joining just to recap we talked about remapping python op codes um in a static pyc file and we also talked about fixing php bytecode at runtime i hope you know if you ever run into something like this down the road that this is a good reference for you to refer back to and hopefully you have a leg up if you're interested in diving a bit deeper i've written a couple blogs what's on the tenable tech blog and also i've dumped all of my code for python remapping and source guardian decoder extension and at this point i'll open it up for questions

good

food lecture oh thank you does anyone have any questions

up here uh all alone i'm not if you're a room monitor she she may have been having trouble with her mic um and i don't see any questions in the q a but we can go back to floor and everybody can maybe you can hang out at the tables for a little while and if anybody has any questions they can come and ask you it was a great talk though thank you oh thank you very much yeah thanks all for joining and thanks to besides tampa for having me