A Glance at Interpreted Language Bytecode Trickery by Chris Lyne

Name: A Glance at Interpreted Language Bytecode Trickery by Chris Lyne
Uploaded: 2021-05-07
Duration: 36 min 8 s
Description: View slide decks and full list of talks available at: https://www.bsidesdub.ie/past/2021.php

BSides Dublin · 202136:0835 viewsPublished 2021-05Watch on YouTube ↗

Speakers

Chris Lyne

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

Procmon uncompyle6 Vulcan Logic Dumper

About this talk

View slide decks and full list of talks available at: https://www.bsidesdub.ie/past/2021.php

Show transcript [en]

thanks everyone for joining my talk today um we'll be taking a glance at interpreted language by code trickery um as karen said my name is chris line i'm on tenable zero day research team the team's goal is to discover unknown vulnerabilities in third party products we'll also coordinate disclosure with a product vendor and also we do quite a bit of sharing our research with the community so some quick background on the talk it started off with two research projects one of which i looked at druva in sync which is a an endpoint backup solution so that one's written in python and there was another project nagios xi which is an enterprise server and network monitoring software

that one's written in php so the main goal when i first started these projects was to get a hold of their their source code and read it figure out how the product works see if i can find vulnerabilities but in both cases i was presented with bytecode protection so i wasn't able to just read the source code like i had hoped so some topics for today that we'll go over we'll talk a little bit about interpreted languages we'll talk about what bytecode is we'll talk about the protections i encountered and also we'll go through how i bypass them so real quick let's let's talk about compiled versus interpreted languages so on the compiled side of things you're

probably familiar with c or c plus plus or go those are all compiled languages but basically you have your source code program the programmer has to explicitly compile that source code into an executable format which is machine code that targets the operating system and the processor so for this talk we'll be focusing on interpretive languages like such as python or php or ruby it's different than compiled language because the programmer doesn't have to compile it explicitly there is an interpreter that does that job and the interpreter will convert the source code into bytecode which is similar to machine code except that it targets a virtual machine environment so the python virtual machine or php virtual machine

so first we'll take a look at python opcode remapping and that was in the druva instinct product and then we'll take a look at the protection code objects and fixing the op codes so in druva one of the first things i did was analyze the the program behavior and at the top here you can see procmon output so that shows some events that took place when the nsync program launched and as you could see a python 2.7 dll was opened and also a library.zip archive was open which is down here so this type of behavior is indicative of a python application that's built with pi to exe basically what it does is it allows the

developer to write python code but then they can ship their application as a windows executable so essentially it ships it with python and all of the python libraries that are required which is what you see in here if you notice all of the the files in here for the most part end with dot pyc so again those are those are the modules the python modules so you might be asking what's a pyc file and the answer is that it's a byte compiled python file so if you were to load up a pyc file in your text editor you're not going to be able to read python source code like you would with a dot pi

the idea of compiling a python file like that is to help speed up the load time and this especially is true with modules because as you saw there are a lot of modules so if they're already compiled when they're imported it really speeds things up however if you wanted to read the source code of those compiled files you would need to use a decompiler such as uncompile six and that's what i've showed you here in this screenshot i used uncompile 6 to decompile the python struct module and as you can see it it works pretty well basically that module is just a few imports now the protection i encountered i encountered it when i tried to decompile

the destruct module that was shipped with druva insect now when i tried to decompile it i got this interesting error unknown magic number 62216. so clearly there was something in the pyc file that uncompiled 6 didn't like so my next step was to figure out what is what's the structure of this pyc file why is it what's going wrong here so when i was trying to learn about the pyc format i started off with the hello world application it's just one print statement i use the pi compile module to create a pyc from that source code and at the bottom here you see a hexdump of that pyc file so the first four bytes of a

pyc file are what's called the magic string the the the next four bytes are the timestamp of when the file was created and then following all of this is a code object we'll talk more about the magic string and code object as we go so that magic string i just mentioned to you contains a magic number and if you remember the the the error before said unknown magic number 62216. if you look in here this is a list of documented magic numbers so like six two two one one corresponds with two point seven a zero and actually all these many of these correspond with that python version so with each magic number you can see that something's introduced

right so we introduced setup with build set we introduced map ad we introduced set ad these are all python instructions from bytecode instructions now all of these bytecode instructions have a corresponding op code so for instance the highlighted line here and this is defined in the upcode module the highlighted line here shows that the call function instruction maps up to the 131 opcode and so forth make function maps up to 132 build slice maps up to 133. now there are quite a few of them now if you were to use the opcode module so import opcode and dump out the op code map you can verify that number we just saw call function maps up to op code 131

and so take note of that one also take note of dupe top maps up to 4 28. so if we were to do the same thing with the druva installation we import the opcode module that was given with the druva installation check out the op map notice that call function has a totally different up number there and so does dupe top i'll go back so you can see that the 131 and four 1 11 64.

so now we'll look at code objects a little bit more in depth as that's where the instructions and op codes are contained so that code object that's in the pyc file if we were to read from a pyc file load up the code object that starts at index eight use the marshall module to load those code bytes you can see that there is a code object in there and if we were to execute it we would get that expected output hello world right so in that code object there is a field called co code which contains the raw byte code so this string you see right here this is raw byte code it contains at the beginning here this d

um is a it's an op code and then each op code has operands following it so the arguments to the instruction now since we're trying to decompile a python a compiled python file um i should want to tell you about disassembly real quick because that's an intermediate step in between decompilation so those op codes um can be disassembled you take a code object use the disk module and you can disassemble the the bytecode so as you can see for that hello world it ends up being quite a few instructions compared to the the python source file you have a load const which loads the hello world constant which is defined in the constants tuple that constant is printed a new line is

printed and then the script returns none now in the real world obviously we're not going to see hello world applications we'll see more object oriented like applications they're much more complex and still this is a fairly simple example but if we take for instance the hello class it has a a blank constructor doesn't do anything and then there's a method called say hello that just says oh hey if we were to disassemble that code object i want you to notice here at construction at offset 9 we're loading a constant which is a code object so if we were to look into that constants tuple at index 1 you would see that there is indeed a code object

now we can disassemble that code object too and in fact it contains um some instructions to load even more code objects so as you can see a basic class can start to get pretty pretty complex you have nested code objects now now we'll start getting into how i went about fixing these code objects in order to decompile the stuff so my my strategy was to you know you read a pyc file and then take a look at all the op codes in the bytecode if so we know that we have the mapping for the druva opcodes and we have the mapping for a normal python 2.7 if we see a druva opcode of 111 we know that it's

a call function instruction now we can replace that opcode with 131 and so on if we see a 64 we know it's a dupe top then we replace it with a 4. so ultimately because of the complex nature of code objects and how they could be nested the algorithm had to be recursive so just bear with me we're going to go through this diagram real quick so obviously we're reading a puic file has a magic string timestamp and that code object we're interested in so we read the code object and then there is a main function that i implemented for fixing code objects so when the code object comes in that raw byte code in the co code

field is remapped so like we saw before and then we also saw that the constants tuple can contain nested by code so if there are if there are more code objects in the constants tuple we'll go ahead and loop through those pull out a code object and then call this routine again so that's where the recursion comes in and the the whole idea is to produce a new code object with remapped op codes and co code and then remapped op codes inside the nested code objects in this constants tuple and ultimately in order to make the compilation work with a standardized tool i set the the magic number there to 62211 which was python 2.7

a0 i also set a a time stamp of today right uh and then that new code object is is stuffed in there so after i did that on after i ran that routine on all of the um the pycs that were delivered with druva insync i was able to decompile them unfortunately i can't show you all their source code because it's you know it's their intellectual property um yeah so that that's python opcode remapping um next when i was looking at nagios xi i ran into php source guardian and this is a proprietary protection mechanism so when i was looking through their code base they have had all sorts of php files some of which

i could read the source code and then there were others that looked like this and i don't know about you but when i look at this i can't i can't read any of this there's there's no uh there's no programming logic in here basically what this is this is an encoded file so if you were to run the source code guardian encoder on a php source file you would end up with something like this notice at the top here that it checks to see if the sg load function exists and if so make our way down then the sg load function is called and what this function what this this code does is it allows source

guardian protected files to decode themselves and execute themselves now if we zoom in on that sg load call there's a big string of stuff and this is the the programmer's logic that we're interested in so i'll take a quote off of the source guardian website they describe their product they say our php encoder protects your php code by compiling the php source code into a binary bytecode format which is then supplemented with an encryption layer so they compile the php code and then encrypt it now this sg load function has to first decrypt then execute the bytecode so this argument here this is the encrypted bytecode so that's what we're after so we'll take a look at a few quick

internals concepts so php byte code is it's similar to python and that you know it's a it's a bytecode format that the the vm understands however i'm i'm not aware of a concept of a pyc file in the php world however there are php extensions that will pre-compile scripts um and here's a quote the op cache extension improves php performance by storing pre-compiled script by code and shared memory thereby removing the need for php to load and parse scripts on each request so same idea scripts are precompiled into bytecode for speed purposes right now in general when a php script is executed it does get compiled into bytecode which is then executed by the zen vm

runtime so php and the zen engine also provide different hooks for extensions and these hooks allow extension developers to control the php runtime in ways that are not available from php user land so some hooks that we're we'll talk about coming up here are zen compile file so think about when php source code is transformed into bytecode that bytecode is then executed so we could hook send execute and also each instruction just like in python has its own specific op code and those op code handlers the functions that know what to do with those instructions those can be overwritten as well and something i'd like to point out here as well is that hooking is extremely useful for

debuggers um if you think about some debugging you've ever done you probably set a breakpoint on a function name um so that's that's the same concept you would hook that function name now in order to to pull this by code out that i was interested in i used a an extension called the vulcan logic dumper and the vulcan logic dumper allowed me to dump the instructions of a php script it did it at compile time so when this source code is compiled into bytecode that's at the point where it's dumping it so as you can see there's an echo that goes hello world and a return one now when i tried excuse me so vld like i said hooks zen compile file

this is a snippet from vld that shows how that works so first the the script is compiled and then the and it compiles it into an op array and that op array contains the byte code and then vld dumps the op array so that's that output you saw here this is vld don't bop array now if you were to run vld as is when it hooks at compilation time if you were to run it against the source guardian protected file you wouldn't get the output that we're interested in we're interested in what the nagios xi developer implemented however what you would see is that source guardian wrapper that i was talking about before this is all the bytecode that would run

in order to decode and execute the protected bytecode so as you can see at the top here like we saw in the source code it looks for that sg load function if it exists then eventually it would get called however the the output is pretty lengthy so it's not shown here okay so just let's let's uh take a step back real quick so if we're visualizing the hook that we want we know we don't want it at compile time but basically when a source guardian protected file is launched by php the interpreter launches all of the code is in there to decode the compiled encrypted bytecode at compile time which we just saw it would be compiled into bytecode which

is end operate this includes that call to sg load and then after comp compiling the wrapper that code would get executed which in turn would execute sg load so as she load a fire it would decrypt the encrypted bytecode and then call zen execute to execute that protected by code and that's the by code we're interested in so the second invocation of zend execute is what we want to hook so my next step was to modify vld and i did this just like we talked about i i hooked the second invocation of zen execute so here's my vld um hook um so i wanted to see when execute ran so i printed out execute and then the operator would be dumped

and then finally the by code would be actually executed so with the hello world we can see that the hook was hit execute was printed and clearly the instructions were were executed successfully however you don't see any instructions like we did earlier um the question was why why am i not seeing any instructions and at this point i decided i needed to debug the php process when i launched this encoded file so i set a breakpoint at send execute as you can see it was hit twice we're interested in the second one so at the second break point here i printed out the the op array and as you can see there are three instructions

and lines start is zero and line end is zero and that was weird because if you looked at all the individual instructions in the operae um and actually the the instructions are zen ops each of these had a line number of zero which was kind of strange um but if you were to look at the structure you can see it has a handler it has operands line numbers now i dug into the vld source code to try and figure out why the instructions weren't being dumped and i ran into a function that dumps individual instructions and i found this interesting if block if um op line number equals zero which they all do we return we don't dump anything so

i went ahead and commented that out and i got some new output so that was pretty exciting so at the top here is my modified vld against an encoded file encoded hello world and at the bottom here is vld run against um the hello world prior to being encoded so this is not an encoded file this is what we're expecting to get however um the output i got was interesting because the number of ops were increased you'll notice there are three up top and two at the bottom and that there is an additional instruction prepended there's a jump at the beginning but if you trace the jump jump to instruction at offset two then it would simply return and the echo

would not get executed so the control flow is a little off there now i i wasn't sure why this was happening so naturally i started off you know creating more samples and this was my next sample i generate a random number which will end up being either zero or one depending on the out but depending on that random number i'll echo one or zero so i reran this test again the encoded versions at the top the unencoded versions at the bottom and notice that again there are more instructions there's an additional jump if we follow the jump to four what happens here is a function call to rand takes place but only one is passed as an argument

so if we jump straight to four then zero doesn't get added as an argument so that doesn't match up with the source code additionally if you were to look at the output down here there's a jump z but there is no jump z so it looks like that instruction got changed so i went back to the debugger for the hello world if you remember there were three instructions a jump an echo and a return now something that stood out to me was the jump handler address didn't point to a symbol like the other two as you can see with the echo instruction the handler is it has a symbol associated with it zen echo spec const handler

and the return has a symbol associated with it so that was kind of strange i looked into this address for the jump handler to see where it was in the loaded libraries and sure enough it actually pointed into the the source guardian extension so this weird file name is the source guardian loader extension i set a break point on that address and sure enough once that function fired i was inside of the source guardian extension there now when i disassembled that source guardian jump handler function there was a specific instruction that stood out to me and that was the call so inside the source guardian jump handler it's calling another function at a pointer you know it's a function

pointer when i stepped into that function sure enough the zend jump handler was called so basically a bunch of stuff happens when a jump is encountered and then the actual zen jump handler is called if you were to look at the operands of the jump instruction before entering the source guardian jump handler and then compare it to inside of that source guardian jump handler right before the zen jump handler got called you would notice that operand one actually changed and since it's a jump the operating one is a jump address so the address of the jump changed which means that the jump would then jump to a different instruction now just to kind of recap on that inside

the the source guardian jump handler this is the logic so first that jump operation is referenced the operands are de-obfuscated and then the default implementation of the zen jump handler is executed after that the operands were re-obfuscated to their original state okay so now we'll start getting into my solution in order so so i started off with this jump instruction how can i fix just one jump so my solution was to modify the source guardian jump handler function but create my own new function so when a jump was encountered my function would run so i copied some of their their code to allow for the operands to de-obfuscate however i did not allow the zen jump

handler to run instead i set the current current zendops handler that jump handler to be the at the actual zen jump handler and then i didn't allow the operands to restore to their obfuscated state um so that's just for one operation but we did have to fix we have to fix the entire opera so if you remember back before um when we would dump an opera it would just show the obfuscated values and then once the operator was executed by zendexq the source guardian handlers would run for zendot the operands would be obfuscated then the real zen handler would run and the operands were obfuscated again now with my solution i added a step at the beginning to

fix that operator so using the same logic from from the source guardian handlers i would allow those end operations to the obfuscate as intended and then i would set the handler for operation to actually point to the zen handler so then when i dump the operae you would see those corrected operands and when the actual operator was executed the zendops would execute correctly because the operands were de-obfuscated and the handler was pointing to the zen handler and okay so i told you about the jump handler but there were several different source guardian op code handlers for different types of instructions so i've grouped them into five different handlers here so source guardian had five different handlers

depending on the opcode value it would determine which handler was called so i ended up having to create five of my own as you can see off to the left here i would loop through the the op array look at every instruction and depending on the opcode value i would call a specific function to fix it as you can see the top here if it's 42 or 100 i call the fixed jump this function handles these op codes and so on so we've seen how to fix an opera now like in the python world classes create a different layer of complexity here's another example we have a class one and a class two each of which has either a funk one or

funk two that echoes one or two each class also has a function that is not used so i've not used one or not used to and they return one or two now i define these classes and also i have like kind of a main function here generate a random number one or two depending on the value either class one or class two is instantiated and either func one or function will be called now when i ran my modified vld against this um here's the main that we saw right we generate a random number uh depending on the value either class one or class two is instantiated func one or font two is called and for this particular case the random

value was one um so as you can see func one was dumped however um they not used one function was not dumped and class two entirely was not done so i only got the output for the instructions that were executed the main and class ones funk one so in order to dump it all i had to tap into the class table and the function table so like we saw before i have that fixed op array routine which would fix the main and then i would have to tap into the class table which contains class 1 class 2 and all their functions and in this case we didn't really have any functions but outside of the classes but all of these

would need to be processed as operas um and then at that point we would dump the operae and so then i was able to view the instructions for not used one and additionally class two entirely so thank you all for joining just to recap we talked about remapping python opcodes in a static pyc file and we also fixed php bytecode at runtime hopefully this talk helps you down the road when you encounter a similar protection mechanism if you're if you're interested in diving in a little bit deeper i have written a couple blogs on the topic and all my source code is available online and at this point i will open it up for questions

thank you chris uh very interesting normally i can um have an opinion on most talks that are given but this was like a different language to me so um very very interesting very technical so so thanks for that i'm just going to check the questions uh tab there and swap card just give me a second so there have been no questions yet so i'll just give folks under maybe 10 or 20 seconds sure and if there's no questions then we'll give you back maybe eight minutes okay there's no questions coming in chris so thanks again for your for your time today um the feedback was very good on the on the chat uh very informative from from lots of

folks um so thanks again for your time chris and enjoy the rest of your day awesome thank you for having me

A Glance at Interpreted Language Bytecode Trickery by Chris Lyne

Related talks