Karta - Fast Source Code Assisted Binary Matching - Eyal Itkin

Name: Karta - Fast Source Code Assisted Binary Matching - Eyal Itkin
Uploaded: 2019-11-02
Duration: 29 min 13 s
Description: Karta - Fast Source Code Assisted Binary Matching - Eyal Itkin BSidesTLV 2019 - Tel Aviv University - 24 June 2019

BSides TLV · 201929:13243 viewsPublished 2019-11Watch on YouTube ↗

Speakers

Eyal Itkin

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

Ghidra IDA Pro Karta

About this talk

Karta - Fast Source Code Assisted Binary Matching - Eyal Itkin BSidesTLV 2019 - Tel Aviv University - 24 June 2019

Show transcript [en]

okay hi everybody sorry for the technical issues I'm and I'm going to present an ID a plug in it but I wrote a called kata but let's start from the beginning I'm really keen I'm a vulnerability researcher at checkpoint research usually my justice project so focused on embedded devices and network protocols which was the background for this specific Ida plugin usually you could see my results projects or suggest your own project for my Twitter and let's begin the motivation for research for this plug-in what is that usually and marwari searchers or even irritability researchers in female projects need to identify open source versions inside a given binary this could be the version of a open SSL what

is being used and Lee P&G easily but you'll see often and you want to identify the open source for specific reasons sometimes you want to help researchers in the reverse engineering process because I want to find out what the binary does and does it support TIFF if it has a liberty if inside it locating one day's and given film where in many many ephemeral projects you don't need to find a zero because they don't update the open sources so you must probably have a given one day that is exploitable but you need to find out which open sources are being used statically linked inside the film and what are the versions of each open source and hopefully we could do that

automatically we want to identify a Lipsey functions because we don't have good heuristics to find STR LAN or mem copy STR and copy or a socket when you reverse engineer a filmer file on Windows you will have it for Lib C but if you have an embedded file it's an HP printer for some reason it's in an arm CPU you need to find it all on your own and one more reason with several people here in the audience know that because I work with them on a given project is that I worked three days to reverse-engineer a specific SNMP model in some device and after which I found out that it's a used open source called net assign IP and I

knew the open source already and I simply couldn't realize that I'm reversing a known open source so I don't want to reverse engineer and Edison in P over and over and over again if you would have an automatic tool to identify it and match the functions I could skip directly to interesting models instead of reverse engineering the same open source but is already known it has no good exploitable vulnerabilities the problem with current tools is that usually the firmware is used we have a specific showing we used in a previous research project which has roughly 65,000 functions are in Cisco hotels you have hundreds of thousands of functions and even on TV viewer which is a simple

process on your Windows computer you have roughly 150,000 functions binary is a pretty large and we need to cope with that and find our functions inside these large binaries and this is the problem because most being diffing tools are dependent on the size of your binary and not on the size of the file or open source you want to match inside it so I mean the liability size is carry most of the tools are dependent on n and not ok which means that although bindis and the fo are really good tools for binary diffing or pet reefing they fail poorly on huge embedded files few more files they simply can't find small known chunks inside a big live away when

I used the offline one project the D beer goes to roughly three gigabytes and when Paul compressed and when I used bindi if we took it three hours and it found nothing we want something that takes two into attention the size of the film and the size of the library we want to find and hopefully the complexity will be dependent on carry rather than on in some background last year we had a research on the fax protocol here's the link for the full research the film were for our test case was an HP Officejet is simple common HP Officejet when the purpose of the research was to find a remote code execution over the fax

protocol which worked now we want to look on this project and see how that plug-in kada would help us in a research we wanted to find a to build develop a debugger and we could debug for any one day we found in a farewell for the web interface for example exploit a one day you use the debugger and then debug the Fox worker in order to look for one day we need to identify the open sources we need to find the vulnerable functions and eventually we use David's ivory from G so which is not the important part the bad part for us was manually identifying the open sources in manual identifying the vulnerable functions if

it would have used Cora for instance we can see what the basic output where once you execute a script is this nice list which shows you that the open source is further used orally PNG is really about - a cell and a vulnerable version of G so because if you simply type in Google G soap and vulnerabilities you'll find that this specific version is vulnerable to device IV and now the most important part is woodcutter match function and see me tell us here is this vulnerable function start the exploit form this point let's see and yes simply out of the box you take a feel will you find out of the vulnerable open sources which are

embedded inside it match the configurations to find a function and that's it and in an additional research this time from plotted Zillow if we have a series of blog posts about a WebRTC in this link and it found one specific CV which is in not in home itself but in a library called leave the px and we specifically said that this vulnerability is most poly effects more versions and more products okay let's take T if you will for example execute kata and check if it's vulnerable to the same vulnerability which Google found and we found a vulnerable version of live depicts actually really vulnerable version because it's outdated in two years this is the code function of the

vulnerability and it pays special attention to the numerical constants inside it because when you execute it Cora we found it in either and we can see the same numerical constants and it took it roughly ten seconds to find it so the tool works it's already on our github account now let's check out how it works okay we want to map the binary and actually really really a descriptive mapping of the functions in the binary we are interested in a specific open source which is found over there and it should be somewhere inside the large film file a large executable and we're not interested in all of the executable we only want to find the open source if

we zoom in and we pay special attention to each of the functions in the selected area we will see that the selected area starts with all of the functions from one specific compiled source file followed by all of the functions form an additional compiled so as file and so on and so forth we can see that essentially the compiler compiled each file independently and then link them all together which means that instead of matching each function on its own we can use this behavior in order to match full files one to another instead of matching a single functions and that's the basic idea behind Carter we want to map the files inside the binary outer space this

is a plug-in it's already on our github account and we're we are focused on binary matching and not binary diffing the matching is done on geographical base which means we want to locate the binary we want to locate the open source and then locate each file inside and one important feature of our plug is that it's a high depth of mastic we compile the configuration for Lee PNG for example on x86 and then matched it on in our big endian 32 bits you simply could need to compile the configuration once and use it everywhere now let's see how it works the first stage is fingerprinting the open sources which are used inside a binary it's an

independent stage it doesn't have anything in common with the matching phase so we can simply fingerprinting and that's it we want to find which versions are used which open sources exist and hopefully we'll get a specific version it works with for now it works with a basic string search and you can say ok you simply search for given strings why would that suffice to find Lee PNG or Z Leybourne Edison in P and it turns out it's it's good enough because if you look in this string a descriptive string from Lee PNG there's no reason that the credit for the developer will be in a string inside the binary usually it's comments in the open

source but he decided to compile it in instead we should in the future improve this thing the search for constants or anything other than basic springs but even if we look on a common OpenSSL we can even see that it's an open SSL the specific version in the specific model and even the compilation date for the open source in the string inside the binary it it turns out that in most open sources that we looked for a popular in embedded point and devices you have such descriptive strings compiled inside it basic string search is enough for the beginning now that we know that it uses leap PNG or open SSL it's time to match the specific version inside or binary

sorry okay you want to locate what we call anchor functions which are strong descriptive functions that we won't easily be confused with functions from other projects or other open sources which means that they should have a quite unique artifacts these could be unique numerical constants such as these constants form sha-2 you think and it could be some really really really long string from this version of lip PNG if you have this string inside your farewell it most probably holds Lib PNG and if one function refers to this thing you will know what this function is this means that using a unique enough constants we can search throughout the entire binary for anchor functions and once we found a single function we can

zoom in because we know that the open source blue will compile to this specific area and it will look like this we will have K minus 1 functions potentially before this match and K minus 1 functions potentially after this match but the entire open source will be compiled into this specific area and it won't be scattered around the favorite space in other ways once you find a say first anchor we have many in the P&G we have roughly 50 we found all of the anchor functions we have matches to begin with and now for the next phase next phase okay now we can draw basic file boundaries around these matches exactly like we did when you

zoomed in on the library itself if you know the function is contained in a file with five functions and we already found one we have basic boundaries around it and we could do the same all over all over all over again and sometimes two files will be adjacent enough to border each other so we'll have a better limit even simply saying okay we have a given set of functions that could be beneath us because we have an a lower bound from the upper bound of a given a file before us now we have basic file boundaries essentially we know where several files could reside some of the files will be floating we don't know where the files

could be inside the scope area and we refer to them as omnipresent there could be anywhere when we'll find a single match for that file will pinpoint its location only now when we have this specific scope we can start building two canonical representation for each functions we know we want to know the size of the function the size of the storage stack the auricle strings no matter constants the strings a call graph everything we need to know about each function but we only need to do it for order of magnitude of K functions instead of n meaning that Philippian do we need to do it for roughly 400 functions instead of 100,000 functions once we build its canonical

representation we can refer to that function later on and it will look like this we know what is the frame size numerical constants which functions are being called and from this point onwards we will only work on this canonical representation of each function we don't need Ida anymore this means too that if you implement the basic pre-processing phase for your reverse engineering tool for example that out - you only need to implement this part the rest of the logic is completely a disassembly independent it only needs these representations now we need to look for file hints you find it in Microsoft binary use in many other devices you have traced strings form and traces form log outputs which often

contains the name of the file so it will be a tcp dot C or RDP something-something dot C and this means that there are some functions which contain the name of the file inside them if you already know where the file should be we could look inside it and search for this these references and find more matches and sometimes you can find you know it's called area more files this gives more information and inside each file we have a agents it's like anchors but weaker they have locally unique numerical constants or locally any extremes and we can find them with a high to positive probability which means that the chances for error are slim in ly PNG on the office

suitcase at this race we had roughly 70 functions but were already matched even before we tried the traditional a matching of scoring similarities called RAF and every other ballistics which is the last phase here you do everything to do and every other tool being different to will be matching tor you walk on you score each feature you check what feature match what features don't match and you have the basic scoring function to declare if the functions are similar enough to be declared as matches this part is generic you can simply take it from your a tool plug it in and it will work the most important part is the geographic location because we have additional

scoring we could use and we have penalty is rooted in fear also geographic matching means we have several assumptions which are hard rules we must obey this means that two kind of a candidate for a given file must reside on the scoped area of that file if you know that sha-1 in it should be in sha-1 dot c there's no reason to look for it on the file mp5 dot C or AES dot C or any other file if you already know the Chowan dot C has four candidates and we don't know which candidate is which function we only need to look for these four matches and that's it so we have a low a other space to look for meaning

that it will be much faster and the probability for error is slim in addition most of the compilers tend to preserve the function audio which means that if I found sha-1 dot in it the next function will most probably be sha-1 that digest and when sha-1 dot finalize because that's the order of the functions inside sha-1 dot c a Carter gradually learns if the compiler preserves the order or not and when adaptively boosted this gives a distinctive amount of matches with high probability if Carter found the compiler to preserve the order and walking on the neighbors of each function we could say that ok we had a free and streak of a function from the file match to a given

binary this next function matched to the next function in the binary and so on and we can even guess that ok maybe even the next function will match in check and if the score will be big enough will matter that's how cada is a working and now how do you use it Carter compiles configurations from the open source because its source code assisted binary matching this means that you simply take your favorite open source you compile it in debug mode so we won't have compiler optimizations to compile the source you get the binaries and Carter is being trained on these later one read the configuration you we have several precompiled configurations on a github now when you have a given

set of configurations the first phase face will be identifying the used open sources once we identify the given specific version and we have a configuration for it I'll take the binary we'll match it to get the specific configuration and now the metro will take the binary the configurations and we'll try to match it as simply as that and hopefully it will find most of the functions inside binary and we have an annotated banner way to start a research instead of reverse engineering the same functions all over all over again thumbs up ok this is a new part we added today it's on our github account or it should be on a github account now when

working on x86 binary is a either does a pretty good job when working on film files for example arm files for research projects either could be improved it fails totally fine transitions between a arm in thumb code sections it fails to find many functions we need to improve the analysis so Carter could be used and instead of manually identifying well a function starts and where function ends which could be pretty meticulous I develop an additional part to be used with kata which uses really really really seek machinery in order to improve the analysis of either it's being trained on your own IDB and vanity proves the same ID be from the same file you could

already find on a github account and this is the blog post for mr. suits published right now thumbs up should be used as pre-processing phase before using Carra we checked it on several film files and we actually used it on a MIPS binary and MIPS dll file before we started a current research project and it for the HP office dirty took 40,000 functions into 70,000 functions and on the MIPS finally took 200 and thousand functions into 50% more I think it works quite well for us maybe it will work also for you and now for the results we tested the main writing results after we manually analyzed Li PNG is 11 OpenSSL we have several disclaimers because open

SSL is used it has 5000 functions only a 3,000 3,500 functions were compiled inside the Officejet and i had to manually analyze each of the functions because that's how you test it eventually in the PNG we had 300 functions we found all of them we found all of the excerpt functions but inside the HP Officejet fumer we had no false positives and it took less than 30 seconds to find all of Li PNG inside a huge binary for a silly bit was even easier because it took roughly 20 seconds and no false positives and even in open necessarily takes less than 20 minutes to find the vast majority of functions inside open SSL with 8 false

positives out of almost 3 K functions so in this test case it worked quite well later on we tested it on Windows binaries and other wineries and it seems to be consistent the result of when I executed kada on a VM on my laptop and these are the timings so it's not the best timing constraints probably could be faster on your computer but it's fast enough even an open SSL you can execute it go to eat something and wave before you get back it's finished only P&G leave the pigs silly it simply finishes fast and that's it why use kora there are many good open source and close source tools if it could be used for binary matching or

binary diffing you have functions in search of Google DFO and P guys from Kishan you have been deceive an aversion of being this just came out every problem here is good but each program should be judged against the goal of the developer of the fan of the program because if you read the documentation of belief belief was being developed for pet diffing for a finding similarities in a different close enough binaries it wasn't designed to match open sources in a huge database of a fellow fire big ice is the same hey sorry the auto is the same because is more matching oriented eventually you could use beam this but it's it wasn't designed for binary

matching and its really great for pet riffing so use it for if it's different instead if we compare the basic tools in our specific scenario of matching open source functions we could see this basic table and essentially there was no other tool was a hit actual Gnostic that works quite well on the origin of binaries and that is aimed upon matching the functions it's hard to match the functions and BL hit actual agnostic when you use fuzzy hashing luck in functions themselves and you can't use all of the features form the open source when you try to match because the gaius tries to compile the sources on its own and finding out macros and structs will differ from each

execution and you if you simply compile it with a regular compiler you don't have to emulate a compiler to derive the features each tool here is good most of them were designed for binary matching we were designed for pet defend and they work well for positive that's it this is the tool thank you for coming that was great thank you so much high-five first we say love you talk ever forget so and hopefully more soon remember if you want to be a speaker at besides that avi if you have to submit to talk in our call for papers questions hey what's that courses we have time for one question we have time for one question for Alice

okay so the question was a the first stage was a ankle functions and is it sensitive for different compilations and different compilers anchor matching uses a high enough enthalpy of numerical constants or long enough strings and the compilation doesn't change that so we only need to scan either to search for strings or a numerical constants the computation doesn't change that if you compile your own configuration without optimizations you will know which constants are connected to which functions and on the binary itself if you look for the computation doesn't change that at all so the anchors are quite stable hence the name and in libyans you had roughly 50 anchors and it finds most of them just before I

finish we have a bill outside if you wish to work on such projects in a research group we have a course for a new employees and qhorin employee is called CSI and you could visit us in the booth in the kill zone

Karta - Fast Source Code Assisted Binary Matching - Eyal Itkin

Related talks