← All talks

Podobieństwa i różnice, czyli analiza malware’u dla leniwych

BSides Warsaw · 201746:062.0K viewsPublished 2017-10Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Mentioned in this talk
Tools used
About this talk
Autor: Maciej Kotowicz
Show transcript [en]

Mako, let's get to work! So, welcome after dinner. I don't know if it was good for you, but for me it was quite OK. I'll tell you about some strange things. The problem is that we've just changed the configuration and I can't see anything on my slides, so we'll see how it goes. If someone doesn't know me, my name is Maciej Kotowicz, I work at CERT, I deal with strange things related to malware analysis. Everything will be going to the army now, so there won't be a whole world. I mainly analyze malware, exploitation, Dragon Sector, which I think I didn't write down. I wrote it down. Some other bullshit. As we all know, the Internet is a very dangerous place, right? Some Trojans, Worms, Spams. One of

the things I do is searching for all this and finding out if it's really dangerous. A lot of zeroes, you look at the bytes. Usually it looks like this, there are many windows, some codes, let's make some small ones, maybe some guesses, someone has been doing malware analysis, someone knows what I'm going to talk about, should I just go? Maybe I'll go, right? Okay, if someone hasn't been doing it and has no idea what reverse is about, then it will be a really weird presentation. But we'll try to do something about it. I have an hour for this, I won't try to spend an hour on it, The person behind me has a million slides and will be doing it longer than me.

As I said, it's a strange job, analyzing malware. It's doing the same thing over and over and over again. But as it's written, it's for sure. It won't end soon. Malware is something that's growing, there's more and more of it. I said it's more and more interesting and complicated, but it's bullshit. It's boring as well. This is what this presentation is about. How not to do the same thing all the time and try to make your life easier. Excuse me, it's very hot here, let me undress. And everything is off. Okay. What I'm going to talk about is the widespread diffing of code, looking for similarities and differences in it, to make life easier, not to analyze the same thing in circles and in circles. The

methods I'm going to talk about are mainly used for this, but they are also used in something like attribution. Everyone probably remembers this picture. I hope at least, I don't want to talk about it, it was very boring. at least for me. But what's important is attribution, meaning an attempt to say that something belongs to something, someone did something, someone is responsible for something. And one of the main methods that's being used now is code similarity. Plus, of course, architecture similarity, and so on. And to convince you that it works a bit, I'll try to do a little demo, which probably won't work, but let's see what happens. It will be very fucking weird. I will take it with

me. Let's try it. Technology is changing. I remember when it started with such a beautiful room. Was it at the Polytechnic? Yes. Where there was literally nothing. And it worked. What you can see here... It doesn't work as I would need it. What you can see here is a piece of ID with a piece of malware that is newly loaded. I analyzed it once, but it's very boring, so we're loading it again. A lot of functions, a lot of unknown stuff. So we're trying to make life easier with weird things I'll talk about later. I've got it here.

As you can see, I like one-letter scripts. Why call them certain things? Ah, it doesn't work like that. The script I'm using now is quite old, and Aida is new, so it doesn't work as it should, but we'll try. It's not easy, trust me. Look, there's a mark. Ah, there's a success sign. Something happened. Something happened, something reversed itself. So it makes our lives easier, if we reverse something earlier. And as I said, this work is very repetitive, so we always reverse the same thing. So if we are able to prepare some information earlier, we are able to recreate our previous work, It's not that easy to show that a function is the same or at least similar. So

the techniques I'll talk about will try to show you how to implement, how to approach the problem, not reversing the same thing by writing down your work and recreating it. So you can see that some... I don't know if you can see anything on the left side? Reduce window. Let's do it like this. Maybe it will be better this way. You can see that some new functions have appeared, they have some names, so we don't have to worry about them. We trust our previous work, of course, that we did it well and that it's called asinit, it's asinit, and not "sink the nearest nuclear power plant". One can do it and the other, and this and that. These are some simple

signatures that do it, it has its drawbacks, which I will talk about later, this tool itself. But for now we can see that something is working and we can make our lives easier. Let's try to do something even funnier. I don't know if it will work. There was supposed to be an attribute that doesn't work. Oh, never mind, I don't see anything happening on my computer. I'll try it later. So we didn't pray well and it didn't work out. I'll get drunk. I lost my slides, I don't know where they are. Oh, here they are. Okay, so these were signatures, and everyone claimed that signatures are dead, and V is dead, and it doesn't give a shit. And now we have to go

to, what's new? Machine learning, right? Machine learning, big dates, behavioral analysis, breaking, whatever. Let's come up with some texts and something will come out of it. Of course, I claim that signatures are not dead at all, they are quite good. The problem is that you have to choose what we are using these signatures for. And we have the source of these signatures. We can make these signatures based on bytes, and that's what people say is not working, because it's not possible, because the malware is dynamic, it's packed, it modifies itself, it makes us tea on the fly or someone else, or a girl, it depends on how much we want to make presentations, and so on. But we can choose certain characteristic features for, for example, the functions that

are in this malware. and build signatures based on them, which will tell us that this function has this and that property and if we define them correctly, if the other function has the same properties, then the same function is likely to be the same. And these features can be the disassembly of this code, they can be used permanently, for example, let's try to do it pathologically. Maybe it will be visible.

It doesn't work. Let's try it on AES. A code that isn't re-locked doesn't work. Never mind. If we have some kind of cryptographic operation, for example initializing hashing functions, they have some constant ones that are very easy to recognize what the function is. If they keep repeating themselves, we'll build a signature based on the regular ones, and it'll be quite decent. For example, we can try to make a beer riddle: who can recite regular DMD5s? Redford, can you do it? It's simple. I don't know, I'm just standing here. And back again. Good. But they know, so I won't give them beer, because it's pointless. So it was just an invitation for someone else to respond, but since there was no one

else, it's hard. Let's move on. References to some strings, some tables, things that are unique for a given function. It's obvious that if a given function references some tables, it wants to do something with them, so it has a good characteristic. We can try to estimate how complicated this function is. There's something called cyclomatic complexity. I'll tell you how it's calculated a bit later. I have an operation from graph theory, which I have no idea about green for a long time, but we'll try to talk about it. Functions, of course, if we present them as something called a control flow graph, so a graph of control flow, then they have, they create, since it's called

a graph, it's a graph, of course, so we can estimate this graph and say something about it. Yes, about its shape, or the number of nodes or connections. It talks a lot about the characteristics of the function, what other functions it evokes, and what other system recommendations it uses, what other apicols it uses. There is also a lot of research on that, the very attempt to estimate what a given function does based only on system functions. Are there any ready-made databases of such characteristics? I said "there are" and I lied a bit, actually. There are, there are sometimes as tool add-ons, but of course they are based on a given tool, so the databases are based on tools. Aida itself comes with

a certain number of patterns, it's called FLIRT. and allows mainly to talk about basic libraries of a given environment, for example, Delphi runtime, Windows runtime, whatever. These are very simple signatures, so they work more or less like that. There were some projects that I will talk about at the end, only on the tools themselves, to get to that, but there are not many of these bases, usually it is simply built by itself and based on it. There should be something here. Oh, there is. And now, having certain characteristics, having these bytes, we can define these signatures in some form. The most obvious is of course... This is not my good day. The most obvious is the form, just so

pure, when it comes to searching for bytes. If we take the bytes belonging to a given function, we will look for these bytes, we will find these bytes, yep, we have this function. This is the procedure of certainty that this is exactly it. Finding bytes is really slow, so it doesn't work well. We can make some hashes out of it, assuming that the hashing function is correct, for example, that MD5 is H1, SH is H2, H5 is H6, whatever, whatever. It's fast enough to compare, and precise enough to say that if something has a hash, then it's exactly what these bytes are. And we can build some patterns, when it comes to bytes, like regexes, which say that these bytes can be like this,

or like that, it affects the size of the signature, the accuracy, etc. And we can apply the same to characteristics. We can build a certain feature and hash it in some way, which will allow us to search quite quickly. our base that we will build one of the methods is building n-grams this is one of the first places where I have no idea what I will talk about then we will go on and on, I will now know less what it is about because this year I decided that I will learn something and I will not show what I did for some time and I will do something I have no idea about so I learned it yesterday,

you learn it today, maybe we will learn it together until tomorrow You can find n-grams wherever you are dealing with any machine learning, so right now everywhere. Probably if you want to order coffee from Starbucks or wherever you order coffee from, it will be done by machine learning and they will divide the coffee into n-grams and send you individual bytes. The idea is that we take, it's a very beautiful word, n-gram, which doesn't say anything, it's just that we take another string of n length starting from each next byte. and we can not search for a million bytes, but half a million. Great discovery. But you can obviously count hundreds of them and Tesla will work like that. We can do it on bytes, but we

can also do it on mnemons. So if we symbolize a function, we can have a string of these next mnemons and operate on them. And it also works somehow. Signatures are also built on this. mainly machine learning, because I'm a bit of a fan of this idea, so I'm just saying it to be complete and not to use anything. However, Ngram is based on the previously mentioned FLIRT technology. People from Hex Race, a company that develops ID Pro and Hex Race compilers, create these signatures by looking at the first 20-30 bytes, I don't remember exactly how many, and based on that I build signatures. If the function starts with 30 bytes like this, then it's probably whatever we get. You can guess that it doesn't work

quite right. A slightly better idea is masking, masking the code first of all. Having some simple code from some random place in some random malware, we see that it looks like this. What is very dependent on the compiler is of course the setting of registers, the setting of variables on the stack, so these are things that can... If we consider only binary signature based on bytes, it won't tell us anything because the bytes themselves don't have any information that it's a call to open process or anything else, but just a number of bytes that say that we're going to jump here. So what can we do to make it more generic? We can hide a bit of these values that are variable, first of all, so

offsets in to variables on the stack, to the addresses to which the calls will be made, other jumps, things that are dependent only on bytes, and not on the shape of what it will look like, how the function looks like. And the third thing that is ideal is Yara. These two slides are probably the only work I've put in, I mean, the physical coding in this presentation. We can generate a template that will be responsible for what happened, and then disassemble it, by putting some variable registers where there are some registers, and by giving wildcards where we don't know what bytes should appear, and generate a nice string that says "this code is this code".

Let's go back. Someone doesn't know what Yara is. Yara is a tool for matching bytes patterns. It's quite sensible, designed by people from VirusTotal, then bought by Google together with VirusTotal. So it's successful in naming, because Google is great or whatever. And it works quite well, it's a quite sensible language, it allows matching on the level of Nibli, so 8 bits, you know what I mean. I mean, there's some other architecture? Well, not exactly, because it works for a specific architecture, you have completely different bytes, because you have different instructions. We will get to things that work against cross-compilation in a moment, such as more "wise" ones, i.e. theoregraphs. I will come out then. But this

is very simple. This is something I use to build my signatures, which are useful for me to find specific locations in the code, with the tools we use to automatically extract malware information. And this is so good, because we know what it's about, we know what we are looking for exactly, We just want to get to this piece of code. So using this to say if a given function is... for all functions is a bit... ...not optimal, because there will be big signatures, because there's a lot of code in the function. But for simple patterns it's very nice. And you can express quite a lot, because we can't say if certain bytes match, we have

the whole logic language for that. Yara is generally nice, I recommend it. Maybe you can find a guy who made a whole PowerPoint presentation only in Yara. He talked about Yara without knowing anything about it. He turned on Yara and slides were flying around. A little awesome. Yes. A little bit about hashing. One of the methods of hashing is, the feature of course, is something called Small Prime Product. It was invented, I don't know where, but actually for the first time probably used by people from Dynamics, now Google again. I don't know why Google is repeating it everywhere. This is not a sponsored presentation, but it's a very nice technique that allows us to avoid the problem of reordering instructions. The fact

that the instructions after recompiling the given code will look a little different. What we do here is we take some characteristic, in this case, which was used, we take mnemons, we assign some first number to each mnemon, to have unique features, and we multiply. Multiply is a variable or something, so it doesn't matter which side, so it doesn't matter how the instructions are added, the value will be the same. For good features, for example mnemonics, it gives a lot of accuracy. We can use it not only on mnemonics, but also on AST of a given function. So it's something quite universal, we can go a little further than the assembler itself. And this is the basis of

saying that a given basic block is a given basic block. If we take a CFG of a given function, we will have some flow. I wanted to say something, because I see you're sleeping well. Sorry. But I understand the lack of sleep, so I'm sharing the problem. I'd like to sleep well too. Aadrem, yes, it's possible to show the basic block with great accuracy, it's unique, it's what we're looking for, and it will be used in a moment. What do we have next? We have Cyclomatic Complexity, which is something I mentioned before. It's a kind of code complication. Quite simple, simple, simple metric. We just take a graph, we count all its edges, all its vertices, and we do things. In

AIDA Python we can express it in five or six lines. Sure. Basic block is: the edge of the edges are transitions, i.e. jumps, conditional and non-conditional. Maybe I should actually go back to the beginning, what I'm talking about. I assumed that everyone would know what I was talking about, probably wrong. So, this tells us how complicated the graph is. It is used not only for what I'm talking about, but also for general building of metrics, first of all, of the code, really. Only then someone came up with an idea: "Ah, let's put it somewhere else." And we'll see what's going on. We can use it to start analyzing malware. Because if we become malware, there's a lot of code, a lot of functions, we don't really

know what to start with. And as experience, or my previous presentation, or someone else's, says that it's worth starting from a place that's complicated, like cryptography, Generally, if something is complicated or something interesting is happening, it's worth looking for something interesting. The same applies to malware analysis, but also to finding errors. Because everything I'm talking about, using it for malware analysis is secondary. You'll find much more on the Internet if you search for patch diffing. To put it simply, we take a bug, we take the previous version, we do a diff, we look at it, and here three bytes have changed, which means that there was a bug here, let's try to do something to trigger

it. And that's the main use. Not long ago people started to think that maybe they want to make things easier with other things. The same thing with anti-complexity can be used to start reversing, we take something that is complicated and we look at what's going on there. Maybe it's an interesting function in terms of malware, maybe it's interesting in terms of error searching, because, you know, there's a possibility that in a complicated code there are more errors than less. And now some magic. I think I need to get drunk. Because it's going to be hard. As I said, there's going to be more graph theory here. It's a bit like my time of late high school, early school, which was a

long time ago, when I was a good boy and not so much... whatever. I didn't attend conferences. So now it's going to be a bit of magic, again from people from Zainamix. And it was, I don't know, invented 10 years ago and no one has invented much better since. This is a certain form of... expressing the characteristics of the graph. The problem with the graph is that it's very hard to find it. We can't ask too much about the basics, like what things look like in this graph. We can ask more about ints, in the worst case, strings. Ints are perfect. So we'll ask about ints. As I said, the graph is constructed into some brackets. The bracket looks

like this. in a simpler form, we take the edge of a given graph and we wonder what its characteristics are. What can we say about the edge? We can say how much of its vertices that connect it have to go in and out. What else? Because we're talking about basic blocks, we want to... It's really... If we make some order on these graphs, usually it's some topological order, which is now my possibility, but... We want to have a graph sorted by the in/out sequence. So at the very end we will have a graph that has no in/outs, and at the beginning we will have a node that has no in/outs. So we start with the in/out function and end with the ret. And now

if we have the amount enter and exit for each of the vertices of a given edge plus their topological order, then it tells us a lot about how this graph looks. This is of course a lot, but not enough, because we can have a lot of functions that will have exactly the same shape, and will have completely different properties. We also add the characteristic of the basic block, that is, the vertices. So in this case, for example, the multiplication of the first numbers based on mnemonic in a given Basic Block. Therefore, we can add something there, but I don't know what. Aha, and here is also added some number of circles, which is in a given Basic Block, on each side of course. And here are some dots. Dots

are cool, dots are much simpler than graphs. They are still not numbers. There are obviously magical mathematical methods of putting commas in real numbers, and that is multiplying each position of a commas by the number of Z elements from I, where I is the number of the position in the commas. Did you understand? I didn't either. But it works. It's very simple. We take each value from this and multiply it by, let's say, times the element from 2, from 3, etc. The element from the first number, or something like that. To be found on the Internet. And this gives us a number. And this number is quite nice. It's so unique that it represents exactly what we want. And having this, it doesn't represent

it that much, but it has such a nice property that it allows you to decompose it back to the graph. But I won't talk about it, because it's complicated and I have no idea about it at all. So it was chosen. And these two things, or is it really the basis of this Bindiff tool, which is very well known for people who do vulnerability research? or malware analysis. There is also the basis of all these tools that Zainamis once created, i.e. VxClass and BinCrowd and what else, besides Binnavi of course. Binnavi is something completely different, only related to graph analysis. But it is also part of the whole codebase that Google took over. Probably you

can do a little more with these graphs. The problem is that if anyone knows anything about graphs, then graph isomorphism is difficult, i.e. searching for the same graphs which can be transformed into a given graph, and we know what is isomorphism. It is a complete NPS problem, which means it is difficult. Searching for graphs' subgraphs is also difficult, so we are not able to say that something is part of a given subgraph. Generally something else, the similarity of these graphs is also difficult, generally graphs are difficult. I don't know. So it's all cool, but But unfortunately, apart from the fact that it's easy to write a piper about it, it's not really applicable to anything, because it's hard to find

algorithms that will allow you to act quickly and accurately. Yes. We can go a little further. We can say that something is similar, that it behaves similarly. If we have a function, completely black box, 100 bytes, we don't care what it is, let's say we managed to guess calling convention, we give it three arguments, some buffer, some buffer and length. We'll get that one buffer will be in the other buffer, exactly the length of bytes. It's easy to say, "I think it's memcpy." So if something is growling like a cow, it's probably a cow. I think it was a longer statement, but I can't remember how it was done. But I remember it was cool. Yes, we can achieve things like this with very... maybe

not simple, but at this point easily available technologies like emulation or symbolic binding. Yes, SMT solvers are not exactly what I said, but never mind. Yes, we can emulate this piece of code and see the input and output. Assuming we have a good emulator with simple functions, it works very well. As I said, for simple things. For more complicated ones, or if you want to be more precise, we can symbolically make this code, and then all the constraints that will be created, write down the constraints, and then try to solve them with some SMT solver, like Z3 or some other, and compare these constraints. If these are the same constraints, then it's probably exactly the same. But it's also not easy and not scalable.

The idea is to build a system that will allow us to connect to this database, find all the information about the functions we have, and draw data about it and we don't care. Instead of hundreds of functions, we reverse two or three because we've already seen those hundreds somewhere. Or if we want to throw an attribute, we know that this kv code belongs to OpenSSL, this is curl, this is something else, etc. Fortunately, we don't have to do it ourselves, although it would be very nice, because there are some tools that work less or more correctly. I'll try to get closer to you. What I showed at the beginning, of course, I lost one of the

"z"s. It's called "rizo" and it's a very simple script from a person called He was on the internet under the nickname devttss0 and he was mainly working on building up the routers when it wasn't fashionable yet. Hi Michał, greetings to the comathe. He created it because he lacked the appropriate heuristics to search for library code in a large firmware. As you can see, it works quite well, I use it with a lot of results. After every analyzed malware, it takes all the functions that already have names, makes some characteristics out of them, makes some hash out of it, and writes it into some simple text file, probably, from some Python picklap, and then you can recreate it nicely. Minus

is the same as flirt. Flirt, as I said, works on a much more precise principle. It's not in Polish. It takes a piece of the beginning of a function and matches it with the names. The main problem with these tools is that they only have information about the name of the function, nothing more. One of the things that is very important during reversing is to have the types that which we operate with. So if a function takes some arguments, we want to see what arguments they have, what types of signatures they have, so we would like to have these signatures not reversed again, but imported. Unfortunately, none of these tools have it. That's why they're sad. We can find some similarities, and these are projects

that come and go. Bincroud from Zynomix has fallen. At that moment I was looking through some of these things and I think it was one of the best forms. It worked exactly as I said. It was taking the index from a given function, sending the information about this function to the server, saving it, everything was fine. But I think I would remember it again. But it doesn't work like that for longer than I work in CRC, so I was never interested in it exactly.

The newest one is the one at the bottom, i.e. the first function identification recovery signature tool from a company called Talos or Cisco. It's a kind of an attempt to make a good tool that has been beautifully tamed. But not in a moment. First, the Crowder, which was, in my opinion, the ideal tool for work and worked exactly like that. as it was necessary, so it took not so much what the signature of the function saved, but it also took types and saved them, and you could import everything nicely, define, choose whether this function is appropriate, whether we believe the person who uploaded this function, because of course it was all in some cloud, which people did not like, because

we do not want to share with others, but that's another matter. But it worked quite well and allowed to import a lot of data. When I was reversing Zeus for the first time, it helped me a lot, because I was reversing only 40% of the code, not 100%. Most of it turned out to be earlier. I don't know much about Polyhom, because it's in French and everything is in French. But they have some cool ideas for the next hashes, which work quite well. I recommend reading this paper in French, if anyone can read it. As I said in the first version, it's a test, it's the only thing available in full public. But I don't

recommend reading it unless you want to have some nightmares. It's more or less a lot of thousands of linecodes in one file, all at once, class after class, functions in all of them. And it's funny, it works like this: I take all bytes of a given function and send it to the server. And on the server it does everything again, so it decodes instructions, does some... defined by us, or there are three modules defined by this loss, it's just the same bytes, post-decompose, from assemblies it takes some masked bytes, removes the pieces of code I mentioned earlier and compares them, and I think it was something else, but also very simple and not very useful. And it just counts the hash of mnemons,

sorry, sh1 of mnemons. And all of this is very nice and beautiful, but we lose all the data that IDA already has, so the whole graph structure, which is quite important. And I don't save types either. But there are some things that work, that are more interested in differences than similarities. This is the previously mentioned bin div, which is probably still being developed, I think. It was supposed to be open-sourced, it didn't work out, it costs about 200 dollars. It wasn't supposed to be. "Ah, so you can insert a binary here, but you can't see what's inside." So you can reverse it. And it works somehow. There's turbodiff, darun, grim, then there was patchdiff. These are all attempts to make a bindiff once

again, in the moment it was once. When Zynamiq was bought by Google, the sales would be stopped and people would start to complain that we lack these tools, so let's create our own and other things. I don't think any of them is developed. I saw that there are some presentations about R2. I have a very specific opinion about R2, so I allowed myself to think that there is R2D, which I don't know if it works at all. I tried to do something once, it was being sequenced all the time, so maybe it works better, it was a few years ago. There is the newest project of a man from Not from Catalonia, but somewhere nearby. Also from Spain, or so-so. From the Basque Country,

exactly. Thank you. His name is Johan... No, if you'll ever find me, you'll kill me for this presentation. What? Never mind. He's a metal-ass on the Internet, among others, he's the author of a book about antivirus hacking. And this is a rather new tool, it's only two years old. It has all the heuristics I mentioned, plus a lot more. First of all, as the first tool, it uses the fact that we have a little more than just disassembly, we also have decompilation, so we can also work on it. So it tries to apply all these heuristics to the decoded code, if we have a decompiler. Recently, one of the engines that generate graphics for the

Diaphor is also a radar game. It works quite well, the presentation was on R2K in the last one. I recommend, not even to see how it works with a radar, but how it works. It's quite sensible. What else could be done with it? You could try some other funny hashes. Yesterday a friend told me that there is something like minhash, simhash and that it can be used to collect different features. Minhash and simhash are hashes used by Google again. Minhash was used in Altavista and simhash in the browser, which is known to search for similarities in the pages. So there is a chance that it works quite well. I don't have any other ideas. If someone has, I'd like to listen. But in

reality, you have to do it, you have to do it and write a specific code that will work as it should. Because, unfortunately, there are tools that are usable, open, and they're not available anymore. I'm trying to take this presentation, among other things, in order to organize my knowledge. Unless... Although I can believe that I created more chaos than order. But I like chaos. So yes, if anyone would like to work with me on something like this, I invite you to contact me. If not, that's also fine. Maybe I'll finally gather and do it. If anything was interesting, or you like masochism, or you don't want to work in the army in the near future, through security, we invite you to NASK. And that

would be it from me. Any questions? I got used to it. How can you deal with code confiscation? I don't know. Something like that? Not really. These are static tools, of course. The big plus is that when reversing malware, the code is not loaded. It is loaded at the beginning, but when we reverse the malware, it is usually not loaded. It is very rare. You can just write a debugger before and then use the tools for it. So it's the same for the packers? You don't want to identify the packers too much, you don't want to analyze them first of all. You unpack it with any method. I don't unpack Malware at all, it's a waste of time.

I'm very grateful to the people who upload all these millions of videos on YouTube, how to unpack Malware, it's a waste of time. Everything can be done semi-automatically, which I've already mentioned before. So yes, if it is unpacked, if it is still sealed, you can write some tools that seal it and we will continue to work on what we already have. Anything else? Thank you very much.