← All talks

The Exploit Factory: Building a home exploit mining cluster

BSides DC · 201829:4687 viewsPublished 2018-11Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
TeamRed
StyleTalk
Mentioned in this talk
About this talk
In this talk John Dunlap will present a method for building a small scale compute cluster oriented toward large scale smart fuzzing on a home budget. John will present methods for converting performance concepts normally reserved for scientific computing applications into practical “fuzz farm” techniques. Topics such as high speed multithreading, vectorization, process management, and cluster node management will be discussed in a manner friendly to those new to scientific computing. John Dunlap (Senior Security Engineer at GDS) John Dunlap is a NYC cyber security expert. He has given presentations on his exploit development research both at home and abroad, including talks at Defcon, Derbycon, and Australia’s Ruxcon. John Dunlap is a major proponent of hacker culture preservation, and is a supporter of the international demoscene. John Dunlap specializes in reverse engineering, exploit development, social engineering and source code analysis.
Show transcript [en]

besides DC would like to thank all of our sponsors and a special thank you to all of our speakers volunteers and organizers for making 2018 a success my name is John Dunlap this is a talk called the exploit Factory it's kind of about how you can DIY yourself a little compute cluster to do some fuzzing to find some vulnerabilities so you might make some exploits this is more for if you have a hobby of fuzzing and or maybe you're running a small business and you're looking to spend a few hundred dollars a month on having a very powerful fuzzing ring and as a secondary concern thinking about what what kinds of optimizations that we could learn

from the scientific computing world and apply to the fuzzing rig that we're kind of build in order to get some kind of gain right and instead of just having the off-the-shelf images or whatnot try and do something to our fuzzing setup to make it run faster maybe at a cost of little extra engineering time you know say we're a home security researcher like myself and we want to use some of our our acumen to get more return on our dollar that's what we're going to talk about here today this is me John Dunlap I hacked like this I find CVEs I like to do binary exploits I like to do fuzzing I like to do reverse

engineering and I also do a lot of bio a can't do a lot of I did a talk at Def Con earlier this year about encrypting data in living organisms like editing e.coli back putting ciphers in them just pretty cool stuff here's my formal about me card if you're interested with all my personal details if you want to get to know me if you want check out my previous talks here's some links to that it's all on my github if you want to check them out but you know DNA encryption machine learning FPGAs and the history of hacking even deep deep dives into that 3d printing security in jump Orion programming gadgets for exploit development so I do

a lot of research researcher and Esther reverse engineer and by the way we're while we're on the topic of talks did anyone go to go Def Con this year and I was covered in sand for a couple days it was great I think it can still taste that sand this is it took a while to get out of my teeth so who is this talk it intended for as people who like to do fuzzing and do it faster I think over the last couple years everyone's discovered the AFL is this magic beast that finds bugs but sometimes it's slow some platforms it's extra slow for some kinds of software it's completely non viable and this is

for people interested in wanting to get more out of their fuzzing sessions for the money they're putting into it I'm one of those guys who has to take everything way too far and as soon as I started doing fuzzing stuff I thought how can I make AB a wealth buster how can I put way too much compute power at this how can I over engineer the the daylights out of it and that's kind of what I did and interesting experiment way of an explore where you can get a little more bang for your buck now here are my motivations for this talk I was limited in my fuzzing setup I wanted a semi automated system I want to be able

to throw jobs like a job queue AFL because if you've ever used fuzzy there's a set up process that's kind of time intensive you have to find a way to throw stuff at it and I want something that could be replicated basically without a lot of overhead I ran into a lot problems doing 3d praying research there's this whole ecosystem and the 3d printer world of slicers which we've never used a 3d printer what a slicer is is kind of like a compiler that turns a 3d model into a set of instructions that the machine can understand it's kind of like the same stuff that CNC machines use is called feed code and I found that

there was a lot of vulnerabilities in the software but the software ran so slow that it was a little immune to fuzzing and it wasn't so helpful an automation stage so in this talk we're gonna explore techniques for porting high-performance scientific computing optimizations for a small business or individual fuzzing rate we're gonna evaluate each of the standard techniques that people would learn like a scientific computing accelerated class and see if they offer any benefit for our purposes whatsoever and see if we can't get more crashes per hour of compute time and see if we can't get better crashes and see if we can't have some better automation optimization which limits the pain of setting this up

now we're going to define some trade-offs everything in engineering is a trade-off right we can't have everything and the way that a lot of people or companies would approach a situation just throw throw power at it right why why prematurely optimize when you can just throw a couple more CPU nodes at it and for me I'm not made out of money I'm willing to spend an inordinate amount of money on this for some reason but because I'm fascinated with cluster computing and optimizations and optimizing compilers and stuff but what we're gonna try and do is not spend all the money and what we're gonna try and do is take the money we have and trade

engineering time to to get the profit all right here's some some previous fuzzers actually just for times sake and complexity sake I'm not gonna get too deep into these hopefully there'll be a version of this talk someday where I'd benchmark my my system against others but I didn't and honestly they're not quite optimal in the same way that I'm looking to do these kind of micro optimizations are honestly in a more widely applicable system not super advisable because you're gonna trade a lot of engineering time for what you get but I like to tinker so this is fun and I think a lot of people get something out of it so let's before we get too deep into the

application side of this let's talk about in general some high performance computing optimization techniques sort of Lee if you were going to take a class on scientific computing what kind of techniques would you learn to optimize your C code to make it faster we're not generally talking about like big optimizations for algorithms what we're talking about is things like optimizing for cash I guess if we're going to go over this in detail and then we're gonna go over how we can apply those optimizations to a fuzzing scenario and whether it helps it all and what scenarios it does help just keep in mind this comes from an optimization kitty I never worked for Sandia or anyone like that

I just took a couple of Intel classes bought the Intel optimizing compiler and learned everything I could about of why this stuff you know so please enjoy my lead techniques if you will also just as a warning you might hear a lot about these guys I'm not trying to advertise for them it's just kind of what's out there if you're looking to write super fast stuff it's just gonna come so let's do a quick overview of the categories we're going to talk about in terms of optimizations that we're going to do on our system so we might look into applying Simbi optimizations which is same instruction multiple data sort of parallelization of a sort we might be doing threading

openmp we might be doing optimizations for cache various kinds of multi press processing GPU acceleration perhaps explore some FPGA acceleration and optimizing compilers but what we're not generally talking about our improvements to the algorithms themselves and the reason that is is that number one finding fuzzer that generates better test cases isn't the point of this thought what we're trying to say given the test cases we have how can we generate them faster or how can we get through the program execution state faster so that doesn't necessarily give us control over the algorithm we're using and improving on that level I think everyone's right to point out that improving the algorithmic complexity is the best way to do this before micro

optimizing things so let us move on to the individual optimizations so I think a lot of people who do see programming you've heard of sim D optimizations so I beg your pardon I'm going to describe it a little more detail that's probably necessary for people who are not familiar with doing this kind of thing but we're talking about same instructional tool data or Intel AMD style assembly instructions that allows to do essentially addition and multiplication on vectors instead of scalars your compiler already does optimization like this but it's imperfect it only compilers have to be conservative in the behavior they can't they can't optimize when it's potentially unsafe they can only optimize when it's provably safe and

that means they leave out a lot of opportunities for what we can do and if we want to trade our engineering time for computer time like if we think we can make a good engineering decision to optimize further and we might save a good bit of money where it applies if this helps you sort of visualize what what simply is in terms of registers you know you might have we're used to like 16 32-bit registers you might have up to 512 bit registers and when we're doing things like matrix multiplication like a very thorough sim d optimisation might really give us some degree of speed-up on the order of like 10 or 20% and of course

like C program for people not answer like you can't do that the compiler will always do it better and it's just not true I'm not gonna entertain that like I just said before it only sometimes does it do it it has to do it conservatively and we will find situations where we can help ourselves by doing it manually and hopefully the compiler we're using or the library using sort of enables us to do that efficiently so if you're wondering about like the general case because when I got into this I'm like how do I find the cases where this doesn't happen easily conveniently if used until optimizing compiler it will just tell you it'll look at every loop

in the entire programs that I gave up like I didn't do it but if you're manually looking through code looking for situations where the compilers giving up you're looking for data dependencies situations right so in many cases the compiler decides that one piece of data is unsafely dependent on another I'll show that in a second but if the compiler thinks that it will never try to optimize it but in the last situations we can teach the compiler that that's okay or we can just manually do things so here's here's a case for y'all to thinking about it of what a unsafe parallel situation would be it's like Fibonacci sequence most people are familiar with this comes up

on like coding interviews and that kind of thing and in this case because we're computing things based on previous values we can't do it all at the same time right it's linear that in this case doing we can't refactor things into matrix multiplication so keep keep that sort of dependency in mind when we're thinking about how can we speed up fuzzing on our compute cluster right is our operation linear or can it be put together as sort of refactored into the format of matrix multiplication threading you know we're gonna use some context switching switch with other pieces of process but it's not deterministic you know we can't we can't count on any particular timing of the

threads we can use a library like openmp to add some safe and smart threading but we still have to worry about data races smart multi parallel programming is and I can't get too deeply into that but you can use a MP and pragmas like this to define mutexes and whatnot and we're going to sort of think about whether that will help our fuzzer situation whether it's even worth worrying about we also might think about cache optimization which is almost always totally manual but arranging data accesses so they always avoid making a and again does that work with fuzzing can we do an entire fuss job and keep it in the cash most people think this is black magic it's it's pretty

approachable if you go and read the compiler Docs so there are also special caching architectures you can get if you have a Xeon Phi they have MC DRM and stuff like that shared memory architectures that make it easier to have a much larger cache that more seldomly invokes the penalty here's the docs for that or the description of what it is if you use the until optimizing compiler if there's even options to do it automatically but again you have to double check that the compiler is actually doing it again we'll explore these in detail whether it's useful for our purposes in a sec or you might have a multi house set up like our the cluster we're hoping to do but

then we have to find a way to spread out our algorithm right we have to see the thing about like Map Reduce in that kind of algorithm like can we split up the action of fuzzy are we going to split up the generation of fuzzing test cases among multiple CPU nodes are we gonna split up the execution or we gonna even do something even more wild and split up the individual parts of the process we can sort of with MPI we can actually go through and say only do this loop on certain hosts that can be really useful here's an example of like what the openmpi pragmas look like this is kind of a deep topic so we're going over fast

but I think you guys get the idea and then one idea that's I'm just gonna put in your mind we're not gonna explore too deeply at all is the idea of doing GPU acceleration and this is like kind of the future and and vidi has some really cool technology for automating these kind of optimizations are automatically turning c code into a format that can be executed on the GPU but for fuzzing it's just not there I have never seen the GPU accelerated fuzzer so we're gonna leave that so let's talk about our our text stack for this what this is made out of so here's a high-level map of our architecture so we have a job queue we

got fuzz job pre-processing we're gonna set up processing to prior to things being around by the fuzzy we're gonna run our buzzers we're gonna do some crash cries and we're gonna do sort of sort over our output and a big part of this that most people wouldn't be familiar with if you haven't done any like mmm cluster computing is like that style of job queue sort of a script that can kick off the drops and also measure their efficiency and say it took this many seconds to run this algorithm you get the idea I went for a sort of traditional style queuing option called torque which is available and I've done - it's pretty easy to install and gives you the same

sort of API for kicking off jobs and queues and whatnot like you stat and whatever that you would see in most college clusters so I imitated that or imitated a cluster set of Intel uses so a lot of Hauser's will acquire some additional work before fuzzing and start traditional professors may require rule set for fuzzing outputs some directions for where to fuzz active guidance fuzz's AFL may require cross cross compilation binary instrumentation so we up for and have to write some some scripts for this thing to kick off before can run a job and that's going to be part of the cost of running a job so we're just gonna assume we're running AFL for guide to fuzzing

and BFF for traditional fuzzing and for crash adding tools in our experiments we're going to use crash walk and AFL crash analyzer if I think I only have numbers for you later for one of those so keep that in mind and just to think about how we're gonna string all this together we wanna think about how do we triaging from fuzzing or how do you wanna keep the triage have synchronous from the budget again there's actually a lot of options for how you split up things across nodes and what is actually useful and for output we want the largest number of good crash test cases possible and we also want some data in which crash test cases are most

interesting to us so now we're going to consider the viability of each of these optimizations and how how it potentially could work on our system and see if that as we go on if any of these are viable speed ups for for our scenario so buzzards in general have a weird bottleneck to them in that the fuzzer itself may only be doing a fraction of the work depending on the target software often it's possible for the target software Eclipse the amount of CPU so really speeding up the fuzzer itself isn't is a little bit of waste time but we're still gonna explore that but it appears that a lot of the biggest things are speeding up the execution of

the test case without changing a semantics so most of the delays in generated as defense software it kicks off new iterations to the software it's gonna fork new processes this is a big thing if you see everyday FL on mat cause this is like horrible Buzzard will spend a lot of his time waiting for the first case in our software to finish its possible situation with their parent process is taking more time than child but it's uncommon so first set of experiments is like wood running sim D operations optimizations avx-512 on on the fuzzer itself give us any game and really in a lot of cases dancers no again so much of so little of the time is spent in the

fuzzer itself as opposed to the host software running the test case that it's not a huge difference I think we'll look at numbers in a second there is a way to refactor this into matrix multiplication in most cases so intel advertises maximum difference of 6 times performance in transition from AVX 256 to 512 for optimal workloads the most i observed was about 2% speed-up switching things to avx-512 so kind of not worth it then there's a the reasonable consideration that that may be if we we thread our fuzzing processes will give some gain out of this problem is that most fuzzer authors have already done this and thoroughly integrated OpenMP so we're kind of not so many opportunities

to speed it up there fortunately it's kind of smart people doing smart things already it's very unlikely we'll find much for the fuzzer to do while it's waiting for child child drops to finish anyway and then there's a lot of people trying to get a good openmpi fuzzer going I've seen experiments and papers but nothing to finish I think this is a preferred approach treszura zone shoulder into multiple compute nodes and everybody means up in the end but when one question I don't see explored quite enough is is there another variation on this approach that more efficiently uses our resources is it possible that we could use openmpi pragmas to spend all of our time in the

hard parts of the software to run is there a way to do that safely and still get the same crashes I think that's a deeper area of research that people should explore we're not gonna bother with cache optimization for the fuzzer itself again it's probably not gonna work out what about the target software is it possible that we might be able to safely recompile the target software with optimizations so we spend less time in it when we're running your test case in case most cases is yes use a optimizing compiler you can sped up 20 30 times scrape stuff so it's possibly worth it but you also have to consider for yourself how much time engineering

time you want to trade to get the speed-up and whether the target program is is how do you say amenable to it in my 3d printing test cases it really was because it had linear algebra but in something a little more linear it might not so I in my own tests they give you charts in a second I observed about 10% speed-up and generations of fuzz cases for appropriate workloads as compared to control so what about to OpenMP on the target software this gets really complicated if you want to try and find a way to add threading that wasn't already there and then you run into the cases it's really hard to add that without side effects and causing

data races that weren't there in the original software but there could be a possibly massive speed-up if you design it correctly then again if it's running on the same system as the first test case generator you're probably running out of threads anyway so that's that skip over that and then we might want to explore how we can better split up over nodes the running of the target software I touched on that before so we don't have to go too much deeper on that again splitting up things smarter ways would be a cool idea we tried some some more creative things it didn't work out and what about free art software so we have software presents an especially daunting

task because graph triage software has all the problems of running the fuzz test case but worse because the way a lot of the works is crash edge will need to invoke a debugger for each job it's running then the debugger must invoke the target program right until it crashes and that means we're waiting around on two forks instead of one and whirring on a crash that might take it undefined amount of time and currently there is there's no solution to that other than to parallelize everything and it's just down to the number of nodes you have so there's nothing to gain from cindy not not even from OpenMP or an MPI other than the obvious thing a paralyze

early parallelizing in the most obvious way you can think of so just to sum up what did we learn in this section the great majority of work is in the child process optimizations to the fuzzer itself don't make as much difference to the overall performance as optimizations to the child processes but optimizations have to keep the semantics of the ritual program there might be opportunities to further update if we set up exotic threading or MPI paradigms but we're trading a lot of work for the results we get so why don't we set up a fuzz erase we you know learning what we've learned how much gain can we get for our crazy fuzzing rig for the willing amount of

hobbyists engineer time - badge so here's our very informal possibly little unscientific experimental design we're going to set several of our imagine if fussing cluster designs and run them head-on with each other and we may learn some things from the performance here's my VPS setup into this on digital ocean with their high performance compute nodes the rank torque for control queuing and using the sec subscribe before a torque AFL crash lock torque AFL AFL crashing analyzer torque BFF and crash lab again I think I don't have PFF numbers here for you guys so I apologize in advance and you know here's my cluster and it lives and for our target software I wish I'd written

this talk after a trailer bits that they're there talk on benchmarking puzzlers otherwise I would have integrated some of the test Suites they had there but I used abdomen and an old version of lid magic that I knew to have a lot of crashes that's one like typical go-to ones for AFL so my parameters are this each buzzer gets to run for 10 minutes controlled by the queuing system I run it once in the control node run with the trivial parallelization and then with more and then with more and here are our results and they kind of conform to our intuitions from before the easiest way to get extra speed out of these things is to add extra by hand

sim D optimizations for adding nodes obviously adds a lot of speed but a lot of cost so it's not the best return on investment it on pay is OMP is kind of dicey someday I will get a more scientific graph on this thing but I really hate using gnu octave so maybe I should just hurry up and get a mathematical essence alright so that covers what I did I think you guys might have questions here before I move up to some of my conclusions so I'm gonna open the floor you guys have questions oh it's quiet that's happen I've been doing so fast okay sure nobody alright distributing this across multiple nodes worked very well we already knew that

adding some manual optimizations helps a little bit but not enough for it to be worth it outside of a hobbyist perspective in most cases and it's very labor intensive if you're curious about saying of a similar situation to what I did I did this for about 300 bucks a month worth of fuzzing and we have a good profit on it on bugs but if they were smart about how they set it up maybe you could do even better than I did so yes it's got a doable breakeven I don't think it would be any any cheaper doing it with an at-home chart a cluster and paying for your own electricity and cooling and stuff just back of envelope

it seems like it's not viable so we didn't leave we didn't add any slick continuous integration tools to this we didn't have any automatic deployment of nodes that might be stuff to add for the future I'm definitely continuing this research I really want to find a way to GPU GPU eyes this I feel like that's the future of it really hard to do it without side effects I think that a lot of the future of this is doing chunks of programs instead the whole thing's strategically but it's hard and there's all kinds of exotic solutions you know that base fuzzing and that kind of thing but that's that's pretty much where we're at well that's the end of my talking if you

guys have any other questions let me know my name is John dun laughs you're my contact details and you can grab the slides right here

[Applause]