
[Music]
let's get started so first of all thanks to everyone who has been present here I know is Sunday but I hope for the next 25 to 30 minutes we're gonna make this talk pretty good and you'll learn something great and where's mine so it's kind of journey for the next 21 25 to 30 minutes hopefully let's get through it so today we're gonna talk about trying work Yas we typically gotta focus on a one of the main scenario which means like we're dealing with botnets we're dealing with different set of crimeware but in this particular talk we are going to look into how the attackers or the botnet operators are deploying botnet command and control panels this is all one of
the most important thing to know when we talk about the security intelligence in the field of malware and this actually gives us a lot of interesting insights and then we'll take a look at it so little brief I think background are mine so I think I just wanted a little disclaimer most of the research that we're gonna discuss during the course of this talk is totally an independent one nothing relates to my employer because I always believe that you know security give a community has given you so much we are in the world of shared responsibility model like let's start stick to it but at the end of the day it is always our individual strengths to
actually give back to this karate community as well so that drives me towards a lot of interesting things that is apart from my job responsibilities so we'll take a deep dive into laid it down so let's get it started so what are we going to talk about today so we're gonna look into the world of cybercrime we're gonna talk about why botnet are being drawings growing we're gonna look into the a the basic HTTP base command and control communication channels try to see what it is all about we're gonna look into the some of the real world Kumada and control panels how they look like and then most important part is that what are the techniques that you
have to opt to actually go after and find these kind of command and control panels typically pressing the channels on the HTTP protocol then we have like we did some empirical analysis we're gonna discuss the results of that stuff that actually gives you more feel about what it is all about what we are talking about ah then we have a little bit discussion on the arms race and we'll conclude with the question and answers yep so I'll start with a little analogy right I mean these dates the more of the security threats are relying in the context of data movement right I mean earlier we have some complexities around we have data storage and all that we have been
advancing in that field but right now with new compliance coming up gdpr and all those kind of things down the lane we are more individualistically tying after the context off which are used to call as like data movement chaos white what I actually mean that at the end of the day with the last ten years mobile revolution is occurring new devices are coming up so what I feel that more devices more targets more targets both threats more threats more exploitation and more exploitation means which results in most more security breaches if you seen around in the lis recent times right we talked about who here many many security controls like solutions devices everything has been
deployed but still we are facing security breaches what is all about so considering that scenario it's very important to understand the complexities that are involved around the data movement similarly which basically resonates back to the scenario which we used to call it more devices or are present in the wall more devices means like more launch backs for the attackers to trigger the attacks that's why if we in the context of boss what is bought mean right basically you know you compromised the device you install your malware you know it become a part of the botnet and then you can perform a lot of nefarious activities using that network of ports but again at the endpoint the
more important part is that in technology things are growing exponentially so as the threats so as the infections so let's take a little look into the growth in botnets if you see around but we are seeing in recent times you know new botnets are coming out whether these are dwell up with the new code or the reuse of the code I know in 2008 spy I came in Zeus came in and then there a variety of those botnets you know using the same code Citadel one a lot of other things these days we are looking into infection of Tesla agent and a lot of other botnets but this this at this thing is actually going
exponentially and that actually a little bit worrisome down the lane because at the end of the day you know lot of nefarious activities will be done on the internet that involves users that involve devices that involves organization enterprises whatever it is attackers are basically after a scenario to get the money why what I mean by that is earlier when we have seen in the growth of botnets people are doing it for fun and profit purposes you know we still see denial of service but these days those basically adversaries are financially motivated and that's what we are seeing these days at the end of the day if they steal your credentials I mean they're gonna sell it in the
underground community they have like well-defined structure marketplaces in the underground economy and they are just strengthening that economy day by day so that's what we are feeling right now just another one stats before we start the wheel talk growth in botnets right and then we can see the law over IT or botnets coming across and that is targeting IOT or either they are targeting different you know devices you know dak stop lap toss whatever it is the most important point in understanding the botnet growth is like how they are abusing the protocols so I'm just gonna take take a deep dive a little bit into the basics and then we gonna you know talk further so when I
talk about protocols I always feel that at the end of the day there are standard protocols there at HTTP protocol IRC protocol p2p and a lot of other different protocols but still at the end of that is a client-server model right I mean your systems are in fact they have to connect back to the CNC server and then they're still using the same protocols for communication the most important artifact in is that how they are abusing that protocol are they are using the similar or they are using the same way the way RFC's i defined the protocols or are they actually screened during the process of using the protocols for CNC communication so we actually take a look into it so let's
start with a simple basic CNC architecture specifially Curly for the HTTP based botnets so let's say if you look at this particular graphic right so you on this side you can clearly see that there's a board executable which what I actually mean by that your system is compromised malware is running inside it which we call as a bot in the context of botnet terminology now since the body is running inside a system it has the capability it's all powerful right it's can subvert the integrity of client application whatever it is all about but we are focusing here on the CNC communication channel not the infiltration one because it has already infected the system but we're gonna
focus primarily on the protocol scenarios so the bot has to send your information back to the CNC server why look into that contacts is it this a can is it right to say that okay you know board extracts the information is gonna send it back to the server server we receive it processes stored in the database the attacker can come back look into the information and then you know sell it in the market place but if you look at the three different main components before your data gets back into the CNC admin panel there is a CNC gate as well which is a very important design component of the CNC deployments where we talk about
botnets what that gate exactly means the gate is basically a actually scrutinizes the every single communication that board is making or sending data back to the CNC server so body sending data get gonna filter it it's gonna look into it whether some data is coming up which kind of board is sending what kind of data whether is a checksum associated with that data if there is any anomaly the CNC gates gonna filter that traffic out so it's like cleaning process right you know when the security researchers come back and they do some kind of analysis they can still connect back to the CNC domain try to mimic the behavior of a bot but in that case there are
certain mechanisms they have to go after try to make sure that they mimic the traffic exactly the same way bot is sending the traffic that's why these adversary's bots and operators they come up with this design of CNC gate right they want to make sure that they get the Queen data in underground economy the value of weighing data is way way much more than the like garbage starter right because that's where they get a more money because the data is clean it's validated it's verified and that's what they are going after so all these steps design components are tying back to the underground on me like how they gonna collect the data how they gonna sell the data what
kind of data is getting back into the CNC panel but this is a very important artifact in the design component of that so baaad is gonna send HTTP traffic HTTP requests in the form of HTTP POST back to the CNC server but the gate is going to filter it making sure that the data is coming as expected so in this case I'm trying to solidify the fact that I was discussing earlier it's not something we are just discussing it but look at this so we were analyzing some of the you know the code of this botnet poni botnet which we try to find actually got a hold of you know while doing penetration testing of some of the
CNC servers try to make sure that how it is being deployed and what it is all about so when we realizing the code it was very very clear that the concepts that we discussed earlier was implemented right away so very simple use case scenarios what if no data is sent back by the bot to the CNC server right so the gate is gonna come back it's filter it if you look at the code is there and it's gonna tell you that how they are filtering it it's a PHP base code but it's validating the fact that gate is there it's a primary component of the botnet CNC architecture looking at the second one this is the
admin panel so we discuss about there's a gate then there is an admin panel the admin panel code is gonna show you how it's interacting with the database what kind of white listing it has implemented and how the data is gonna go back into that tier 3 we just call isn't that into the database so Gage and the admin panel they are interrelated right every request has to be validated and verified by the gate before it goes back to the C&C domain so this is a basic architecture we're gonna target and then we'll do some discuss about the results on the empirical analysis so before moving further I want to give you a feel of it you know how the real-world CNC
panels look like so if you look at this slide primarily so we have like Lockheed BOTS CNC panel there is a Tesla agent CNC panel Tesla agent is the kind of stealer that is being you know distributed a lot these days and this basic purpose is to steal send it back to the CNC servers store it there and then you know reuse it in the cybercrime marketplace another one is the Pony CNC admin panel Citadel CNC admin panel and things like that one of the important fact about Citadel is that it is basically originated after reusing the same code with Zhu State so some of the mistakes that zeus botnet had like in 2009-2010 this botnet overcame in
2014 I think we gave a talk in blackhat where we discussed about how to actually pen test the CNC panels and we highlighted some of the weaknesses in panels and actually these days when we test the panels like in 2018 after three years most of those weaknesses have been eradicated so which means that attackers are also looking what kind of research that are going in a scary research community making sure that they fix those issues so that the infection process goes stronger and stronger the next set of panel if you look at the administrator panel is a coin miner coin by a sense lot of other things Godzilla won and all that so all these things is
try to give you a feel about the infections that are running inside the compromised you know machines they are using Bitcoin mining another kind of stuff as well so all those information still goes back to the CNC servers making sure that wherever they mined they get the idea the data everything up from the compromised machine and these are very important you know like whenever you drink you know penetration testing or try to test these kind of CNC servers you encounter all these scenarios a lot but then you have to have like all techniques and tactics try to go after you know performing reconnaissance and other kind of things try to find out if there is
some kind of you know exploit that you can trigger to actually get some shot out of it and some stuff like that but at the end of the day you deal with all these kinds on on your regular day purposes and the most important part is that you know you need to make sure that you got a hold of it you get maximum intelligence gathered out of it try to make sure that you use that intelligence to build strong detection solutions or try to share that research with the community if somebody wants to use it they can go ahead and have it so right now we have talked about your basic architecture we talked about design
components of the botnet now we look into some of the CNC administrative panels sported by some of the you know botnet codes and all that now let's talk about some of the techniques that we use to actually go after and finding CNC panels because we understand what it is in theory it looks like but at the end of the day we want to make sure that as a security operation guy is sitting in a cloud sitting on a data center wherever it is even in my local network I want to make sure that what I have to look into so that I can detect and fingerprint CNC communication typically on the HTTP level so couple of I mean these are the
wide variety of techniques that we use but concentrated on specific set of attack vectors so first of all we all talk about static analysis you know performing reverse engineering attackers are making mistakes they are hard coding the things hard coding the CNC domain C&C URL you just Reaver reverse-engineer it or debug it and get the ID out of it you know how the code works in the underlying assembly and then you know got a feel of it Ida Pro all these kind of tools are there so you get a feel of it and then you got a able to find the CNC domain and then you start your penetration testing afterwards but it is find out that okay this is a very easy
technique you know just passing the value of your CNC domain URL as a string in the binary is just pretty easy you know strings lot of other things tool you can use to get it done but then they came out okay we do want to do that we're gonna have like some kind of different mechanism or different technique to go after which we call as like domain generation algorithms and some advanced algorithms embedded in the binary so that when the binary or the malware executes sin the system at the same time the algorithm is triggered in the system and they actually generate the C&C domain the CL see URL dynamically so in that case until unless
you run the malware in a controlled environment you basically attach it to the debugger you don't have any idea what the CNC domain is going to look like until unless you reverse engineering it completely try to extract the algorithm you know drive from the algorithm you draft a pseudo code back into the Python write the proper code and then you know you just get the idea of how the domains are being generated so those are the power of domain generation algorithms and when dgs generate domain we Turman we turn them as like algorithm generated domains you can build CNC URLs out of it as well so in dynamic analysis in a control environment in a VM where you go
after you run the malware you dump the pcaps you start analyzing you know how the network traffic looks like HTTP Laur get requests are going on post requests are going on and trying to find out anomalies in that for example simple anomaly you're sending the dot over HTTP POST but the body is encrypted basically it is a deviation from the HTTP RFC standards because HTTP is a clear text protocol you don't need to have any encrypted content and body and all that so it's a couple of these things you are usually you get this intelligence you analyze more malware try to see characteristics of other bots and then you figure it out okay these are the
attacks these are some of the techniques they are using other use cases that you know whenever the send they make mistakes like Russians other guys were actually drafting or in some Europe or summation whenever they are drafting any kind of writing a code they make a mistakes and HTTP headers or some other things so these kind of anomalies you keep on encountering while you're doing reverse engineering dynamic analysis and some stuff like that and the third is the most important part from the static analysis and from the dynamic analysis you extract some got out you build some signatures and the most important part that we all love is Google right so using Google search Docs you can still
detect a lot of CNC panels I'll show you in a bit you have to build patterns use Google Docs and then just run your Python code and then it's keep on detecting some similar set of web deployments and you can clearly see that is so easy that Google index the CNC BOTS as well the problem is that most of the deployments are the CNCs of these kind of stuff they don't put up like restriction on the basically such bots right so they just go after find okay just try to get every HTML page into a cache and then they you have the indexing ready and then you just use the Google Docs they give you a
feel of it other one is like hosting providers right data centers or in cloud or AWS ones rights where people are hosting their VMs and those VMs are also getting infected compromised people are deploying C&C panels there as well so you need to have like some scanning scripts you wanna up go after and then make sure that you know you find some kind of patterns here as well at last one of the important factor also is about threat intelligence gathering from the open source platforms or like paste bins and other kind of content sharing platforms you you get a pretty good feel of it there are many good players in the market right now which are actually
mining the information for us as well but combining all these there are a couple of other techniques as well but combining all these techniques together you can really build a good fingerprinting solution and try to see that you know how Aidid to detect CNC panels for specifically that are using HTTP protocol so let's just move further and give you a quick feel of it you know just a quick sniff out of the reverse-engineering so doing the looking into hard coded CNC so you can clearly see one of the examples showing the the Mallo is actually using tor network to actually send the data out other one is the is a pseudo code that I drafted
after looking into the algorithm and then we just wrote the Python code try to see how the you know domains are generated in an automated fashion so you have the seed value you have the token you just put that in afterwards engineering draft the code put that in and you run the Python code is gonna give you all that your means right so that's the way it is being done for from the static analysis perspective looking at the dynamic analysis when we try to look into the network traffic you will you can clearly see that this one on the top the first snapshot here the body is actually generating running the domain generation algorithm is generating the
domain on the fly and it is sending the the domain DNS queries try to make sure that one domain out of the hundred is Gerren it should resolve which attacker is going to actually set up for the CNC communication so it's gonna start in the I trade a fashion you know first domain second domain on 200 or whatever the numbers are right one domain is resolved it get the IP address it's going to start communicating but you can't detect these kind of things directly back into the reverse engineering until unless you fingerprint and try to find out what that algorithm is all about run in a controlled environment get all the domaine and then try to see from there
once the domain is is all the bot is actually sending data back to the CNC server using the HTTP protocol and that's the example I was giving you earlier like you know it's still using the post it's the HTTP communication but the data is encrypted so these kind of anomalies we need to figure it out to make sure that how to look at the data from the compromised machines interesting one that I was talking about on the Google Docs or any search engine Doc's here so this example is primarily for the LA keyboard so we analyzed couple of CNC panels and we find out that these that the author of this particular botnet they are actually
using a random web page name with the PHP one basically is a PHP based random web page name but it wasn't exactly a random page right it looks like if you see that it's generated random but you see wide variety of deployments the page was the same so it means that it it just try to you know fudge you from the perspective that there might be some another random generated web pages but it wasn't the case so you build a simple signature you run it through the Google Doc engine and then you can clearly see you started finding CNC panels on the top of it right so this was just a simple google search and you can clearly
see the panels are coming up you can see which domains are you know compromised or specifically created for deploying the CNC panels but this is one of the most important another one the example is the tesla agent you can again build up at Google Doc what it is all about and then you know pass it to the doc engine to get a feel of it and you can clearly see the Google is now indexing the CNC panels that are being deployed either uncompromised domains or basically the one that we discussed earlier but you can clearly see that you can still get a hold of it that where the Tesla agents are but these kind of
things these kind of signatures or these kind of intelligence you can only gather once you you know and do perform some kind of static dynamic analysis or a correlation between all those techniques to get a better feel of it but these are the most important techniques and it's security just not only all about a one layer it's like multiple years and correlation between those layers and trying to figure it out how we need to connect the dots making sure that we have a good story before we go further and perform our further analysis or any kind of experimentation so based on all these things we analyzed couple of CNC panels we working in the same field for a couple of years and
then we came across we didn't find one specific you know experiments were not conducted in this context in which we really want to look like okay dump the data dump the URL list basically the C and C URLs and try to dissect those URLs to understand the security intelligence or basically to gain more security intelligence out of it what I mean by that I want to check how the domains are generated I want to check how many random how would the entropy of that URL is I also wanna check you know what how many CNC URLs were using HTTP TPS how many were using tor communication channel so these kind of things like that try to get get you a feel about
what we are going to obtain as a part of the security intelligence and you can use that intelligence to build detection solutions you know you want to mind some that out of your Splunk you can use that if I see these kind of patterns these kind of stuff feed it into Splunk app and then you can clearly see what you can out of it so we did this empirical analysis try to see what do you want to cover so in this particular case we won't basically look into you know where domain names are not used on the IP addresses are used you know non-standard HTTP POST or based web communication b NS b c and c communication and cryptic
communication channel how the botnet cnc panels are deployed like what kind of web technology the attackers are using to deploy these kind of channels right is PHP ASP aspx whatever it is all about we really wanna see and we want to perform a quantitative and qualitative analysis on the top of it because sometimes you can get off when we are performing experiments we really want to see the volumes right I mean typically in the field of just an example in the field of scalar white people say you know we don't have enough companies in that space that build a protection or detection solution in that context the reason for that is we have only see very
few set of attacks for example Stuxnet do although they were very aggressive but the volume not that high so that actually derives the market to certain extent but talk about Bitcoin mining talk about other botnets that are youing standard these have been going for years and years and why we have like antivirus is still existing firewalls still there although people are not happy with antivirus but it still has to be on your machine so that's actually resonates back to my earlier point security is not about only one control but multiple controls and the correlation among them we will also want to look at perform the entropy mapping because let's say you want to write us some fuzzing tool you
get a one CNC URL and you are looking at that particular URL and you wanna fudge more and then you write a Python code or any other code and then you can go after okay take this URL as a baseline try to I trade the webpage name try to I trade the numbers and stuff like that so when you have an idea and you have a control of certain set of data of the CNC URLs you can clearly see what kind of entropy is being set up for that URL so that actually helps you to write some go on the URL fuzzing tools and try to make sure that whether you can find more CNC
panels out of it for example we have seen recently still the domain is the same but the attackers have deployed multiple CNC panels on the same domain or one panel on a one domain and stuff like that but usually we understand that okay if there is a one domain you know one resources there only you say one CNC panel will be here no but there can be multiple directories multiple CNC panels will be there this we have seen earlier and we came across there this is also an important fact when you study botnet in that context so considering all that scenario so we clean the data we look at that stuff and some some inferences were
very very good so we came across that whenever the HTTP communication takes place right it is not necessary that in the let's say bot is sending HTTP POST request out so there is a host header in that it's not necessary that host header has to have a domain name which means that if the attackers if the bot is actually sending the HTTP traffic using an IP address as the destination there will be no dns try for that and that you if your solutions are based on the artifact that okay we need to first see the DNS traffic then the HTTP traffic that it basically bypasses it because there is no domain browser is not gonna resolve that domain
it's still gonna initiate a socket connection back to the IP address and start sending data on the fly so these kind of things are very important and we see that eighteen percent close to eighteen percent was like the deployments that we analyzed for the C&C they were actual using IP based communication but still as we discussed earlier the volume based right look at the DNS base CNC communication eighty-one percent we're using the domain names so it makes sense we need a DNS based solutions that actually you know dissect the DNS traffic and gives us the idea right what are that normalities in the DNS traffic is right the domain generation algorithms the DNS handling and all those sort of things
but this actually gives you idea like domain names are still being manipulated to a greater extent so we also look into a non-standard HTTP port so I have a lot of discussions going in a community where people say if you are doing at GD P connection either is gonna be on port 80 443 eighty eighty eight zero eight one and stuff like that that is not entirely true I mean you can't deploy a web server or any port just you have to bind on that particular port to start listening to your web connections so we found that at least not enough but still 1.2 to 1.3 percent although twelve thousand CNC URLs are we analyzed they
were actually using the non-standard HTTP ports which means that if at the firewall or at any place if your solution is just or any signature that you have deployed as a part of your security operation that is only looking into ok TCP port 80 four four three eight zero at zero it can bypass that rule on the fly so this is also an interesting in tally engines that we need to incorporate into the tools and cryptid SSL connections yes we've seen that where you know there were no validated certificates but it was just only 0.5 percent so still adverse rays bought and operators or attackers are going after clear text HTTP communication to a great extent it
doesn't mean that they are not using HTTP advanced botnets might be using but for the distributed infections that acts that are not pretty much targeted in nature they are actually go after okay we don't care we just want to send that out and why they are able to use abuse HTTP protocol because this in empower it's a very default configuration every firewall needs to allow the HTTP traffic because that's how we browse the web so they are actually making that behavior abused some HTTP communication channel and then send the traffic out of it and then you need to have ids/ips whatever it is there to actually dissect that content making sure that nothing's you know
sensitive goes out but but when you try to correlate all this kind of information it gives you a very different scenario looking at how you need to build a solution you wanna define machine learning when I define some algorithm based on that you wanna go after a natural language processing whatever it is this is the intelligence that make your machine learning algorithms out of the flow sometimes you can clearly see anomalies based on the data but this intelligent these kind of intelligence makes it way different may advanced when you build when you use data based approach to for detection and protection server side deployment PHP is the preferred choice right open source easy to code and all
that 96% 96% out of those URLs were using PHP based C and C's right so the way they have deployed command and controls the way they have written is using PHP only word which saw only 0.625 we're using ASP yes p.net stuff like that so it basically based on the volume based analysis at the end of the day it's done it's the PHP is a preferred choice so when you're building your signatures your rules and your other stuff like that that is the intelligence you really need to go after okay I can I can just chop up the window for the other set of server-side deployments but for this particular one I need to have a
stronger rules we look into the C&C panas deployed on the compromised web WordPress because WordPress Joomla and all other kind of thing they are being you know some kind of vulnerabilities came in and then you just simply run the exploit start deploying the web resources webpage is either used for distributed in factions or other stuff like that we still found that the attackers use the compromised WordPress deployments 4.8% which is still a good number then say during that so considering all that Analia we typically pick into the seven eight or other kind of metrics trying to us these variables and build a matrix out of it and try to get feel that you know how these botnets are being
operating right we understand the architecture we understand the design components but at the end of the way these stats these numbers gives you a feel that you know where we need to go after when we try to build advanced level of network detection and production solutions another one important one is duh you know that how the top-level domains are being used like so when the attacker is registering a domain what kind of domain it is going after it's a comm is the org or whatever it is without a doubt we found that approximately like 40 to 45 percent or thirty eight to forty percent exactly we're using the dot-com domain why four means like com they have their cnc
deployed on a like abc.com or some stuff like that and similarly we analyze a lot of things you know dot r u dot net dot dot base dot infor heavily heavily being used domains where they have deployed their cnc panels moving further just the entropy once we found that the majority of URLs were not that random in nature so let's say you have a abc.com and then you further go had there's a resource one and then there is a webpage name you can still build build a fuzzer and then you can still you know keep on running the fuzzer using different resources and then you find that if you basically it resolves it actually accepts your request you get
a feel of it that there's a similar kind of resources there so from the entropy perspective we didn't find like they were not generating the cnc urls to a greater extent so I think we're doing just close I just want a little bit talk about the you know arms race we all know about that is all about attackers and defenders but again at the end of the advance exploits are coming up at one security protection and detection mechanisms are put in place but still we see security breaches and all that this is this race is never going to be ending wherever things that are gonna end the race then the market will be gone right
so most important artifact is who is going to take the first step right as a security practitioner researchers who's gonna come after building strong security intelligent driven solutions based on what we have seen in the history and what we are expecting it to coming in the future but again at the end of the day we are dealing with the world the data movement is all time high complexity components away they are design all complex your data is just left your phone you don't know where it is going on we have seen many cases where you we just don't have any clue we talked all about privacy all the time but the world works in different way
this is what the reality of the world we live in so then say during all that scenario at the end of the day we need to be proactive we need to share the research with the security community and making sure that we build stronger correction and detection solutions so so I think I'm pretty much from a time so if you have any additional questions we can talk or we can talk about later [Applause]