← All talks

Androzoo APK Search: A Search Service Of Meta-Data

BSides Luxembourg · 201824:37292 viewsPublished 2018-10Watch on YouTube ↗
Speakers
Tags
CategoryResearch
StyleTalk
Mentioned in this talk
About this talk
We introduce Androzoo APK Search, an online service for querying structural information extracted from Android malware. The service is supported by an ElasticSearch cluster which can be leveraged by security experts to access a broad set of meta-data, including developer certificate information, source code elements, manifest information and antivirus labels collected for 1 million malicious applications. Androzoo APK Search can be accessed through a REST API and integrated with external projects via any HTTP clients. Compared to other platforms, our solution supports a fast access model for retrieving the list of applications which match a specific feature (e.g., call to a given method name). Thus, our system enables the community to track indicators of compromise related to Android malware. With more than 900 fields extracted through static analysis, experts can also exploit the meta-data that we provide to devise better detection systems and prevent the propagation of malicious samples. Finally, Androzoo APK Search can be used to compute analytical metrics and create a baseline for the characterization and the classification of malware families.
Show transcript [en]

oh hi everyone I hope you all had a good lunch and you're not too sleepy so let's welcome metric urea who is going to talk about Andrews Rebecca's search so a search service of metadata related to Android malware so Madrick is currently a PhD student as well as teacher assistant at the University of Luxembourg so I give you the floor okay so thank you so yes so the topic of my presentation will be on on the whate malware so I know that it's not a topic that may interest all of you so first let me ask who is interested in his company or the person Arabi on on the weight the on weight application like

for pen testing reverse that's what I thought okay I tried to keep my up confession interesting so first let me hear a bit about me so I'm a doctoral researcher and teacher assistant as investor for example my main I'm also fillin I work with data scientist mentor at open classroom clear loosie loosie to 0 or in French my topic my thesis topic is focused on on with security machine running and big data and especially I'm interested in dissecting and understanding the behavior of on the Weidman well I added some some underwear to my social account and my my I called repository so about my team we I work as a researcher at SN T so it's

centered at Luxembourg which stand for interdisciplinary Center for security reliability and Trust our Center is focused on creating partnership with industrial partners and so to enter industry or problem within with this Center we have we are working in a research group called Salalah which is specialized in security in soft relation r is and and yeah and radiation we have a team of about 50 people and you can find the link here in the presentation to visit and women this team we have a sub team which is specialized in on auto with on weights acuity so our array of expertise of Vietnam in our wave and grab detection and also static analysis so we have some good Tier one

publication that we publish at research conference like Hughes Knicks if we e TN and poppin now let's discuss about the program so prime I want to present today is as a publication of Android malware and as Bonilla is saying here one does not simply prevents application of on with malware when didn't survive in the movie but I want to to do back up some numbers so that the first two points come up directly from the blog of Android solidity in 2008 there was 3.5 million application on Google Play and the same year the team they removed 700,000 application in welcome still as I'm full or as malware which is a lot and by the way does not include the application

they're currently not detected so it just one that they are currently able to detects and if I and if I add a report from data so computer company they say that there is 344 new mobile application which discover every every hour which is a lot so there is no way when you man can analyze this amount of of malware it will be fully basically transformed every people on the world is reverse engineer so in were former combat to try to understand the cause of this problem I think we have to think about the process how do we create new número are created and how they are they are understand for me it's come down to an

automation problem because in this fight we have a clear advantage and we do not on one end the attack we are able to with some script to automate the creation of new malware they can repackage it off escape the code and use a new technique to try to avoid current detection the problem is on the other hand we are lacking resource to analyze this amount of code and it's more difficult for us to analyze because we all the thing we do to try to bypass our current security measure we have no way to keep up with the current rate of propagation and finally and this is importance compared to other area where machine learning can

be used to automate this task security is one of the most difficult one compare it for me as the medical field like if you are on Amazon and you want to amazement recommend some products I mean if you don't buy the product it's not a problem they lose a bit of money but there is no grave consequence on our on our side if for example we are we miss even one malicious file or on the other end we block it it's possible it's possible but it has consequences like some people doesn't get his report some person can be fired or you can just grab the whole infrastructure of your company so that's why it's remote difficult for

us to provide fully automated the solution since we have no cannot guarantee the behavior of the application the same if you go to a robot doctor would you trust is the opinion so the first question you will ask is what do you think is this disease of this one and VC and currently are not at this stage so for me to try to keep up with with with the attacker we need to develop even more automation and data automation and also to share information but we are present to you on topic and such so from its part of the solution what you are doing here is we are providing the Security Committee with an online service that allow us to search

metadata related to on wide application so as our current service is is based on 1 million malicious application and we provide a read-only access which means that it's not like mist where people can share information we collect the malware reprocess them and you can issue queries on our system to get information so the use case that we found could be useful for the community will be first to try to identify application which are sharing the same components if you don t five at the class or file or method is involved in to some malicious activity your first Direction will be to check on on this kind of database what are the other application but use this indicator to

try to block it you try you can try also to be a bit more advanced and also to to avert to assess the trends of of malicious artifact which our goal is to see if by something which is really common or if something something is falling out of of the trend and finally our goal with this platform is to create tool but other people can use to create automated solution so if you want to know more this is the link to the documentation of the website it's a loop that we needed lu / epoch a search is if some information that you're currently indexing so the metadata varies we forefathers would not know it I try to

explain on natto in application is just a zip file its own Fu K fight just to zip with a bit of Java metadata so it contains files so it can be images of our shell scripts and we get information like the name pavan's nature of these files we also retrieve some meta information like the Marquette's origin of the application this its size and also the package name associated to it we also reports on to various levels and here we already integrate with one of our solution which try to come up with a single family per malicious application we also get you you also get to develop certificates so we'll create the application with certificates so the

issue on and the signature of the disasters indicates and finally the most important thing some code objects which are methods field plus everything you can find in Java so we also have strings invokes the goal is to create as much possible come today - for signature and for finding rated application and your stuff some manifest information which is specific to annoyed this other dimension the activities intense and so on I tried to represent on this on this picture here so how's the things be representin visually so it's like a graph so Renaud will be the application and this rectangular one are the components and the interesting thing here like a social network is to try to

study the relation what application include what artifact and vice versa okay our brief we discuss the architecture so it's based on elastic search so it's full full search text engine that we use to Index this information so we use it to store document and indexed by attributes we process the extraction of metadata for multiplication with Python using salary so we distribute it across our cluster and we present elasticsearch api of apache where we enforce some basic application so all this is diagram just not but this system is osted by University and good I would say research sure it's the best default so don't expect to have some SLA night night night desta Levis that's the case

we try to do our best to endure everybody's traffic and it's possible but sometimes the fire or will not keep up of these things now I just want to explain a bit more how you could access this API so we provide a little authorization mechanism which is based on HTTP basic code so here you just have to send the your email an API key over HTTP by adding others and it will authenticate the requests was not too two complex you first need to encode your key with makest base64 which is a simple algorithm to send things of us internet an infant to get a key you first have to send us an email to oh no

sue our bastard Lu will send you the key and then you will be able to send this with the 4000 encoded key with each request to get information for the access method so the nice thing with elasticsearch it what is based on the rest api which provide you with JSON document so it should be familiar with everybody which use an API over the Internet so any HTTP clients or any reason client can process information from the API some example GQ curl backed an HTTP HTTP HTTP and Python request now I will provide you with some example so the goal is to show you what you can do with the with our service and as the

plague visits this is only to me to type in Python to get access to it so once you get you type this free line and you have the key for sure you will be able to access and issue some quest I also added the link if we want to know to create some more powerful elastic search queries so there is nine example I will not explain all of them in detail and for those who are not so interesting about the example it's interesting to know bit more how elastic search work if you want for example to integrated to a sim solution so this is how he look like so we need to provide a path to the to

our metadata index which is called apk index and this is a simplest query which simply returns involved document which are currently index one you can also create saw some queries like in SQL this one so the query is a recent document is sent with the request and this one is returning all the family who match was on tavares say it's come from the add new family you can also get a single document so it will retrieve all the metadata related to the application so below you can see the list of metadata so that's what I mentioned previous right so next one multi document it's possible to retrieve the document in bulk so okay this one is

a bit more interesting because it creates a logical operation so here we are interested by every application which contain a Chinese translation at which include dot SH file so when I was at the unc-tv request took something like less than five seconds to to return and that's a nice way to gather if you have some suspicions about country of an IP and zones a good way to to get potentially dangerous application you can also do some analytics similar to a group by in SQL so here we are computing the total size of of a set or a PK I think I made I think it's something like ten terabytes of applique but we can't leave your hosting and we and so you can

also create diverse type of queries like this you can also retrieve applications with like I associated to the application if you are looking for example for domain like if you are working in a bank you can say what are the application which includes domain in the package limited useful for example if you want to retrieve the application and try to impersonate your own company this one is a bit complex just you are interested by that there is something called the score API in elastic last retrieve a lot of data efficiently so if at some point you are saying okay maybe I don't have enough result or taking too much time and score API and if it

doesn't work send me any members moving and the last one this one is useful for example for our researcher when we want to study for example some specific features so you can for example select every application that we are Hosting which contains the ring the red-hot state commission all right so I finished with the example so now we just present what we also intend to do in the future with our service first thing I want to do is to include my application we we divided it into three three stage the first stage was to import every manager application and with the most important features so what we are currently does that you are currently in

now we also want to include the benign application in our set so in total about seven million application which account for sixty terabyte of data and finally burst want to include national features that help other people to create model to detect this kind of application we also want to provide over service on top of apt search and I will detail in the next slide and finally we also at some point of I did the service called ethical eyes to provide the riving inspection of of applications which was returning more data which we are currently serving with services currently broken but if I have some time I tried to repair it and to make it available again if people are interested

by integrating IP have search into their products you can you can contact us for example be interested to integrate this solution with miss and some of my colleague already in the discussion to try to do so I give you our simulated walk something I did in the past again to try to understand better on Whitman where one of my first walk was to try to unify the Lebel of of antivirus report so when you receive a not something like you know various total report which contains labelled for many antivirus the org or get a different name and we have a different syntax different semantics so the goal of our tool is to try to say

ok this malware it's come from this family if there is some common terms like one that one say it's called a PG app and the web sites for a do they always occur in the same we say the same application so we we say it's a synonym we are I also created stays which is some statistical metric to try to build or qualify on malware data set here the goal was for researcher to guarantee but the data they were using to train machine on in model was not bias for example by including a set of antivirus not too different from the rest of the set so we try to answer to evaluate here some question like do some interior detection

a very to generic of what has the confidence level and also is there some similar detection across antivirus finally we are currently working on a new approach called epigraph our goal is to try to find the discriminate component so the component which appear in a single Maria family but which do not appear elsewhere so we say that this feature are characteristic of a given family and that we are interesting as the first path to locate the malicious behavior of Unknowing application so it is continuous yet if it's if is interesting let me know so in conclusion on also epic a search it's an online service but people in the community can use to retrieve information about on load

application our goal is to try to enable the committee to to get easy access to auto in malware with butanol standing of characteristic and what do they and how do we arm where user this data set is available to every researcher and some small condition you can check them on the web websites and for industry actors you will need to contact us this is on the case to on case by case basis but we already have some collaboration with partners to integrate on those results that's it I'm open to any question I just saw but you have a piece be interested bit by enjoy the work and buy the thing we are doing as research awesome thank you

[Applause]

thank you for the nice presentation I was just checking the website to buy the container and I saw on the axis that it was quite complex to actually get acces to the service and international something about institutions and government position and stuff like that so is it possible for someone who wants to play around or check to the taxes in an easier way maybe so it's you have to send us an email I think it's best to discuss visa via an email because it's true we cannot it would be irresponsible for us to send malware to people without any plague visits so currently our condition is we it has to be for research sure it has to be sent by

people coming from an institution like so in your example so all away all unless it you like this and then they need to agree but we will not judge them themself and so on so the best thing is to send us an email and you can discuss how can we share the information yeah thank you anyone else okay so I do have a few questions so assuming I am malware analyst and I guess covered like strange things with this on rosu am i able to download the samples yes so I didn't mention it but on osoo it's part of a larger service but none of my critic created and where we are 7 million application but what

can then load the thing is previously we didn't have any any system to do some plate to do some queries like I want I'm uninterested my application but use root Bluetooth for Wi-Fi for instance so yes you can use and also to download the application and and trying to understand how to - quite a solution as you may see this was most most that's was tailored for researchers but if you have encountered some climb in the company I think it's also interesting source to get infamous evil samples again I have a question about this new feature that you want to implement so mainly the machine learning ones will this information be available through this interface like I don't know

the cluster ization all these classifications that you will do will it be available to this this search it will be available I think yes but mostly like documents so which means I will be able to to get the results but I don't think it would be too valuable to make some search on it because here I have to go back with robots I was most interesting by what I call binary features which is something you could say is it in the application yes or no like a file the file is here or is not here but for a machine learning algorithm we are most interested by a computed matrix for example what is the total number of

premiership which is the things not too useful for analysts but if you want to create some behavioral model where it's interesting so it will not be searchable like the video that you have here but there will be available with the same interface again that's the best default so it develop our time on the time I have to allocate to this to this ok and my last question is about this underlies is is this something which analyzed the epoch is dynamically or is just static analysis of the dough for first month it's only static analysis we try to create a service with which could include many type of analysis tool knock on door guard for instance and return

its results so I think to advance to over non solution I can't remember the one but the one where a web application you can quite a lot of information but on with application forest our goal was provide more of a wide set of analysis but necessary to compete of the best analysis compared to over existing solution okay so I finished with my question so if anyone else has some questions no okay so thank you again Matariki for your attention