← All talks

Project Ava - Can Machine Learning Be Used To Complement Web Penetration Testing? - Matt Lewis

BSides Cymru Wales · 201951:46395 viewsPublished 2019-10Watch on YouTube ↗
Speakers
Tags
CategoryTechnical
StyleTalk
Mentioned in this talk
Show transcript [en]

okay um good afternoon everyone prahandar uh i'd love you to be here naisi bordhama hey um i am from wales originally now exiled in london so it's good to be back uh in my homeland to talk to you about project ava uh for those who are wondering uh ava got its name from the ai robot in the uh hollywood movie x machina a very good film but i often get asked where the name came from and that's the image there so who am i doing i am the uk research director at ncc group i've been in about so i've been in cyber security about 17 years straight out of university so it's pretty much all i know in terms of my

career i had about 10 years as a check team leader crest certified mostly in web application testing and network intrusion analysis i was in the kpmg pen test team i worked at irm as a pen tester and i started my career at csg which was part of gchq straight out of uni and so yes i'm a swansea boy i did my first degree at swansea in computer science and then went to oxford for a year to do my masters in the same subject and the reason why i'm here today i think mainly as well also in terms of my research interests i'm a big fan of tooling and methodology and automation you know so anywhere where within

security we can get some sort of efficiency gains within security testing uh that's a great interest to me and so this this project was a nice way to sort of uh explore some new techniques uh in that realm so what i'll go through today is i'll go through an overview of what project ava was about 10 parts or phases to the actual project and fundamentally i want to share our lessons learned the research that we did on this was very exploratory very new to us we hold our hands up that we're not experts in this domain and so that's why we want to share with people to sort of get some discussion about what the role

of machine learning might be within web application and more broadly penetration testing we'll talk about some different approaches that we tried supervised unsupervised and reinforcement learning and we looked at expert systems as a type of approach also i'll go off on a brief tangent which is one of the one of the things that happens during the research as we were playing with natural language processing um we sort of saw some potential applications of machine learning within a social engineering context which was quite interesting nothing to do with what we were trying to do in the first place but it was an interesting tangent and i'll go into that and then i'll summarize what the outcomes were from our research and what

we feel are the the next steps not just for us but as an industry so project ava came about i guess mainly because as you're probably all aware machine learning ai it's not just a buzzword now it has crept into the cyber security space and lots of product system services that use those techniques we wanted to understand uh whether we can use any of those techniques within our web application testing process ncc group is probably one of the world's largest security testing teams we do hundreds of web app tests every week all around the world if there's any way that we can gain efficiency gains by using machine learning or ai then that's of course of

interest to us i'm not saying that just because we can then try and get more work in what we want to do is maybe use some level of automation to do the low low hanging fruit the easier stuff to allow us to focus on the more manual side of things the complicated things within web web app testing that maybe aren't that achievable with current ai or machine learning techniques um we set out to do something around augmentation with our human based testing rather than replacement so we didn't want to sort of try and create a ai robot that just does web app testing we knew we'd be doomed from the start if we tried to be that ambitious

so we're looking for ways to complement how we already do web app testing i mentioned the efficiency gains we also wanted to use the the project as a vehicle to learn about machine learning like i said it's for ncc it's not typically our strength we know security and penetration testing very well we're not really data scientists so we wanted to use a project as a vehicle to help us upskill with the different concepts in that area and also to understand you know is the mlai thing just hype or does it actually have some positive applications um so it was quite an ambitious project for us probably at first in terms of its scale we got approval for about 400

people days worth of effort because we knew there'd be a lot to do here and it was purely experimental with no preconceived ideas of output so we didn't set out to create something that we would take to market and start making money off it um that just wasn't the wasn't the um the rationale at all we were fully expecting that maybe nothing would come of this at all but we would explore what was possible uh during the process most importantly even though i'm here talking today all of the research and the hard work was done by my colleagues on team project ava who are listed here so all the work and credit is to them i

just did some little direction here and there was a bit annoying now and again in trying to get some deliverables from them in terms of a management type thing but um yeah this is uh this is all their work that i'm presenting today so part one very briefly for us was to just look at what are the frameworks out there like i said we were starting from ground zero um we understood there were a few different frameworks that people are using to do ai and machine learning most of them are open source so we we downloaded them all we created some virtual machines some playgrounds to play around with them with some free and open data sets not on

web app testing but all sorts of different things just to get a feel and understanding of which ones might be best fit for purpose the tldr was i guess these three kept cropping up a lot and seemed best fit for our purpose so psychic learn python based bit more in the academic space but very good for quick prototyping um tends to flow and keras as a sort of higher level wrapper around tensorflow tensorflow coming out of google seems to be one of the more prominent frameworks using this space so there's a lot of support for it there's a lot of guidance out there and also the python aspect was attractive to us because we we know

python quite well and have a lot of python coders within the organization so only for those reasons did we sort of hone in on these to do the rest of our research um there's no slur on the other packages um you know i'm sure they are good at equally good at doing their their things certainly cafe cafe ii which have come out of facebook a lot more geared towards image recognition that sort of thing and that's not really what we were after for our research purposes we also looked at the different cloud-based solutions are out there as part of this um i don't know if any of you had a chance to play but most if not

all of the public cloud providers have a number of just quite simple wizards that lets you without much if any knowledge of machine learning and ai play around and experiment with lots of things in this space so certainly microsoft aws google ibm watson um formerly bluemix which i'll come on to in our social engineering piece and we looked briefly at the alibaba machine learning platform for ai also and if you haven't seen it just to sort of show this is one of the screenshots from the amazon machine learning i guess framework within the cloud and it's just a very simple next next wizard type thing so if you've got some data this will walk you through the process

of uploading it selecting your features training some sort of model to do something and then hey presto you've got your machine learning model in the cloud that you can then start put pumping uh real world data in so uh it was quite interesting to sort of understand that these exist and how easy they are to use how much they abstract away from the the data science aspect of the the hardcore maths and algorithms in the background so making machine learning very accessible to most people similarly this is um the one from ibm one of their data science playgrounds a little bit more detailed maybe you probably need to know a little bit more about what you're doing here

but here you can drag and drop your your process this is for a an image classification system and then very similar next next next and there's your there's your machine learned model so while we were looking at those cloud solutions and particularly ibm watson which has got a lot of powerful ai and machine learning capability to all sorts of things one of the quirks of research is usually you do go off on a tangent sometimes completely unexpected just something looks interesting and curious and so that's what happened when we were playing around with their then nlp their natural language processing capability and where it came about was they have a um they have a playground you can sign up

for free for i think 30 days for ibm watson access and you can play around with a number of different features that they have within nlp and they have one thing that says type in your twitter handle here and we use natural language processing on a massive trained model to work out what your personality is based on how you write and what you've written so i uploaded that based on about 3800 words from from my tweet history and it comes back quite quickly with this um with a summary so it's initially a little bit miffed because it's like well you're a bit inconsiderate shrewd and heartfelt never thought of myself as inconsiderate before but um

i think a lot of that might be because on twitter i tend to adopt a slightly different persona to myself and maybe a bit more sarcastic and angry on twitter aren't we all um and i think that was that was picked up but also this particular demo is more geared towards people who work in market research who do profiling to do targeted ads but it picked up that i have experience playing music that's true i play a few few instruments it picked up that apparently i don't like country music that is true i'm not particularly a great fan of that so this piqued our interest in that um this looks quite interesting in there it

it can with a fairly limited amount of data pick out some things that seem quite accurate about me as an individual so we probe this in a bit more detail and in the back in the background to this is a full-on api into into watson's nlp that you can interact with and query and basically if you have a lot of data you upload that and it comes back with a lot more um information about people's personalities specifically it's aligned to what's called within psychology the big five so i'm not a psychologist but i read more into it as part of this and apparently our personalities our individual psyches can be broken down into these five main

aspects of agreeableness conscientiousness extroversion emotional range and openness then within each of those you have different facets that are broken down even more so with agreeableness you've got things like altruism and cooperation you know how cooperative might you be as an individual how modest might you be your morality your sympathy trust that sort of thing but this got us thinking around we do a lot of fishing exercises social engineering is part of red team engagements and while we sort of try and hone in on uh i guess specific individuals of interest this sort of takes it a bit further than that if you have this information on your targets our hypothesis is well if someone comes

across as particularly trustworthy maybe they're more likely to click on a link or download something so you're exploiting some aspect of their personality so we wrote this um this sort of quick little tool sort of try and uh demonstrate that called pims which stands for personal personality insight manipulation suggester it goes up you give it a bunch of text whether it's someone's tweet history their facebook data or a thesis they've written the more data the better and the more applicable it is to your target the better and it comes back with a breakdown of these different facets and anything that's about 0.75 or above is deemed to be a strong indicator of that facet so likely that person does

have that so for me apparently i'm fairly fairly highly sympathetic and trustworthy which makes me a bit nervous and maybe have to sort of tone that back a bit we haven't done any in-depth tests with this but it seems like an interesting curiosity that might have some weight if we're trying to maximize the likelihood of somebody a target clicking on a link by virtue of exploiting some aspect of their personality put my hand up it feels a bit scummy i mean it feels a bit sort of invasive doing that but it's also interesting that if we do some tests in this space and get some strong correlations it might be a good way to then understand

how maybe different individuals need to be trained in different ways about the risks of and cyber security and the risks associated with with phishing um so we're looking at that further um as a curiosity and we may try and loop that into our our fishing platform called piranha as a future extension just for fun uh we also at the same time since we'd collected all my tweet data played around with markov chains mark of chains allow you to do a number of like different things like um simulating text based on a corpus of text so i use this um this script here slightly modified from uh from github i've trained it where it's not training

as such but used it against all my tweet history and then i tweeted out with a hashtag mark of matt so people knew it wasn't me some of the outputs that came from the tools so this one here i hear gdpr is a diuretic one kilo of that stuff will have some interesting effects that was quite funny and then the one over the other end started talking about the welsh assembly and ninjas and all sorts of strangeness but it sort of makes sense sort of doesn't it's a mishmash of a lot of the stuff that i've previously tweeted about but it was an interesting uh and a funny quirk of what we've been looking at

it's like i said that was a massive diversion we had to stop ourselves getting too deep into that but we we have sort of spun that off into a separate research stream so getting back on track with project ava um before we launched into sort of doing any coding or playing around with stuff we wanted to understand what's been done before always within research a good positive step you want to see who's done what before you don't want to duplicate effort that sort of thing so we looked across academia and industry how successful had previous attempts been uh and with that we wanted to understand you know were we being too um ambitious with just 400 days worth of

effort assigned to this project who are the main players how long has it taken them um i think can we reuse anything that might be open source in this domain in short not too much came out these three three came out as sort of things of interest so i saw takaysu researcher in japan he he's published a lot been quite open about what he's been doing around using ai and machine learning not just in web app testing but penetration testing more broadly since since we did this research he's published a lot about using ai combined with metasploit do a lot of intelligent scanning and exploitation so really interesting stuff he seems to be one of the main leaders

in this space cloudsec had a very interesting approach using intelligent crawling so they trained some good models to intelligently crawl a website and work out when something is a login page or a search page and then from that go and launch the various attacks that you would typically run against those types of um page or function so search being usually you would look at cross-site scripting in any reflected search terms or sql injection that sort of thing and then there was a commercial offering by high tech bridge who had this immunity web ai which we haven't played with but they they have that collateral on their website that they have that claimed capability so most of what we saw were fairly good

crawlers so people using ai or machine learning to do intelligent crawling of a website with some little bits of ai or machine learning for the actual attack stuff but not that much um academia was really great for specifics you know there were some very good papers out there on um trying to do cross-site scripting just in one type of attack using machine learning or ai that sort of thing but there wasn't too much in the way of complete systems patents patents patents there are loads of patents out there at the moment everybody's sort of trying to get ahead of the curve with the ideas that they have about machine learning uh in uh penetration testing uh having a read of

them though there's a lot of well rather there's not much information in the patent so they are quite dubious but you can see that people are trying to get a bit of a land grab on what might be a good idea at some point soon um so there have been some some gains i mentioned the crawlers some solutions focusing specifically on certain vulnerability classes like sql injection across site scripting we didn't see one particular strong solution in this space sadly i guess there was nothing massive massively usable from the open source we did find a lot of projects but as sometimes happens um you know they get abandoned or nothing's really happened in a few years uh we found quite a few

going back 10 years so even though machine learning ai has been around um well it's a hot topic rather into the recent two two or three years gone people were looking at this ten years ago which was interesting but uh most projects were abandoned uh we noticed that you know most people who had succeeded had only released some of their work probably because of intellectual property which makes sense and so um even though we haven't sort of highlighted anything that we think oh yeah he is a massive player in this space and it's going to be a big game changer that doesn't mean that that isn't going to happen at some point soon you know somebody might be sitting on

that type of intellectual property and it could land any moment but we didn't see anything obvious in that space so we knew that we'd have to sort of start pretty much from ground zero excuse me before we did that we that we sort of set out to properly architect and design what we wanted to achieve i guess naively we thought well one of the world's largest security testing teams we capture loads of data you need lots of data to do machine learning and ai so job done you know this is going to be easy but actually within a i guess a large commercial organization there are some barriers that doesn't make that as easy as as one might think

so firstly we have contractual obligations even though we have lots of clients we have different terms or contracts with them you know sometimes our contract will say you know you will allow us to use your data as part of ongoing research and that doesn't get queried and that remains in the contract some clients don't like that and that as part of the contractual negotiation that gets written out so in addition to some retention so you know some clients require us to delete their data at the end of testing so it's not that we do have this instantly accessible massive database of data as we sort of naively thought we would have also about the time that we were doing

this about may last year it was literally when gdpr was about to come into force that was a hot topic obviously at that time and that focused our mindsets a lot in terms of well if we're thinking about creating a database with all sorts of web app traffic from all our clients you think about what might be in a web request response pair things like usernames passwords they could be in clear text there could be session identifiers that are still valid contextually whatever the application is it could be some sensitive data like financial data we instantly started sort of wince a bit around you know if you have all of that in aggregate in one place that's quite a

minefield in terms of the risk that you're running having that sort of thing so i think gdpr was a good thing in that sense in that helped focus our mindsets there um with it being a hot topic at the time rather than the temptation of just to say yeah let's just get all the data and start training some models and also in terms of storage location so you know we had to decide are we going to do all of this on premise are we going to embrace the cloud are there issues in uploading our client data to the public cloud so all of these things we did have to have to consider this is what we came up with in the end

so for any data that we were contractually allowed to use or data we were allowed to accumulate either from open data sources or we used a lot of where we have known vulnerable training web apps we were able to generate the data we needed by using those and if we did have any sort of legacy data we could consume that in if it wasn't violating any contractual agreements so we produced a number of plug-ins to our our burp suite arsenal so we're big users of burp suite for our web app testing because it's very well extensible so this was assuming that okay well let's have all our consultants running the extension it will allow them to do

some supervised learning tagging of the data that we need we can consume anything legacy that we've got we then pump that into an elastic search instance we decided to go internally in the end so we didn't use any cloud just to minimize risk um that much more then what we call this was the anonymization or pseudonymization boundary where where we could we would anonymize the data to even more reduce the risk of any exposure in that regard then that elastic search instance would give us that pool of data that we could then generate our training data so this is essentially a way to amass lots of web requests response data and to use the power of our large security testing

team to manually tag in a supervised learning way we did also generate a lot of synthetic data in the early part of our research as well just in the interest of being to get up and running quickly but with a massive caveat that obviously synthetic data isn't ideal you do you do ideally for good results need representative and realistic data so it was very simple in that what we were doing as part of our training within the research from the plug-in within burp every time a consultant would have a confirmed vulnerability from their manual testing they could just right click send to project ava and categorize what the vulnerability was whether it was a sql injection cross-site scripting

eventually we got a few other bug classes into that list so that's the approach that we took so the first prototype that we had given that we now had some corpus of data was we first looked at text processing and semantic relationships only saw the assumption that well okay you have a web request and response pair there might be some relationship between what was sent to a server and what came back by virtue of what you typed in and what the vulnerability might be from what you've observed so for example you know when you're doing cross-site scripting you type some stuff in a field or a form goes to the web app something happens and you get

maybe a 500 error back and there might be a verbo sql error so you may be able to learn or identify some correlation between what the request was what the response was and that that infers a specific type of vulnerability and this just sort of shows an example where we chucked in a load of data to this um uh word to vect feature which is a very simple neural network which does semantic relationship learning between words um and pairs so this this is where it shows that it learns that you know if you give it a word set cookie it's learned with high accuracy that other words that are semantically related to set cookie are the sakura flag the http

only flag the path um 301 for whatever reason in this example and so it was building on this sort of premise that rather than having to code regular expressions to try and pull out what we're interested in this is basically having a machine try and learn and infer those relationships itself um just to further exemplify so this was from our synthetic data where we generated lots of data and then when you search for words that related to insecure you get all sorts of stuff that comes back related to crypto open ssl words like error and obsolete and similarly for sql you get words like injection statements query that sort of thing excuse me like i said you know this is new stuff

to us so there are a number of different um models that you can use different algorithms you can use within machine learning and ai we tried pretty much as many as we could um very much in this sort of uh let's munge this type fashion let's just throw lots of data try different algorithms this is from our jupyter notebooks where you're trying all sorts of different uh bayesian classifiers support vector machines uh different convolution neural networks that sort of thing just to see how different algorithms were performing in different ways on the synthetic data that we'd used um if anyone is a data scientist i'm not but there's some sort of specifics about the uh the parameters that we had you

know we we had some results back um it wasn't very good identifying insecure cookies insecure cookies sorry we suspect that's because of the the high entropy that comes associated with a a cookie value that is random so it was getting quite confused on on that but we could infer um with good accuracy missing cross-site request forgery missing cause sql injection and cross-site scripting was in the 70s 80s but again massive caveats small data set synthetic data we need to do a lot more testing so we know we didn't sort of uh raise our glasses and say yeah we've cracked this at this point this was still very early early days so the next thing we um we looked at was

the second prototype rather than try and do something like a catch-all so in the first one we were trying to identify cross-site scripting sql injection a number of different classes of vulnerability all at the same time let's just try and focus on one type of vulnerability so sql injection um but also let's let's focus on one type of vulnerability against one type of um actual database so mysql as the database management system thinking that maybe we can get better results by just focusing on a smaller a smaller target set and so for this we also removed the um the use of synthetic data so we used well i say real vulnerable data it was data that we captured from doing all

sorts of crawling and scanning of known vulnerable web apps or training apps like the damn vulnerable web app um utility that sort of thing vulnerable wordpress all of which had my sequel as a back end so that gave us our data set and this was the process that we did so very much like um like the the first prototype we trained the model we we found that support vector machine multi-class was giving the best results there we test a new mysql application uh using that system then we'd select all the false positives so we had an additional plug-in into burp to allow us to select what the false positives were and tag them as non-sql injection then that would

retrain the model for us we'd go back and re re sort of retest the the same application to see if the number of false positives are reduced um and then that was of the process to try and refine a lot more and then we continue testing with new mysql-based applications and with that we did get about 99 accuracy after a few iterations of retraining the way that the plug-in worked for us is that as we were testing that the model would be running um in the background and then using the burp intruder so this is the output from an intruder it would highlight in red those ones that it thought from what had learnt those requests responses which were

which more which were much more likely to be um sql injection like confirmed vulnerability so it felt like there was some good merit there but again you know we we didn't throw our glasses up and say yay 99 there is a temptation to do that we're aware that that does happen in the industry you know you can train a model do some testing the the numbers look good and then you sort of release your product but we still need to do a lot more testing in that space with bigger data sets but it showed some promise and it was it was an improvement on the the first prototype we then sort of stopped and looked back

at what the various issues were that we'd experienced with the first two attempts there were some issues around well in the previous ones we're sort of trying to read html like we do you know like we read a book but that's not typically how we do testing there are other aspects involved with trying to infer for vulnerability exists and also there's an issue with different locales right so you could train a model on english language based web apps but what if you come across a web app that's got error messages in chinese or japanese or whatever that's going to completely not work so you know that that's an issue as we said you know the web request

response pairs the data itself might be sensitive in some way so that that's an issue and then the supervised learning aspect it does require human effort to do the tagging and labeling um pen testers i count myself in this we're quite lazy you ask a pen tester to do something else that's quite boring they probably don't want to do that or they might forget to do it so we realized that you know we can't rely on that as a way to constantly be getting our our data so then we looked at well okay can we flip the idea on its head and look at anomaly detection instead so rather than trying to learn and detect known bad why

don't we sort of learn what we think is known good hesitate to say known secure because you never can confirm that completely but what you think is sort of roughly roughly good in terms of security and then you do anomaly detection such that in the future something that pops up as an outlier from what you trained that might be the vulnerable thing and also in doing this as well rather than using the data types that we'd use so far we thought let's let's see if we can do some uh proper feature engineering and pull out lots of metadata uh around what we capture rather than the data itself and train our models based on that

so for example for feature selection when you've got a lot of web data requests and responses you don't necessarily have to inspect the traffic itself if you're trying to possibly infer identify vulnerability you've got lots of numerical data that you already have or can uh generate so you've got http response codes quite simply and http 500 as an error could be a strong indicator just on its own you've got the differences between response codes in terms of what was the previous request to the current one you've got the response size and the differences in response sizes so again depending on what your attack string was or what you sent to a server if a lot of

data comes back that could be indicative of some sort of sql injection that returned a lot of data you have like response delay or time lag so it could be some time-based type issue that you you can infer if you capture that sort of data um and also all sorts of just interesting stats about the response body rather than the data itself so you might capture the number of vowels or consonants or symbols uh and the more data that you have in that space the more that you can capture and train might then ultimately lead to a model that gives you some interesting insight as and when you come across something vulnerable which stands out as an outlier so

this one again i know there's a theme here we didn't have any strong conclusions here it feels like there's some merit here but we need to do a lot more experimentation and training here with a lot more data sets but this at least gives gives a few approaches that takes us away from the data protection privacy issue that we have from previous attempts for the fourth fourth approach we sort of looked at reinforcement learning and genetic algorithms so this is sort of the i guess the more interesting but much more complex ai type stuff where you're making a system properly learn how to do stuff um and sort of through genetic algorithms it mutates itself

through that sort of prize reward type system again rather than try and do this to do or uncover every type of vulnerability class we focus on cross-site scripting and we did borrow the approach from the takayasu the japanese researcher i mentioned towards the start because he published a lot about this and he'd had some successes using um long short-term memory recurrent newer neural networks um with this though we did have some challenges um the the researcher who's working on this piece wasn't sure about how or when to tweak the mutation rate the training rate within the algorithm and the different evaluation functions that you use throughout the process and also as you start going into

recurrent neural networks and you're learning to do stuff the compute power is is pretty insane you need a lot of power and gpu at a minimum so he ended up using a dockerized selenium multi-threaded hub solution to do a lot of simultaneous multi-threaded stuff because initially he was having a run time of 10 hours just to learn to do a very simple cross-site scripting he managed to get that down to 10 minutes which is a bit more usable but still you know isn't ideal in terms of if you have a high turnaround or a need to do a lot of testing quickly this is just sort of one of the example outputs from the algorithm trying to learn to do

cross-site scripting on a particular input field apologies is a bit small but just showing the different scoring of the evaluation function each time it's iterating different things and i'll zoom in down here this one was quite funny because one day thomas my colleague who was doing this sent me this email saying i've just been trolled by my own creation because you're sending lots of lols back to the screen and just trying to learn how to do cross-site scripting um so um there's still a lot more work to do there like i said takayasu has had some gains there uh but we're convinced that we can get better improvement here um the issue that we had here and the lesson learned it was

and thomas is open about this is that he he wrote everything from scratch in terms of doing this um and that comes with a big overhead it comes with a lot of questions in terms of how you're validating your approach and what you did given that takayasu had published a lot of um a lot of results that looked promising and have published some code um our next step is to actually use what he's done or see if we can um get some better performance from what someone else has done already and then lastly we tried expert-based systems so expert systems i guess they're type of artificial intelligence but you know they're quite old they're a

specific um i guess subclass of ai um basically it's a program that emulates decision making and how we as humans think and follow actions to come to a an inference or a decision it's very proven in medical diagnosis and other fields that have a strict requirement to follow a rigid methodology to understand why an outcome was the outcome that it was and at its heart it's simply a series of if then else rules it's a way of codifying i guess how we as humans would go through the process of trying to uncover a specific class of vulnerability and what the different facts we would need along that way and the inferences that we would make

for this again we used we developed a plugin for burp suite based off this open source um java based uh expert system called clips it's very old this came out of nasa it's from the 1980s um so it is but it is still very applicable and the fact that it's java based made it quite accessible to burp suite as a as an extension so what we were going to do here was try and automate uh again focusing on cross-site scripting some of the monotony of trying to find or at least confirm cross-site scripting so for cross-site scripting you identify a potential parameter in a page whether it's reflected back that sort of thing um you confirmed the uh the

value is controllable um those first two things burp does that for you already you're already sort of pretty much mostly there it's this last step where we wanted to hook in the expert-based system to test the value or to go through the process of determining okay this value that's reflected back or stored is actually vulnerable to cross-site scripting so again we were using damn vulnerable web app here just for a very simple proof of concept we um for this example this is one of their stored cross-site scripting and here we have a few just facts that we assert so we have a controllable value that's stored in the back end and is returned in a response so it's a like

a message board type thing um the stored value occurs in plain text so it's not inside an html tag or javascript and the length of the input is seven characters so the way that the expert system works is you you you assert those as facts into the clips environment following a few different rules that you've um set out so just to sort of in a bit more detail there so this is the syntax for clips um it's quite is quite sort of i guess of the 80s for a start but it's quite hard to sort of learn this and i know richard our researcher who worked on this bit did battle with this for quite a while but managed to get it

working for a very simple example and here we're testing the length we're testing that it's um less than 30 30 because the final attack string or payload that we send is 30 so that's something that has to be that has to hold true for what we're trying to test and then goes on sort of checks after the length things aspects about the angle brackets um it asserts sort of the less than great than type properties um whether it's in text whether it's stored that sort of thing but these are all sort of rules and assertions that we as the researchers have to type out and codify as part of the expert system um but in a very simple example um

richard did manage to show that yes you know you can you can automate to appoint this methodology or process observations there though i think i've already touched on them but there is a very steep learning curve to clips particularly i mean there may be other expert-based systems out there that have an easier syntax to learn with but we just used this one did find it sort of slowed us down a bit converting the human knowledge into those rules is of course very laborious but the payoff is very positive because once you've got it and it's working and rigid you've codified that repeatable methodology that the expert system can follow but if you think about if you wanted to

cover all classes of web app or different types of cross-site scripting that massively explodes the um the number of rules that will be needed to be codified for something effective in this space uh and then a few people when we're doing this research asked well how is this how is this better than burps active scanner you know burp's active scan is pretty good you just point it at a few fields or input parameters and it tries a lot to different stuff the difference is that this has the potential to be a lot more efficient because rather than sending loads and loads of different attack payloads in a noisy manner this is going through methodically and also when this fails

and this tells you why and when it failed so it gives you the reason why cross-site scripting can't or won't work in this example which the burps active scanner won't do you don't get that feedback so it's quite a nice approach in that sense so despite being hard we do feel that there is mileage with expert systems in web app testing but again the running theme we need to do a lot more testing in this space so some conclusions we've got from this um just to sort of wrap up and what the main points were we did find that concentrating on specific classes of vulnerability is much easier you know if you are if you're interested in sort of setting out

to do similar stuff in this domain if you try and do a a one-shop type catch-all class as a web vulnerability it's going to be very hard so i think that the best approach is to focus on can i use machine learning or ai to detect and exploit cross-site scripting sql injection that sort of thing um the need for real-world data is of course very paramount it sounds obvious but we we learned that quite quickly and that could often be quite the challenge you know it's um a lot of data gets generated all the time but there are various restrictions around how easily we can access that data um the anomaly detection method did show some good

good potential especially with the feature engineering and the issues that gets around data protection so we're definitely going to be exploring that more and the reinforcement learning you know that's very exciting but it's hard it's exciting to sort of have something actually learn how to do something and then get really good at it but it's hard because it's a very steep learning curve there's a lot of data science involved you need a lot of data you need a lot of compute power to get some good results there and i haven't really mentioned it but thinking ahead longer term um if or when we get better results and capability here is the um the issues around retraining or

understanding when is it a good time or when should we be retraining models a big concern that that i've got with any application that uses machine learning or ai is what i think royal holloway in a paper wrote about called concept drift so if you think back around sort of intrusion detection systems when they came out many years ago everybody was buying them in rolling them out and adding signatures and then they had their ids and it was doing stuff alerting blah blah blah but i don't know how many of you have worked sort of network security over the years or worked with people who use ids but we commonly would see or work with

clients who had an ids system and you talk them about well when did you last update your signatures and they say well maybe three four years ago or something and so they have a system that's looking for um detection is trying to detect intrusion that probably doesn't exist anymore you know the attack landscape has moved on the techniques have moved on and so those systems are effectively um redundant they're not doing anything and i think we're at the cusp of experiencing something very similar with machine learning and ai in that if you're training your models on what we are seeing now in terms of attack techniques uh procedures um that sort of thing if you're not retraining that to keep up to

date with the change in threat and attack landscape then i think the same thing will happen as we've seen with ids in the past that the models will become less and less effective and so this this idea of concept drift that the concept of your initial model is drifting away and becoming less useful is is very important and so we are we are wanting to sort of get on top of that earlier than later so that we don't end up um repeating that same mistake with machine learning and ai the expert systems they showed good promise but like i said there's a steep learning curve in learning the the sub syntax for some of the frameworks but it

does show a good potential robust method of doing testing with that feedback loop and it is of course technology agnostic in some of the way that it goes about trying to uncover flaws um i didn't mention intelligent crawling in terms of what we were doing because we weren't looking at that but a lot of people are focusing ml and ai on the intelligent crawling aspect of web apps because um if you can intelligently identify what different functions are within a web app and infer you know that's the search field that's a login field uh that's a good step towards getting better automation and you could then think about hooking in your well-trained models that detect

cross-site scripting or sql injection after you've done that intelligent crawler to get to a point where you have a very fairly decent automated web app testing capability but we didn't focus on that but it is it is a relevant piece or part of this type of research i guess obviously but machines do lack contextual awareness so regardless of how um how successful we or other people might ever be in this domain it will never get to a point where the machines this gets a bit philosophical but the machines don't know what it means to search or to log in login even for us humans you know we we might see a simple get request and see a

parameter called s we might know just from having seen the app and what that request is that the s actually means search or search string those sorts of things a machine won't easily know or understand contextually and so i think you know there's even though we can envisage a time where maybe we do have a level of machine learned automation and web app testing i do think that there's always going to be a role for that human-based inspection to get that better inference and then lastly accuracy i mentioned it mentioned it as a running theme throughout the talk but there is a big danger to sort of create a model do some tests the frameworks come back and give you a 90

plus accuracy and for you to think well hey i've cracked it here's my solution before the reasons i've discussed are things like you need lots of data the issues of concept drift um we we sort of we are exercising a lot of caution there before we make any claims about actually how good our models are so in terms of next steps also um we do have another research i guess rounds to build on our on our lessons learned from project ava i will look to home in in more detail on some of the approaches that showed most promise we have an internal conference every year called ncc con we were hoping maybe in 2020 in it's in

january so it's probably a bit too soon maybe by 2021 uh within that conference we're going to run a um people versus ava competition where um we'll have un never before seen web app to either our consultants or ava she will have been trained on her various techniques uh and we'll give them a time limit to find as many flaws as they can uh we'll see who the winner is that's just a bit of a bit of a fun thing to do but be interesting to see what um what comes of that um that's that's sort of certainly helping us focus our aims to get to that competition type setup and then just to summarize um i guess in

terms of a couple of quotes my personal opinion from this research journey and what we'd observed was that you know i do think that machine learning techniques will definitely form part of a web app testers toolkit within the next two years easily um you know not not necessarily just just from us we may or may not be able to release something but i'm sure others will this stuff um in the wings about to drop um there will be techniques that we will be using however vivienne ming are very well respected ai researcher individual i really like her quote about how ai might be powerful technology but things won't get better simply by adding ai we certainly did learn that

there is a lot of hype about ai and machine learning it's not necessarily best fit for a lot of things that we were trying to do or sometimes it will give some efficiency gains but it's not going to solve or automate us out of a job anytime soon that was me thank you very much dealconvow i will take any questions [Applause]

[Music]

we didn't that's a good question yeah sorry so the question was that um seems that it might be easier to do a lot of this with uh white box assessment if you have the source code so using various techniques over the source rather than this a very black box approach um great question um yes i agree you know if you have source you can obviously do a lot more um i think what i don't know what others in this room think about if you work in pen testing but for whatever reason within the uk what we've noticed with ncc is still uh compared to our us colleagues and how they work there there's a bigger reluctance to hand over

source code in the uk a lot of the pen test engagements are much shorter always black box there's still quite an attitude of well the bad guys wouldn't give the get the source code so why should i give it to you sort of thing we often have to contend with that so i think that's mainly why we went down that avenue because it's most of the work within the uk sector that we do it's often quite black box but no i completely agree that exploring techniques across source code is equally a valid research domain yep

sorry you say that again

[Music] yeah so the question being um so a big issue with machine learning is not necessarily giving you that feedback as to why it made a decision it did it's that whole black box aspect yes it is a concern um definitely um you know we we would not want to release this or use this on um in terms of gaining client assurance if we didn't have trust in it yeah any other questions no okay thank you very much