Improving Response by being "Data Wrangling" Amateurs in AWS

Name: Improving Response by being "Data Wrangling" Amateurs in AWS
Uploaded: 2022-03-08
Duration: 37 min 17 s
Description: This presentation was held at #BSidesBUD2021 virtual IT security conference on 27th May 2021. Improving Response by being "Data Wrangling" Amateurs in AWS - a presentation by Swetha Balla Incident response in AWS can be challenging for a couple of reasons - either logs are not available, making re

BSides Budabest · 202137:1746 viewsPublished 2022-03Watch on YouTube ↗

Speakers

Swetha Balla

Tags

CategoryTechnical

TopicCloud IAM

ResearchCase Studies and Incidents Analysis

StyleTalk

Mentioned in this talk

Tools used

AWS CloudTrail QuickSight

Platforms

AWS Glue

Service

Amazon S3

About this talk

This presentation was held at #BSidesBUD2021 virtual IT security conference on 27th May 2021. Improving Response by being "Data Wrangling" Amateurs in AWS - a presentation by Swetha Balla Incident response in AWS can be challenging for a couple of reasons - either logs are not available, making response impossible, or the log volume is large, making it hard to identify anomalous activity. These challenges are not necessarily new or unique to the cloud environment. However, building a relatively simple data pipeline by leveraging some of AWS’s “data” services can help address these challenges. In this talk, I will share “data wrangling” skills that I have acquired by responding to multiple AWS breaches, with a focus on: - Which logs should be enabled, and why? - How to store these logs to reduce storage cost and improve query performance? - How to visualise logs? - A sample case study (focus on Cloudtrail logs) using these skills. This presentation’s key takeaway will be learning about some tools typically used by the data teams and using them for incident response. https://bsidesbud.com All rights reserved.

Show transcript [en]

hi everyone my name is schreiter i work as a consultant and work predominantly with clients helping them deal with breaches that have occurred in their aws environment today in this presentation i'm going to be sharing some of the lessons i've learned as part of during aws incident response for different clients in particular i'm going to be focusing on some of the learnings i've gathered as part of doing data wrangling in aws right a quick walkthrough through the agenda initially i'm going to focus on some of the challenges associated with doing incident response in aws but not focused on why it's different from traditional incident response or how to look at logs that's not the focus of this talk

instead i'm going to highlight some of the challenges in terms of once you have all the logs that you need to conduct incident response some of the challenges that you might face as a professional and how to use native aws services to wrangle some of this data towards the second part of the presentation i'm going to switch gears a little bit and do a demo of some of these services and show you how to convert data etc and then run through some metrics uh using athena queries and compare how much time and how much data is canned if you run the queries against sort of raw cloud trail drugs versus running the same queries against a different data

format we can talk about that and finally i'm going to wrap up by again showing you another aws service to visualize all these logs and the reason for sort of sticking to native aws services is sort of two-fold one is to prevent us from having to move data around but also the second thing is some of these aws services work nicely together so it makes our life a bit easier or at least just definitely make mine a bit easier right um let's get started before i get started i just wanted to ask a question how many things do you think are needed basically to improve incident response in aws and the reason i ask this question is i

think it means different things to different people whenever i ask this question i get different responses and it kind of varies from do we have the right logs do i have access to the logs in an easy to consume fashion are all these logs in a sim solution can i set up detection logic these are sort of the questions that the security team typically ask but then the operations team might be asking questions around things like how much does it actually cost or for regulatory reasons how long do we need to store these logs for etc so the focus can be different depending on who you're speaking to and sort of where you're in the journey in terms of doing

incident response in aws well my learning has been that i just needed one particular thing to improve my incident response skill set and that was data wrangling okay i am exaggerating there are a lot of caveats associated with this um it's not enough uh if you can just wrangle your data but the reason i'm highlighting this and i'll come to the caveat in a bit is let's say you've attained nirvana with aws logging you've got everything everyone's on the same page you can get access to logs easily now what the next issue or the problem that i faced is oh i have terabytes of data or maybe petabytes of data how do i analyze this

quickly enough how do i analyze this without incurring a lot of costs especially as a consultant i'm probably not using the client environment for doing this so that's always a consideration but mostly if i run a query against this data i want to be able to get my responses faster i want to reduce the amount of data that's being scanned and that's when i realized that i'm not doing this efficiently because my clays take too long to run or i am scanning too much data whereas what i'm interested in is only a couple of fields within a particular log source that's why i highlighted data wrangling so we can now talk about all the caveats

i have put some of the aws services on this slide sort of to give a bit of guidance on things that we might be looking at in terms of logs when i say caveats you know it's typically always about logs i have highlighted a few questions and uh i'm going to go through each of these the first thing is do you have logs for incident response and it's as simple as if you don't have the right logs or if you don't have logs you can't do incident response it's as simple as that you can't answer your clients you can't answer your customers you don't know what the threat actor did so you know that's always a consideration

but there is a cost associated with it as a security team we obviously want access to everything just in case this obscure incident happens i need a particular log set um which i only use once in six months a year or whatever but i want to have everything i want to switch that a little bit and talk about you know have i actually enabled key log and for aws the most important log source from an ir perspective is cloudtrail and with cloudtrail i mean not just the management events but the data events as well and then we can start thinking about things like do we have vpc flow logs so i get an idea about the network

flows if i'm using s3 do i have s3 access logs enabled if i'm using aws config to keep track of changes between my different assets am i using cloudfront am i using elb and again for all of these do i have logs it's always important to bear in mind that there is a cost associated with it and to be thinking of what are critical logs that are absolutely required in my case i always say it depends but you know bare minimum have cloud trails management and data events the second part of this is right we have all these logs but you know how do we actually store these because if you send everything to the sim the costs are high

because often licenses for sim are linked to the storage or data ingestion um within aws s3 if you're using that as the store for all your logs it provides a lot of capability that allows us to use let's say different tiers of storage depending on the criticality of the logs i can use lifecycle so that i don't have to manually manage all of those and let's say that you know i just want logs for three months afterwards it's very rare for me to go and look for these logs so they can go into a different tier within frequent access and maybe after a year i'm only going to keep them for the ad hoc situation where you know there's a

regulatory question or there's this obscure incident that i've been talking about so i need to go fetch it for those so i can use native s3 services again to manage all of these and keep my costs low without incurring too much cost but the other part of storage itself is for example if i take cloudtrail i'll use cloudtrail as an example quite a bit if i take cloudtrail as an example everything's in json compressed obviously and stored in s3 however when i query these logs i don't really do a query that needs to go through each individual row i'm interested in particular fields within the table let's say a particular ip address or a particular event type etc

it raises questions around whether json or robay storage is actually the best way to store cloudtrail or purely from an efficiency perspective if something like parquet or a columnar database would be sorry columnar data format would be more suitable for our use cases and in this presentation i'm going to be focused on columnar data formats and how it's improved query efficiency for me at a very high level i'm not going to go into a lot of data science or analytics columnar data storage is more efficient just because we are reading data per column as opposed to each row and there are efficiencies to be gained from that and the last question is you know senior management wants a dashboard or

even i want to be able to highlight anomalies or see what's happening and be able to visualize some of these logs easily so do i really need to transfer logs off platform by a platform i mean aws to create these dashboards or can i keep my logs effectively just in s3 and query them using athena and then create my visualizations using some other aws service so for this use case i'm going to be talking a little bit about quicksite and that's what i've used and it's pretty neat because it doesn't really need me to create multiple copies of the same data the fewer copies we have the easier they are to maintain obviously from an

operational perspective but also there is less storage cost associated with that so these are the caveats to sort of bear in mind in terms of how we want to wrangle the data a do we have the right log b how do i store them so we'll talk a little bit about columnar format and see just how do i visualize the logs without having to create another copy of the data set all right if i switch and focus on data wrangling purely i've created a very simple pipeline or flow of how i would take logs as i mentioned i'll be focusing a bit on cloudtrail log and use just native aws services to convert the data from the

json compressed format to park a which is a columnar format and then use athena to create these logs once they've been converted into part k and also use quick site on top of athena to create visualizations so here on the left hand side i'm assuming that all the logs are already stored in s3 and then i switch to sort of steps two and three here where i'm using aws glue effectively glue is a serverless etl engine that aws provides so the first step is really to allow glue to create a metadata table for our lore rocks so you create what is known as a crawler and you point it to the s3 bucket it goes reads all the data and creates the

metadata table so pretty neat all you need to do is create this job one thing i've learned through painful experience dealing with cloudtrail logs and glue is that they are often created with a specific type of data types for the different fields in cloudtrail however for two particular fields request parameters and response elements blue creates it with struct as a type however the data definition language that is provided by aws for cloudtrail logs are just using string so i had to modify it from struct to string to make this whole pipeline work so great we now have our metadata table in aws glue the next step is i want to convert all my data from the json compressed

format into parquet snappy compressed format so i just use blue etl and then again create a job where it takes all this data which is in json format and converts it into park a which is columnar format and stores it in an s3 bucket that i've pointed it to and then again i have to let blue run a crawler against this s3 bucket with the parquet data table sorry parquet data files and create the metadata table and then once that's created i can see the table in athena so i can run all my queries against this parquet table in athena and finally i can use quickside to visualize all this data so all i've used really is assuming my

logs are in s3 glue works with other data stores as well is use three different services plus s3 aws glue to do the atl create the metadata tables athena to run my queries and then finally quicksite to visualize all these logs i'm going to do a short demo but i just wanted to highlight their prerequisite before i switch to the demo i'm assuming that you have an iem role with the right permissions as with everything in aws there's always a role there's always a policy in this case you need access to a role which has the aws glue service role policy attached to the role um there will also need other things like you know being able to access the

s3 bucket where the logs are right access to the s3 bucket where it's copying the parquet files to etc um i've also mentioned that you know in the previous slide that i have all the logs already in s3 i'm just mentioning this again but most important is especially if you're doing it for the first time is to be patient it was definitely not easy for me as someone with no data background to run glue but just going through the tutorials with aws it was fairly straightforward to replicate i was able to create the crawlers i was able to create the etl job and then you know create the table in athena so it's fairly straightforward uh quite

good and easy to do um i'm gonna just now switch over to the demo here i have my aws account i have created an s3 bucket i have added a sample cloudtrail file here so i am going to duplicate this i can go to blue i have all my prerequisites already set up so that's pretty good i just wanted to save a bit of time um these are all old tables that i've created so i'm going to add a crawler i'm just going to call it b sides budapest called trail um i'm just going to say look at all the data stores crawl all folders within ns3 bucket s3 i am going to select the bucket that i

showed you with one file and then next i don't want to add anything else next um i already have my i am roll so i'm just attaching that to this particular crawler and you can run the crawler on demand or you can set up sort of like a cron schedule for it so that it automatically picks up whenever a new file is added to the s3 bucket which is what i would be doing when i run this in a more production sort of environment i've configured it i'm just going to create all these tables in the default database and i have created this so now i have to run this crawler it refreshes it shows me a

status hopefully it doesn't take too long but i'm going to let that run

probably come back to this by switching between the presentation so that i'm not holding you guys up um i wanted to share why i felt this was quite interesting for me and important i've just added a few queries in here and i'm comparing sort of the runtime for each of these queries and the amount of data that was scanned for each of these queries if i run the query just against the json compressed file on partitioned and the power k file in terms of data set i used the summit route floss cloudtrail database i thought you know that's available to anyone who wants to go and test it with glue so it's probably an easy enough data set

for everyone to use i know that the data scan here says 1.53 gb it's usually not that big it's i think around 240 mb or so the actual tar file which has the different culture log i just created multiple copies of it because i wanted to have roughly more than a gigabyte at least to do the comparison between running my queries against just the json data and the parquet data the first query i have here is just running through some of the api errors so you know just select if you feel from cloudtrail and i'm looking for particular error codes which indicate that there was a failure things like client invalid permission client not permitted access denied

um just for simplicity i restricted findings to 25 so you can see that you know with json the run times around 17 seconds it's count all the data whereas with parquet it's around 5.37 seconds and it scanned still not that great but you know still less than whatever it is with json now that i start moving to the second query which is looking at activity from a particular ip address which i know is malicious um so i've put the query i said you know give me the event time source name and the user identity for a particular source ip address so we're starting to narrow down again with json around 19 seconds because it has to go through all the

data row by row scan all the data whereas with part k the run times you know just under three seconds that's quite a big difference um and the data scanned is 9 mb similarly another query ec2 instance enumerating an s3 bucket again with json around 13 seconds scanned all the data whereas with 4k is taking you know less than two seconds and it's scanned 38 megabytes or 39 so you can see that there is a savings not in terms not just in terms of how much time the query takes to run but also in terms of how much data is actually scanned by athena when these different queries are run so just to sort of give you an overview

of how much savings we can have sorry i'm going to switch back to blue and see if this has run yeah as you can see they have finished running so now i can go back to my table and here i'm just going to sort this um here's the table that was just created i am going to go to the table as i mentioned earlier the request parameters and the response elements are instruct so what i'll do is i'll go to edit schema click here switch that to string click response element switch that to string save and great so now we have our table uh with the json file so all right just to quickly refresh this is what we

have done i have my logs in s3 i've moved to the crawler i've created the metadata table now i'm going to create a job to convert data call it b size budapest json part k i already have my role i'm just going to let aws glue propose a script to me and here's where the script's going to be shared i have some of these directories already so it's just going to store it there next and here's the table that we just created cultural logs json1 the classification is cloudtrail i'm going to go i want to change the schema and i want to create my target format is 4k and i want to save all of that in s3

and i'll quickly show you that this pocket is empty cultural logs part k2 select next this is what blue's going to map from json to parquet i'm not changing anything so i just click save job and edit script and i'm just going to close out of here and now run this job if i just click here it shows you the status um all the logs going to cloudwatch so if you're interested in finding out what it's doing you can go there and take a look um it also gives you an indication of how much time it's taking to run the script well so far let's take in 10 seconds so this just going back to my diagram

here this is what we are currently doing is we have written an etl job we are running it and at the end of the whole process it should technically be in s3 the parquet file the converted parquet file um let's see if this has completed i chose a very small file so that none of this takes minutes to run her we can come back to it in a bit um here are some metrics um in terms of savings coaching code not from a cost savings perspective per se but in terms of amount of data that is scanned and the effectiveness of running queries against parquet data my learnings using the flaw of cloudtrail database from summit route has been that

um roughly running 15 queries typical ir craze against uh cloudtrail roughly 74 percent less data scanned and you know athena charges based on the amount of data that scans so that the lesser the data that scan the better it is um and it's seven to seven percent quicker as well because less than a scanned it has a more effective way of being able to go to the table and the particular field that you're after admittedly this is a very small data set but you know i've used it on cases and it's been remarkable because um especially as data size keeps going up you can use other services obviously like redshift etc but if you're using

athena for doing ir instead of taking 10 minutes 20 minutes or timing out with parquet i can actually run the clothes much faster and it helps me triage quicker as an incident response professional so that's the savings part let's go back and see if this has finished so this job has finished it took roughly a minute this is the timeout period you can get the job to run quicker based on the parameters you use um but i just use the default settings so if i go back add a crawler again besides go to past okay data store scroller folders my data should be in this bucket let's refresh this and see here's the parquet files that have been

created snappy is just a compression format so i select i click next this is similar to what we did before i don't want another data store i already have my iam role i'm going to run this on demand create it in default um i'm just going to add b sides budapest say prefix next finish um that's my crawler i'm going to it should take similar amount of time less than a minute um so i'm just going to let that run and switch to athena

these are some of the older logs that i have but this is the particular data set that we asked bluetooth crawl and create a table for so you see it this these are just all tables i have so let's go back and see how it's just like real time how much time is taking um just to go back to the pipeline so you know just to refresh memory we took the raw logs from cloudtrail in json compressed format created the metadata table using glue uh crawlers then we ran an etl job to convert it from json to parquet format then i showed you the files in f3 in dot snappy dot prk format and i'm running the crawler again to

create the metadata table so let's see that's topping so almost there just give it a couple of seconds

sorry about that um i just wanted to pause and not move to the next slide because that's moving into another aws service

right so this job has finished running if i go back to athena and click refresh here's the table and i've added the prefix of b sites budapest which is why it's showing us our way all right um so if i preview the table there you go here's the information you can see the user agent the event id user identity again with aws it's complicated because you have a principal id it has information on whether you use a session whether you use an access key etc so all that information is there event name source recipient account id region and here it is i also quickly wanted to show you the properties um this is all based on what we selected

when we set up the crawler so we just said you know created it in the database sorry default database so that's what it is the create time you can also look at the table data definition language remember i mentioned that we need to use string for request parameters and response elements so this is reflected here but here's the entire data definition table it also has information that you know part k is being used and then just like properties and the number of records so that that was a very very quick introduction to blue and how to use it and just to show you that you know it's fairly simple to use um so we've spoken about

generally getting logs moving it into a peanut querying it some of the efficiencies that we have seen in terms of data the amount of data scan but also how much time queries run finally i'm going to move to quick site so i have this open here and i am going to create a new data set here's athena i'm just going to call it again boost these slides budapest default database

that's weird apologies about that i just realized what um i've done wrong i have used ireland as a region uh whereas when i set up quickside i used us east one and not eu which is why it can't see the data set so unfortunately i can't show you how to create it but effectively once you create the data set you just connect to athena and it looks very similar to elastic or tableau or one of the many visualization tools you know hopefully unfortunately all these things happen during presentations so if i switch back i do have an example dashboard that i created with quickside so hopefully this helps give an overview on the left hand side once you select

the table so when i was doing my testing i used a table called flaws final test parquet analysis you see this tag here which says spice if you want quick side to be quicker you can use spice which basically just means it's doing all the analysis in memory rather than directly querying from athena and s3 so again depends on your use case all the types of visuals that you can create are at the bottom the thunderbolt if you select it athena sorry not athena click site comes up with a suggested chart uh you just need to propose what fields you want here i've just added a few um they're not indicative of anything as such but

you know on the first chart here it's just showing the top source ip addresses and the number of records um so on the right hand side we see this huge uptick um you know like almost 5 million so probably that's normal or a lot of people are connecting from the same um whereas all these little [Music] lines here are probably in only so that's what i would go and look at if i was investigating but i would also check with the client um especially if there's objects like this if it's an ip address that belongs to them or not um also i created a bunch of pie charts it's just showing different types of events different

types of user identities user agents you can also create pivot tables so you know i wish i could have shown you a quick side but i realized that i i messed up with regions so that's an important learning uh to keep in mind with regions always that it needs to be in the same region but this is what quickside looks like all you need to do is point it to the athena table with your parquet data and it it exposes all the fields to you and you can create the dashboard and it's pretty cool because i'm not creating a duplicate copy of the data i'm not having to migrate the data into another data store i'm not having to

extend my pipeline so i quite like using this just because it's simple and it makes my life easier and it provides the functionality that i'm after from a dashboarding perspective finally just going back to takeaways and this slide links back to the caveat slide that i was talking about first is always do you have the right logs whilst it would be ideal to have everything and all logs unfortunately it's not always possible for various reasons cost is a major driver of this so be thinking about what are the important blogs within aws we spoke quite a bit about cloudtrail that's absolutely the most important log source especially data events to give you that level of detail the

second aspect we spoke about as part of logs is how do you store these logs we spoke about using columnar format and i showed you some metrics you don't need to take my word for it you know the log sources that i used are open source and the reason i used it was so that anyone can take the exact same log sources use glue and athena and quickside to be able to replicate whatever i've done but you know roughly almost 75 percent to 80 say not 80 75 percent let's call it in terms of uh better running of craze and also less data are scanned when using parquet um but also i showed you a demo hopefully

give you an overview of how simple it is to use something like glue to convert data um and it can actually be set up as a job so you can you don't need to do it every time and then finally we spoke a little bit about visualizing logs i wanted to show you a quick side unfortunately i messed up but hopefully the dashboard screenshot that i showed gives you an overview of how easy it is to use clickside and that it provides the sort of visualizations that are available in other tools as well thank you for your time

Improving Response by being "Data Wrangling" Amateurs in AWS

Related talks