Building an Auto-Remediation Platform for the Cloud

BSides SLC · 202230:5373 viewsPublished 2023-01Watch on YouTube ↗

Speakers

Taylor Wilson

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

AWS CloudFormation AWS CloudTrail AWS Config EventBridge Terraform

Platforms

AWS API Gateway AWS IAM AWS Lambda AWS Step Functions

Frameworks

AWS CDK Serverless Framework

Show transcript [en]

welcome hello thank you for coming today my name is Taylor Wilson and I'll be talking a little bit about building an auto remediation platform for the cloud subtitled take advantage of all your cspms real quick just a minute about who I am again my name is Taylor Wilson I got a degree from UVU in technology management I think it's a sort of a sleeper degree right it's kind of a hidden degree it's a great fit for a lot of us in information security it's cross between business and I.T with a little bit more in project management than like an information systems degree career-wise I did CIS admin work for a few years security focused I was the new guy on the team at

the time and they're like hey there's this new thing it's called security why don't you go ahead and learn that for us so we don't have to and that's how I ended up in the security side of things from the security side last six years I've been I've been doing just security full-time and Cloud security as well that's as the cloud security engineer or architect are now now director of engineering and architecture at NuSkin just down the street here um disclaimer I'm definitely AWS biased just because that's what I use the most and it's what lives in my brain right all of the terms that I'll use has specific examples specific application are AWS focused but they work across the

the principles the same everywhere including any other Cloud that you use that isn't listed here right um let's see thesis right what I'm talking about I say thesis because I think it makes me sound smarter than I am but it's get value from all your cspm tools by creating your own automated Central remediation system it's easier than it sounds right and that's what we're talking about today it's it's pretty easy to extract value from all these sources of cloud security posture management that you probably already have today by show of hands real quick who has Cloud workloads deployed at their other company or personally anyone who doesn't is fooling yourself because you do everyone does

um real quick infosec control types right I categorize a lot of these into just two simple buckets there are preventative controls and detective controls preventative of course being you prevent a in this case I'll be talking about a bad Cloud configuration for being pushed out through IAM at the Indian access management right people only have permission to do exactly what you want them to the way you want them to do it service control policies or sort of settings you can set per Cloud account to say you know we want things done this way or through pre-deployment infrastructure as code scans right in a CI job you scan the things this this looks good it's good to go on right and

then detective controls you deploy it then you assess it and say is this does this meet my policy or not and those assessments you know come in as alerts or risk which you then remediate either manually or hopefully automatically um which would I rather have of course preventative and a perfect world we would only need preventative controls every workload we would be deployed with zero misconfiguration there'd be zero vulnerabilities as of when it's deployed or ever in the future right of course that's we all know that's not what really happens um but real quick before we get into preventative versus Detective why do companies use cloud right it's like the thing everyone's doing it these

days I know I think it's important that we understand as Information Security Professionals um the business value the cloud brings um you're able to scale your costs with Revenue right you say here's a service it's paper for use as more people use it you're making more money your costs go up with your Revenue right and they go down those Revenue drops um you can quickly onboard new technology you can participate in more rapid Innovation I think of for example recently we had some devs say Hey you know this Kafka thing sounds great let's check it out for a little bit in AWS again that's just my default mindset but you go in manage service for it deploy

it you don't have to worry about the operating system the hardware all of your access patterns into any new technology that's provided by a cloud provider is a known access pattern right it's security groups or IEM it just really lowers the barrier of Entry you know two two three days later they're done figuring out what it's all about they turn it off costs like you know two or three dollars um that's that's real business value and then developer efficiency I think we often underestimate the the value of having devs close to their deployments um their deployments are simple infrastructure is code template you push it out and it's there and it's deployed a few minutes later

um I think devops is a lot like zero trust which in my in my opinion is a journey not a final destination at least not usually you're all there's always a give and take a push and pull a little bit of balance that goes into like what devops means for you so with that in mind let's Circle back to our two control types preventative and detective in the real world there will always need to be a balanced approach you prevent the worst things from happening but in order to get some of that value from the cloud for your company you need to also have detective controls there are some things you cannot prevent and some things that you'll just have to you

know detect and respond to so prevent the worst things respond to everything else cspm I've mentioned it a few times already let's dig a little bit deeper into what that actually is uh in infosec we love acronyms right and it's just like everything's abbreviated everywhere what it is cloud security posture management it's a way to identify risk in the cloud of cloud configuration right it's the configuration of your Cloud control plane it's a configuration of resources as seen by your cloud provider right it's it's anything that you could get to through the AWS web console for example so what it isn't is like vulnerability scanning of like your operating systems deployed in the cloud it's not EDR and

point detection it's not application security although there's a case to be made that application security and Cloud security are extremely related and I've got another talk about that it sure is but cspn as a definition Cloud security that it will identify misconfigurations let you know about it so I'm a very Hands-On practical learner so this is what helps me right here are some example findings that you might get out of a cspm tool Amis that are shared with an account that isn't yours right Ami being like a golden image of a virtual machine in the cloud IEM role with overprivileged policy attached you know this role looks like it can do more than it should

or an unencrypted RDS instance just a database right in the cloud so let's look at how we would prevent or detect these and sort of the level of effort required to prevent versus detect these things now I did handpick these examples to prove my point of course but an am I shared with an account that isn't yours it is hard to prevent you would in order to prevent that you would have to maintain a list of all of your approved accounts you can share with in every single policy in every single cloud account you have maybe it's doable if you have just a couple accounts no big deal and just a few policies and roles but when you

start operating at any kind of scale that that won't work right I mean we operated a relatively small scale and AWS and have 50 Plus accounts that constantly churn a little bit right um the in example a role with too many permissions you could say centralize all IEM roles with one single team who knows exactly what they're doing and will always honor the principle of least privilege then you're slowing down your deployments not getting value from the cloud um you could try service or permission boundaries in AWS for example that would require a change to every single deployment that goes into the cloud which is you know we have thousands and thousands of workloads running it would slow down be hard it'd

be hard to hard to prevent easy to detect unencrypted RDS instances same thing right very hard to prevent through like IEM conditions or service control policies easy to just see that it's been done wrong go in and fix it so that is cspm is the source of information that we can use to respond to security misconfigurations now back in the olden days right when I first started doing Cloud which was just six years ago um we decided hey there's a new space it's called Cloud security posture management we're in the cloud let's see what's out there used to go out look for cspm Solutions cspm tools do a vendor selection proof a concept Etc buy something implement it good to go

today every single security adjacent Cloud adjacent tool that's out there has some sort of for free will throw in some cspn type findings right they're they're all over the place I mean I was really surprised when like my my APM application performance monitoring tool it's like hey plus here's all the stuff about your Cloud configurations like well there are some good there's some good data there there's some good findings um EDR tools especially have those these days API security it's like some of my network stuff is surprising Network like nids tool it's like hey and since we're here in your Cloud looking at Network stuff we'll look around a little bit more and give you some good

cspn findings so what we're talking cspms they really are a commodity they're they're something that everything has there's all this valuable data out there and my thesis today is that we can take advantage extract value from all of these Solutions easier than you think it might be right um so how do we get value from all these things real quick what a cspm standard workflow might be this is basically incident response in a lot of ways you get an alert you determine is is that resource allowed to be an exception to the policy there will always be exceptions as much as we might wish there weren't but it does make sense that there are right for

example um we have some sandbox and lab AWS accounts back to the RDS encrypted database alert from from earlier it costs more money to have encrypted RDS instances right and in our lab accounts we have pretty strong guarantees of no personal data in there or sensitive data in those accounts and so let's just save a few bucks and I'll you know allow our sandbox and lab accounts to not adhere to that rule so you'll always have an allow this to check then you'll do your response right maybe you you know that's the stick's a problem whatever you define that to be fix the problem and then I'm a huge advocate for training right identify who

made the configuration that was flagged as being insecure and let them know how they can avoid that problem in the future right so I'll talk more about this in a moment so we'll get to it then um so diving a little deeper into the respond box there you know there's kind of four basic things that you do you correct the resources configuration hopefully automatically that's you know if I have an S3 bucket policy that allows unencrypted objects I'll go in and change it to only allow encrypted objects you can very often do that in the cloud without any outage or downtime or incident you could terminate the resource some resources in the cloud are difficult to

adjust once live and kind of do require delete and redeploy so you can terminate the resource only if you have good training and alerts to say tell the person that deployed an S3 bucket the reason why it deleted itself after 10 seconds is because it wasn't secure and they need to change this so you know that one only works if you have some good training involved or automatic response to the end user who deployed it and or so other response type add something to the backlog right if a human needs to make a decision it's more of a strategic Direction what do we do here there's no established pattern to follow set into a backlog right we'll have like

a GRC team or something um prioritize it send it out to the right person have that conversation with them and always log these alerts for the sock to correlate with other threat Intel and other alerts right it's always good for them to know when I'm looking at this resource what other alerts have been associated with this in the in the past getting a little bit more into the training side now if a cspm response workflow I'm a huge fan of just-in-time training jit it's kind of a term from the manufacturing space that I think really applies well to a lot of infosec applications identify the Violator tell them what the problem was what we did

you know this is your old bucket policy we changed it this is the new one uh tell them how they can avoid this problem in the future right and we send out infrastructure as code you know terraform serverless framework cdk whatever um cloudformation we send a snippet say if you put this code use this configuration in your template you can deploy as many buckets as you want and they'll always be compliant and then lastly we always link to a security standard where they can just find out more like why is this even important that we require encrypted objects in S3 right or also there'll be a some like if they need if they feel the need to petition for an exception

from this rule or policy for that resource give them instructions on how to how to do that uh we we send these out just via like email because it's the simplest way but you could easily do it through through your chat app or whatever um so so we've talked a little bit about what a standard workflow is for cspn let's talk a little bit about that centralized remediation service this is in my experience the best way to get value from all of those tools which have included and are starting to include increasingly more Cloud posture management Cloud configuration findings uh in in their alerts collect all the sources send them have them send events into one

single place and do your response process from there right as a bonus you can often handle a lot of the default findings from your cloud provider right in AWS it's for example config or cloudtrail or Macy or whatever the same system will often apply to to those you can handle them from from the same centralized mutation system do your exception policy management there your response and your jit training now you might ask why why would I not do that from the cspm itself some not all not even most in my experience some cspm tools will have the ability to click a button and it will go into your cloud provider and make the change on your behalf

if you do that you're missing out on a couple key steps to what I believe our remediation workflow should be right you would need to maintain a list of exceptions in every single tool that has cspn capabilities right instead of one central place to say this resource is allowed to be exempt from this policy you'd be stuck maintaining that a loud list in every single resource that's expensive from an operational standpoint also cspm tools don't often I've yet to see it identify the user that made the change and have a nice training email sent to them at the time that they're that their resources remediated or changed um in general it feels like you have a

lot less control if you're trying to Leverage The built-in cspm response functionality as well as no centralized circuit breaker or you know on off switch kill switch I imagine a large machine running something's gone wrong where's that big red button you just slap it and it and it turns off right um you know maybe it's some resource that is production critical is you know being deleted right after it's deployed every time it's causing an issue do you want to look through all five 10 20 of your cspm sources to figure out which one's doing it or just have a centralized let's just turn that off for now figure it out and go from there so that's the advantage of a centralized

service you might ask how is this what I'm proposing different from like standard soar it isn't it's the same thing so we'll talk a little bit about doing it yourself versus doing it in a low code platform if you're doing it yourself you of course have the ability to tailor it to exactly what you want you can customize it to your heart's content um I will say the people writing the code if you do it yourself um often they only need a low level of coding experience they don't need infosec experience as well right we as professionals can dictate what we want to happen and they can be enabled to make it happen in code the way they want

um which lowers the barrier of Entry right it makes it easier to find people that are capable of writing these simple simple scripts to do not a centralized remediation platform um around here especially we have a lot of code boot camps or development boot camps you can get Junior devs out of there very affordably with excellent node experience node.js all the colleges and universities in the area I feel like python is on every curriculum around here for a lot of different majors and python is a great language with uh to be able to interact with Cloud apis everyone has an SDK for it it's quite simple um I myself run an intern program where we

have a revolving door of two sometimes three interns come through and that's what they get to work on is adding more rules and code to our centralized remediation service um if you're doing it yourself it's all serverless so extremely cheap right and event based so you get very short response times between when the alert comes in from the cspm tool and you've acted on it it's that's less than a couple pennies and happens within a few seconds usually doing it in a low code platform is absolutely good enough like you can make you can extract a lot of value from your cspn by doing the same general workflow in your low code platform it could be even your store

which is tied to your sim your security incident event management tool lower barrier of entry in most cases I will say some soar platforms I've seen out there don't integrate very easily at least in the in a secure manner with your Cloud providers most of them do some of them have a little bit of a hard time with that just in like credential management you'd have to fall back to some sort of Legacy authentication methods but it certainly yes it could it could be your sore I mean if you want to do it yeah you could do it a lot of those low code platforms which are gaining popularity and there's some there's some certain advantages to the

to those and I will summarize again it's easier than it seems we'll get into that in just a minute here I will say learn from my mistakes avoid this Pitfall I was that poor lady falling into the hole there six years ago a cspm tool was the dedicated thing and there wasn't budget for it so I was all right well we'll just do I'll just do what I can with what I've got and I'm not a programmer but I hacked together a few scripts and made this the system and it and it worked but don't write your own evaluation logic use the evaluation logic built into all of your cspms built into all of your

Cloud providers they'll give you security alerts don't try to say I'm going to look at this S3 bucket and check all these things to make sure that it meets my standards that's commodity it's out there everyone's got it don't write your own because then you're stuck maintaining your own unless the caveat there of course is if you have a very specific Threat Vector or you're using a cloud service and a very unconventional manner which I've seen many times then short you can it is easy to just add your own evaluation logic to say anytime whatever it is you're looking at is created let me assess it and go through my own checklist to make sure it's compliant or

not and like for example in AWS that's what they config services for it's quite easy to just add your own rule but in general don't write your own evaluation logic you'll you'll kick yourself later okay high level architecture this is what a central mediation service looks like event sources are on the left mostly CS cspm tools Some Cloud events really could be anything right if you're running uh we're running some on-prem workloads in countries where that's the easiest way to be compliant with their local data tenancy laws you can send whatever kind of event from wherever you want into the system we do it throughout books we'll talk about that more in a minute but then within

that system there's again just the three steps which I can allow list fix the problem do all your response and train the user talking more about invent event ingestion uh web hooks magical everything supports web hooks you can send any event out of any system through webhooks I would be very if you have a system that you use that's even security related in any way that doesn't support emitting events via webhook I'd love to hear about it because I everything I could think of supports sending events out as a web hook and they're super easy to collect right every single low code platform Sim store tool or just you know cloud service can can ingest

web Hooks and I will say you're going to want so web hooks don't support complex authentication methods as a standard I always choose oauth 2 for authentication for apis you can't do that with web hooks so you do have to fall back to just API Keys you will want unique API Keys per Event Source so that if you're looking at events coming in in the logs and saying these events are strange was a key lost you would know which one you need to rotate with which tool and I want to point out there's no code in this at all so far right receiving web hooks through Amazon API Gateway and sending those events straight into

eventbridge there's no code you just click click click Send It All Through good to go it does the key checks as well you don't have to write your own custom authorizer or anything and this is what enables our event-based architecture it's just that event Bridge uh check the allow list right the new purple box at the bottom there so this is the first time we're writing code it's about 10 lines of code but that's where you'll keep a database and basically you just have sets of this specific resource is allowed to be exempt from this specific Rule and that that's it right this little just check when an event gets to the event Bridge it's sent automatically to the allowless

Lambda which is again a platform as function as a service really does it serverless doesn't cost us are there anything per run checks against the database which again I'm using managed services for nothing to maintain really there and then if if it is if it does have an exception that's where the story ends for that one since it has exception done if there is no exception and it is a violation writes another event back into that event Bridge that's where you get you know your event-based architecture throughout the rest of these steps uh now for respond the blue box this is where you'll do whatever response you want right you can do anything you can your response could be

you know terminate the resource modify the resource change some other mitigating control to accommodate for the misconfigured resource you know maybe add a WAFF rule or whatever that's something we do you will need an IAM role for that function to be able to go into wherever the violating resource is and and adjust it right or wherever your destination takes you um so that's still very simple this is often just 30 to 60 lines of code for us in simple python and of course report back to the event bus that I have done something to this resource because of this reason which brings us to the last step train the user right again it's important people will keep making the same the

same mistakes over and over if they don't know what they're doing wrong so identify the user we do it via cloudtrail we do that separately just so we don't have to embed it in every one of those lambdas but we have a separate service just step function just search cloudtrail who touched this resource last we'll we'll email them call it good we keep all of our emails in just Ginger templates the Lambda sees the event come in grabs the template renders it out sends it an email done so I will highlight though those three lambdas you will need one set of three per cspm alert that you want to auto remediate right those are all unique to the specific

alert so what I what I didn't show here I simplified it quite a bit for the talk today but there's another listener on the event bridge that sends every event that goes through it to our socks so they can correlate against all their other alerts coming in we have a few we call them core Services things that live outside of this that these lambdas can use for their own convenience right like the user identity service identifying users that that's a core service there's a few other ones like that we do write back to our cspm tool to say you know using their apis which are super easy they all have one just to say you know you can dismiss

this alert you can snooze it for two weeks you can close it or re-test or whatever the whatever the tool wants we just do that as a sort of abstraction layer so that as we push and pull cspms out of this whole process we only have one set of code to adjust instead of all the lambdas checking the database is actually it's the third it's a third party server it's a it's a separately hosted service that we use just for simplicity's sake and then I will highlight as well a lot of what these Landers do we're all the same right it's a very similar functions so we put all of that shared code we call

it an SDK put it in the Lambda layer which is straight across the three of them and then you just call functions that already exist and each of these lambdas which keeps their length very short and they're very simple to write um and it will say sources from cloud events that are straight on the event Bridge they bypass the Gateway they just go straight into our eventbridge that's just a built-in functionality of of AWS looking back six years what it's been like to operate this tool again each each lamb does between 20 and I guess 80 lines of code it's all python for me I've had two interns right and maintained basically the whole thing

um we automatically respond to 85 ish plus or minus a lot alerts from five different cspm sources it costs 52 dollars a month in AWS spend plus the two interns time 50 of that is just the database Cost Storage is a little extensive there two dollars of which is like running the lambdas I would say our average mean time to detection or mean time to response really is 10 seconds as soon as that alert comes in it goes through the whole thing the reason it's 10 seconds and isn't shorter is because often we have to wait for cloudtrail to catch up there's often a 15-minute delay there sometimes but within about 10 seconds is our average

run time or from from start to finish to response and the metric I would be most proud of is the number of new misconfigurations identified and coming into the system is trending downward and again that's because we bother to train our users and tell them hey this is what you did this is how we changed it this is why this is how you can prevent that from happening in the future so again back to the back to the thesis right you can get value from all your cspm tools by creating your own centralized identity sorry automated remediation system and it is easier than it sounds take value from all of these tools which you already have and start

realizing it thank you

Building an Auto-Remediation Platform for the Cloud

Related talks