BSides Iowa 2018: "Immutable Architecture and Ruthless Automation"

Name: BSides Iowa 2018: "Immutable Architecture and Ruthless Automation"
Uploaded: 2018-04-15
Duration: 26 min 55 s
Description: BSides Iowa 2018 - Track 1 Speaker: Ben Schmitt What the hell is immutable architecture and why does it matter? In cloud environments, it means treating your servers like cattle, not like pets. You deploy them and you don’t touch them until it is time for new ones. You knock them over, spawn new on

BSides Iowa26:5535 viewsPublished 2018-04Watch on YouTube ↗

Mentioned in this talk

Tools used

Chef Falco Nessus Puppet Splunk Twistlock

Platforms

Alpine Linux Amazon ECS AWS Docker Kubernetes VMware vSphere

Languages

PowerShell

Vendors

Red Hat

About this talk

BSides Iowa 2018 - Track 1 Speaker: Ben Schmitt What the hell is immutable architecture and why does it matter? In cloud environments, it means treating your servers like cattle, not like pets. You deploy them and you don’t touch them until it is time for new ones. You knock them over, spawn new ones and since all of this is load balanced and your services are stateless, you get fresh servers without downtime. What’s more…if you monitor them, any interactive use is rogue. Let’s discuss this in-practice and cover the benefits related to security. If course, this is enabled through ruthless automation and this means embracing git, python/ruby/other and apis is the new normal. If we can automate things, software can become a force multiplier and we can monitor things differently.

Show transcript [en]

all right we're gonna get started we're running a little short on time I think so I'll try and speed it up I've only got about 30 slides so I'm guessing 30 40 minutes we'll see time some questions at the end so I'm Ben I work at de Walla I'm also on the board at sec DSM pretty happy to be here sorry about the weather it kind of stinks but I guess it's good we're inside so I'm gonna talk about some properties of mutable architecture you fired the term immutable infrastructure I want to take it up a notch and say well how can this be an architectural design pattern because it changes how we do security so

I want to talk about some unique elements and properties and they make us better as defenders and automation is a key to that so we'll go through some background today properties of mutable architecture some examples old and new and maybe an application how this could work so put Bruce Hornsby up there my wife loves 80s music but he's got a song where he says some things will never change we're gonna talk about immutability today and so when he has that thing some things will never change like that's my little theme song for today because if things can't change or shouldn't change if they do that's rogue so that's kind of the theme what we're doing here

she likes mandolin rain too she liked that song Hornsby alright so we're gonna talk about cattle versus pets anyone heard that analogy before I should not we'll talk about it talk about the term Phoenix server snowflake server immutability security properties and examples so first of all cattle versus pets I'm originally from northern Wisconsin so lots of cattle in fact lots of dairy cows so up on the right our dairy cows that's a picture from the University of Wisconsin that's cattle anyone know what type of cow that is probably hard to see but it's black and white yes yes which originally is from the schleswig-holstein area of northern Germany which my family's from so it's kind of interesting all right so that's

Holstein cows up there anyone know what's down there on the lower right my daughter and then what's the other thing it's my dog his name is Hank so my colleagues who came here have met him he's a big Great Dane and he's a pet we treat cattle and pets very differently cattle are there to serve a purpose Holstein hosting cows have a very high rate of producing milk they serve a purpose and the when they're gone we replace them they're not pets whereas that little girl is attached to that pet like you wouldn't believe they've grown up together so his vet bills are huge he eats table scraps we treat him like a pet we're gonna talk about how this

really pertains to servers and infrastructure shortly his name's Hank by the way all right so what is a phoenix server so phoenix server is kind of hard to define so martin fowler is the gentleman who coined this term he's a British architect and software developer and so thought it was a cool icon of a phoenix but let's talk what it is and I think his definition is a really good one it should rise regularly from the ashes the advantages you avoid configuration drift and as you work at scale configurations do dress ad-hoc changes to configurations unrecorded stuff happens so a phoenix server is one that can rise from the ashes treat this thing like cattle you can knock it over and

bring it back route of trusts anyone heard that term root of trust you go back to an area where Trust is no longer derived like a certificate authority well if you trust some base image some base part of your infrastructure that's a root of trust you're going back to that which is a very good thing I talked about knocked over like cattle and how do we do this so Eric was talking about the cloud AWS virtualization containers we're going to touch on that too using virtualization tech this can be done via code and API is a little load balancing in there to make sure it's done almost seamlessly you could do this stuff live and then the phoenix server is actually

coined by one of martin fowler's colleagues cornelius Sytsma what is the snowflake server anyone heard this one it's snowflake server okay it's the opposite this thing is kind of like my pet Hank you treat him very carefully and again to get a good definition was top I think Fowler's is the best one so it's finicky business Kiva production server running you have to ensure the operating system another component are properly patched keep up-to-date upgraded regularly figuration draft etc so how do you do that it's a form of command line invocation jumping between buoys and editing text files when you're done your uniformity starts to drift you have a snowflake server good frisky resort bad for a data center the reason

that's bad it is hard to change these things after I was running over time anyone have a server you're afraid to touch it's not fun to touch that's a snowflake and we have to have them like people have snowflake servers but it's fun to talk to them in this way because maybe there's another way in the future you could do it slightly differently on it's hard to reproduce things when you work on these things because if they have drift or they're different they're hard to make sure things are consistent all right so what does immutability and I promise we're gonna get out of the definition kind of boring part in a second here property being unable to be changed and

so if you go back to development languages Java in particular is the one I think's good for today an object is considered immutable if the state cannot change after it's constructed and why is that a good thing it creates simple reliable code so that applies to infrastructure is code and that's why I'm thinking part of this immutable architecture discussion matters if something can't change it can resist corruption so you turd Eric talk about corruption Tom corrupted a container instance that was running WordPress can we stop corruption or corruption are detected and you can also have it easier to enforce validation or policy against that and by the way I'm kind of conflating immutability containerization and immutable

infrastructure but that's why I'm calling a mutable architecture I think each of these pieces play until we can do in the future so a little more immutability can you apply this to more than objects types and sequences inside of a language like a tuple and Python is immutable that's great then how does this apply to infrastructure so one example is docker containers and you heard Eric touch in that a little bit they're small purpose-built containers that are fungible and replaceable I like the word fungible I wanted to use it when I was taught that one in high school good examples like corn you harvest whole bunch of corn you all together put it in a silo and then you

pull your corn out honorable thing doesn't matter if it's mixed well these are fungible and replaceable you can do with instances - it's not just containers so virtual machines right if they're purpose-built they give you launch stopped or destroyed and replace yeah sure these things are gonna change a little bit the pids are gonna change the logs are gonna change but that route of trust is still established when you do these things at scale so immutability can also be a practice not necessarily a distinct property I think it's both I mean the infrastructure so you can do software-defined networking now so load balancers routes firewalls they can all be code driven and you can do this

without manual intervention if you have an automation tool that can push those changes out you're using git you're doing commits your version history reviews on that etc you can do immutability and infrastructure then you can also kill and replace parts or infrastructure programmatically all right so how do you enable this stuff how do you enable immutable architecture I'll call it there has to be code most cases it's Python just what we've been doing but you have to have code rooted trusts I touched on that already have to have a safe place to go back to you need ruthless automation we're gonna talk a lot more about that that means getting humans out of the equation quite frankly

turning everything into a damn honey pot and then monitoring you have to monitor this stuff so verification integrity becomes easier interactive use and change becomes rogue and the age of things in your environment also becomes an interesting indicator we'll talk about that in a second anyone heard of the pink sombrero there's an old post from 2011 called cowboy coding and the pink sombrero so I worked with a really good engineering team I were hiring by the way talked about that later in many positions but they're the one introduced me to the pink sombrero and this means inevitably you have to go into production occasionally or you have to go someplace and unfortunately break immutability but fix something well if

you do that that's a big deal we should avoid this at all costs so you have to do that you put on the pink sombrero and some people actually do this and that means you are doing something that is dangerous something that is an anti-pattern you better be very careful so when you're making changes and you're not following ruthless automation because you have an issue think of the pink sombrero as a metaphor in the references at the end I linked to that post but it's from 2011 it talks about scaling an infrastructure so think of the pink sombrero meaning you're breaking this immutable property so some more properties that enable integrity Rubik trust established can

cryptographically signed if you're an Amazon you can cryptographically sign on a test of things because they have a PKI available to you if you have rapid authorized change via code it's repeatable that's ruthless automation and that is lower risk because you can move faster you know what you did you can roll back and you can do rapid vulnerability management if you're rolling new instances typically in a cloud environment those instances should come up fully patched that's pretty nice we'll talk more about that but you can also patch any time of supporting architecture meaning a few a bunch of stateless stuff you have a load balancer in front we're gonna show a diagram later you should be able to roll things

in production in real time to keep your environment evergreen well then rapid recovery containers in other parts of the or infrastructure would do this but you have the ability to rapidly recover from an event you're not gonna rack stack power configure and care and feed a server you can recover more quickly all right so flux is an interesting property flux meaning change so an adversary who will break into an environment needs persistence they might do a smash-and-grab like Tom did with your tables but if they're after data as an asset they might hover for a while most advanced adversaries will have persistence over time so this is an old stat from Netflix but the average age of

an instance in their infrastructure in 2013 was 24 days and I'm sure it's a lot lower now I don't have a newer stat for you but I think that's a fascinating statistic that in their infrastructure granted their cloud native that's it's really impressive and if they're an AWS their machines Auto patch that means their machines their patch cycle has less than 24 days at worst that's pretty cool so you've heard of NIST change some of the password requirements right like hoff and you change it lengths do an event do it based on risk but we have these 9121 day things in our minds when we change our credentials well instance ages the new password age I think that's

interesting because you can lower the instance age I think you get some benefit in flux because an adversary can't hang around you're gonna kick them out just by rolling infrastructure that's a real metric cloud conformity has it as a metric they offer inside of their kind of management platforms that's kind of neat I think anything can be a honeypot if this stuff's ruthlessly automated and you sprinkle a little two-factor on top you're doing pretty well but if there are changes that are made with a pink sombrero on you better know about it you've alarms that go off and you watch them anything else that happens is rogue and that is something I mean you can

detect pretty quickly and I think you reduce your attack surface in this case I'm talking about containers so alpine linux is super minimal as a base for your containers the attack surface is very much restricted you layer just what you need on top and that containers presented you don't have bluetooth you don't have a GUI you don't have cups for printing you don't have blah blah blah blah blah so let's use an example let's do some artisanal infrastructure work so let's just use Splunk in a small environment let's show how you can do it manually and how you can automate this thing using some properties of immutability I think it's a good way to bring this home so if you're gonna do a

Splunk deployment and I've done one a long time ago and it was artisanal about ten years ago actually what do you need well you gotta get stored she talked to the storage people to get you know access to the sand or whatever else instances of physical servers based on your org ID physical servers back then I had a rack stack power etc you need licenses you need digital certificates so you got to talk to the certificate person or team you need agents to be deployed all over the place to collect the Splunk logs and ship them back when you probably thought I'd ldap for ten occation that seemed pretty reasonable for a Splunk rollout and your server and

storage are probably tightly coupled back in the day and I'm not doing this stuff that much anymore typically wanted really good discs presented to you via expensive sand that was just for you or really expensive raid on there so they're pretty tightly coupled all right so what do you do next you need a pretty performant host so back then it was Red Hat was pretty much the standard two CPUs eight gigs ram two terabytes a disk and hopefully I don't rack stack and power this stuff so now I have your your server right you have to get SSH access or console access your VM where you're in that VMware vSphere manager thingy you gotta get sudo access because can't

install this was a regular user got install this Blunk rpms and able to boot start and you gotta get all the licenses after that's done between your defaults and you're probably following a checklist right doing this via a checklist maybe you have puppet or shaft or something doing it for you but you're probably following a checklist I was back in the day buying to an IP secure stuff deploy your agents then you can do dashboard and query time so that's a lot of different steps so graphically this is my best representation of what I think this could look like back in the day so if it's a manual process as grey if it's partially automated orange was

fully automated green well I think your provisioning is probably somewhat automated probably giving a big VM console access and software's probably manual is not everyone has access to the license is probably restricted install your software that's partially automated the Splunk installer is kind of nice config changes deploy agents so I think there's some opportunities there so the automation really in the orange is provisioning your VM the install script from Splunk and agency an orchestration tool all right so what are some opportunities that I think your uniformity as a function of scale is gonna drift if you're building a big Splunk environment and haven't done this in a while but you had a bunch of these

servers so you can have a lot of checklists a lot of good process a lot of QA but the chance you'll miss something is very real do you allow system changes these are all mutable service you can SSH into me you can change them so these are open for change the Fen occation is obviously allowed and do they have MFA enabled like do you have duo and something snapped into Pam so that when you do a fennec ade at least it's really well protected probably not speed to implement and update as a function of human as function human resources or people if you want to speed this project up you need more people to do it because right

now we're not automating um you're gonna configuration drift over time they're gonna patch windows are gonna have downtime the configuration drift over time what's gonna happen all right so that's the artisinal rollout how can we do this a little more automated I've got some charts coming and I'm a little devoid of graphics here we'll add some in all right so we're gonna use a mutable architecture principles if we do this we're gonna have storage that's not as tightly coupled we can use elastic block storage we can use an m4 instance those are pretty easy to get access to that'll fit that same performance that we talked about earlier you can apply licenses code certificates this code

agents can be applied via chef and you can apply LDAP integration all via code sieve a looser coupling of server and storage which is nice alright so what can this look like well if I'm gonna provision this instance it's a custom ami it's Amazon machine image and it's got stuff baked into it that saves you a ton of time and you have root of trust all right I don't need console access to deploy this I don't need software to the plates that have baked in I said to mount the elastic block storage install a software which is automated apply the config change it which is automated and deploy the agents which are automated so we've

shortened a lot of time we've done some pretty cool stuff here so what's different speed this can be done in under 10 minutes and if you do it once under 10 minutes you can upgrade in under 10 minutes that's pretty cool roll back if something's messed up you're not getting a restoration on restoring something you can roll back quickly no ssh required to do this at least on this infrastructure you can roll it out so it makes monitoring for 10 occation offense all the more meaningful I'm not saying it's a honeypot but if you don't have to SSH in that means if someone is they have a pink sombrero on or it's rogue it's that simple

ok no good state we talked about that you can everyone attestation with Amazon file integrity monitoring you can tune this stuff because you know what should change and what shouldn't change so that's kind of a nice property because file integrity monitoring is not necessarily easy I talked about a honey pox this thing should never change like the Bruce Hornsby song and if you sprinkle in some MMF behavioural monitoring with off host log analysis you have a party meeting things are going really well so can we do better than this we took a Splunk rollout which is pretty manual now we automated it made it cool can we do better the answer is we can do

we can do better there's more we can do here so let's take it to the next level with load balancing docker containers an orchestration tool like either ECS or kubernetes anyone so you heard about containers from Eric I have to cover those of you guys cool little containers are nice docker file ok we're good alright so these things are optimized for containerized applications so good examples these like I think the top 10 or 12 I found no to nginx Java apps Postgres etcetera was your WordPress any containers got it cool all right stateless applications are really nice for this because you can knock these containers over you load balance them your end user doesn't know the

difference you a minimal attack service via Alpine Linux or Core OS so you talk about the term living on the land at all ok so if you're an adversary and you break into something you're probably want to live off the land use the tools available to you so in Windows environment you're gonna use PowerShell it's there right well I pop a shell in WordPress and Eric's container I have a PHP shell it's a PHP interpreter and the environments there it makes it harder you have to live off the land and head live off an arrow mono land you can also mark containers as non writable so it's a copy-on-write model and the filesystem shouldn't be written to so that saves

you bit to no shell no RDP no NetBIOS no PowerShell you can do better in my opinion so what does this actually mean Ben so this is a very simplified example of I don't know where Eric for what you're doing but it would work for a simple service here so auto scaling group number one on the left has six containers that are running let's say it's Amazon ECS kubernetes doesn't matter they're running fine on that thing up top is a load balancer so the load balancer takes requests in make decisions in the backend where to send them since under docker containers we're all good and we decide we have to roll new containers because these containers

you build of a shelf-life once you commit these things and they're immutable like they have a shelf like job is gonna age quickly OpenSSH gonna age quickly whatever you put in there hoping you're not putting OpenSSH in but they're gonna age so you have to roll new ones in this case fire up a new auto scaling group and start putting in new containers that you have and they're ready to go the green traffic is coming in and we're getting ready to push stuff new traffic blue and auto ski in the group to the back-end data store is persistent so that's the only persistent thing in this model so we can do this all via code and say hey we want to roll

to the new stuff we go ahead and take auto scaling group 1 and start knocking over like cattle those containers because there's no data in those things we don't need them load balancers aware of it is shifting traffic over to auto scaling group 2 and we start to put the load over there until we knock over and destroy all the containers over LAN left the only thing that persists here is the database of the back end so that's a rolling deploy which can be done stateless during production it can be done quickly it's repeatable and you have immutability in the containers can we do better I think you can do them better than that so yes what do you

deliver increased uptime increased flux you can do this thing all the time you can scan your docker registry so that route of trust where does this docker come from you have a registry somewhere you can scan that registry to make sure your containers are good there's a tool called clear that does other tools you can get but if you run necess against your environment Nessa's is pretty much not docker aware at least not that I'm aware of but running against your docker registry with something like Claire will tell you if you're missing certain certain certain patches updates etc talk about reduced attack service and you can validate your container integrity you can do a diff or you can mark them as

read-only so your ability to change the container so it's immutable is removed can we do better I'm making it even more than this it's not just containers you can drive your infrastructure this way so your security groups your DNS entries like how often do you change your DK more SPF settings probably hopefully so maybe that should be under code control you should push that out with dual control and have code manage those settings not of a human management your routes your databases your IP addresses your firewalls a SDM is a pain in the butt to use for a sa management but you can get from your a SDM the changes are Delta and you can apply

those via code if you want and not be inside the GUI using your pink sombrero changes are not changes which are not under dual approval and don't have tests or rogue so if we do these things via code media api's if you have ap is they have pretty strong authentication usually a keen a secret at a minimum but you can do better you can apply certificate based authentication to api's so you a certificate not an IP whitelist the plus your credentials to have an API do something that's pretty powerful so vendors can provide this stuff API strong authentication and options for containerization so you're talking to your vendors I would ask them do you have an API it's important to me and

then I want to ask them do you support containers and if they don't that may be the case but it's a good question to ask you can do really advanced docker monitoring as a tool called Falco from systick there's a commercial one called twist lock they will monitor the containers and look for process in the containers look at the network traffic and monitor what's not necessarily a blind spot monitor environment that's immutable to make sure it's immutable and if you didn't listen to any of this which is okay but at least use docker at home try this stuff at home it's a good way to do a home lab Brandon at SEC TSM gave a talk about something called Hugin

which is an agent or like monitoring thing for social media adversaries etc it is a pain to set up I spent about an hour and I got really frustrated and there's a docker container I fired up and was testing in minutes so it helps for testing so I went through this pretty quick on Drake save a little time but I'm pretty much down to the end I think immutability is very powerful it's a new ish technique we should consider as a security defense ordinary things can become sensors or honey pots if your SSH in and you ping sombrero on multi-factor that and that should be rogue but you can document what I needed to do it otherwise if anyone else is SSA

Qing and it's a pen test integrity is king more and more cryptographic signing to assure integrities and become popular in operating systems and ruthless automation takes discipline it's easy to just go ahead in a small environment install the stuff but take the discipline to automate that get it under source code control dual approval and get it under multi-factor and watch your snowflakes really closely so I put the references up one make sure you had those that's the cowboy coding and pink sombrero link the original post is down I define a repost of it because it's 8 years old almost but I will stop for any question because I went through that really fast log monitoring yeah it

depends on your environment that's the simple answer but if you're an AWS they have API is that allow you to have the logs or you can do instrumentation to do more so I think there's two parts of logs that are important so you have all your authentication events that should be centralized and duo is a good way to do that because if you don't multi-factor and past duo you don't pass muster so that's one place I would have logs the second is all those orchestration events that happen let's say it's through ECS you have things that with cloud formation and orchestration tool I would multi factor that you're gonna use some kind of CI tool to do that circle CI Jenkins

whatever else those logs are really valuable and then you have things that are running on your containers you may want to instrument those so there's cystic Falco other ones and you have those logs where would I put them all Splunk err and help stack that y'all running in your environment but that's a really environment specific question it's hard to say but the ones that are really important are your authentication events you multi-factor them that becomes an easy way to detect rogue activity so it's probably an sir you want but it's that's more like let's go grab a coffee and talk at a whiteboard and I'll figure that out

okay thanks for coming stay dry

BSides Iowa 2018: "Immutable Architecture and Ruthless Automation"

Related talks