BG - Zero Trust Networks: In Theory and in Practice - Doug Barth & Evan Gilman

Name: BG - Zero Trust Networks: In Theory and in Practice - Doug Barth & Evan Gilman
Uploaded: 2017-08-27
Duration: 59 min 12 s
Description: BG - Zero Trust Networks: In Theory and in Practice - Doug Barth & Evan Gilman Breaking Ground BSidesLV 2017 - Tuscany Hotel - July 25, 2017

BSides Las Vegas59:121.1K viewsPublished 2017-08Watch on YouTube ↗

Mentioned in this talk

Tools used

Cassandra Chef HashiCorp Vault

Platforms

Docker Docker Swarm ZooKeeper

Standard

SPIFFE

About this talk

BG - Zero Trust Networks: In Theory and in Practice - Doug Barth & Evan Gilman Breaking Ground BSidesLV 2017 - Tuscany Hotel - July 25, 2017

Show transcript [en]

[Music] now we have next talk by Doug and even it's zero trust network in practice and in theory so Doug Andy man for you thank you hello everybody I hear me OK on the back awesome so yeah as mentioned oh my name is oh my name's Evan Gilman this is Doug Barth oh and we're here to talk about Sarah trust networks in theory and practice Doug and I are both a sorries we met at pager Duty actually Doug hails from primarily software engineering background do we need to turn this down a little bit maybe here comes some feedback Doug and I dug hills from software engineering background I come from background the computer networks and

right now those areas stripe and I've been spending my time focused on an open source project called spiffy so we both joined Patriot within a couple months of each other pastry was still pretty young back then a fairly small infrastructure I'm pretty certain that most people here know what Pedro didi does so goes without saying that availability is of the utmost importance for the Patriot II platform and availability is kind of the was is the requirement that this entire story we're gonna give is kind of based on and that was the key business driver for implementing Sarah trusted Pedro Duty was that this thing must be available at all times contrary to popular belief a

lot of people think we did it for security but we actually did it for availability so how did we get started on this whole zero trust thing it's a little bit of an interesting story so Pedro de t is hosted in multiple cloud providers and multiple regions so this diagram each dotted line represents a geographically disparate cloud provider and we had a active active across these boundaries right so even though the the infrastructure was kind of small at that time it was still kind of challenging because of this property and we did this of course to meet those availability goals and so this is a Cassandra we had Cassandra zookeeper a few few services striped across multiple data centers

like this and and in doing that there's lots of third-party networks between these systems right because a lot of this traffic is versus traverses commodity internet so that presents are pretty large security challenge for us we want to provide access control we want to provide confidentiality on basically the majority of flows within our infrastructure because of this topology so this is a big problem we had to tackle we decided to tackle the access control problem first essentially really all we wanted was security groups you know but security groups are AWS specific and we're multi provider and even within AWS they're region specific and we're cross region so we basically wanted to reimplemented groups without actually using security groups we needed

a tour cross cloud and we needed a tour across region so what we ended up doing was we built some fairly heavyweight iptables automation and to chef which provided the security group like semantics based on the chef role and we defined that what's known in chef lambda as a LW RP we defined this custom dsl resource that you had used to write the policy that would power these IP tables rules and then chef would crunch crunch is declared policies and I'm like translate them and to you know generate essentially the IP tables rules necessary to realize these policies so this is an example small snippet of what that policy it looks like so this code

would be executed on a web server to allow access from a load balancer right and this thing as basically as it is provided most of the flexibility we needed in terms of access control so with that that problem kind of somewhat addressed we wanted to turn to the privacy issue Michael IP tables you know all the encryption and privacy up till then was all configured by hand which can be pretty hard because we're going like intra cluster communications in Cassandra things I don't natively support encryption have to jam it in there and that was pretty painful so what we wanted was just like blanket encryption which is easy to use and kind of just work right we wouldn't have to

kind of go and turn all these knobs everywhere you know and in order to solve this this particular privacy problem VPN is I would say the most popular solution right IPSec VPN our SSL VPN whatever you choose but for Patriot the architecture this typical VPN deployment brought a lot of challenges right VPN is usually deployed a site-to-site tunnel and the active active nature that pay tree to the infrastructure meant that there's a whole lot of cross DC talk so these tunnels would be very heavily utilized right that brings scalability concerns also brings availability concerns how do you multiple VPN heads equal cost multi path routing between them and this and that it gets a little

messy additionally not all the providers that we use on all the computers had like V PC like functionality like we couldn't really control their routing tables because it might carve out Network subnets or things like that and so the lack of that control forces providers that don't have that into this hub-and-spoke VPN model which I kind of had like a little example down there what that looks like so we'd have to have like a VPN host which all the host dialed in to in order to route their traffic to other sites right so all of these things combined give just a lot of overhead for this this architecture for Pedro Duty it didn't make a whole lot of

sense and the funny thing was we didn't really even care about the routing of packets we just wanted the security that be that VPN brought so what we did was we decided to drop the VPN part and just keep the IPSec part so we ended up deploying raw IP a second what's another transport mode and it was configured as full mesh between all the hosts and would opportunistically turn up when needed so with that in place the network would go from looking like this to looking more like this right where mutual authentication occurs as soon as the first packet is sent and encryption is applied completely transparently from then on and the the Linux kernel was

configured to drop all packets which were not associated with an active IPSec relationship so it was not possible to send packet which was not encrypted so we got with this we got all the benefits of VPN security without like VPN itself without the overlay without all this other crap we didn't really want so with this figure confidentiality solved the 50 sec we got access control mostly solved with IP tables also a little bit solved at IPSec because it was doing a mutual crypto authentication between all the hosts but when you put these things together they exhibit some really interesting properties first all flows in the network or authenticated them encrypted hands down without a doubt we

could prove it right all flows have asserted is authorized if there's no policy it doesn't go anywhere and additionally there is no inherent value an IP address because all these things are being dynamically calculated based eyes the work around it that's cool like we just you know reconfigure everything and you're often racist but perhaps the most interesting set of properties is that this network has no centralized firewall right there's no network gateway that you have to pin traffic through in order to reach one port of the infrastructure another part of the infrastructure and there's no private network either there wasn't really a need for it at that time everything was secured without the use of a perimeter firewall so essentially

what we did was we had succeeded in building this secure perimeter list network right but it wasn't necessarily the goal when we set out but but that's kind of where we ended up you know and it was around that time that Google beyond core paper was published and this paper kind of described Google's perimeter list corporate network and they had been working on this for some number of years actually but by the time they got around to releasing this thing and since this paper was released a series of papers had been released describing in more detail how the beyond core project function within Google but those papers describe this like giant and manageable perimeter which had grown

too large too permeable which was just not really effective at stopping modern threat landscape which was a really exciting paper for us to read because there were a lot of similar principles there we looked at the Encore plan we looked at what we built and we thought hey like there's there's these are kind of like very very similar ideas going along here just different applications so that validated a lot of like early Patriot II security decisions for us which we were obviously happy about not only demonstrated but like I said it kind of demonstrated like other other use case you know client side of this whole thing so it was essentially like a broader application of very similar

philosophies that we were using to secure data center traffic and it really helped us to understand kind of the full scope and implications of this model which we now call the zero trust model so following that Doug and I got realization right Doug and I gave a pair of talks on the model and the reasoning behind it shortly thereafter we were we actually wrote a book about it which were which is we don't really like to write but we were excited to do it because there wasn't very much out there on this topic and it was very poorly understood by most people and so with that I'd actually like I had up ever Doug he's going to talk a little bit

about the zero trust model itself thank you yeah so let's kind of look at this at the high level what do we mean when we say zero trust I think probably the best way that I've learned to explain it to people is that we assume that the network is hostile and then we change our network architecture as a result of that and what do we change about it we we want to remove trust from the network so if we're gonna remove trust from the network what does that mean from like a systems design perspective it means that we can't say that your position in the network or your address and the network is sufficient for determining

authentication or authorization and instead we have to like rethink how we actually handle those two concerns and so you know the end result here is we end up doing deep authentication authorization for everything and in our own like discussions with people it seems like there's really three things we're trying to authenticate authorize its users devices and applications another kind of key idea here is when we are doing all this deep security controls in the network we want lease privilege throughout the network so that ends up meeting like in practice instead of having just like the one centralized control system or several centralized control systems we're gonna sprinkle like a large number of control systems throughout the network and really we're

gonna treat the network as like a large distributed system where we have enforcement like as close as possible to what we'd call like the workload and we need like control mechanisms which can reach out and like control all of those those enforcement mechanisms so in a zero trust network I'd say that like one of the key concepts here is that every flow is expected somewhere we have a database that tells us that these are the the application flows that we've put that we've categorized and they should exist and our network like checks to make sure that any communication on the network is allowed before it actually transmits it I want to like make a really careful distinction here that

it's every flow or communication is expected but this is like subtly different than saying every flow or communication is authenticated you're gonna have communication on your network that is unauthenticated but like the path that goes down varies a lot and so the the way I keep it straight my head is I think about you know an authenticated user scenario or maybe an unauthenticated device like both of these scenarios actually need to need network communication but they're shuttled off to somewhere else so the users case it's you know single sign-on or some sort of like user authentication in a device case it's something like we're going to bootstrap the device and reimage it and ultimately and let it back on the network with like

proper image and credentials so we're gonna have like this database of you know a literal whitelist of network communication how do we want to capture it and one of the things that we definitely don't want to do is we don't want to capture policy in terms of like the physical man estate manifestation of the network today we want to capture at a higher level and we like to call this symbolic policy and this is you know you're describing your policy and your flows in terms of like the logical like components in your network and purposely keeping them divorced from like actual implementation details and the reason we do that is because we kind of expect in

the future our network will change and if we have like good logical descriptions of all communication in the network we can calculate from that you know how it should be applied to the network that we actually are running today so when we're capturing this policy one of the interesting things is like well how do you capture it both Pedro duty and Google on others they they tend to like capture it as like code and a domain-specific language that's versioned over time and that's really nice because you have like this language that it fits well for writing code to actually deal with the physical implementation but it's not really like mandatory that it's written as code it's

just that you like separated policy from enforcement decisions one of the things I think kind of echoes this philosophy really well that maybe I don't know if like everyone realizes that echoes it is say Amazon security groups when you define a security group you're talking about you know a logical database security group allows access from like an application system and you don't really care about the IP addressing but under the seat under the covers amazon is you know obviously keeping track of which virtual machines are database machines which ones their application machines and enforcing it for you but you've described it at a much higher level and another like kind of key concept as we worked through these ideas is you

know we talked about we need to authenticate users devices and applications in the system but kind of like crucially authentication can be handled orthogonal with each one of those separately but like the authorization has to be the union of them we want to like make sure that when we are making a policy decision if we allow or we disallow communication that's that's a combination of all three of these things and I think like to help you know clarify our thoughts on this we need to give it a name and so the name we decided to give it was a network agent it's kind of meant to like echo like a user agent in a browser but the

name doesn't really matter the idea is that every time we're making an authorization decision we need to consider all three of these things and have policy that is defined based on the union of those and that's what we would want our policy engine to actually work on is like this network agent and having like this concept of the network agent I think like ultimately lets us better express policy that we want like kind of these weird edge cases that don't fit well so an example that you know I've run into myself is I have a you know a corporate provision laptop I also have my phone my personal phone which is allowed under the corporate network I

have strictly less access on my personal phone than I do on my laptop all right and if I don't have like this network agent in a policy engine to let me write that policy and up in these weird situations or it's like I have to like not allow the phone to talk to this thing and it would just be better if we said really what we are that are authorizing is a network agent so we're gonna sprinkle these control systems throughout our system obviously like that's a lot of different control systems to reconfigure as the network changes so how does this become like a thing that actually works it's it's automation right it's pervasive automation throughout the system if we

didn't have it we probably wouldn't do this architecture and automation really helps us on two things one it makes it possible to do this but it also makes it easier to do the quote/unquote right thing or the the thing that we would like to be true and really what we're doing is you know in like if you look at that security group thing right we are continually letting the automation take over the enforcing the logical policy we gave it as the system changes and as new things come or old things go we want a system where when new virtual machines or new containers come into it the system just reconfigures itself and because we've like written and debug this code we

trust it to do the job that we expect it to do if we don't have automation you know the reality is that you know humans being humans they're gonna try to figure out how to cut corners or not like change things as often and so I think like ultimately we have to let the automation like do its job and we do our job of defining policy but we've talked a lot about like enforcement and I think that's not like the full story of zero trust networks I think the more interesting or potentially a more interesting story here right is like visibility in a zero trust network is greatly improved over traditional network logging of traffic and analysis

of that traffic is as you all know critical for doing risk analysis and forensics of you know past incidents and the really nice thing is that like all these control systems that we've dropped throughout the network are really natural like points for us to hook in and expose that visibility to the rest of the network we can do really good logging we can do eventing and you know anomalies in the network and the fact that we've like defined this like deep policy for communication we expect ends up meaning that like our alerting story can be driven off of that more aggressively we can like alert off of things that aren't supposed to be there and if you think like far into the

future maybe it eventually becomes a really great like feedback system where we can like respond to threats that don't really make sense to us or or that are new within the network because like we have this deep visibility so zero trust networks are gonna you know require a whitelist of policy and our advice to people that we've talked to is like definitely start early imagine all of you are very like receptive to that suggestion but a lot of people like tend to think about it in terms of well I'm going to build the system that I have today and then in the future when like security or when the risk of the security against the risk of that system is great then I

will go build up like my security mechanisms and I think that's probably the wrong way to think about it building up whitelist is a lot easier if you maintain it over time and if you delay it the the trailin just becomes harder to the point at which it becomes impossible so we definitely advise people if you're if you're starting to figure out ways that you can start capturing these flows early even if they're not like the best like description at least it's somewhere to start so you're not leaving someone like this archaeological effort of trying to figure out why things are the way they are in the future and if you're going to start early well

you need to have some way to make sure that your whitelist reflects reality so as soon as possible once you have this whitelist like flip on enforcement and use that as like a driving function for actually maintaining your whitelist over time I don't think this problem gets any easier and I imagine all agree so those are like the broad ideas of what is in a zero trust Network I think it's hard to like visualize that from just like the principles so we're gonna dig into like two systems designs and try to understand the components that are there to actually better understand the idea here so what are like the key systems you need to actually build these

networks now it's pretty clear you have like a few different components the first is we're gonna have a control plan we're gonna buy data playing different services are gonna exist in the control plane and they have different responsibilities so I think that to big like control plane systems that drive a lot of these network designs is these inventory systems so we're going to user inventory and device inventory and these systems are meant to be like the source of truth for what exists inside that network and everything else is driven off of that source of truth a lot of these networks also like because we're doing automation configuration management systems can be a very convenient like potentially

general-purpose automation framework for your network and those systems will leverage these inventory systems to actually implement enforcement decisions in the data plane finally like the last piece is we need to authenticate you know these users and devices and we want that to be 40 and n chose frequently and so we're probably going to have some set of authentication services which are again like leveraging these sources of truth to make the decision of whether or not this person or device is actually authenticated and so like in concrete terms this is all like familiar stuff right this is for a user it's gonna be single sign on it or for a device it might be some sort of

like certificate issuance system but again like the idea is it has to be online it has to be easy and cheap to rotate regularly and once we have that then we actually can start building systems that can better respond to threats so as these you know this is very basic as these systems grow you can get way more advanced than this you could start adding new components that that further leverage like sources of truth and try to like respond to novel threats but these are the broad strokes and so let's dig even a further level deeper and start looking at like particular implementations to better understand how like people have approached building these and I think when we do this it's

important to realize that like this is not like the perfect systems design this is the end result of an organization's goals coupled with like their business needs results in their implementation and and we can take lessons from that but it's not going to be implement this and everything will be perfect or even if this makes sense to you in your system but one of the things as we talk to people is there are kind of two different concerns here that we need to figure out how to deal with and the first one is like server-side zero trust networks how do they look and how do we build them and we're gonna focus on like pager duty because we feel like that is

a good example of how a pager Duty like evolved their zero trust Network in the server side implementation later Evans gonna talk about like client side which is what we see on the beyond Corp design but let's take in a page of duty first so pager Duty being a start up you know like all startups does things in the the simplest way or the the simplest thing that could possibly work approach and so our original implementation was chef cookbooks actually calculating and applying policy to every server in the interest and in chef you have like this really rich device inventory because it indexes all sorts of facts about every single device that it runs its configuration

management on and so it gives you this really nice opportunity where you could write configuration management that can calculate out like knowing the set of every device that's in the system but naturally as this like system grows you start running into problems right and the big problems that we ran into was a we had no problems with scaling it particularly around like just the time it takes for like changes in the infrastructure to roll out throughout the rest of the system but we also had like frustrations with you know how how isolated is the system we don't really want like our security mechanisms to be sitting alongside say installing like a library or something that we are giving

to another team to to configure their servers I'm additionally they started working on like Rhian visioning how the system is running and using containers to dynamically assign workloads to infrastructure and so that kind of exacerbates the issue right where you have workloads that could move dynamically between systems and you don't certainly want to just turn off all the security because like they might just hop around a fairly frequent basis but that said that's alright like the way we think about this is we have a system that works we have known behavior we're going to extract this out of this you know MVP environment and put in its own new dedicated system so we did that and that

system with something we call topology manager and ended up with like several different components some that lived in the data plane on the servers some that lived in the control plane but together they kind of like took over responsibilities from chef and codified something we were already doing in a way that would support like our future system design so let's dig into that now let us start at the data plane so on the data plane we ended up having an agent that runs on every server and that agent had like a fairly clear responsibility which was to manage the network configuration for that server and it did it in response to device inventory changes that came from the control plane

and so anytime like a device came in the system or left the system these agents would be signaled and they would automatically update each servers configuration and response to that you got to remember that in Pedro Duty's case like the network environment that we were operating on it doesn't have enough controls for you to control say like even something as basic as security groups in every provider so we have to do enforcement locally and that ends up meeting like IP tables if you're talking about filtering but in a scenario where you had providers that gave you like better control that is clearly better to like push that away from the workload slightly to benefit it so the other or

one of the big things that this agent was tasked with configuring was the IPSec policies that we talked about before so again we have like IPSec giving us her or device to device encryption and authorization guarantees so when a new sister our new server comes into the system we need to update different servers to allow it access if you're not providing a IPSec relationship then all traffic was dropped so these agents would in in response to a new node coming on would reconfigure every server and within like a few seconds and that gave us those guarantees and kind of like met that problem or solve that problem of we don't know how to or we are hitting

scaling challenges with chef and it's you know time to converge but what about the workloads so we're putting things inside containers how do we handle that so we did two things to help like manage that responsibility so the first thing is that topology manager was configured to understand the workloads that were coming on to a system and more importantly like the policies that went along with those work clothes so if you kind of look at it in today's view I've never looked at like container schedulers especially like cougar Nettie's you might see similar ideas right we're like you have an API that understands what workload is running on this box and you can leverage like the

knowledge that that container scheduler has to let you like reconfigure the box this is similar idea is just you know a few years prior so we didn't have like kubernetes to do this but same idea you know like I'm gonna run service foo service foo should accept request from service bar and therefore to reconfigure this server to emit or allow the events of that traffic so apology manager handled that case another system that's separate and we'll talk about a little later is we wanted to have like per service credentials and so we ended up using an open-source product called hash II Corp fault to manage like the generation and distribution of those certs but we're

gonna talk about that a little later in the control plane because it feels a bit natural bit more natural to talk about it there so this is the basic data plane right simple agent that sits there and has some like source of truth to know what to do in each server let's go up a level and talk about the control plan so that one of the interesting things when you think about like server side versus client side is server side has some some different characteristics from client side and from my eyes one of the the big differences is host can be cycled out more easily also you generally can know what to expect on them versus client

side where people are just installing software more or less ad hoc and so we get this great opportunity to like make a system that's responsive to changes in that that environment so the big thing that we have to like really guard against is actually making modifications to the layout of the network and so we're gonna look at like host provisioning as an example of how we manage that but there's also like the parallel example of how we manage like which workloads run out which machines but hopefully like talking about those provisioning will help you kind of see how you might extend that into workload assignment so this all comes back to an authorized user any change in the the

control planes data sources is obviously very sensitive and it's something we want to manage right access to and so we have to lean on normal authorization and trust of users so these users are on an authenticated device are able to do mutual TLS enter the the control plane they have their password they've provided a multi-factor token so we've generally trust that they should be able to make this change they want to do but what we end up doing is instead of letting them just you know proactively make the infrastructure they want to provide they have to signal to the provisioning service that I would like to make some new infrastructure make some change in the network and then that

provisioning service is now taking over responsibilities from them to do all the actions that they requested so in the case of like provisioning a host that provision Aaron's going to create the cloud instance the user told at which image and which provider to use but the provisioner is responsible for actually making the API calls making this requested change or reality and once that instance is created the provisioner is in a good position to register ended it into the device inventory system and capture all the key data it needs in a way that we trust because it's not like self-reported from the device so it can capture you know IP address and type the device and all this kind of stuff when

we update that device inventory that gets pushed down to the agents and we know those agents will respond and reconfigure themselves to allow this you know new trusted instance to come into the network and you know change their IPSec configuration change our IP tables to let that host communicate and then the next key thing this provisioner does is it talks to something that we're calling the user inventory system but it's actually a cat hash Ecore fault system and what this is responsible for doing is we are using credential rotation as a way to identify users or really like applications or services that are running inside the network and we're going to give each one of those

services their own unique credentials that we've provisioned for them in response to them coming online so when when an app that is set up to deal with this system comes up that contacts faults using a one-time credential that the provision are put on the device so it could boot and then when vault actually receives that request it ends up making accounts for it on any dependent Services so imagine you're talking to a database it would go and create like a per workload database user with the proper credentials and gives that to the app and that is only accessible for as long as that app is running but apps need to know how to talk to evolve to like kind of

facilitate this workload so we do have like instances where like a legacy service isn't able to communicate with vault on its own to get its credentials in the right way so in this case we end up like leaning on configuration management to still provide credentials still legacy services back to my vault but you know handled in a different way instead of the app like four actively making an API call figuration management will put a credential on disk four of the app to read this is just simply meant as like a way to provide some compatibility and that's like the basic shape of the system so kind of the things I think are interesting here is you know

configuration management is still a thing we want in our network we're not trying to say we don't use it but we've removed some responsibilities from configuration management put it in something that's a bit more baked and a bit more well scoped and defined now that we have the experience of running a different system what are we accomplished with all this I would argue that we accomplish removing trust from the network you don't see any like discussion here of particular IP addresses and policy or in turning up devices we're not talking about like carving up workloads into subnets instead we're letting like this device inventory help us you know create like trusted communication between networks or between devices in the network and as

a result we don't have end up having perimeter security devices instead we have this perimeter assist network and that gave like this you know company like a really great benefit which is like this multi-cloud deployment was much more tractable than it would be otherwise we're not thinking about the problem in terms of like inter DC connectivity you don't have to turn up those VPN heads we just treat it as like encrypted packets which the infrastructure can already write this way and that's really cool and useful because it ends up giving us like an agility that we wouldn't have otherwise we can turn up and head turn head turned off data centers rather quickly I know

in a case we ended up doing it would took six weeks from saying like we're gonna turn up a brand new provider to we are running a third of our workload in that new provider and that was only enabled because we had all of these rich systems to support it all we have to do really is teach the the provide provisioner system about this new data center about the API calls and because we don't have a whole lot of requirements for the provider we can can run workloads there and lastly in these zero trust networks we want every single communicator have strong authentication on an authorization and the the encryption mechanisms and the the filtering

mechanisms we had in place allowed us to say with you know certainty that these communications are always secured user management authentication is automated with fault we're able to rotate credentials very quickly and it all just kind of came together to feel like a very natural and like really operable system so that's the server side and now I'm going to bring up Evan to talk to you about client side and in particular in the example of Google V encore thank you Doug alright so the client side examples are very different from the server side examples for a handful of reasons I think the first is just that clients are wild they often act in unexpected ways another difficulty comes

in their mobility they travel around a lot you can't often predict where a session is going to come from which means that occasionally you have to open services to the Internet to handle that and and finally clients almost always act excited like this hybrid zero trust client and that they're almost certainly going to interact with nonzero trust resources in addition to the ones that you can protect it under under your corporate umbrella so because of that and also because device stuff is easy user authentication is particularly important on the client side implementations if you will so we looked at Google Beyond corpus kind of a prime example of a client-side zora trust implementation it's relatively mature

they've been very kind and publishing papers on this but you know there's still there's still some secrecy there so it's kind of kind of hard to see through all the implementation details there but we get to know some things so at the beginning you know they kind of described like their journey along this way to evolving beyond corp they described a very very large corporate network which of course requires an even larger perimeter tens of thousands of users accessing thousands of resources maybe hundreds of thousands of users and tens does tens of thousands of resources we can't really be sure but maybe maybe even larger than that they have many many remote employees and not just

full-time employees but people you know accessing from at home after hours or employees and things of this nature what one of the thought exercises too is like how many visitors do you think Google campus might see in a given day worldwide probably a whole lot of visitors right and and this all kind of worked ok when the perimeter network was like surrounding a building which you had to badge into and everything was all you know secure and you had to go there to perform your work but how true it is that assumption today so in frame like this it's pretty clear that the perimeter systems have not aged well or scaled well for that matter the

perimeter is just too permeable and once inside elevated access can be enjoyed by most folks google recognize this and they see this the small kind of falling apart which is kind of the impetus to launch or beyond core project so the Bianca project aimed to kind of move the entire corporate network to this perimeter lacera trust model removing the perimeters and removing trust from the corporate net or aim to alleviate a lot of these kind of remote issues that they were facing so I'm going to talk a little bit about the beyond core to the particular beyond Corp implementation that Google did and so for these through that's natural to start with a client so

we have a client here and the client is a user coupled with the device so we power that network agent that Doug talked about before I want to request is first made user gets sent to IDP for authentication the same way that you would normally authenticate a user user name/password top key harbor token whatever it is it's basically regular SSO stuff there's nothing fancy and after authenticating with IDP the user gets kicked back to this accessing they call the access proxy I when doing so the client device presents a device certificate which the access proxy uses to authenticate the device which is calling it right so the client negotiates this mutual TLS connection and then the the proxy authenticates not

just the device but also authenticates you know the IDP information pass through from from the client so it takes these two information the Tait it does the authentication directly and then it takes these two bits of information and it kicks it up to the control plane to actually authorize a request so it passes the it passes the identifiers about the device about the client about the request itself pretty much the entire context of the request that's being made so when an authorization decision is made the proxy enforces it so you have that enforcement component and then affords that request to a back-end service over an additional mutual TLS connection the difference here though that I like to draw out is that the

relationship between the access proxy in the backend is not is not what we wouldn't necessarily call zero trust relationship unlike the corporate client to the access proxy so there's kind of like a line of delineation here between like the zero trust perimeters architecture and like traditional perimeter architecture despite the fact that it uses mutual TLS so we can see that this is pretty much it for data plan on beyond Corp is fairly simple so let's have a look at control plan so we have user inventory which is similar to what people might have now can be held up and be pretty much anything but we record lots of rich metadata here including the perhaps the department

that they work for their role in the company which devices they've been issued maybe pub keys there are things like this and this user inventory back several other services one of which is that SSO I talked about earlier so SSO uses this user inventory as source of truth and unlike I decided this regular basic SSO it's nothing fancy right and once once the once the user gets kicked back to the access proxy the access proxy authentication so device the devices that we also have device inventory is the kind of the other source of truth here which Doug also kind of covered and device inventory is particularly challenging a physical world particularly if you're tracking serial numbers or combinations of parts

and different machines and things like that because you end up swapping these things around and the device inventory can be challenging to maintain in terms of accuracy of the data there many different types of devices and it can kind of be hard hard to keep track of but it's still a really important piece and we have to have it despite the challenges so this is kind of you know where we have some divergence on the Patriot II implementation here so rather than device inventory pushing configuration data out and notifying things out there to reconfigure themselves the device inventory pushes data to something like Google calls access control engine we've called policy uncertainty my heard Doug talk

about it and there's access control engine all the authorization within the network right so it pulls the data from both user inventory and device inventory and then takes out information with the context of the request which is being made in order to make an authorization decision right so it considers all these things and the gaseous control engine can be loaded with both like coarse-grained policy and fine-grained policy and those things can be layered so while the access proxy and validate authentication those identifiers still get passed up for ultimate authorization decision to be made elsewhere in this access control engine and the access control is really flexible can take a variety of inputs I think in fact I

googled implementation puts a pipeline in front of it with just this data hose kind of information flowing through it so the systems are anxiously you can kind of see how you give this your trust guarantees through that client and access proxy relationship and even though this this deployment is very very different from the Pedro duty deployment we accomplish very similar things so many of the same outcomes right there's no trust in the corporate network on beyond court model in other words the corporate network is really no different than the Internet you know access that you'll get from sitting inside Google building with the on corpus same access you'll get from coffee shop or at home

this is no different and as a result there's not really any any stringent requirements on the perimeter there it's pretty pretty wide open also freed users of VPN requirements you know what are you dialing into if the network you're dialing into is still untrusted right so now I can travel wherever they need to which we call kind of like internet mobility to go everywhere and they have the same level of access to stay would have in the office they don't have that deal with clunky VPN software visitors and intruders to corporate headquarters don't get access they shouldn't have things like that and finally similar to the server-side deployment all requests are strongly authorized and authenticated every

single one google proxy has the advantage of operating at layer seven so they can and they do authorize request by request basis and that's distinctly different than datacenter deployment pager to be is done where we we authorize flows so you can see like how different these two approaches are but they still kind of achieve the same basic goals right the same guarantees Doug spoke about that are given by a patriot system or presence here and Beyond Corp - so it becomes apparent that it's kind of two sides of the same coin right um it's a similar philosophy to similar if they just at the different application different business needs so all that said we can actually do even better than

this all the stuff that we just discussed is what we kind of considered like the minimum bar for zero trust network you can you get way fancier than this so mature zero trust networks take a lot of other things into account namely they take into account risk and Beyond Corp has and D taken steps into this domain they've introduced what is called a trust David call trust inference engine we call trust engine and this service basically calculates the risk of a particular request or authorizing a particular requests based on a number of factors right it can examine the device making the request because say it has this device been patched against X Y Z CDE yes or no and all that all those

inputs that we can kind of interest back to get added up to create a score like a confidence score right and that score is used in addition to like traditional binary policy which helps the policy writer catch unknown unknowns so the policy can put like traditional constrains plus this minimum score right and this helps keep like suspicious requests out you know ones which would otherwise just be accepted because we haven't written an explicit policy to check for it and as an additional concession and the access control engine actually passes the score back down to the access proxy and the access proxy injected the score along with a bunch of other information as a layer seven a tag

or header depending on the protocol backend protocol and this will have the back end to make fine-grained authorization decisions based on the riskiness that the request has been deemed by the control plane so you can say is this person or admin okay there's like this extra special substitute function on this admin thing like is how risky is this particular request maybe we'll choose to not permit this right so we can tell you this actually even one step further and integrate behavioral Harris tix which i think is kind of cool so for a device behavior perhaps you use s flow and you detect a behavioral anomalies based on s flow data coming from different devices or any a similar

network sampling protocol or just an example s flow and for users perhaps that's just like regular user accounting you know that's for anything like successful or failed logins over time and doing anomaly detection there so both of these pieces of information can give really really good confidence signals or scores and then you take all these things and you it affects the score appropriately right so things that are particularly risky may be denied and this is actually pretty hard UX challenge right like how do you explain why it was denied you know oh well your score was too low okay well how do I fix that right so it's really really important to pay attention to kind of

the UX like the experience that people have when interacting with these your trust and ours and not just in daily operation of this of the network but also kind of in a migration from you know your typical perimeter or old school network to the zero trust paradigm right exceptions will be normal and you should definitely plan for that and this is particularly the case if you're doing scoring right out of the gate but the good news is that zero trust generally actually improves user experience which is kind of interesting because security very rarely does that because authentication and authorization all these different layers are so automated the only thing the user really ever sees is just that kind of a typical

authentication that the comes through SSO or whatever they don't after for the APN they don't have to muck with all this other stuff all that stuff just kind of happens underneath the covers for them right so in the end these networks are actually easier to use than the networks that most of us build today now the reality is that all this stuff is pretty new some of its not even realized for instance like we're not aware of a client side and server side implementation in the same organization furthermore Doug and I have only built the server-side stuff we haven't really built the client-side stuff you know we and as a result like we've made dozens

of phone calls for the writing of the book and the implementation of all this stuff dozens and dozens of phone calls from practitioners researchers ccos you name it we've spoken to them and you know kind of picked their brand on this stuff because the reality is as an industry we still don't really have all the answers for the zero trust architecture it's still pretty new there's mostly roll-your-own folks who are building these things that all have a built in house you know there are some good tools out there are some good building blocks that you can leverage but you still have to kind of put them together in a range in these particular ways in order to

give the guarantees you're looking for the vision is really big which means it's gonna require multiple pieces of software working on concert in order to give these guarantees there's not almost I couldn't say fairly confidently I highly doubt there's going to be a single piece of software which is going to just kind of install and give you all this stuff right there's plenty of room for lots and lots of tools and approaches here so it's still very very early there's lots of opportunity and we think there's really promising so who's doing it right in addition to the two examples we kind of went through so we'll break it down again and decline a server-side implementation because we don't know we

don't really know anyone doing both so in terms of client-side implementations companies like coca-cola masa and of course Google we spoke about are all doing a you know zero trust class I'd implementations kurukulla their driver is like rapid provisioning of branch offices when they want to turn up a branch office they don't have to have an IT guy their phone up configure the VPN and this and that whatever they just get regular public internet public hotspot everyone's off to the races it's really really easy mazda is using it for like fleet phone homes all their cars in the field are basically like rolling computers you know and they actually paying back massive infrastructure periodically and how do

you secure these things over cellular connections and all this other stuff you don't control the network at all right so they're having zero trust approach for their client-side phone homes on from cars and Google's just did it for Ross security and scalability they see a lot of challenges most of us don't see and they bump up against these limits sooner sooner than other people do and so they've just kind of seen the merits in the architecture and ran with it on the server side we have companies like lyft Square of course pedro duty we spoke about already if you ever use docker docker swarm under the covers the docker sort of network model is very

similar principle so they don't really trust the network everything's mutual TLS it's all injected by humans orchestrated in other ways so if you dig into the network stack on docker swarm you'll begin to see some of these principles coming up right most of these companies here have adopted for just raw security Patriot II you know adopted for like that multi-cloud mobility being able to turn off goober clips anywhere and somewhere similar to the coca-cola case but all these companies have rolled their own zero trust semantics right there are however some commercial products and options available out there a perrito sniffy and edgewise are working on stereo trust problems in the data center so server-to-server zero

trust company called crimson which i think was just acquired Vitter and Waverly Labs working on client side zero trust problems and they're working on it through a standard called SDP or software-defined perimeters which is a new standard which is being kind of shepherded through by cloud security Alliance to perform quiet access to your trust and then additionally we have company called CLT and duo both working on similarly quiet size your trust problems that they are doing like a lobby on corp like beyond corpus style right so very very similar echoing like the architectural deployment of beyond corp itself so all of these providers all these companies provide trust the components which can be used to build a

zero trust network but as I mentioned there still no end-to-end solution I doubt that there ever will be and none of these things are mutually exclusive they can be mixed and matched and spread all over the place kind of thing so taking all this together there's not really hard to see the perimeter is just really toxic and undermines system security in a really bad way people hiding behind it they feel safe they don't do the controls that they should be doing they're right and the industry in general is moving towards just really strong and deep authentication and authorization across the board right - in order to compensate for the holes which are popping up in perimeter we

can't really trust the stuff behind perimeter anymore like we used to and we feel pretty strongly like the industry is converging on this model right and you can kind of already see it happening already with a bunch of the movement which is going on some products are more ambitious than others but like I said like they're still you know the scope is just way too large for a single product and the end result the undersold of all this stuff is just a more secure and more operable system one it is easier to operate and it gets in the way a lot less quite frankly so I'd encourage you to just kind of keep an eye out

things are happening every single week in this field it's very rapidly-evolving so if your interview to talk more will be around for a while we'll be here tomorrow - I'm happy to talk to you go out for drinks whatever you want but that's all we've got for you today thank you very much for coming and listening [Applause] we'll get Doug back off we can do a Q&A if anyone has any questions contacts yeah sorry I can give you our information if you want afterwards so a little bit of a logistical question but so in the client model of GeoTrust how do you manage end points or do you manage end points or do you even care to

manage end points like how do you you're talking about health checks like you must be this tall to ride this ride right how do you do that on an unmanaged system or how do you manage a system in a your test client I think that management client management is mandatory right it's really really hard to assert the state of that device or even collect any information about it when there's no management forget about loading loading the device keys onto that thing you have to have some assurance that they could be stored securely and all that stuff so endpoint management is required and that information the the you know how tall are you to ride this ride checks usually

come from reporting coming out of that out of that endpoint monitoring right the caveat being it's endpoint monitoring so you have to take a little bit of grain of salt right but that rich information can come through and certainly you can you can use it to your advantage but it's not to be considered a strong source of truth like the device inventory our user inventory might might be does that answer your question

just an FYI about a company that's tempting to do blue scient and server-side Palo Alto hmm we have a spoke a bit to John Kander bog the field CTO there yeah they're doing some interesting stuff for sure yeah do you know I need if anyone is deployed like that yeah I'd like to talk to you afterwards if that's okay that'd be great well that's okay yeah so just a question about defensive depth one of the things you lose in the client-side model is that you don't have a perimeter so you know that SSO is done and an untrusted medium what's your thoughts on if if we can trust the security of that that communication and

that authentication server and you know if we lose some security I've moved to this model yeah sure that's a fair point I think the the ultimate question here is like I can see a scenario where you say joining a VPN is a good another layer of Defense and then you the problem is you end up with just using the UX like burden it's just really rough very users no no I'm not sure how you would like it that that second layer back in without like being a user with more pain I think like for those publicly facing things manat super super great but you do kind of just lean on mutual TLS which means that the attack

surface is reduced to your TLS library so some some the FTP protocol I spoke about earlier software-defined perimeters they get around this by using what's called free authorization so they they use a protocol called single packet authorization you might have heard of where they assign a UDP packet and squeak it out onto the network and then when the authentication service receives this UDP packet validates the signature and then pokes this very granular hole in a network ACL and say like from this source IP to me I'm going to allow TLS for the next 30 seconds right and then mutual TLS follows that so so there are some ways that you can kind of layer things there

without like going full hog backup perimeter you know but the challenge has come that the the clients move around and you don't necessarily know where they're gonna be unlike server side where you can like poke that hole you know what ip8 you gave it and that kind of stuff does that make sense alright thanks for a great talk and I guess the questions regarding race conditions and split-brain scenarios in when you talk like server to server zero trust networks did you experience any issues with that page duty and do you have any suggestions for a solution in those cases you take her imagine like spinning up new instances that haven't get added to AC Elliott Sun right

datacenters going down centralized management in the underlying libraries they definitely experience problems in probably the protocol with split brain issues I'd say like architectural e there's the nice benefit that since you don't have like gets not data center today center having communication problems that's host a house like in general you could just treat it as like any host failing to communicate on the network and so usually our response to those types of failures was a missing timeout somewhere that like the application should have handled this main on the provisioning site I'm assuming that like you use surfs to do like gossip communications so it but the network protocol like guarantees that eventually it will converge yeah we did run into

some of those challenges especially when things are all built into chef still it took some time for those things to roll out and things would come up they wouldn't work right I know you have to wait for all these runs to complete which was a lot of the identification to move this out to the dedicated system topology manager ended up fully converging across the entire fleet and a couple hundred milliseconds so back guarantee we kind of like especially with containerized work Elizabeth stuff we kind of greased the wheels up a whole lot and didn't we kind of absolve ourselves from those issues of hey the workbook came up but it couldn't talk to anyone because hasn't been reconfigured

yet in terms of like split brain data center failure things like that that issues kind of pushed on to the inventories so you know make sure that the device inventory is a che and the way that we did it was we had like strong consistency within each data center and an async replication from like I did like a designated master to the rest of the data centers and and although like luke data center local workloads would look to the local device inventory which was receive updates from this master and then we built like some fairly clever tooling within chef to be able to say like okay now this one's the master and everything would slip over so but those problems

kind of get pushed into the the inventory systems rather than the zero truss architecture as a whole sure what kind you mentioned uh standing up a new data center really quickly what kind of off metrics and and unique identifiers are you using when those machines check in to convince the new data center that in fact this is one of the properly white listed machines in that right so you're asking about kind of like secure introduction and attestation right so first of all when we add a new data center the way that we do that is we basically teach the provisioner about this new data center and how to make API calls into it so the provisioner when it turns up this

new instance there's information gets kicked back synchronously with that API call it usually is like an IP address a name where was I bought too much I'm made of data and that's what we send into device inventory so when things come up we can we can cross-reference it with what the provisioner got back from its API call like do these things match up right the project I'm working on right now open source project called spiffy is aiming to solve this in a much stronger way with security reduction and platform specific note attestation so keep an eye out on that thank you everybody

BG - Zero Trust Networks: In Theory and in Practice - Doug Barth & Evan Gilman

Related talks