
welcome everyone today I'm going to present on CloudFlare server list deployment of OS query and how we explore big datasets my name is Geller Bedoya I'm a security engineer at CloudFlare on the infrastructure and security team we focus security challenges on the hardware firmware in Linux layers and we're also a build focused team as well and create services where necessary
what is CloudFlare it's a globally distributed anycast network it's in 200 cities in more than 90 countries there's 26 million Internet properties and 10 percent fortune 1000 customers are paying the larger the network the better performance and security we can deliver to our customers CloudFlare mission is to help build the faster and more secure Internet we cache content around the world in hundreds of data centers and we determine the fastest delivery we provide unlimited distributed denial-of-service mitigation capabilities as well our engineers are implementing the latest web optimization techniques for example TLS 1.3 with zero round-trip time HTTP 2 plus quick and a server list platform called workers and cloud player also provides the fastest Public DNS
resolver with so many services running in all these data centers CloudFlare has a large attack surface how can we gain visibility of malicious network activity to monitor bad actors and how would we properly triage and respond to these security events in other words the team needed to know what was what's happening on the edge data centers we needed a standardized way to collect information and analyze that information in an isolated environment always query solves these concerns os query was originally developed by Facebook a few years ago to monitor their infrastructure it's now a Linux Foundation open source project with hundreds of contributors and thousands of commits making it the most successful or making it one of the most successful
security focused projects on github specifically it exposes an operating system as a high performance database and its design allows one to write sequel based queries to explore an operating system efficiently and easily in effect it abstracts concepts such as running processes kernel modules sockets through these queries and by using scheduled queries you can create an infrastructure baseline and use that to detect anything anomalous os query to scale OS query there's a couple phases one of them being configure eight endpoint configuration and inspection CloudFlare engineers leverage saltstack which is built on Python to create event-driven automation to deploy code in other words when a data center is created saltstack is actually what configures the BIOS the firmware the
encrypted file systems and deploys the running services in what is called a high state there's thousands of states that are executed in other words saltstack powers CloudFlare and so we needed to create our own OS query high state further adobe has released an OS query compliance framework for saltstack managed servers called hubble stack I recommend looking at that project this is the an OS curry salt State and this is what orchestrates the installation of OS query on the edge servers on the left side we have state information line 26 you can see that a package is being installed you can define how the service runs you define files that are necessary for its configuration you can see that
yamo is also serialized to JSON which is what OS query requires you can see that we also installed extensions on line 50 this is a Prometheus metrics exporter and then we also configure extensions on the right side you have pillar data which is yeah mol easy to read and both of these screenshots are it's just yeah Milland Python so if you want to configure an orchestration you would do that through Python and yeah Mille and these high states get executed every 15 to 30 to 45 minutes and this deploys code to the edge salt stack is CloudFlare is most popular project and there's over I think on average there's about a hundred commits per day meaning
we have to ensure the we need to be confident and we need to ensure stability in these high states so in the top you see a team city build which installs a CloudFlare specific variant of OS query below is a build of the go prometheus metrics exporter which monitors the health of the daemon and what we have captured below is operational checks that happen on salt and because it deploys code frequently we need to ensure that the services pass tests phase two of scaling OS query is really on data transport and storage we previously talked about the configuration endpoint configuration the inspection such as the Prometheus and extensions but the most important parks are the
most important parts are actually how you how you handle this data to achieve this we created cenote a cenote is a yeah it's a sinkhole that uncovers groundwater underneath and the Mayans believed it to be a portal to another Cosmo cenote is a GCP based server list os query back back-end API developed and go its serverless nature means it auto scales meaning that when CloudFlare expands its anycast network we automatically deploy o as query and any data that is received cenote than auto scales it means we don't manage servers and this is great because as security engineers we can just focus on the actual events on the left you see the OS query logo that represents cloud
Flair's Network which then sends telemetry to cenote in a TLS 1.3 connection to function which performs business logic it communicates with Redis it also sends messages to a topic and this topic allows multiple subscribers and so we have a couple subscribers that are specifically focused on detection one of those is cenote detection another is stackdriver which is a real-time log management the third is bigquery which allows analysis of data and the fourth integration is CloudFlare sim and I just want to mention that o is query is a single element of how we protect our edge servers there's other strategies such as secure boot in practice the publish subscription model of cenote means we can provide many too many
asynchronous messages and that decouples the senders and the receivers and it creates high availability communication between these independent applications or micro services cenote specifically leverages Google's bigquery for both detection and performance queries we also have cold storage we use memcache we use Google functions and then we push this data to a sim as well and you can also talk to third party API if necessary one of the advantages of server lists it's it's is its performance characteristics and so here we've defined a test case with 500 1000 5,000 and 10,000 concurrent connections to the OS query log API on the left you can see the number of invocations per second of these cloud functions and on the right
you can see the number of active instances when 10,000 connections were made cenote invoked about 250 functions of the log function and then what you see on the right is that Google will invoke these functions but they will remain active and so server lists function has a concept of cold start and initially it sets up networking it sets up the runtime and then it also caches global objects and what it shows is that about a hundred to a hundred 50 active instances is plenty compute power for 500 or even 10,000 connections what this also means is that you can make sure that your business logic is well defined to perform a function very well and
efficiently what I've captured here is a pro and a con of serverless design what I've captured here is actually a memory leak in a Google function because I forgot to close a Redis connection but what's great is that even when this air condition occurs the serverless platform just immediately invoked another function and then by using pub/sub you actually have high availability and integrity or insurance that a message is actually received so even when the air condition was met like everything operated fine would we also see here is buggy coat and then code that was fixed and so initially it would just continually the memory usage would just continually increase and then after that was fixed the memory profile of that
function was around 50 or so megabytes one of the cons are one of the considerations that you have to take our Google related Google or Google constraints and functions have their own limitations stackdriver has their own limitations and we encountered this particular RPC error message when we were writing over a thousand messages per minute and what that means is that you just have to come up with creative ways to solve these problems so things to be aware of one of the constraints of a Google function is that you could only actually run for five minutes and so you want to make sure your business logic is again well defined and you can and there's tricks to optimizing network
connections and whatnot another subscriber of this data is stackdriver stackdriver is google's observability tool it allows you to monitor your infrastructure and improve its performance and improve application performance in other words it actually collects signals from jeep from running GPC services and then creates graphs or then pushes them to this real-time log management this can be used for manual investigation here you can see a node valid enrollment of node valid and I've reducted some sensitive information you can see an OS query status message you can see an OS query result message and because stackdriver understands JSON it's it's very friendly in the browser and you can just open up specific arrays if you wanted another
subscriber of the OS query data is bigquery bigquery is this serverless scalable data warehouse that allows you to make informed decisions and its services meaning to large data sets quickly there's client libraries and Go Python are what we have here is go code that sets up a context sets up a big query client creates a query and then runs the context this is great for both interactive and automated analysis I think so far we have terabytes of data and the queries are generally pretty quickly here we actually have a real data point when we were at when we were retrieving os query performance data of a pack and it processed 43 gigabytes in 1.4 seconds it's very efficient and it
also creates detection and response flows what we have here is taking that client library and then creating a rule engine out of it and so here is a JSON object it has a defined structure this runs every 30 minutes there's a severe associated with it and there's a query that attempts to that grabs crontab related to the tree that was collected on the edge network there's also a playbook URL and if this query was to return data it would actually trigger alert ID 2000 to the third phase of scaling os query is the visualization aspects we mentioned we mentioned automated analysis using Google functions we mentioned manual analysis using stackdriver and we mentioned automated analysis with
bigquery the fourth aspect is just visualizing the data bigquery in addition to running queries also allows you to export data and that creates data sets so that data set can be a week a day or a month and based on this static data set you can then create graphs and in this case we one of our team members surfaced some of the OS some of the collected OS query telemetry i've redacted all of it but you can see there's a section for kernel versions if a thumbdrive were used in a data center that would get logged and and again this is all information that was collected by OS query on the anycast network sent to
send via TLS to cenote and then ultimately exported the cenote platform is designed for data analysis and transport we can extend the platform to address specific security challenges we are going to add file carving support this would allow us to retrieve a file from a metal and then we could potentially push it off to Joe sandbox for live malware analysis we could also push it to a container for manual analysis we're also looking into identifying vulnerable components in libraries it's it's common for servers to be misconfigured and so you would want to ensure that everything running is what you believe it to be we're also interested in simulating a metal and and for example trying out a
doctor escape technique or kubernetes escape technique to ensure that the OS queries collecting relevant telemetry so adversarial simulation for Linux we're looking into that and we're also working with industry experts to help us with vulnerability detection through an OS query extension and this is some of the identified engineering work that we have planned or yeah the we that we're considering questions [Applause]
so there the messages are coming through TLS but besides um no but I think there's definitely ways to extend oh is create to support like signed queries
any additional questions potentially the idea with the file carving support is that we can create a query to grab a file from a metal server we would then retrieve it and from the cenote platform we could perform an automated analysis by a third party such as vias total or Joe sandbox but we could also just do manual analysis in a container we can retrieve the file put in the container and then do an isolated analysis there I
think I'm not sure if I see the connection with OS query and cenote on that but maybe we can talk further
there are no more questions thank you very much [Applause]