eBPF – A Virtual Machine Inside the Linux Kernel

Name: eBPF – A Virtual Machine Inside the Linux Kernel
Uploaded: 2020-12-11
Duration: 44 min 49 s
Description: Sergey Smetienko explores eBPF, the in-kernel virtual machine that enables flexible security policies and performance optimizations in Linux. The talk traces eBPF's historical roots in Berkeley Packet Filter, explains how it solves the user-space/kernel-space performance bottleneck, and demonstrates

BSides Ukraine · 202044:49166 viewsPublished 2020-12Watch on YouTube ↗

Speakers

Sergey Smetienko

Tags

CategoryTechnical

StyleTalk

Mentioned in this talk

Tools used

AppArmor Auditd BPF (Berkeley Packet Filter)seccomp SELinux systemd tcpdump

Platforms

Docker

Concepts

eBPF

About this talk

Sergey Smetienko explores eBPF, the in-kernel virtual machine that enables flexible security policies and performance optimizations in Linux. The talk traces eBPF's historical roots in Berkeley Packet Filter, explains how it solves the user-space/kernel-space performance bottleneck, and demonstrates practical applications in syscall filtering, container isolation, and systemd security hardening.

Show transcript [en]

Hello, my name is Sergey Smetienko. I will tell you about eBPF, a virtual machine inside the Linux core. To begin with, a few words about myself. I am engaged in the architecture of security and DevOps in the NOC service company. We are working in this direction for the last I have been speaking at various security conferences for 10 years, especially at BSites and am its co-organizer. Today I will tell you an interesting, in my opinion, very important topic, because eBPF is now a very hot direction within the development of Linux core and many subsystems will migrate to this platform over time, because it allows us to build very flexible and very interesting solutions. First, let's talk about how the operating system

works. This architecture is dictated primarily by the manufacturers of processors and how they laid out a memory protection system. We always have a division into user space and kernel space. User Space is the space where user applications are launched. They work there with big restrictions. And we have the core space of systems that works without restrictions. When a user application wants to perform some operation, if this operation is related to disk output, input-output, network operations, terminal output, Any operation that goes beyond the scope of this application is addressed to Hydro using the System Call mechanism. Look at the presentation. You see that we have a call to the Write function here, which writes something or displays

it on the screen. For a user who writes in some high-level application language, will look like a function call write, respectively, this function call will be transferred to the standard C library and then this standard C library converts this function call into a system call and this system call will be transferred to the core of the system for execution. Such a mechanism, as I said, is embedded in the architecture of microprocessors and it will bring a number of restrictions, even not restrictions, but from the point of view of productivity, unnecessary load, because with this kind of work, data is sent from the user space to the kernel space, and if the core returns some data back, then there is also a reverse copying.

This copying leads to a very large cost. And this is a rather slow operation, which was even slower after the information about the problems with protection became known. All-known bugs in processors that we have been actively discussing in the last couple of years. Workaround to these bugs led to the fact that this operation of transferring data from the user space to the kernel and back became even more powerful in terms of performance. This is just to understand the approximate layout of what is happening and why we are talking about it. Now let's talk about where this virtual machine came from and what what historical tasks were solved, how it all developed over time. That is, around the 90s, there

was a problem, an issue of intercepting, removing from the network interface of all traffic, that is, our usual TCP dump. Initially, it was written The possibility of creating so-called raw sockets, which allowed the user-level application to read directly all the data that passes through the network, which the network card sees. There were problems with performance, because then computers were very slow, it was the 90s, and the ability to download all the data that was going through the kernel to the user space and at the same time not losing packets or somehow reasonably processing them, there was simply not enough performance. Then in Berkeley, their researchers created the first solution, which was called the Berkeley Packet Filter. What did they do?

They suggested that let's we will bring the mechanism of filtering packages to the core of the operating system. And let's say if we have a lot of traffic on the network, and we are only interested in a small part of this traffic, then at the core level of the system we will decide whether to raise this package to the user space or not. Thus, they solved the performance problem. At the same time, there were many more protocols than now, except TCP/IP, there was IPX, and some other different protocols that have now died successfully. And there was a question of writing a filtering mechanism that would allow all protocols to be processed. It would be too difficult and too

It would be difficult to write a separate patch for each protocol, separate codes, it would be very difficult. So they thought and thought, said: "Let's create a virtual machine. Inside this virtual machine we will feed a certain bytecode. This bytecode will contain the rules of filtering that will be applied to the traffic core. And thus the solution will be universal. Accordingly, they developed this whole thing. When you use the Utility TCP Dump, if you pass some the filtering rule, then you use this BPF solution. That is, you have a bytecode compiler that wraps these rules into a bytecode. Then this bytecode is passed inside the core. There is a compiler in the core that translates this bytecode

into the native processor code and then this piece of code is compiled and already clings to the package reception function and very quickly, well, how very quickly relatively quickly, but allows you to filter packages at the core level and we thus solve this very task later The solution was expanded, well, as a result, it took about 15 years. Initially, the virtual machine BPF consisted of one battery register and it had a small set of 15 32-bit words in the memory of variables. And the code that was compiled with BPF could be used either as a register battery or could work with these variables. And that was all that was available for the compilation of such a JIT code. Well, it's clear that the

task was not difficult, actually. That is, to compare, read some values from the package, compare, maybe, with some parameters, meters, and so on, and, accordingly, make a decision. When this decision was already expanded to a modern level, we got a much more convenient virtual machine for execution, not execution, but for use. We now have 10 virtual registers of 64 bits available within the virtual machine. We have a stack, that is, within the framework of a virtual machine, we can do recursive calls or calls of some functions. The stack has a length of 512 addresses. And we have the Maps mechanism. The Maps mechanism is a very powerful mechanism. On the one hand, it allows the virtual machine-operated program to have the ability

to work with these memory areas, either in key-value mode or through hashes. But what is important is that these maps allow you to map this memory, which is inside the core, from which our virtual machine works, into the address space of the program that this virtual machine has launched. Thus, we do not have this long mechanism of pumping data from the user space to the kernel space and back. Thanks to the virtual memory mechanism, our virtual machine inside the core and the application that works on the user space, they are addressed to the same physical memory page. And so it all happens very quickly and wonderfully. Well, it is clear that such a virtual machine allows you to create quite a

universal code, upload it to yourself, execute it. And it was decided that Why not use such a mechanism not only for filtering packages in TCP dump, but also for other possible applications, because the solution is really very flexible. If you are familiar with how the virtual machine WebAssembly works in browsers, you will see very familiar for yourself If you understand how a virtual machine of JavaScript works, then you will also be familiar with the concept of EWPF. In general, in simple words, EWPF is something like the average between WebAssembly and JavaScript, but for the core of the system. how it all happens, here, for example, TCP dump, I will go to it because this is the

utility that everyone, I am sure, used, knows how it works, and therefore everything is enough for you, it should be transparent, do not require any additional explanations, that is, we have TCP Dump application it is sent to the library LibPickup which is a user space library that transmits these rules that you pass TCP Dump in parameters in bytecode. Then this bytecode is loaded into the core. In the core we have a Verifier mechanism. Verifier is an intellectual filter, let's say, or verifier, which is clear, verifier, which runs on this bytecode and determines whether its core will accept execution or not. There were restrictions before that some cycles could not be done there, but now this restriction is removed. Now Verifier checks access to memory.

In general, validity, we will say, it checks the validity of this bytecode, not to go into this topic. We will not go into this topic yet. And if Verifier says that this bytecode is suitable, then further the core compiles this bytecode into a native set of instructions, places it in a certain memory page, which can be execute code and then the compiled code just clings to the desired function inside the core for the TCP dump, this will be at the level of the network driver, a function that will call this bytecode for filtering if it is needed there, or in other places where Linux allows this bytecode to attach inside the core of the operating system

and thus modify or adjust or add new functionality to what we already have. Today I want to tell you about how the virtual machine eBPF is used to filter syscalls and how it can positively affect the security of the application. What is it useful and interesting? For this, you will need to remember another historical element of the core, such as Secure Computing Mode, which appeared in the core in 2005. What was the matter there? That then it was the time when exploits were very widespread, which are remote code execution in binary code, and various solutions were discussed, what can be done to build a more secure computing environment. And one of the solutions that was proposed within the framework of

Linux was the so-called secure computing mode. It's a pretty simple thing. What was it? The application could tell the core that from this moment I am going to secure computing mode. This meant that from this moment the core of the application was ready to process only 4 syscalls. Here they are represented: exit, seek return, read and write. and all other syscalls led to the error that the application completed. That is, the mechanics were such that a certain application read all the data necessary for its work, switched to Secure Computing Mode, and from that moment on the application knew exactly that It will not be able to open any new file, not create any new file, not run any external application. That

is, it will only work with the resources that are already open. There is a recording reading operation. Cisco Lexit is clear, this is the completion of the application. And with Sig Returns you have signal processing. but we will not go deeper into the internals, so to speak, what is happening there with signals. In general, everything is enough to understand such a superficial that we have 4 syscalls and we have such restrictions. But you yourself understand that such a mechanism is quite strongly limiting and impractical, because such applications are really which could work in the Secure Computing mode. Over time, a variation of the use of the eBPF machine was proposed in order to filter syscalls. That is, we do

not have strict restrictions on these four syscalls, and we have the opportunity to write our own program This program will receive information about which syscall is trying to execute the application at the input, and our program, which works within this virtual machine, will already accept the decision whether to allow this syscall to be executed or not. From the point of view of programming, the code looks like this. Something like this. Someone will probably say that there are other solutions, there is Security Enhanced Linux, there is App Armor, there are other options for filtering syscalls in other operating systems. Why block a new solution? Well, look, let's say the same Security Enhanced Linux for him, in order to write policies on which

syscalls to allow and which not, this is a rather laborious task and it will be on the shoulders of the system administrator who wants to do such a thing. Secondly, these such approaches as Security Enhanced Linux work from the side that we have a policy, it is loaded once and in this mode the system works. If we want to change this policy, then we have a very resource-intensive task from the point of view of system administration. The use of the eBPF approach gives the opportunity any program itself and bring with you your policy and say that I am there, the program is there, and I know that for my work I only need this set of syscalls, and if somehow from the

process that performs the program there, a syscall will be submitted to the core system that is not is contained in this list, then something went wrong. That is, we do not impose responsibility for security on the system administrator, but we give such an opportunity to impose such responsibility for the work of such a filter on the programmer or programmers who develop such a system. The code that is on the screen now is the code that imposes the same four restrictions on the program that we had in the hard mode Security, Secure Computing mode. That is, we have here Exit, Return, Read/Write. We can create exactly the same policy and load it the system is already dynamic and already the program

itself, that is, what is happening here we have a library of all the comps that allows us to work with all this without having to call there to create such a virtual machine and program for it with our hands everything is already there prepared we just say that we want create such a policy for ourselves. We say that in the policy action for violation will be actionKill, that is, if the application launches some wrong operation, it will be killed by the core. Further in this example, we say that we allow syscall read/write, sigreturn and exit group. Well, exit group, exit, as it were, It's the same, we won't go deep into the naming of syscalls. We

can consider it the same. And here we see that after all the complots that our context loads, we get a call to the getPid function. The getPid function returns the pid of the current process. Inside the system, it is implemented in such a way that this function is addressed to the core with the corresponding syscall. This syscall is called by the name of the function get_pid. And if you take and compile this code and run it, then you will see that this second printf will not be executed, that your program will interrupt its execution in this place. Accordingly, you will receive a corresponding record in the audit journal. Here you can see that UID 1000, an application called SecComp, an exo

file that was lying there, was completed with signal number 31, this is a sectile, which caused a problem at the time of performing a syscall under number 39. And we have an instruction pointer that points to the code that led to the launch of the GetPID syscall. There is a convenient application, it is part of the auditD, you can use the syscall number. So it's very simple, so as not to wander around. include the core, just take it and get it by number 39, it will write that the number 39, syscall number 39 is our getPID, that is, our application failed to respond to getPID. Accordingly, if we have a mechanism that not only stops

the application from running, but we also received information in the audit that we had and our application was stopped. Accordingly, if we have some kind of production environment, we have a monitoring set up that analyzes the data of the audit, then most likely we will have a rule somewhere that says that if our production application fell off with such a thing, then most likely we need to raise the alarm and figure out what is happening.

We live in a world of containers, on the one hand, and on the other hand, we live in a world of programmers who do not really like to do unnecessary work, and security, often from the point of view of a programmer, is an unnecessary work. We have other tools that allow us to build the same functionality in the application, ready without the participation of programmers. For example, the Docker used by everyone contains the ability to work with the same syscall filter, organized by this machine. And I can even tell you that all applications that run inside Docker, they are already running with a default filter. which docker does not allow. It does not allow some syscalls that, in the eyes of

docker, are not normal for the container application, and can affect docker itself. This thing is in the documentation, here, on this link. And there you can see 19 syscalls that docker blocks by default. And there is a default policy, which is organized in Docker, but organized according to the principle of whitelist. That is, in this policy, in the form of JSON, there are syscalls that are allowed. And for some syscalls, you can filter not only syscalls themselves, but also those arguments that to this syscall. For example, there is such an example for the syscall mkdir, that is, creating a directory in this profile, which lies by default. We have here, it is written that mkdir, it has an action "allow", any arguments, that is, such

a syscall will be missed. Here we have the opportunity to experiment. We can take this policy, which was by default, cut out the resolution for syscall mkdir and with the help of this command we have a parameter that determines this security profile we can run docker container in this case it is running alpine linux and just experiment here we are doing mkdir test inside the container and root, we should have been given the opportunity to perform this operation, but we get the answer that operation not permitted, that is, our policy worked, which did not allow us to transfer this operation to the core for execution. Next, we have, in addition to docker, there is also our

wonderful systemd. a demon that instead of init now launches everything else in modern Linux, then it will probably be interesting to find out that in fact, every systemd, this unit that it launches, it silently launches it in a separate namespace, that is, in fact, from the point of view of the core, each unit is launched in its own in a separate container, because we remember that docker is just a thing that allows you to manage namespaces, and the mechanism that allows you to create containerization in Linux is the namespace mechanism, and docker just controls them, they didn't do anything super-space-like in docker itself. And in SystemD there is a similar functionality, just there, when some of the units

that are described in SystemD are launched. SystemD does not impose all the restrictions that Docker imposes when it launches its containers. That is, the namespace mechanism allows the creating namespace to use either a parent namespace or create a separate virtual one. Thus, create a separate file system, create a virtual interface, create a lot of things. Systemd just launches it all in a separate namespace, but the created namespace inherits the namespace of the root system itself, for the process, it seems that it is launched inside the root system, and not in its own separate namespace. But I went a little bit to the side. The main thing is that Systemdiv also allows us to impose the same restrictions on the launched process, such as

the same docker. But what you see is a piece of a unit file that Systemd can interpret. You can add these lines to any unit and see what you get. There are comments here about what restrictions you can impose. No new privileges is a thing that will not give to increase their privileges in any of the available ways, that is, set EUID syscall or launch some suid applications, it will be blocked. Protected kernel modules is a thing that prohibits the loading of core modules, that is, these syscalls that are related to the loading of modules are blocked. You have the ability to make restricted address families. That is, when creating a socket in the Unix system, you can specify the address family

for the socket you are creating. You can create a Unix domain socket, you can create IPv4, IPv6, Netlink sockets, you can create raw sockets, you can create sockets for any other protocols. that are supported by the core. But ordinary applications will most likely require only Unix sockets and sockets Enet and Enet 6 for their work, so why give them the opportunity to create anything else. Real-time restrict is the ability to switch the scheduling of the current processor or some other in real-time mode. This is a protection against a local DOS attack when you raise the process privileges in the scheduler and start calculating the whole system. Restrict suitesgit is also an interesting limitation. It allows you to make sure that the application cannot

do the chmod to the file that will lead to the creation of a setuid or setguid file. That is, a file that will change the privileges of the process that will be created. Deny, write, execute is also an interesting limitation, allows to say that the run-to-black application will not have the ability to create a memory page with write and execute permissions. That is, you cannot have a memory page in the system where you can write, and this memory page cannot simultaneously be installed by a bit that allows the execution of the code from this page. That is, when your regular ELF file is loaded with the system, then the code that is stored in this ELF file is loaded

into the pages and these pages are placed with the "read" and "execute" flags, that is, this code can be read by the processor and executed, and all pages that relate to the data pages are placed with the "read/write" bits, but the "execute" bit is removed for them. Thus, even if some process begins to behave very strangely and try to write code in some place of memory and then pass the control there, then such a thing does not give the system the opportunity to have such memory pages where you could write and then what you wrote there to perform as a code. A very small number of applications that are legitimate for their work require such functionality, so you can safely use it. Well, and then

we have the last parameter in our example, this is System Call Filter, that is, what we talked about, we can simply list here all the syscalls that we want to allow for this application, all the other syscalls of this application will be blocked. All this functionality is fully implemented thanks to the fact that we have our mechanism in the core, eBPF, or eBPF, as someone says. And just in this case, the systemd can compile such a policy for us, run the parent process, attach this policy to this parent process, So, we, as people who manage this system, have an opportunity to impose such restrictions on applications and control them. Where can it be particularly applicable? If you remember, a couple

of years ago, an interesting fake was posted that a remote code execution was supposedly found in NGINX. And then it all led to a rather large shift in traffic, because it is a very popular web server, and this is a popular web server that always sticks out with its port, that is, it is available from the Internet. Accordingly, theoretically, this could be a very serious problem at the level of the entire Internet as a large network. But when this thing came up, we took care of this problem and just made it for all our applications that we have launched in combat systems. We created all the comp filters at the level of SystemD or Docker, depending on what

was used. And this gave us a sense of confidence that if there was some big worm that would automatically infect the system, then, provided that there was a remote code execution in Nginx or some other popular demon, then for us it would end DOS option, that is, the application would have fallen, the operating system would have been killed, but the execution code repair would not have worked for such a worm. Well, if you have a set of logs and audit analysis, then again, those demons that work in your systems in combat mode, they should never fall in a normal situation. no SIGFOLT, no SIGKILL, and other strange signals. And, accordingly, if suddenly somewhere the audit

is running on the fact that some of our Internet-facing application fell on such a signal, and we know that our policy and our filters were applied there, then this is already a very good bell that you need to urgently run there and figure out what is happening there, because this is definitely not a normal situation. Below there is a link to a very good post on how to assemble in a convenient way for an unknown application for which the developers have not published policy information about what syscalls this application uses. And then, if you understand approximately, and it is often quite clear by the name what this or that syscall is responsible for, you can see how

this application works in its normal mode and, accordingly, write such a rule, load it into Systemd and feel much more confident and calm about your systems. That's all about eBPF for today. I hope to tell you more about other eBPF applications on the next b-sites. I just didn't want to upload a lot of information at once, so I hope to see you again. My name is Sergey Smetienko, the Nox Service company. Bye everyone. Thank you for watching.

eBPF – A Virtual Machine Inside the Linux Kernel

Related talks