seccomp is a kernel feature that restricts what syscalls can be used by a process.
Almost every container runs with seccomp enabled to restrict its access to syscalls.
The seccomp profile defined for a container is finally converted to a BPF program that the kernel runs on each syscall to decide whether to allow it and how to handle it.
An OCI runtime, such as crun or runc, gets the seccomp configuration as part of the OCI JSON configuration file, then generate the BPF program using libseccomp.
Instead OCI container engines, such as Podman, CRI-O or Moby, use a higher level JSON file to define the seccomp profile, this profile will be then converted to the configuration passed to the OCI runtime.
The higher level configuration permits to customize the seccomp profile in relation to the container configuration. For example, it is possible to allow or deny a syscall only when a specific capability is also granted to the container.
Writing a seccomp profile in JSON is painful on its own, but there are
also several limitations in the libseccomp API that make it even more
difficult to write. As the seccomp_rule_add(3)
man page says:
|
|
This behavior turns out to be quite difficult to handle if the syscall should be treated in different ways.
If you don’t believe it, look at what we had to do just to return
EINVAL
when the first argument to the socket
syscall is equal to
16
and the third one to 9
(why it is done this way is a topic for
another time):
This also doesn’t scale, if we want to add another condition we would need to provide the configuration for each combination of values.
So what to do?
Last week I’ve started working on easyseccomp. It is still a PoC but it seems to work already quite well.
The goal is to have an easier to use language to define a seccomp profile.
libseccomp is not used to generate the BPF bytecode (altough it is still needed to lookup the syscall numbers).
To give an example, the socket
syscall example above would look
like:
That’s it.
How to use it?
Since the seccomp configuration is passed to the OCI runtime as part
of the OCI configuration file and doesn’t allow any customization, we
need (at least for now) a side channel to pass it.
Annotations are a mechanism to pass arbitrary information to the OCI
runtime. I’ve added a custom annotation to crun. When the annotation
is present, crun ignores the seccomp
configuration in the OCI file
and load the raw BPF bytecode from the specified file.
The PR is here: (https://github.com/containers/crun/pull/578).
Once the BPF filter is generated by easyseccomp, the raw result can be specified to crun using the new annotation, e.g. from Podman it is possible to do:
|
|
The container engine has all the logic to convert the high level JSON configuration to the OCI version, including the logic of looking at what capabilities are granted to the container.
For now we need to take care of this step when the easyseccomp profile is generated.
easyseccomp supports customizations of the profile with a mechanism similar to the C preprocessor:
|
|
These definitions can be specified to easyseccomp:
|
|
If CAP_AUDIT_WRITE
is not specified to easyseccomp then the code
between the #ifndef
directive and the #endif
is ignored.
Conversely, #ifdef DIRECTIVE
permits to specify code that is
included only when the specified DIRECTIVE
is present.
The #if(n)def/#endif
directive mechanism is a replacement for the
excludes/includes
rules used in the JSON file.
To facilitate the conversion between an existing JSON configuration file and the new language, I’ve added a Python script convert-from-containers-policy.py that can be used as:
|
|
The conversion is best-effort, but it is a good starting point.
Given the new profile, the BPF can be generated (assuming running on AMD64) as:
|
|
Generated BPF
Running with seccomp enabled has a runtime overhead on each syscall performed by a process. The overhead depends on the generated BPF.
The BPF generated by easyseccomp, at least the one created from the profile above, seems to perform better than what libseccomp does.
On my machine, using the kernel 5.9.16-200.fc33.x86_64 and a crun version that support loading the raw BPF filter, I’ve used this simple C program to benchmark the seccomp overhead:
|
|
and I get:
|
|
The first command disables seccomp, while the second one uses the version generated by libseccomp and the third one by easyseccomp.
EDIT:
Linux 5.11 has constant-action bitmaps for seccomp, thus the performance in the example above is the same for both libseccomp and easyseccomp versions. Since the constant-action kernel optimization works only for ALLOW rules, the smaller BPF generated by easyseccomp (using the containers default profile, it is down to 20% of the libseccomp version) still performs better in all other cases.