Crun on scratch

The journey to speed up running OCI containers

Wed, 21 Sep 2022 16:30:00 +0200

When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.

An interesting issue handling the seccomp listener

Mon, 05 Sep 2022 21:59:12 +0200

A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.

Seccomp made easy

Sat, 30 Jan 2021 21:10:14 +0200

Seccomp is a kernel feature that restricts what syscalls can be used by a process. The allowed syscalls are described as a BPF program that the kernel evaluates on every syscall entry. While effective, writing and maintaining seccomp profiles in the JSON format expected by OCI runtimes is tedious, and the underlying libseccomp API has surprising constraints — particularly around combining per-argument rules for the same syscall — that make complex policies difficult to express correctly.

Almost every container runs with seccomp enabled to restrict its access to syscalls.

Playing with seccomp notifications in the OCI runtime

Mon, 10 Aug 2020 10:40:19 +0200

A couple weekends ago I’ve played with seccomp user notifications and how they can be used in the OCI containers stack. Seccomp user notifications are a Linux kernel feature that lets a privileged monitor process intercept specific syscalls made by a less-privileged container, inspect the arguments, and either emulate the syscall or return an error. This opens up possibilities for safely expanding what unprivileged containers can do — for example, emulating mknod — without granting broad kernel capabilities to the container itself.

Seccomp user notifications are a powerful Linux kernel feature, that delegates syscalls handling to a userland program.

Avoid a memory page allocation on mount(2)

Fri, 27 Dec 2019 16:16:33 +0000

While working on crun, I got surprised by how much time the kernel spent in the copy_mount_options function. A container runtime issues a large number of mount(2) syscalls during startup — bind mounts, proc, sysfs, devtmpfs, and more — many of them with no extra options to pass. It turned out that passing an empty string instead of NULL for the data argument caused the kernel to allocate a full memory page and attempt a copy from user space on every one of those calls, adding measurable overhead.

Crun moved to github.com/containers

Mon, 12 Aug 2019 09:54:25 +0000

The giuseppe/crun github project was moved under https://github.com/containers/crun. Moving to the containers organization means the project is no longer a personal experiment but a community-maintained component of the container stack, alongside tools like Podman, Buildah, and fuse-overlayfs. This makes it easier to coordinate changes across the ecosystem and signals that crun is a supported alternative OCI runtime for production use.

Similarly libocispec, used internally by crun for parsing the OCI configuration file was moved to https://github.com/containers/libocispec

Rootless resources management with Podman on Fedora 30

Sun, 12 May 2019 20:36:59 +0000

I have finally opened some PRs for conmon and libpod that enable resources management for Podman rootless containers on Fedora 30 when using crun. This builds on the cgroups v2 delegation support added to crun earlier: Fedora 30 ships a kernel and systemd new enough to support the unified cgroup hierarchy, so with a single kernel command-line option and a small systemd drop-in, unprivileged users can now set memory and CPU limits on their containers without root access.

Resources management with rootless containers and cgroups v2

Tue, 26 Feb 2019 21:22:10 +0000

cgroups v2 will finally allow unprivileged users to manage a cgroup hierarchy in a safe manner without requiring any additional permission. In the cgroups v1 model, writing to cgroup control files requires root, which means rootless containers cannot enforce memory limits or CPU quotas. The unified cgroups v2 hierarchy introduces a delegation mechanism where systemd can hand ownership of a subtree to a user process, enabling the OCI runtime to configure resource limits directly without any privileged helper.

New COPR repository for crun

Wed, 15 Nov 2017 19:25:46 +0000

I made a new COPR repository for crun so that it can be easily tested on Fedora without having to build from source. crun is a lightweight OCI container runtime written in C, intended as a faster and lower-overhead alternative to runC. The COPR repository tracks the upstream development branch, making it straightforward to try out new features and report issues before they land in a distribution package.

https://copr.fedorainfracloud.org/coprs/gscrivano/crun/

To install crun on Fedora, it is enough to:

C is a better fit for tools like an OCI runtime

Mon, 23 Oct 2017 21:21:19 +0000

I’ve spent some of the last weeks working on a replacement for runC, the most used/known OCI runtime for running containers. It might not be very well known, but it is a key component for running containers. Every Docker container ultimately runs through runC. The OCI runtime is the thin layer between the container engine and the kernel: it reads a JSON configuration file, creates the necessary namespaces and cgroups, sets up mounts and capabilities, and finally execs the container process. Because it runs for such a short time and its workload is almost entirely syscalls, the implementation language matters for startup latency.

Crun on *scratch*