While working on crun, I got surprised by how much time the kernel spent in the copy_mount_options function. A container runtime issues a large number of mount(2) syscalls during startup — bind mounts, proc, sysfs, devtmpfs, and more — many of them with no extra options to pass. It turned out that passing an empty string instead of NULL for the data argument caused the kernel to allocate a full memory page and attempt a copy from user space on every one of those calls, adding measurable overhead.
Run containers without pulling images
CRFS is a Google project that aims at running a container without pre-pulling the image first. The key insight is that in practice a container process only accesses a small fraction of the files in its image, so fetching the entire image before startup wastes both time and disk space. CRFS achieves this through the stargz (Seekable tar.gz) format, which restructures each compressed layer so that individual files can be fetched on demand rather than requiring the entire tarball to be downloaded and extracted upfront.
Crun moved to github.com/containers
The giuseppe/crun github project was moved under https://github.com/containers/crun. Moving to the containers organization means the project is no longer a personal experiment but a community-maintained component of the container stack, alongside tools like Podman, Buildah, and fuse-overlayfs. This makes it easier to coordinate changes across the ecosystem and signals that crun is a supported alternative OCI runtime for production use.
Similarly libocispec, used internally by crun for parsing the OCI configuration file was moved to https://github.com/containers/libocispec
Rootless resources management with Podman on Fedora 30
I have finally opened some PRs for conmon and libpod that enable resources management for Podman rootless containers on Fedora 30 when using crun. This builds on the cgroups v2 delegation support added to crun earlier: Fedora 30 ships a kernel and systemd new enough to support the unified cgroup hierarchy, so with a single kernel command-line option and a small systemd drop-in, unprivileged users can now set memory and CPU limits on their containers without root access.
Resources management with rootless containers and cgroups v2
cgroups v2 will finally allow unprivileged users to manage a cgroup hierarchy in a safe manner without requiring any additional permission. In the cgroups v1 model, writing to cgroup control files requires root, which means rootless containers cannot enforce memory limits or CPU quotas. The unified cgroups v2 hierarchy introduces a delegation mechanism where systemd can hand ownership of a subtree to a user process, enabling the OCI runtime to configure resource limits directly without any privileged helper.
Rootless containers @ devconf.cz
The video of the rootless containers talk from Devconf.cz 2019 is finally available on YouTube. The talk covers how user namespaces, fuse-overlayfs, and slirp4netns come together to allow running containers entirely as an unprivileged user, without any setuid helpers beyond newuidmap and newgidmap, and discusses the remaining challenges around cgroup resource management and overlay storage performance that still need to be addressed for rootless containers to reach full feature parity.
SUID binaries from a user namespace
Additional IDs that are allocated to a user through /etc/subuid and /etc/subgid must be considered as permanently allocated and never reused for any other user. The reason is that a setuid binary created inside a user namespace can retain access to any UID that was mapped in that namespace, even after the namespace is destroyed. If the same UID range is later assigned to a different user, that new user would inherit access to files owned by the old user’s containers.
Disposable rootless sessions
Would be nice to have a way to “fork” the current session and be able to revert all the changes done, without any leftover on the file system. With fuse-overlayfs, a user-space overlay filesystem that unprivileged users can mount, this turns out to be surprisingly straightforward: mount the entire root filesystem as the lower layer of an overlay, point the upper layer at a temporary directory, and every write is captured there and can be discarded at the end of the session, leaving the underlying system untouched.
An Emacs mode for Rust
I was looking for an Emacs mode that could help me to hack on Rust. The built-in rust-mode provides syntax highlighting and basic indentation, but for a language with a complex type system and borrow checker it is useful to have editor integration that can navigate to definitions, show type information, and offer completions. This post covers setting up racer-mode, which drives the racer code-completion engine to provide those features inside Emacs.
Rust-mode itself has not enough features to help me with a language I am not really proficient with yet.
Rootless Podman from upstream on CentOS 7
This is the recipe I use to build podman from upstream on Centos 7 and use rootless containers. We need an updated version of the shadow utils as newuidmap and newgidmap are not present on Centos 7. The shadow utils are installed using “make install” which is not the clean way to install packages and it also overwrites the existing binaries, but it is fine on a development system. Podman is already present on Centos 7 and in facts we install it so we don’t have to worry about conmon and other dependencies.