Posts for: #Rootless

Why do I have two /sys/fs/cgroup in my container

It happened a few times in the past that users wonder why they see two /sys/fs/cgroup mounts in their unprivileged container. When working with unprivileged containers in Podman, users often notice two /sys/fs/cgroup mounts if the container is not using a new network namespace. The duplication is not a bug but an intentional consequence of how the kernel handles bind mounts that cross user namespace boundaries, combined with the need to provide the container with a writable cgroup view that is scoped to its own slice.

[read more]

Rootless resources management with Podman on Fedora 30

I have finally opened some PRs for conmon and libpod that enable resources management for Podman rootless containers on Fedora 30 when using crun. This builds on the cgroups v2 delegation support added to crun earlier: Fedora 30 ships a kernel and systemd new enough to support the unified cgroup hierarchy, so with a single kernel command-line option and a small systemd drop-in, unprivileged users can now set memory and CPU limits on their containers without root access.

[read more]

Resources management with rootless containers and cgroups v2

cgroups v2 will finally allow unprivileged users to manage a cgroup hierarchy in a safe manner without requiring any additional permission. In the cgroups v1 model, writing to cgroup control files requires root, which means rootless containers cannot enforce memory limits or CPU quotas. The unified cgroups v2 hierarchy introduces a delegation mechanism where systemd can hand ownership of a subtree to a user process, enabling the OCI runtime to configure resource limits directly without any privileged helper.

[read more]

Rootless containers @ devconf.cz

The video of the rootless containers talk from Devconf.cz 2019 is finally available on YouTube. The talk covers how user namespaces, fuse-overlayfs, and slirp4netns come together to allow running containers entirely as an unprivileged user, without any setuid helpers beyond newuidmap and newgidmap, and discusses the remaining challenges around cgroup resource management and overlay storage performance that still need to be addressed for rootless containers to reach full feature parity.

[read more]

SUID binaries from a user namespace

Additional IDs that are allocated to a user through /etc/subuid and /etc/subgid must be considered as permanently allocated and never reused for any other user. The reason is that a setuid binary created inside a user namespace can retain access to any UID that was mapped in that namespace, even after the namespace is destroyed. If the same UID range is later assigned to a different user, that new user would inherit access to files owned by the old user’s containers.

[read more]

Disposable rootless sessions

Would be nice to have a way to “fork” the current session and be able to revert all the changes done, without any leftover on the file system. With fuse-overlayfs, a user-space overlay filesystem that unprivileged users can mount, this turns out to be surprisingly straightforward: mount the entire root filesystem as the lower layer of an overlay, point the upper layer at a temporary directory, and every write is captured there and can be discarded at the end of the session, leaving the underlying system untouched.

[read more]

Rootless Podman from upstream on CentOS 7

This is the recipe I use to build podman from upstream on Centos 7 and use rootless containers. We need an updated version of the shadow utils as newuidmap and newgidmap are not present on Centos 7. The shadow utils are installed using “make install” which is not the clean way to install packages and it also overwrites the existing binaries, but it is fine on a development system. Podman is already present on Centos 7 and in facts we install it so we don’t have to worry about conmon and other dependencies.

[read more]

Network namespaces for unprivileged users

A couple of weekends ago I’ve played with libslirp and put together slirp-forwarder. The challenge with network namespaces for unprivileged users is that creating TAP or TUN devices requires privileges in the host network namespace. SliRP sidesteps this by emulating a full TCP/IP stack entirely in user space, so the helper process can forward traffic to the outside world using only normal socket operations, without needing any elevated capability.

SliRP emulates in userspace a TCP/IP stack. It can be used to circumvent the limitation of creating TAP/TUN devices in the host namespace for an unprivileged user. The program could run in the host namespace, receive messages from the network namespace where a TAP device is configured, and forward them to the outside world using unprivileged operations such as opening another connection to the destination host. Privileged operations are still not possible outside of the emulated network, as the helper program doesn’t gain any additional privilege that running as an unprivileged user.

[read more]

Become-root in a user namespace

I’ve cleaned up some C files I was using locally for hacking with user namespaces and uploaded them to a new repository on github: https://github.com/giuseppe/become-root. The tool creates a new user namespace and maps the caller to UID 0 inside it, while also mapping additional UIDs and GIDs from the ranges allocated in /etc/subuid and /etc/subgid. This is the foundation needed for rootless containers, which require a full UID/GID mapping — not just the single-UID mapping that unshare -r provides — to correctly represent file ownership inside container images.

[read more]

Fuse-overlayfs moved to github.com/containers

The fuse-overlayfs project I was working on in the last weeks was moved under the github.com/containers umbrella. fuse-overlayfs is a user-space implementation of the overlay filesystem that can be mounted without root privileges, which is essential for rootless containers. With Linux 4.18 introducing the ability to mount FUSE filesystems inside user namespaces, this makes overlay-based storage finally usable by unprivileged container runtimes such as Podman.

With Linux 4.18 it will be possible to mount a FUSE file system in an user namespace. fuse-overlayfs is an implementation in user space of the overlay file system already present in the Linux kernel, but that can be mounted only by the root user. Union file systems were around for a long time, allowing multiple layers to be stacked on top of each other where usually the last one is the only writeable.
Overlay is an union file system widely used for mounting OCI image. Each OCI image is made up of different layers, each layer can be used by different images. A list of layers, stacked on each other gives the final image that is used by a container. The last level, that is writeable, is specific for the container. This model enables different containers to use the same image that is accessible as read-only from the lower layers of the overlay file system.

[read more]