Why do I have two /sys/fs/cgroup in my container
Table of Contents
It happened a few times in the past that users wonder why they see two /sys/fs/cgroup mounts in their unprivileged container. When working with unprivileged containers in Podman, users often notice two /sys/fs/cgroup mounts if the container is not using a new network namespace. The duplication is not a bug but an intentional consequence of how the kernel handles bind mounts that cross user namespace boundaries, combined with the need to provide the container with a writable cgroup view that is scoped to its own slice.
The Limitation of Unprivileged Users#
An unprivileged user, by definition, lacks certain permissions that
are available to the root user. One of these limitations is the
inability to mount a fresh /sys filesystem within a new user
namespace, unless there is already a /sys filesystem mounted and
accessible in the current namespace, and that the user namespace also
owns the current network namespace.
When such conditions are not met, Podman uses a bind mount from the
/sys filesystem of the host to provide the container with a /sys
filesystem.
Cross-Namespace Bind Mounts#
A consequence of a bind mount that crosses two user namespaces is the
kernel automatically ’locking’ the new mount, treating it as a single
entity. This has the effect of preventing the inner container from
unmounting the /sys/fs/cgroup mount, as it is considered part of the
/sys mount itself.
New cgroup mount#
The /sys/fs/cgroup mount, embedded within the /sys mount, refers to
the host environment’s cgroup mount. A fresh /sys/fs/cgroup mount is
needed for the container, which is then mounted on top of the existing
embedded mount.
The consequence of this approach is the appearance of two
/sys/fs/cgroup mounts within the container, as it can seen in the
following example:
|
|