resources management with rootless containers and cgroups v2


cgroups v2 will finally allow unprivileged users to manage a cgroup hierarchy in a safe manner without requiring any additional permission.

systemd is already mounting cgroups v2 under /sys/fs/cgroup/unified since long time, although by default there are no controllers enabled there and everything still works using cgroups v1.

It is also possible to use cgroups v2 only, this is known as the unified model. To enable it, it is necessary to to specify systemd.unified_cgroup_hierarchy=1 on the kernel command line, systemd will.

There is an issue in D-Bus when the user is running inside of a user namespace. The D-Bus request include the geteuid(), but since it is relative to the namespace instead of the user on the host, it won’t match and the request fail. If you are going to play with it and launching the container from within a user namespace, be sure to use this patch: https://github.com/systemd/systemd/pull/11785.

To get it working, I had to manually enable some of the controllers for the unprivileged users, as root:

echo +cpu +cpuset +io +memory +pids > /sys/fs/cgroup/user.slice/cgroup.subtree_control

You’ll need to propagate it down to the hierarchy to the user service slice.

Be sure there are no real-time processes running or the cpu controller cannot be enabled. If you hit any error like error: Invalid argument when you are enabling the cgroups v2 control, you can try to fix it disabling PulseAudio and rtkit-daemon. If it still doesn’t work check if there are other real-time processes running, you can find them with:

ps ax -L -o ‘pid tid cls rtprio comm‘ |grep RR

I’ve added some basic support for cgroups v2 to the crun OCI runtime (https://github.com/giuseppe/crun/pull/11). The implementation is not complete yet but it supports already the cpu, io memory and pids controllers. Other controllers must be implemented through eBPF. The freezer controller is still being worked on in the kernel. In the crun implementation, systemd, when present, is used only for the delegation of the hierarchy, all the configuration happens by writing directly to the cgroups file. This will enable crun to work with cgroups v2 even if

Since the OCI runtime was designed with cgroups v1 in mind, I have tried to convert from the cgroups v1 configuration to cgroups v2. For instance, the blkio.weight is converted linearly from the range 10-1000 to 1-10000 to accomodate what io.weight expects

With that in place, we can now ask systemd to delegate an entire cgroups v2 subtree to the container and manage it directly as an unprivileged user.

Using a OCI configuration that includes:

{
...
    "process": {
        "args": [
            "cat", "/sys/fs/cgroup/memory.max"
        ]
    },
...
    "linux": {
        "resources": {
            "memory": {
                "limit": 1000000000
        }
    }
...
}

We can do as an unprivileged user:

$ crun --systemd run foo
 999997440

Next steps:

  • support cgroups v2 in Podman and conmon. Since the OCI runtime configuration won’t chang, there won’t probably be much to fix here.
  • add support to crun for more controllers using eBPF.