One annoying issue with setting a memory limit for a container is that the OOM killer kernel process can leave the container in an inconsistent state with only some processes terminated.
When the system or the cgroup runs out of memory, the OOM killer is triggered and the kernel will try to free some memory.
The kernel will iterate the potential processes to terminate, that is
either any process on the host, or the ones in the cgroup when the OOM
is local to the cgroup.
For each process it calculates a badness
score and then kill the
process that scores the most.
The badness
heuristic was changed a few times, in its current form
it takes into account how much memory the process uses, whether the
process is killable and adjust the score by a configurable value that
is configurable from user space.
The OOM killer works in a similar way either when the entire system is running low on memory or a memory cgroup limit is being violated. The difference is in the set of processes considered for termination.
If the cgroup has reached its memory limit, only one process will be terminated. In most cases this behavior causes to leave the container in an inconsistent state, with the remaining processes running.
A new knob was added for cgroup v2 with the patch:
commit 3d8b38eb81cac81395f6a823f6bf401b327268e6
Author: Roman Gushchin <[email protected]>
Date: Tue Aug 21 21:53:54 2018 -0700
mm, oom: introduce memory.oom.group
For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.
....
If memory.oom.group
is set, the entire cgroup is killed as an
indivisible unit.
Unfortunately OCI containers cannot take advantage of this feature yet, as there is no way to specify the setting in the current version of the OCI runtime specs.
OCI containers adoption
The discussion for adding cgroup v2 support to the runtime specs is still under review: runtime-specs cgroup v2 support
Once that lands, we can extend the containers runtime to set the configuration when it is the desired behavior.
The memory.oom.group
setting can be specified at any level in the
cgroup hierarchy.
In the Kubernetes world, we could support both a per-container and a per-pod mode OOM group mode. In the per-container mode, only the processes for a single container will be terminated on OOM. Instead, if the setting is configured for the pod, on an OOM event the entire pod is terminated without leaving any process behind.
The main difficulty with the second configuration is that shim processes that are usually running in the pod cgroup must be moved somewhere else, otherwise they will be terminated as part of the OOM killer cleanup.