an interesting issue was opened for crun a couple of days ago.
The issue reports that:
runc (v1.1.4) accepts the following .linux.seccomp configuration (sendmsg is in the SCMP_ACT_NOTIFY list), but crun (v1.5, also tested v0.19) just hangs.
"seccomp": {
"defaultAction": "SCMP_ACT_ALLOW",
"listenerPath": "/tmp/foo.sock",
"syscalls": [
{
"names": [
"sendmsg"
],
"action": "SCMP_ACT_NOTIFY"
}
]
}
seccomp has a feature, the user-space notifications, that allows to
intercept syscalls and handle them in a custom way in userspace. If
the flags
argument passed to the seccomp(2)
syscall contains the
SCMP_ACT_NOTIFY
flag, then the kernel will open a file descriptor
and returns it to the caller. The file descriptor is used to receive
notifications from the kernel for the syscalls intercepted.
The OCI runtime doesn't handle these notifications directly, so the file descriptor is passed to a different process.
In the OCI configuration file consumed by the OCI runtime, the
listenerPath
is the path to a UNIX socket that will receive the
seccomp listener file descriptor once crun has it.
What crun does and that has caused the error, was to naively use
sendmsg(2)
to send the listener fd to the specified socket, and do
that just after the seccomp filter was installed, so the sendmsg
call itself is intercepted but no process has access to the file
descriptor and the call hangs.
What to do?
The problem we need to solve is to send the file descriptor from an
environment where the sendmsg
is not blocked,
This is easily achieved with a helper process, that is created just before the seccomp filter is installed. The helper process will be responsible to send the file descriptor to the specified socket.
From the issue report, it seems that runc has already solved the
problem by using a pipe to inform the helper process on what fd
contains the seccomp listener and then let the helper process retrieve
the file descriptor with the pidfd_getfd(2)
syscall.
Two issues with this approach are:
- it requires a new kernel feature,
pidfd_getfd(2)
. - it still expects
write(2)
to not be filtered by seccomp.
The first issue can be solved by using a different approach, instead
of using pidfd_getfd(2)
, we can fork the helper process with the
CLONE_FILES
flag, so the helper process will have the same file
descriptors as the parent process!
We still need to solve the second issue, but we can do that by using a shared memory region and let the helper process do a busy loop on the region until it contains the file descriptor number.
Shared memory
The shared memory region is backed by a memfd created as:
memfd = memfd_create ("seccomp-helper-memfd", O_RDWR);
if (UNLIKELY (memfd < 0))
return crun_make_error (err, errno, "memfd_create");
ret = ftruncate (memfd, sizeof (atomic_int));
if (UNLIKELY (ret < 0))
return crun_make_error (err, errno, "ftruncate seccomp memfd");
ret = libcrun_mmap (&mmap_region, NULL, sizeof (atomic_int),
PROT_WRITE | PROT_READ, MAP_SHARED, memfd, 0, err);
if (UNLIKELY (ret < 0))
return ret;
The first block creates the memfd file, the second one resizes it to the size of an atomic int and the third one maps it in memory.
Helper process
Now that there is a way for the two processes to communicate without using any syscall we can look at the helper process, that just does:
helper_proc = syscall_clone (CLONE_FILES | SIGCHLD, NULL);
if (UNLIKELY (helper_proc < 0))
return crun_make_error (err, errno, "clone seccomp listener helper process");
if (helper_proc == 0)
{
int fd;
prctl (PR_SET_PDEATHSIG, SIGKILL);
for (;;)
{
fd = *fd_received;
if (fd == -1)
{
usleep (1000);
continue;
}
break;
}
ret = send_fd_to_socket_with_payload (listener_receiver_fd, fd,
receiver_fd_payload,
receiver_fd_payload_len,
err);
if (UNLIKELY (ret < 0))
_exit (crun_error_get_errno (err));
_exit (0);
}
the prctl(2)
call is used to make sure that the helper process won't
survive its parent process.
Once the fd
is retrieved from the shared memory region, the
send_fd_to_socket_with_payload
function sends it to the receiver
socket using the sendmsg(2)
syscall.
Main process
The main process, the one that will be eventually execve the container program, just does:
ret = syscall_seccomp (SECCOMP_SET_MODE_FILTER, flags, &seccomp_filter);
if (UNLIKELY (ret < 0))
return crun_make_error (err, errno, "seccomp (SECCOMP_SET_MODE_FILTER)");
if (listener_receiver_fd >= 0)
{
atomic_int *fd_to_send = mmap_region->addr;
int status = 0;
*fd_to_send = listener_fd = ret;
ret = waitpid (helper_proc, &status, 0);
...
}
The syscall_seccomp
function is a wrapper around the seccomp(2)
syscall to install the seccomp filter and retrieve the listener fd.
The *fd_to_send = ret;
assignment writes the listener file
descriptor to the shared memory and that the helper process will
consume.
Conclusion
With all of this in place, crun accepts a seccomp
profile with no
limitations on what syscalls can be intercepted with
SCMP_ACT_NOTIFY
.
The notified process, that receives the seccomp listener, must still
ensure that all syscalls until the execve(2)
syscall are allowed,
otherwise the OCI runtime will fail to start the container.