Okay, I've figured it out. This is really dumb. tl;dr: This is really an AppArmor bug (or even a design flaw if you prefer).
For context, the file we are trying to write to is /proc/sys/net/ipv4/ip_unprivileged_port_start. @stgraber figured out that the problematic AppArmor rules are the rules they have which block writing to most /sys files. How is it possible that one affects the other?
Well, the problem is that runc now uses a detached mount of procfs to operate on (this avoids mount race attacks). Because detached mounts have not been attached to the filesystem, d_name (the kernel's facility for generating names for dentries) just generates a name that looks like /foo if you try to open a file foo inside the detached procfs mount. AFAICS this is what AppArmor uses to determine what file you are trying to write to (because AppArmor is path-based, and d_name is the only way to get pathnames from dentries).
This means that when we try to write to /proc/sys/net/ipv4/ip_unprivileged_port_start, AppArmor sees this as us trying to write to /sys/net/ipv4/ip_unprivileged_port_start which is forbidden by the /sys denial rules. I have attached a program that can show this behaviour using a detached tmpfs mount, it's very trivial to trigger:
% ./aa-bug &
c1:~ # ./aa-bug &
fd: /proc/2061/fd/5
[1] 2061
c1:~ # mkdir /proc/2061/fd/5/sys
c1:~ # mkdir /proc/2061/fd/5/sys/foo
mkdir: cannot create directory ‘/proc/2061/fd/5/sys/foo’: Permission denied
aa-bug.go.txt
There is a trivial workaround for this particular sysctl:
- deny /sys/[^fdck]*{,/**} wklx,
+ deny /sys/[^fdckn]*{,/**} wklx,
(In /etc/apparmor.d/abstractions/lxc/container-base.)
But this doesn't help in the general case for all sysctls. @stgraber has just submitted lxc/incus#2624 which just removes these rules entirely. I think AppArmor should not do this, because it's incredibly broken (literally any detached mount could match against a rule by accident), but this is unfortunately how AppArmor's design works.
From runc's side, we could in theory use this to our advantage -- if we created a tmpfs with a subpath like .go-away-apparmor and then attached our procfs mount to that path, we might be able to subvert AppArmor. However, this has a risk of causing lifetime issues that would require a rework of how we do lookups -- the tmpfs must not be closed after we attach to it because it will lazy-unmount the procfs...
Originally posted by @cyphar in #12484
Originally posted by @cyphar in #12484