For some time now the Linux kernel has been supporting a capabilities(7) based permission control model. In theory this allows assigning fine-grained permissions to processes so that processes that previously required UID 0/root permissions don’t need these any more. In practice though, uptake of this feature is relatively low, and actually trying to use it is hampered by confusing vocabulary and non-intuitive semantics.
So what’s the story?
All special access permission exemptions that were previously exclusively attached to UID 0 are now associated with a capability. Examples for these are: CAP_FOWNER (bypass file permission checks), CAP_KILL (bypass permission checks for sending signals), CAP_NET_RAW (use raw sockets), CAP_NET_BIND_SERVICE (bind a socket to Internet domain privileged ports).
Capabilities can be bestowed on execution (similar to how SUID operates) or be inherited from a parent process. So in theory it should be possible to, for example, start an Apache web server on port 80 as a normal user with no root access at all, if you can provide it with the CAP_NET_BIND_SERVICE capability. Another example: Wireshark only needs the CAP_NET_RAW and CAP_NET_ADMIN capabilities. It is highly undesirable to run the main UI and protocol parsers as root, and slightly less desirable to run dumpcap, which is the helper tool that Wireshark actually uses to sniff traffic, as root. Instead, the preferred installation method on Debian systems is to set the dumpcap binary up so that it automatically gains the required privileges on execution, and then limit execution of the binary to a certain group of users.
Gaining and giving capabilities
This is the most confusing part, because a) it doesn’t behave intuitively in the “just like suid-root” mental model, and b) uses the same words for completely different functions.
Conceptually capabilities are maintained in sets, which are represented as bit masks. For all running processes capability information is maintained per thread; for binaries in the file system it’s stored in extended attributes. Thread capability sets are copied on fork() and specially transformed on execve(), as discussed below.
Several different capability sets and related variables exist. In the documentation these are treated as somewhat symmetrical for files and threads, but in reality they are not, so I’ll describe them one by one:
- Thread permitted set
- This is a superset of capabilities that the thread may add to either the thread permitted or thread inheritable sets. The thread can use the capset() system call to manage capabilities: It may drop any capability from any set, but only add capabilities to its thread effective and inherited sets that are in its thread permitted set. Consequently it cannot add any capability to its thread permitted set, unless it has the CAP_SETPCAP capability in its thread effective set.
- Thread effective set
- This is the actual set of capabilities that the kernel uses for permission checks.
- Thread inheritable set
- This is a set that plays a role in bequeathing capabilities to other binaries. It would more properly be called ‘bequeathable’: a capability not in this set cannot be inherited by a different binary through the inheritance process. However, being in this set does also not automatically make a binary inherit the capability. Also note that ‘inheriting’ a capability does not necessarily automatically give any thread effective capabilities: ‘inherited’ capabilities only directly influence the new thread permitted set.
- File permitted set
- This is a set of capabilities that are added to the thread permitted set on binary execution (limited by cap_bset).
- File inheritable set
- This set plays a role in inheriting capabilities from another binary: the intersection (logical AND) of the thread inheritable and file inheritable sets are added to the thread permitted set after the execve() is successful.
- File effective flag
- This is actually just a flag: When the flag is set, then the thread effective set after execve() is set to the thread permitted set, otherwise it’s empty.
- cap_bset
- This is a bounding capability set which can mask out (by ANDing) file permitted capabilities, and some other stuff. I’ll not discuss it further and just assume that it contains everything.
Based on these definitions the documentation gives a concise algorithm for the transformation that is applied on execve() (new and old relate to the thread capability sets before and after the execve(), file refers to the binary file being executed):
- New thread permitted = (old thread inheritable AND file inheritable) OR (file permitted AND cap_bset)
- New thread effective = new thread permitted, if file effective flag set, 0 otherwise
- New thread inheritable = old thread inheritable
This simple definition has some surprising (to me) consequences:
- The ‘file inheritable set’ is not related to the ‘thread inheritable set’. Having a capability in the file inheritable set of a binary will not put that capability into the resulting processes thread inheritable set. In other words: A thread that wants to bequeath a capability to a different binary needs to explicitly add the capability to its thread inheritable set through setcap().
- Conversely the ‘thread inheritable set’ is not solely responsible for bequeathing a capability to a different binary. The binary also needs to be allowed to receive the capability by setting it in the file inheritable set.
- Bequeathing a capability to a different binary by default only gives it the theoretical ability to use the capability. To become effective, the target process must add the capability to its effective set using setcap(). Or the file effective flag must be set.
- A nice side effect of the simple copy operation used for the thread inheritable set: A capability can be passed in the thread inheritable set through multiple intermediate fork() and execve() calls to a target process at the end of a very long chain without becoming effective in the middle.
- The relevant file capability sets are those of the binary being executed. When trying to give permitted capabilities to an interpreted script, the capabilities must be in the file inheritable set of the interpreter binary. Additionally: If the script can’t/won’t call capset(), the file effective flag must be set on the interpreter binary.
Summary
I’ve tried to summarize all the possible paths that a capability can take within a Linux thread using capset() or execve(). (Note: fork() isn’t shown here, since all capability information is simply duplicated when forking.)
You describe the thread permitted as “This is a superset of capabilities that the thread may add to either the thread permitted or thread inheritable sets.” which seems circular. Shouldn’t this say “[…] either the thread effective or thread inheritable sets.”?
Oh, I forgot to say thanks for a really helpful article!
A new capability set has been added in Linux 4.3: the ambient capability set, which basically permits keeping capabilities during execve() on any non-setuid, non-setcap binary.
Except when calling any executable that has any capabilities defined, unfortunately… The invariant clears out pA set upon detecting anything in security.capability xattr’s, before processing any of the other evolution rules. There’s also a HUGE problem with bounding set now… looks like they had one too many at the bar before making these latest changes. But hey! They can hack and slash it some more, right?