Capability Model
Aegis's capability-based security model — no ambient authority, kernel-enforced per-process capability tables, Rust validation core
Capability Model
Aegis replaces traditional Unix permission checks with a capability-based security model. No process holds ambient authority. A process can only perform privileged operations if it holds the specific capability in its kernel-managed capability table. Even basic file access requires an explicit grant.
This page documents the capability data structures, the full set of capability kinds, the Rust/C FFI boundary, capability lifecycle through fork and exec, the syscall enforcement points, and comparisons with other capability systems.
Maturity warning. Aegis v1 is the first version deemed suitable for public release, not a battle-tested production system. The capability model described here is architecturally sound, but the surrounding C kernel code has not undergone the depth of adversarial security review that would justify strong security claims. All vulnerabilities identified in audits to date have been hypothetical; however, a codebase of this scale – predominantly written in C – almost certainly contains real, exploitable bugs that have not yet been found. Treat this documentation as a description of intended security properties, not a guarantee of their enforcement under all conditions. Contributions are welcome – file issues or propose changes at exec/aegis.
Design Principles
-
No ambient authority. A freshly spawned process starts with an empty capability table. It receives capabilities only through explicit grants — baseline grants at exec time and policy-based grants from
/etc/aegis/caps.d/. -
Unforgeable. Capability tables live in kernel memory (
aegis_process_t.caps[]), embedded directly in the PCB. User space cannot read, write, or forge capability slots. -
Kernel-validated. Every privileged syscall calls
cap_check()before proceeding. The check is implemented in Rust (kernel/cap/src/lib.rs) and linked via C FFI. -
Principle of least privilege. The security policy engine grants only the capabilities a specific binary needs. A web server gets
NET_SOCKET; it never getsAUTHorDISK_ADMIN.
Data Structures
Capability Slot
Each slot in the capability table is a (kind, rights) pair, 8 bytes total:
/* kernel/cap/cap.h */
typedef struct {
uint32_t kind; /* CAP_KIND_* */
uint32_t rights; /* CAP_RIGHTS_* bitfield */
} cap_slot_t;
The Rust side mirrors this layout exactly with #[repr(C)]:
/* kernel/cap/src/lib.rs */
#[repr(C)]
pub struct CapSlot {
pub kind: u32, /* CAP_KIND_* — 0 means empty */
pub rights: u32, /* CAP_RIGHTS_* bitfield */
}
A slot with kind == CAP_KIND_NULL (0) is considered empty and available for grant.
Capability Table
Each process holds a fixed-size table of 64 capability slots, embedded directly in the process control block:
+------------------------------------------------------------------+
| aegis_process_t (PCB) |
| ... |
| caps[0]: { kind: VFS_OPEN, rights: READ } |
| caps[1]: { kind: VFS_WRITE, rights: WRITE } |
| caps[2]: { kind: VFS_READ, rights: READ } |
| caps[3]: { kind: IPC, rights: READ } |
| caps[4]: { kind: PROC_READ, rights: READ } |
| caps[5]: { kind: THREAD_CREATE, rights: READ } |
| caps[6]: { kind: NULL, rights: 0 } <- empty slot |
| ... |
| caps[63]: { kind: NULL, rights: 0 } <- empty slot |
| authenticated: 0 or 1 |
| ... |
+------------------------------------------------------------------+
The table size is defined as CAP_TABLE_SIZE = 64 (increased from the original 8-slot Phase 11 design). This is sufficient for all current capability kinds with room for future expansion.
Capability Kinds
Each capability kind gates a specific class of kernel operations:
| Kind | Value | Description | Gated Operations |
|---|---|---|---|
CAP_KIND_NULL |
0 | Empty slot (sentinel) | N/A |
CAP_KIND_VFS_OPEN |
1 | File open | sys_open |
CAP_KIND_VFS_WRITE |
2 | File write | sys_write, sys_writev |
CAP_KIND_VFS_READ |
3 | File read | sys_read, sys_readv |
CAP_KIND_AUTH |
4 | Authentication access | Open /etc/shadow, sys_auth_session |
CAP_KIND_CAP_GRANT |
5 | Delegate caps to children | Reserved for future use |
CAP_KIND_SETUID |
6 | Identity changes | sys_setuid, sys_setgid |
CAP_KIND_NET_SOCKET |
7 | Network sockets | sys_socket (AF_INET) |
CAP_KIND_NET_ADMIN |
8 | Network configuration | sys_netcfg (set IP/mask/gateway) |
CAP_KIND_THREAD_CREATE |
9 | Thread creation | clone(CLONE_VM) |
CAP_KIND_PROC_READ |
10 | Process inspection | Read /proc/[other-pid] |
CAP_KIND_DISK_ADMIN |
11 | Raw block I/O | sys_disk_read, sys_disk_write |
CAP_KIND_FB |
12 | Framebuffer access | Map framebuffer into userspace |
CAP_KIND_CAP_DELEGATE |
13 | Restrict caps on spawn | cap_mask parameter in exec, sys_cap_grant |
CAP_KIND_CAP_QUERY |
14 | Introspect cap tables | sys_cap_query on other processes |
CAP_KIND_IPC |
15 | IPC primitives | AF_UNIX sockets, memfd_create |
CAP_KIND_POWER |
16 | System power control | sys_reboot (shutdown/reboot) |
Rights Bitfield
Rights are orthogonal to kind. Each capability slot carries a 3-bit rights field:
| Right | Bit | Value | Meaning |
|---|---|---|---|
CAP_RIGHTS_READ |
0 | 0x1 |
Read access |
CAP_RIGHTS_WRITE |
1 | 0x2 |
Write access |
CAP_RIGHTS_EXEC |
2 | 0x4 |
Execute access |
A cap_check call succeeds only if the slot matches both the requested kind and has all requested rights bits set. For example, checking (CAP_KIND_VFS_WRITE, CAP_RIGHTS_WRITE) requires that the process holds a VFS_WRITE slot with at least the WRITE bit.
Most capability kinds use READ as a simple “is present” check. The rights bitfield provides finer-grained control where needed — for instance, PROC_READ with WRITE rights permits kill() on other processes, while READ-only permits inspection via procfs.
The Rust/C FFI Boundary
The core capability logic is implemented in Rust (kernel/cap/src/lib.rs) and compiled as a staticlib (see kernel/cap/Cargo.toml). The Rust crate uses #![no_std] — no allocator, no standard library, no panicking infrastructure beyond an infinite-loop handler.
Build Configuration
# kernel/cap/Cargo.toml
[package]
name = "cap"
version = "0.1.0"
edition = "2021"
[lib]
crate-type = ["staticlib"]
[profile.dev]
panic = "abort"
[profile.release]
panic = "abort"
The Rust compiler produces libcap.a, which the kernel linker script pulls in alongside the C object files. All three exported functions use #[no_mangle] pub extern "C" to produce C-compatible symbols.
The capability module is the first kernel subsystem written in Rust and represents the beginning of a planned gradual migration of the Aegis kernel from C to Rust. The choice to start with the capability system is deliberate: this is the security-critical authorization path, and Rust’s memory safety guarantees (bounds checking, null safety, no undefined behavior in safe code) provide meaningful hardening for the code that every privileged operation depends on. Future kernel subsystems will follow the same pattern – Rust staticlib crates with extern "C" FFI boundaries.
Exported Functions
Three functions cross the FFI boundary:
cap_init() — Initialize the capability subsystem. Called from kernel_main() before sched_init(). Prints a status line to serial via FFI back into C (serial_write_string):
#[no_mangle]
pub extern "C" fn cap_init() {
unsafe {
serial_write_string(
c"[CAP] OK: capability subsystem initialized\n".as_ptr() as *const u8
);
}
}
cap_grant(table, n, kind, rights) — Write a capability into the first empty slot. Returns the slot index on success, -ENOCAP if the table is full:
#[no_mangle]
pub extern "C" fn cap_grant(
table: *mut CapSlot, n: u32, kind: u32, rights: u32,
) -> i32 {
if table.is_null() || n == 0 {
return -(ENOCAP as i32);
}
let n = n.min(CAP_TABLE_SIZE); // clamp to prevent OOB
let slots = unsafe {
core::slice::from_raw_parts_mut(table, n as usize)
};
for (i, slot) in slots.iter_mut().enumerate() {
if slot.kind == 0 {
slot.kind = kind;
slot.rights = rights;
return i as i32;
}
}
-(ENOCAP as i32)
}
cap_check(table, n, kind, rights) — Check whether a capability table contains a slot matching the requested kind with at least the requested rights. Returns 0 on success, -ENOCAP on failure:
#[no_mangle]
pub extern "C" fn cap_check(
table: *const CapSlot, n: u32, kind: u32, rights: u32,
) -> i32 {
if table.is_null() || n == 0 {
return -(ENOCAP as i32);
}
let n = n.min(CAP_TABLE_SIZE);
let slots = unsafe {
core::slice::from_raw_parts(table, n as usize)
};
for slot in slots {
if slot.kind == kind && (slot.rights & rights) == rights {
return 0;
}
}
-(ENOCAP as i32)
}
Safety Invariants
The Rust code relies on the following invariants, enforced by the C callers:
tablepoints to valid memory — alwaysproc->caps, which is embedded in a kernel-allocated PCB.ndoes not exceed the allocation — callers always passCAP_TABLE_SIZE. The Rust code additionally clampsntoCAP_TABLE_SIZEas a defense-in-depth measure.- No concurrent mutation during grant —
cap_grantis called fromproc_spawn(before the task enters the run queue) or fromsys_exec(which runs in the calling process’s context). No preemption hazard exists at these call sites. - Panic handler — The
#[panic_handler]is an infinite loop. This is acceptable because the Rust code contains no operations that can panic (no indexing, no unwrap, no arithmetic overflow in release mode).
C Header
The C side declares the types and functions in kernel/cap/cap.h:
typedef struct {
uint32_t kind;
uint32_t rights;
} cap_slot_t;
#define CAP_TABLE_SIZE 64u
void cap_init(void);
int cap_grant(cap_slot_t *table, uint32_t n,
uint32_t kind, uint32_t rights);
int cap_check(const cap_slot_t *table, uint32_t n,
uint32_t kind, uint32_t rights);
Only three C source files call into the capability module: proc.c (initial grants for PID 1), sys_exec.c (grants at exec and spawn), and sys_cap.c (capability syscalls). All other syscall files call cap_check through the same header.
Capability Lifecycle
Initialization (Boot)
kernel_main()
|-- cap_init() // Rust: print status, subsystem ready
|-- ...
|-- cap_policy_load() // C: parse /etc/aegis/caps.d/ into memory
+-- proc_spawn() // Grant PID 1 (Vigil) its initial caps
PID 1 (the Vigil init process) receives a hardcoded set of capabilities:
| Slot | Kind | Rights | Purpose |
|---|---|---|---|
| 0 | VFS_OPEN |
READ |
Open files |
| 1 | VFS_WRITE |
WRITE |
Write to files/console |
| 2 | VFS_READ |
READ |
Read file contents |
| 3 | IPC |
READ |
Unix domain sockets |
| 4 | PROC_READ |
READ | WRITE |
Inspect and signal children |
| 5 | THREAD_CREATE |
READ |
Create threads |
| 6 | POWER |
READ |
Shutdown/reboot |
Note that PID 1 gets PROC_READ with WRITE rights (for sending signals to children) and POWER (for shutdown/reboot). Normal exec’d processes do not receive these by default — they come from policy files.
Exec — The Capability Boundary
Exec is the primary capability boundary in Aegis. When a process calls sys_exec, its capability table is completely reset:
sys_exec(path, ...)
1. Zero the entire capability table (64 slots)
2. Grant baseline capabilities (6 caps every process gets)
3. Look up policy for basename(path) in /etc/aegis/caps.d/
4. Grant service-tier policy caps unconditionally
5. Grant admin-tier policy caps only if proc->authenticated == 1
The baseline capabilities granted to every exec’d process:
| Kind | Rights | Purpose |
|---|---|---|
VFS_OPEN |
READ |
Open files for reading |
VFS_WRITE |
WRITE |
Write to open file descriptors |
VFS_READ |
READ |
Read from open file descriptors |
IPC |
READ |
AF_UNIX sockets and memfd |
PROC_READ |
READ |
Read own /proc entries |
THREAD_CREATE |
READ |
Create threads via clone |
This is a critical security property: capabilities do not survive exec. If login holds AUTH and SETUID, and then execs /bin/stsh (the shell), the shell starts with only the baseline plus whatever stsh’s policy file grants. The AUTH capability does not leak to the shell unless explicitly configured.
The same reset-and-grant logic applies in sys_spawn (Aegis’s combined fork+exec syscall), where the child process’s table is built fresh from baseline + policy.
Fork — Full Inheritance
When a process calls fork(), the child receives a complete copy of the parent’s capability table and authenticated flag:
/* kernel/syscall/sys_process.c — sys_fork */
uint32_t ci;
for (ci = 0; ci < CAP_TABLE_SIZE; ci++)
child->caps[ci] = parent->caps[ci];
child->authenticated = parent->authenticated;
This is safe because fork does not change the executable image. The child runs the same code with the same privilege level. The capability boundary is at exec, not fork.
The same inheritance applies to clone(CLONE_VM) (thread creation), which additionally requires CAP_KIND_THREAD_CREATE.
Capability Flow Diagram
PID 1 (Vigil)
[baseline + POWER + PROC_READ(W)]
|
sys_spawn("/bin/login")
|
+----- exec boundary -----+
| 1. Zero cap table |
| 2. Grant baseline (6) |
| 3. Policy: login -> |
| service AUTH SETUID |
+-------------------------+
|
login process
[baseline + AUTH + SETUID]
|
sys_auth_session()
(requires AUTH cap)
proc->authenticated = 1
|
sys_exec("/bin/stsh")
|
+----- exec boundary -----+
| 1. Zero cap table |
| 2. Grant baseline (6) |
| 3. Policy: stsh -> |
| admin DISK_ADMIN ... |
| (proc->authenticated |
| == 1, so granted) |
+-------------------------+
|
stsh (shell)
[baseline + DISK_ADMIN + POWER
+ CAP_DELEGATE + CAP_QUERY + PROC_READ]
|
fork() -> child inherits all
|
sys_exec("/usr/bin/httpd")
|
+----- exec boundary -----+
| 1. Zero cap table |
| 2. Grant baseline (6) |
| 3. Policy: httpd -> |
| service NET_SOCKET |
+-------------------------+
|
httpd
[baseline + NET_SOCKET]
(no AUTH, no DISK_ADMIN, no POWER)
Spawn with cap_mask — Restricting Child Capabilities
The sys_spawn syscall accepts an optional cap_mask parameter that allows a parent to restrict the capabilities granted to a child below what baseline + policy would normally provide. This requires CAP_KIND_CAP_DELEGATE.
The algorithm:
- Build the child’s capability table normally (baseline + policy).
- If
cap_maskis provided, compute the intersection: only capabilities that appear in both the computed set AND the mask survive. - The parent must hold every capability it includes in the mask (no escalation).
computed_caps = baseline | policy_caps
if cap_mask:
final_caps = computed_caps & cap_mask
for cap in cap_mask: parent must hold cap (prevents escalation)
This supports the principle of least privilege for service managers: Vigil can spawn httpd with only NET_SOCKET, even if httpd’s policy file grants additional capabilities.
Runtime Delegation — sys_cap_grant (Syscall 363)
A process holding CAP_KIND_CAP_DELEGATE can grant capabilities to a running process by PID:
uint64_t sys_cap_grant_runtime(uint64_t target_pid,
uint64_t kind,
uint64_t rights)
Safety checks:
- Caller must hold
CAP_DELEGATE. - Caller must hold the specific
(kind, rights)being granted — prevents privilege escalation. - Target process must exist.
- Target’s capability table must have an empty slot.
Returns the slot index on success, or a negative errno.
Authentication — sys_auth_session (Syscall 364)
The authenticated flag on a process controls whether admin-tier policy capabilities are granted at exec time. Only a process holding CAP_KIND_AUTH can set this flag:
uint64_t sys_auth_session(void)
The flag is inherited across fork and clone, and survives exec. This means once a session is authenticated (e.g., by login after password verification), all descendant processes in that session inherit the authenticated state.
Capability Query — sys_cap_query (Syscall 362)
Any process can query its own capability table (pid == 0). Querying another process’s table requires CAP_KIND_CAP_QUERY:
uint64_t sys_cap_query(uint64_t pid, uint64_t buf_uptr,
uint64_t buflen)
Returns cap_slot_t entries copied to the user-space buffer.
Syscall Enforcement Points
Every privileged operation in the kernel checks capabilities before proceeding. The pattern is consistent:
aegis_process_t *proc = (aegis_process_t *)sched_current();
if (cap_check(proc->caps, CAP_TABLE_SIZE,
CAP_KIND_*, CAP_RIGHTS_*) != 0)
return (uint64_t)-(int64_t)ENOCAP;
The full set of enforcement points:
| Subsystem | Source File | Capability Required |
|---|---|---|
| File open | sys_file.c |
VFS_OPEN + READ |
| File write | sys_io.c |
VFS_WRITE + WRITE |
| File read | sys_io.c |
VFS_READ + READ |
| Directory ops | sys_dir.c |
VFS_OPEN + READ (mkdir, rmdir, unlink) |
| /etc/shadow open | vfs.c |
AUTH + READ |
| Socket (AF_UNIX) | sys_socket.c |
IPC + READ |
| Socket (AF_INET) | sys_socket.c |
NET_SOCKET + READ |
| Network config | sys_socket.c |
NET_ADMIN + READ |
| Thread create | sys_process.c |
THREAD_CREATE + READ |
| Process signal | sys_signal.c |
PROC_READ + WRITE |
| setuid/setgid | sys_identity.c |
SETUID + READ |
| Block device I/O | sys_disk.c |
DISK_ADMIN + READ |
| Framebuffer map | sys_memory.c |
FB + READ |
| memfd_create | sys_memory.c |
IPC + READ |
| Reboot/shutdown | sys_meta.c |
POWER + READ |
| Process metadata | sys_meta.c |
Various (PROC_READ, SETUID) |
| procfs cross-PID | procfs.c |
PROC_READ + READ |
ENOCAP — The Capability Error
When a capability check fails, the kernel returns ENOCAP (errno 130). This value is outside the range of standard Linux errnos and is specific to Aegis:
#define ENOCAP 130
The value is defined without the u suffix so that -ENOCAP produces a clean signed expression. User-space programs should check errno == 130 or use the Aegis-provided header to detect capability denials.
Comparison with Other Capability Systems
vs. seL4
seL4 capabilities are kernel object references (endpoints, CNodes, frames) that are the sole mechanism for accessing any kernel resource. Aegis capabilities are simpler: they are (kind, rights) tags that gate syscall access. Aegis does not use capabilities as object references — file descriptors and PIDs remain the primary handles.
| Aspect | seL4 | Aegis |
|---|---|---|
| Capability granularity | Per-object | Per-operation-class |
| Object references | Via capabilities | Via fds/PIDs |
| CSpace management | User-level CNode trees | Flat 64-slot array |
| Delegation | Copy/mint between CSpaces | cap_mask at exec, sys_cap_grant at runtime |
| Revocation | Via retype/revoke | Slot zeroing (no cascading revoke) |
Aegis trades seL4’s fine-grained per-object capabilities for a simpler model that maps naturally onto Unix syscall patterns. This is intentional — Aegis aims to be a practical, auditable Unix-like OS, not a formally verified microkernel.
vs. Capsicum (FreeBSD)
Capsicum introduces “capability mode” where a process voluntarily gives up ambient authority and operates only through pre-opened file descriptors with cap_rights_t restrictions. Aegis enforces no-ambient-authority from the start — there is no “enter capability mode” transition.
| Aspect | Capsicum | Aegis |
|---|---|---|
| Opt-in vs mandatory | Opt-in (cap_enter) |
Mandatory from boot |
| Scope | Per-fd rights | Per-process operation class |
| File access | Restricted fd + cap_rights_t |
Capability kind + Unix permissions |
| Compatibility | Runs alongside POSIX | Replaces POSIX permission model |
vs. Linux Capabilities (POSIX 1003.1e draft)
Linux capabilities split root’s privileges into ~40 fine-grained bits (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, etc.). These are primarily a mechanism for running daemons with partial root privilege. Aegis capabilities are fundamentally different: they are not a decomposition of root, but the only mechanism for authorizing privileged operations. There is no root user, no setuid bits, no CAP_SYS_ADMIN escape hatch.
| Aspect | Linux capabilities | Aegis capabilities |
|---|---|---|
| Purpose | Decompose root | Replace all authorization |
| Root bypass | CAP_SYS_ADMIN ~ root |
No equivalent |
| Storage | Per-thread bitmask (effective/permitted/inheritable) | Per-process 64-slot table |
| File caps | xattr on executables | Policy files in /etc/aegis/caps.d/ |
| Ambient authority | Present (for non-cap-aware programs) | None |
Security Properties
The properties below describe the design intent of the capability model. As noted at the top of this page, Aegis v1 is young software. The architectural properties are sound, but the implementation – particularly the C code surrounding the Rust capability core – has not been subjected to the level of adversarial review needed to make strong assurance claims. Undiscovered memory safety bugs in the C kernel could potentially bypass these properties. The ongoing C-to-Rust migration is specifically motivated by closing this gap.
-
No capability forgery. The capability table is in kernel memory. User space has no mechanism to write to it except through
sys_cap_grant(which requiresCAP_DELEGATEand the specific capability being granted). -
No privilege escalation through delegation.
sys_cap_grantrequires the caller to hold every capability it delegates.cap_maskin exec is an intersection – it can only remove capabilities, never add them. -
Exec is a hard boundary. Capabilities do not leak across exec. A process that holds
AUTHcannot pass it to an exec’d child unless the child’s policy file explicitly grants it. -
Two-tier policy. Admin-tier capabilities require an authenticated session. Service-tier capabilities are granted unconditionally. This prevents unauthenticated processes from gaining admin privileges even if they exec a binary with admin-tier policy.
-
Defense in depth. The Rust implementation clamps all table indices to
CAP_TABLE_SIZE, preventing out-of-bounds access even if a caller passes incorrect values. NULL pointer checks prevent undefined behavior on empty tables. However, the C callers of these Rust functions are themselves subject to the usual C memory safety risks – a buffer overflow in an unrelated syscall handler could corrupt a process’s capability table in memory, bypassing the Rust validation entirely. This is the primary motivation for expanding the Rust boundary over time.
Future Work
The existing design documents outline several planned enhancements:
- Unforgeable tokens. When cross-process delegation matures, capability slots may carry a kernel-generated random
idfield to prevent guessing attacks against delegated capabilities. - Cascading revocation. Currently, revoking a capability means zeroing its slot. Future work may add cascading revocation where revoking a parent capability automatically revokes all delegated copies.
- Per-file capabilities. The current model gates operation classes (open, read, write). Future iterations may support capabilities scoped to specific filesystem paths or inodes.
- Expanded Rust boundary. The capability module (
kernel/cap/) is currently the only kernel subsystem in Rust. The long-term plan is a gradual, subsystem-by-subsystem migration of the kernel from C to Rust, starting with security-critical paths. The FFI pattern established here –#![no_std]staticlib crates withextern "C"exports – will serve as the template for future conversions. Expanding the Rust boundary reduces the attack surface for memory corruption bugs that could undermine the capability model’s security properties.