Capability Model

Aegis replaces traditional Unix permission checks with a capability-based security model. No process holds ambient authority. A process can only perform privileged operations if it holds the specific capability in its kernel-managed capability table. Even basic file access requires an explicit grant.

This page documents the capability data structures, the full set of capability kinds, the Rust/C FFI boundary, capability lifecycle through fork and exec, the syscall enforcement points, and comparisons with other capability systems.

Maturity warning. Aegis v1 is the first version deemed suitable for public release, not a battle-tested production system. The capability model described here is architecturally sound, but the surrounding C kernel code has not undergone the depth of adversarial security review that would justify strong security claims. All vulnerabilities identified in audits to date have been hypothetical; however, a codebase of this scale – predominantly written in C – almost certainly contains real, exploitable bugs that have not yet been found. Treat this documentation as a description of intended security properties, not a guarantee of their enforcement under all conditions. Contributions are welcome – file issues or propose changes at exec/aegis.

Design Principles

  1. No ambient authority. A freshly spawned process starts with an empty capability table. It receives capabilities only through explicit grants — baseline grants at exec time and policy-based grants from /etc/aegis/caps.d/.

  2. Unforgeable. Capability tables live in kernel memory (aegis_process_t.caps[]), embedded directly in the PCB. User space cannot read, write, or forge capability slots.

  3. Kernel-validated. Every privileged syscall calls cap_check() before proceeding. The check is implemented in Rust (kernel/cap/src/lib.rs) and linked via C FFI.

  4. Principle of least privilege. The security policy engine grants only the capabilities a specific binary needs. A web server gets NET_SOCKET; it never gets AUTH or DISK_ADMIN.

Data Structures

Capability Slot

Each slot in the capability table is a (kind, rights) pair, 8 bytes total:

/* kernel/cap/cap.h */
typedef struct {
    uint32_t kind;    /* CAP_KIND_* */
    uint32_t rights;  /* CAP_RIGHTS_* bitfield */
} cap_slot_t;

The Rust side mirrors this layout exactly with #[repr(C)]:

/* kernel/cap/src/lib.rs */
#[repr(C)]
pub struct CapSlot {
    pub kind:   u32,   /* CAP_KIND_* — 0 means empty */
    pub rights: u32,   /* CAP_RIGHTS_* bitfield */
}

A slot with kind == CAP_KIND_NULL (0) is considered empty and available for grant.

Capability Table

Each process holds a fixed-size table of 64 capability slots, embedded directly in the process control block:

+------------------------------------------------------------------+
| aegis_process_t (PCB)                                            |
|   ...                                                            |
|   caps[0]:  { kind: VFS_OPEN,  rights: READ }                    |
|   caps[1]:  { kind: VFS_WRITE, rights: WRITE }                   |
|   caps[2]:  { kind: VFS_READ,  rights: READ }                    |
|   caps[3]:  { kind: IPC,       rights: READ }                    |
|   caps[4]:  { kind: PROC_READ, rights: READ }                    |
|   caps[5]:  { kind: THREAD_CREATE, rights: READ }                |
|   caps[6]:  { kind: NULL, rights: 0 }      <- empty slot         |
|   ...                                                            |
|   caps[63]: { kind: NULL, rights: 0 }      <- empty slot         |
|   authenticated: 0 or 1                                          |
|   ...                                                            |
+------------------------------------------------------------------+

The table size is defined as CAP_TABLE_SIZE = 64 (increased from the original 8-slot Phase 11 design). This is sufficient for all current capability kinds with room for future expansion.

Capability Kinds

Each capability kind gates a specific class of kernel operations:

Kind Value Description Gated Operations
CAP_KIND_NULL 0 Empty slot (sentinel) N/A
CAP_KIND_VFS_OPEN 1 File open sys_open
CAP_KIND_VFS_WRITE 2 File write sys_write, sys_writev
CAP_KIND_VFS_READ 3 File read sys_read, sys_readv
CAP_KIND_AUTH 4 Authentication access Open /etc/shadow, sys_auth_session
CAP_KIND_CAP_GRANT 5 Delegate caps to children Reserved for future use
CAP_KIND_SETUID 6 Identity changes sys_setuid, sys_setgid
CAP_KIND_NET_SOCKET 7 Network sockets sys_socket (AF_INET)
CAP_KIND_NET_ADMIN 8 Network configuration sys_netcfg (set IP/mask/gateway)
CAP_KIND_THREAD_CREATE 9 Thread creation clone(CLONE_VM)
CAP_KIND_PROC_READ 10 Process inspection Read /proc/[other-pid]
CAP_KIND_DISK_ADMIN 11 Raw block I/O sys_disk_read, sys_disk_write
CAP_KIND_FB 12 Framebuffer access Map framebuffer into userspace
CAP_KIND_CAP_DELEGATE 13 Restrict caps on spawn cap_mask parameter in exec, sys_cap_grant
CAP_KIND_CAP_QUERY 14 Introspect cap tables sys_cap_query on other processes
CAP_KIND_IPC 15 IPC primitives AF_UNIX sockets, memfd_create
CAP_KIND_POWER 16 System power control sys_reboot (shutdown/reboot)

Rights Bitfield

Rights are orthogonal to kind. Each capability slot carries a 3-bit rights field:

Right Bit Value Meaning
CAP_RIGHTS_READ 0 0x1 Read access
CAP_RIGHTS_WRITE 1 0x2 Write access
CAP_RIGHTS_EXEC 2 0x4 Execute access

A cap_check call succeeds only if the slot matches both the requested kind and has all requested rights bits set. For example, checking (CAP_KIND_VFS_WRITE, CAP_RIGHTS_WRITE) requires that the process holds a VFS_WRITE slot with at least the WRITE bit.

Most capability kinds use READ as a simple “is present” check. The rights bitfield provides finer-grained control where needed — for instance, PROC_READ with WRITE rights permits kill() on other processes, while READ-only permits inspection via procfs.

The Rust/C FFI Boundary

The core capability logic is implemented in Rust (kernel/cap/src/lib.rs) and compiled as a staticlib (see kernel/cap/Cargo.toml). The Rust crate uses #![no_std] — no allocator, no standard library, no panicking infrastructure beyond an infinite-loop handler.

Build Configuration

# kernel/cap/Cargo.toml
[package]
name = "cap"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["staticlib"]

[profile.dev]
panic = "abort"
[profile.release]
panic = "abort"

The Rust compiler produces libcap.a, which the kernel linker script pulls in alongside the C object files. All three exported functions use #[no_mangle] pub extern "C" to produce C-compatible symbols.

The capability module is the first kernel subsystem written in Rust and represents the beginning of a planned gradual migration of the Aegis kernel from C to Rust. The choice to start with the capability system is deliberate: this is the security-critical authorization path, and Rust’s memory safety guarantees (bounds checking, null safety, no undefined behavior in safe code) provide meaningful hardening for the code that every privileged operation depends on. Future kernel subsystems will follow the same pattern – Rust staticlib crates with extern "C" FFI boundaries.

Exported Functions

Three functions cross the FFI boundary:

cap_init() — Initialize the capability subsystem. Called from kernel_main() before sched_init(). Prints a status line to serial via FFI back into C (serial_write_string):

#[no_mangle]
pub extern "C" fn cap_init() {
    unsafe {
        serial_write_string(
            c"[CAP] OK: capability subsystem initialized\n".as_ptr() as *const u8
        );
    }
}

cap_grant(table, n, kind, rights) — Write a capability into the first empty slot. Returns the slot index on success, -ENOCAP if the table is full:

#[no_mangle]
pub extern "C" fn cap_grant(
    table: *mut CapSlot, n: u32, kind: u32, rights: u32,
) -> i32 {
    if table.is_null() || n == 0 {
        return -(ENOCAP as i32);
    }
    let n = n.min(CAP_TABLE_SIZE);  // clamp to prevent OOB
    let slots = unsafe {
        core::slice::from_raw_parts_mut(table, n as usize)
    };
    for (i, slot) in slots.iter_mut().enumerate() {
        if slot.kind == 0 {
            slot.kind   = kind;
            slot.rights = rights;
            return i as i32;
        }
    }
    -(ENOCAP as i32)
}

cap_check(table, n, kind, rights) — Check whether a capability table contains a slot matching the requested kind with at least the requested rights. Returns 0 on success, -ENOCAP on failure:

#[no_mangle]
pub extern "C" fn cap_check(
    table: *const CapSlot, n: u32, kind: u32, rights: u32,
) -> i32 {
    if table.is_null() || n == 0 {
        return -(ENOCAP as i32);
    }
    let n = n.min(CAP_TABLE_SIZE);
    let slots = unsafe {
        core::slice::from_raw_parts(table, n as usize)
    };
    for slot in slots {
        if slot.kind == kind && (slot.rights & rights) == rights {
            return 0;
        }
    }
    -(ENOCAP as i32)
}

Safety Invariants

The Rust code relies on the following invariants, enforced by the C callers:

  1. table points to valid memory — always proc->caps, which is embedded in a kernel-allocated PCB.
  2. n does not exceed the allocation — callers always pass CAP_TABLE_SIZE. The Rust code additionally clamps n to CAP_TABLE_SIZE as a defense-in-depth measure.
  3. No concurrent mutation during grantcap_grant is called from proc_spawn (before the task enters the run queue) or from sys_exec (which runs in the calling process’s context). No preemption hazard exists at these call sites.
  4. Panic handler — The #[panic_handler] is an infinite loop. This is acceptable because the Rust code contains no operations that can panic (no indexing, no unwrap, no arithmetic overflow in release mode).

C Header

The C side declares the types and functions in kernel/cap/cap.h:

typedef struct {
    uint32_t kind;
    uint32_t rights;
} cap_slot_t;

#define CAP_TABLE_SIZE 64u

void cap_init(void);
int  cap_grant(cap_slot_t *table, uint32_t n,
               uint32_t kind, uint32_t rights);
int  cap_check(const cap_slot_t *table, uint32_t n,
               uint32_t kind, uint32_t rights);

Only three C source files call into the capability module: proc.c (initial grants for PID 1), sys_exec.c (grants at exec and spawn), and sys_cap.c (capability syscalls). All other syscall files call cap_check through the same header.

Capability Lifecycle

Initialization (Boot)

kernel_main()
    |-- cap_init()          // Rust: print status, subsystem ready
    |-- ...
    |-- cap_policy_load()   // C: parse /etc/aegis/caps.d/ into memory
    +-- proc_spawn()        // Grant PID 1 (Vigil) its initial caps

PID 1 (the Vigil init process) receives a hardcoded set of capabilities:

Slot Kind Rights Purpose
0 VFS_OPEN READ Open files
1 VFS_WRITE WRITE Write to files/console
2 VFS_READ READ Read file contents
3 IPC READ Unix domain sockets
4 PROC_READ READ | WRITE Inspect and signal children
5 THREAD_CREATE READ Create threads
6 POWER READ Shutdown/reboot

Note that PID 1 gets PROC_READ with WRITE rights (for sending signals to children) and POWER (for shutdown/reboot). Normal exec’d processes do not receive these by default — they come from policy files.

Exec — The Capability Boundary

Exec is the primary capability boundary in Aegis. When a process calls sys_exec, its capability table is completely reset:

sys_exec(path, ...)
    1. Zero the entire capability table (64 slots)
    2. Grant baseline capabilities (6 caps every process gets)
    3. Look up policy for basename(path) in /etc/aegis/caps.d/
    4. Grant service-tier policy caps unconditionally
    5. Grant admin-tier policy caps only if proc->authenticated == 1

The baseline capabilities granted to every exec’d process:

Kind Rights Purpose
VFS_OPEN READ Open files for reading
VFS_WRITE WRITE Write to open file descriptors
VFS_READ READ Read from open file descriptors
IPC READ AF_UNIX sockets and memfd
PROC_READ READ Read own /proc entries
THREAD_CREATE READ Create threads via clone

This is a critical security property: capabilities do not survive exec. If login holds AUTH and SETUID, and then execs /bin/stsh (the shell), the shell starts with only the baseline plus whatever stsh’s policy file grants. The AUTH capability does not leak to the shell unless explicitly configured.

The same reset-and-grant logic applies in sys_spawn (Aegis’s combined fork+exec syscall), where the child process’s table is built fresh from baseline + policy.

Fork — Full Inheritance

When a process calls fork(), the child receives a complete copy of the parent’s capability table and authenticated flag:

/* kernel/syscall/sys_process.c — sys_fork */
uint32_t ci;
for (ci = 0; ci < CAP_TABLE_SIZE; ci++)
    child->caps[ci] = parent->caps[ci];
child->authenticated = parent->authenticated;

This is safe because fork does not change the executable image. The child runs the same code with the same privilege level. The capability boundary is at exec, not fork.

The same inheritance applies to clone(CLONE_VM) (thread creation), which additionally requires CAP_KIND_THREAD_CREATE.

Capability Flow Diagram

                      PID 1 (Vigil)
                    [baseline + POWER + PROC_READ(W)]
                           |
                      sys_spawn("/bin/login")
                           |
                    +----- exec boundary -----+
                    | 1. Zero cap table       |
                    | 2. Grant baseline (6)   |
                    | 3. Policy: login ->     |
                    |    service AUTH SETUID  |
                    +-------------------------+
                           |
                      login process
                    [baseline + AUTH + SETUID]
                           |
                      sys_auth_session()
                    (requires AUTH cap)
                      proc->authenticated = 1
                           |
                      sys_exec("/bin/stsh")
                           |
                    +----- exec boundary -----+
                    | 1. Zero cap table       |
                    | 2. Grant baseline (6)   |
                    | 3. Policy: stsh ->      |
                    |    admin DISK_ADMIN ... |
                    |    (proc->authenticated |
                    |     == 1, so granted)   |
                    +-------------------------+
                           |
                      stsh (shell)
                    [baseline + DISK_ADMIN + POWER
                     + CAP_DELEGATE + CAP_QUERY + PROC_READ]
                           |
                    fork() -> child inherits all
                           |
                      sys_exec("/usr/bin/httpd")
                           |
                    +----- exec boundary -----+
                    | 1. Zero cap table       |
                    | 2. Grant baseline (6)   |
                    | 3. Policy: httpd ->     |
                    |    service NET_SOCKET   |
                    +-------------------------+
                           |
                      httpd
                    [baseline + NET_SOCKET]
                    (no AUTH, no DISK_ADMIN, no POWER)

Spawn with cap_mask — Restricting Child Capabilities

The sys_spawn syscall accepts an optional cap_mask parameter that allows a parent to restrict the capabilities granted to a child below what baseline + policy would normally provide. This requires CAP_KIND_CAP_DELEGATE.

The algorithm:

  1. Build the child’s capability table normally (baseline + policy).
  2. If cap_mask is provided, compute the intersection: only capabilities that appear in both the computed set AND the mask survive.
  3. The parent must hold every capability it includes in the mask (no escalation).
computed_caps = baseline | policy_caps
if cap_mask:
    final_caps = computed_caps & cap_mask
    for cap in cap_mask: parent must hold cap  (prevents escalation)

This supports the principle of least privilege for service managers: Vigil can spawn httpd with only NET_SOCKET, even if httpd’s policy file grants additional capabilities.

Runtime Delegation — sys_cap_grant (Syscall 363)

A process holding CAP_KIND_CAP_DELEGATE can grant capabilities to a running process by PID:

uint64_t sys_cap_grant_runtime(uint64_t target_pid,
                                uint64_t kind,
                                uint64_t rights)

Safety checks:

  1. Caller must hold CAP_DELEGATE.
  2. Caller must hold the specific (kind, rights) being granted — prevents privilege escalation.
  3. Target process must exist.
  4. Target’s capability table must have an empty slot.

Returns the slot index on success, or a negative errno.

Authentication — sys_auth_session (Syscall 364)

The authenticated flag on a process controls whether admin-tier policy capabilities are granted at exec time. Only a process holding CAP_KIND_AUTH can set this flag:

uint64_t sys_auth_session(void)

The flag is inherited across fork and clone, and survives exec. This means once a session is authenticated (e.g., by login after password verification), all descendant processes in that session inherit the authenticated state.

Capability Query — sys_cap_query (Syscall 362)

Any process can query its own capability table (pid == 0). Querying another process’s table requires CAP_KIND_CAP_QUERY:

uint64_t sys_cap_query(uint64_t pid, uint64_t buf_uptr,
                        uint64_t buflen)

Returns cap_slot_t entries copied to the user-space buffer.

Syscall Enforcement Points

Every privileged operation in the kernel checks capabilities before proceeding. The pattern is consistent:

aegis_process_t *proc = (aegis_process_t *)sched_current();
if (cap_check(proc->caps, CAP_TABLE_SIZE,
              CAP_KIND_*, CAP_RIGHTS_*) != 0)
    return (uint64_t)-(int64_t)ENOCAP;

The full set of enforcement points:

Subsystem Source File Capability Required
File open sys_file.c VFS_OPEN + READ
File write sys_io.c VFS_WRITE + WRITE
File read sys_io.c VFS_READ + READ
Directory ops sys_dir.c VFS_OPEN + READ (mkdir, rmdir, unlink)
/etc/shadow open vfs.c AUTH + READ
Socket (AF_UNIX) sys_socket.c IPC + READ
Socket (AF_INET) sys_socket.c NET_SOCKET + READ
Network config sys_socket.c NET_ADMIN + READ
Thread create sys_process.c THREAD_CREATE + READ
Process signal sys_signal.c PROC_READ + WRITE
setuid/setgid sys_identity.c SETUID + READ
Block device I/O sys_disk.c DISK_ADMIN + READ
Framebuffer map sys_memory.c FB + READ
memfd_create sys_memory.c IPC + READ
Reboot/shutdown sys_meta.c POWER + READ
Process metadata sys_meta.c Various (PROC_READ, SETUID)
procfs cross-PID procfs.c PROC_READ + READ

ENOCAP — The Capability Error

When a capability check fails, the kernel returns ENOCAP (errno 130). This value is outside the range of standard Linux errnos and is specific to Aegis:

#define ENOCAP 130

The value is defined without the u suffix so that -ENOCAP produces a clean signed expression. User-space programs should check errno == 130 or use the Aegis-provided header to detect capability denials.

Comparison with Other Capability Systems

vs. seL4

seL4 capabilities are kernel object references (endpoints, CNodes, frames) that are the sole mechanism for accessing any kernel resource. Aegis capabilities are simpler: they are (kind, rights) tags that gate syscall access. Aegis does not use capabilities as object references — file descriptors and PIDs remain the primary handles.

Aspect seL4 Aegis
Capability granularity Per-object Per-operation-class
Object references Via capabilities Via fds/PIDs
CSpace management User-level CNode trees Flat 64-slot array
Delegation Copy/mint between CSpaces cap_mask at exec, sys_cap_grant at runtime
Revocation Via retype/revoke Slot zeroing (no cascading revoke)

Aegis trades seL4’s fine-grained per-object capabilities for a simpler model that maps naturally onto Unix syscall patterns. This is intentional — Aegis aims to be a practical, auditable Unix-like OS, not a formally verified microkernel.

vs. Capsicum (FreeBSD)

Capsicum introduces “capability mode” where a process voluntarily gives up ambient authority and operates only through pre-opened file descriptors with cap_rights_t restrictions. Aegis enforces no-ambient-authority from the start — there is no “enter capability mode” transition.

Aspect Capsicum Aegis
Opt-in vs mandatory Opt-in (cap_enter) Mandatory from boot
Scope Per-fd rights Per-process operation class
File access Restricted fd + cap_rights_t Capability kind + Unix permissions
Compatibility Runs alongside POSIX Replaces POSIX permission model

vs. Linux Capabilities (POSIX 1003.1e draft)

Linux capabilities split root’s privileges into ~40 fine-grained bits (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, etc.). These are primarily a mechanism for running daemons with partial root privilege. Aegis capabilities are fundamentally different: they are not a decomposition of root, but the only mechanism for authorizing privileged operations. There is no root user, no setuid bits, no CAP_SYS_ADMIN escape hatch.

Aspect Linux capabilities Aegis capabilities
Purpose Decompose root Replace all authorization
Root bypass CAP_SYS_ADMIN ~ root No equivalent
Storage Per-thread bitmask (effective/permitted/inheritable) Per-process 64-slot table
File caps xattr on executables Policy files in /etc/aegis/caps.d/
Ambient authority Present (for non-cap-aware programs) None

Security Properties

The properties below describe the design intent of the capability model. As noted at the top of this page, Aegis v1 is young software. The architectural properties are sound, but the implementation – particularly the C code surrounding the Rust capability core – has not been subjected to the level of adversarial review needed to make strong assurance claims. Undiscovered memory safety bugs in the C kernel could potentially bypass these properties. The ongoing C-to-Rust migration is specifically motivated by closing this gap.

  1. No capability forgery. The capability table is in kernel memory. User space has no mechanism to write to it except through sys_cap_grant (which requires CAP_DELEGATE and the specific capability being granted).

  2. No privilege escalation through delegation. sys_cap_grant requires the caller to hold every capability it delegates. cap_mask in exec is an intersection – it can only remove capabilities, never add them.

  3. Exec is a hard boundary. Capabilities do not leak across exec. A process that holds AUTH cannot pass it to an exec’d child unless the child’s policy file explicitly grants it.

  4. Two-tier policy. Admin-tier capabilities require an authenticated session. Service-tier capabilities are granted unconditionally. This prevents unauthenticated processes from gaining admin privileges even if they exec a binary with admin-tier policy.

  5. Defense in depth. The Rust implementation clamps all table indices to CAP_TABLE_SIZE, preventing out-of-bounds access even if a caller passes incorrect values. NULL pointer checks prevent undefined behavior on empty tables. However, the C callers of these Rust functions are themselves subject to the usual C memory safety risks – a buffer overflow in an unrelated syscall handler could corrupt a process’s capability table in memory, bypassing the Rust validation entirely. This is the primary motivation for expanding the Rust boundary over time.

Future Work

The existing design documents outline several planned enhancements:

  • Unforgeable tokens. When cross-process delegation matures, capability slots may carry a kernel-generated random id field to prevent guessing attacks against delegated capabilities.
  • Cascading revocation. Currently, revoking a capability means zeroing its slot. Future work may add cascading revocation where revoking a parent capability automatically revokes all delegated copies.
  • Per-file capabilities. The current model gates operation classes (open, read, write). Future iterations may support capabilities scoped to specific filesystem paths or inodes.
  • Expanded Rust boundary. The capability module (kernel/cap/) is currently the only kernel subsystem in Rust. The long-term plan is a gradual, subsystem-by-subsystem migration of the kernel from C to Rust, starting with security-critical paths. The FFI pattern established here – #![no_std] staticlib crates with extern "C" exports – will serve as the template for future conversions. Expanding the Rust boundary reduces the attack surface for memory corruption bugs that could undermine the capability model’s security properties.