Processes & ELF Loading

Aegis processes are represented by aegis_process_t, an extension of the scheduler’s aegis_task_t. The kernel supports the full POSIX process lifecycle: fork, clone (threads), execve, waitpid, and exit. ELF64 binaries are loaded from either the initrd or the ext2 filesystem, with support for dynamically linked executables via a PT_INTERP interpreter.

v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The process subsystem is written in C and has not been subjected to adversarial testing. There are likely exploitable vulnerabilities in the ELF loader, fork/clone paths, and address space management, as would be expected in any from-scratch OS at this stage. Security audit findings to date have identified hypothetical threats, but real exploitable bugs almost certainly exist. A gradual migration from C to Rust is planned starting with the kernel; the capability system (kernel/cap/) is already in Rust. Contributions are welcome – file issues or propose changes at exec/aegis.

Process Control Block

aegis_process_t is defined in kernel/proc/proc.h. Its task field must be at offset 0 – the scheduler stores all tasks as aegis_task_t * and casts to aegis_process_t * when task.is_user == 1.

typedef struct aegis_process {
    aegis_task_t  task;                    /* offset 0 -- scheduler casts here */
    uint64_t      pml4_phys;              /* physical address of process PML4 */
    fd_table_t   *fd_table;               /* shared, refcounted fd table */
    cap_slot_t    caps[CAP_TABLE_SIZE];   /* capability table (64 slots) */
    uint32_t      authenticated;          /* 1 if session passed login auth */
    uint64_t      brk;                    /* current heap limit (user VA) */
    uint64_t      brk_base;               /* initial brk (ELF end); shrink floor */
    uint64_t      mmap_base;              /* next anonymous mmap VA; bump allocator */
    mmap_free_t   mmap_free[64];          /* VA freelist for munmap->mmap reuse */
    uint32_t      mmap_free_count;
    spinlock_t    mmap_free_lock;         /* guards mmap_free[] for CLONE_VM threads */
    vma_entry_t  *vma_table;              /* per-process VMA tracking */
    uint32_t      vma_count;
    uint32_t      vma_capacity;           /* max entries (170 per kva page) */
    uint32_t      vma_refcount;           /* 1 = sole owner; >1 = shared (CLONE_VM) */
    char          exe_path[256];          /* binary path, set at execve */
    uint32_t      pid;                    /* unique process ID; 1 = init */
    uint32_t      tgid;                   /* thread group ID (= leader PID) */
    uint32_t      thread_count;           /* live threads in this group */
    uint32_t      ppid;                   /* parent PID; 0 = no parent */
    uint32_t      uid, gid;              /* user/group ID; 0 = root */
    uint32_t      pgid;                   /* process group ID */
    uint32_t      sid;                    /* session ID */
    uint32_t      umask;                  /* file creation mask; default 022 */
    uint32_t      stop_signum;            /* signal that caused TASK_STOPPED */
    char          cwd[256];               /* current working directory */
    uint64_t      exit_status;            /* lower 8 bits = exit code */
    uint64_t      pending_signals;        /* bitmask; bit N = signal N pending */
    uint64_t      signal_mask;            /* blocked signals */
    k_sigaction_t sigactions[64];         /* per-signal handler/mask/flags */
} aegis_process_t;

Key Design Points

  • PID allocation is monotonically increasing under a spinlock (proc_alloc_pid). PID 1 is always init.
  • Thread group: tgid equals the leader’s PID. thread_count is tracked on the leader.
  • fd table is reference-counted and shared across CLONE_FILES threads. See VFS Layer for details.
  • Capability table: 64 slots of (capability kind, rights bitfield) pairs. Exec is a capability boundary – capabilities are reset to the baseline capabilities and then augmented by policy capabilities from /etc/aegis/caps.d/. See capability model.
  • VMA table: Dynamically allocated, reference-counted, supports sharing (CLONE_VM) and deep copy (fork).

User Address Space Layout

0x0000000000000000  +-----------------------+
                    | ELF text/data segments|  (loaded from binary)
                    +-----------------------+
                    | [heap: brk_base..brk] |  (grows up via sys_brk)
                    +-----------------------+
                    |         ...           |
0x0000700000000000  | mmap region           |  (grows up, bump allocator)
                    +-----------------------+
                    |         ...           |
0x0040000000 (1GB)  | Interpreter (ld.so)   |  (INTERP_BASE, if PT_INTERP)
                    +-----------------------+
                    |         ...           |
0x07FFFFFFB000      | User stack (16 KB)    |  (4 pages, grows down)
0x07FFFFFFF000      | (stack top)           |
                    +-----------------------+
0xFFFF800000000000  | Kernel space          |  (shared via pd_hi)
Region Address Size
ELF segments Defined by ELF p_vaddr Variable
Heap brk_base to brk Grows via sys_brk
mmap arena 0x700000000000 upward Bump allocator
Interpreter 0x40000000 (INTERP_BASE) If dynamically linked
User stack 0x07FFFFFFB000 - 0x07FFFFFFF000 16 KB (4 pages)

VMA Tracking

Each process maintains a sorted array of Virtual Memory Area descriptors:

typedef struct {
    uint64_t base;
    uint64_t len;
    uint32_t prot;    /* PROT_READ | PROT_WRITE | PROT_EXEC */
    uint8_t  type;    /* VMA_* constant */
} vma_entry_t;  /* 24 bytes, 170 per kva page */
VMA Type Value Description
VMA_ELF_TEXT 1 PT_LOAD with PROT_EXEC
VMA_ELF_DATA 2 PT_LOAD without PROT_EXEC
VMA_HEAP 3 brk region
VMA_STACK 4 User stack
VMA_MMAP 5 Anonymous mmap
VMA_THREAD_STACK 6 Thread stack via pthread_create
VMA_GUARD 7 Guard page (PROT_NONE)
VMA_SHARED 8 MAP_SHARED mapping (memfd)

Operations: vma_insert (merges adjacent entries), vma_remove (splits at boundaries), vma_clone (deep copy for fork), vma_share (refcount increment for CLONE_VM).

ELF Loading

ELF loading is implemented in kernel/proc/elf.c. The loader handles ELF64 executables (ET_EXEC) and position-independent executables (ET_DYN).

Note that the ELF parser is v1 C code that parses untrusted binary input. While basic sanity checks exist (magic validation, segment size caps, overflow guards), a crafted ELF binary could likely exploit parsing bugs to achieve kernel code execution. This is a primary target for the planned Rust migration.

elf_load()

int elf_load(uint64_t pml4_phys, const uint8_t *data,
             size_t len, uint64_t base, elf_load_result_t *out);

Parameters:

  • pml4_phys – physical address of the target process’s PML4
  • data – pointer to the ELF binary in kernel memory
  • base – added to all virtual addresses (0 for ET_EXEC, INTERP_BASE for interpreter)
  • out – result structure filled on success

Result structure:

typedef struct {
    uint64_t entry;       /* entry point VA */
    uint64_t brk;         /* first byte after last segment (page-aligned) */
    uint64_t phdr_va;     /* VA of program header table in loaded image */
    uint32_t phdr_count;  /* number of program headers */
    uint64_t base;        /* base address used for loading */
    char     interp[256]; /* PT_INTERP path (if present) */
} elf_load_result_t;

Loading Process

  1. Validate ELF magic (\x7FELF) and check for ELF64, correct architecture (EM_X86_64 or EM_AARCH64)
  2. Scan for PT_INTERP – extract interpreter path if present (max 255 bytes)
  3. For each PT_LOAD segment:
    • Page-align virtual base downward; compute sub-page offset
    • Guard against integer overflow (4 GB per-segment cap)
    • Allocate KVA pages for the segment
    • Zero bytes before p_vaddr within the first page
    • Copy file bytes (p_filesz) at the correct sub-page offset
    • Zero BSS (bytes from p_filesz to p_memsz)
    • Map pages into user PML4 with appropriate flags (VMM_FLAG_USER, VMM_FLAG_WRITABLE if PF_W)
    • Record VMA entry for the segment
  4. Compute results: entry = e_entry + base, brk = page-aligned end of last segment, phdr_va = first PT_LOAD vaddr + base + e_phoff

Dynamic Linking Support

If the main binary has a PT_INTERP segment, proc_spawn loads the interpreter (typically ld-musl-x86_64.so.1) at INTERP_BASE (0x40000000). The entry point becomes the interpreter’s entry rather than the binary’s. The auxiliary vector provides the information the interpreter needs to find and relocate the main binary.

Process Creation

proc_spawn() – Initial Process

Called from kernel_main to create the init process before sched_start():

  1. Allocates PCB (2 KVA pages – the PCB exceeds 4 KB with capability and signal tables)
  2. Allocates 16 KB kernel stack (4 KVA pages)
  3. Creates per-process page tables via vmm_create_user_pml4
  4. Loads ELF via elf_load, including interpreter if PT_INTERP present
  5. Maps 4-page user stack at 0x07FFFFFFB000 - 0x07FFFFFFF000
  6. Builds SysV ABI initial stack (see below)
  7. Constructs kernel stack frame chaining through ctx_switch -> proc_enter_user -> iretq/ERET
  8. Grants 7 baseline capabilities
  9. Pre-opens fd 0 (stdin/keyboard), fd 1 (stdout/console), fd 2 (stderr/console)
  10. Initializes heap break to top of ELF segments, mmap base to 0x700000000000
  11. Adds to scheduler via sched_add()

SysV ABI Initial Stack

The initial user stack follows the System V AMD64 ABI:

high addresses
  +------------------+
  | argv[0] string   |  "- init" (login shell prefix)
  +------------------+
  | AT_RANDOM data   |  16 random bytes
  +------------------+
  | (alignment pad)  |
  +------------------+
  | AT_NULL pair     |
  | AT_RANDOM / va   |
  | AT_ENTRY / entry |
  | AT_PAGESZ / 4096 |
  | AT_PHNUM / count |
  | AT_PHDR / va     |
  | AT_BASE / interp |  (if dynamically linked)
  | AT_PHENT / 56    |  (if dynamically linked)
  +------------------+
  | envp NULL        |
  | argv NULL        |
  | argv[0] pointer  |
  | argc = 1         |  <- RSP (16-byte aligned + 8)
  +------------------+
low addresses

RSP is aligned so that RSP % 16 == 8 at _start, per the SysV ABI requirement that the stack is 16-byte aligned before the first call instruction.

Auxiliary Vector Entries

Tag Value Description
AT_PHDR (3) phdr_va VA of program header table
AT_PHNUM (5) phdr_count Number of program headers
AT_PAGESZ (6) 4096 System page size
AT_ENTRY (9) entry Binary entry point (before interpreter redirect)
AT_RANDOM (25) va Pointer to 16 random bytes
AT_BASE (7) INTERP_BASE Interpreter load address (if PT_INTERP)
AT_PHENT (4) 56 Program header entry size (if PT_INTERP)

sys_fork() – Syscall 57

Duplicates the calling process with a full deep copy of the address space:

  1. Allocates child PCB (2 KVA pages)
  2. Copies fd table via fd_table_copy (new table, increments driver refs)
  3. Copies capability table and authenticated flag
  4. Copies all scalar fields (brk, mmap_base, cwd, uid, gid, pgid, sid, umask)
  5. Deep-copies VMA table via vma_clone
  6. Creates new PML4 and deep-copies all user pages via vmm_copy_user_pml4
  7. Allocates 16 KB child kernel stack
  8. Builds initial kernel stack frame so first schedule returns via isr_post_dispatch -> iretq to user space with rax = 0
  9. Signal state: inherits mask and dispositions, clears pending (Linux semantics)
  10. Adds child to scheduler run queue
  11. Returns child PID to parent

Process limit: Total processes capped at MAX_PROCESSES (64). Exceeding returns -EAGAIN.

Fork bomb protection: The s_fork_count counter is checked before allocating any resources. Note that this is a basic v1 safeguard – a hard global cap, not a per-user or cgroup-style limit. It prevents trivial fork bombs but does not constitute a robust resource isolation mechanism.

sys_clone() – Syscall 56

Creates a new thread (CLONE_VM set) or delegates to sys_fork (no CLONE_VM):

Clone Flags (Linux ABI)

Flag Value Effect
CLONE_VM 0x100 Share address space (same PML4)
CLONE_FS 0x200 Share filesystem info
CLONE_FILES 0x400 Share fd table (refcount increment)
CLONE_SIGHAND 0x800 Share signal handlers
CLONE_THREAD 0x10000 Same thread group (tgid)
CLONE_SETTLS 0x80000 Set TLS pointer for child
CLONE_PARENT_SETTID 0x100000 Write child TID to parent’s *ptid
CLONE_CHILD_CLEARTID 0x200000 Clear TID + futex wake on exit
CLONE_CHILD_SETTID 0x1000000 Write child TID to child’s *ctid
CLONE_VFORK 0x4000 Block parent until child exits/execs

Thread Creation Flow

  1. Validate TLS and CLEARTID pointers (reject kernel addresses)
  2. Check CAP_KIND_THREAD_CREATE capability
  3. Check process limit
  4. Allocate child PCB (2 KVA pages)
  5. Share address space: child->pml4_phys = parent->pml4_phys
  6. Share or copy fd table based on CLONE_FILES
  7. Copy capabilities and scalar fields
  8. Share VMA table via vma_share (refcount increment)
  9. Set up thread group membership (CLONE_THREAD -> same tgid)
  10. Configure TLS (CLONE_SETTLS), clear_child_tid (CLONE_CHILD_CLEARTID)
  11. Allocate 16 KB kernel stack
  12. Build initial kernel stack frame (same layout as fork)
  13. Add to scheduler, handle CLONE_PARENT_SETTID / CLONE_CHILD_SETTID
  14. If CLONE_VFORK: block parent until child exits or execs

Error Cleanup

If kernel stack allocation fails after partial setup, sys_clone carefully rolls back: decrements thread_count, calls vma_free (decrements refcount), unrefs fd table, and frees the PCB.

sys_execve() – Syscall 59

Replaces the calling process image in place:

  1. Copy path and argv from user space (max 64 args, each max 255 bytes). argv buffer allocated from KVA (16 KB working area too large for kernel stack).
  2. Binary lookup: initrd first (trusted, no permission check), then ext2 with X_OK DAC permission check
  3. Load binary into KVA for ext2-backed files
  4. Point of no return: vmm_free_user_pages destroys the old image. After this, any failure is fatal (matches Linux behavior after flush_old_exec).
  5. Reset state: brk, mmap_base, mmap freelist, VMA table, fs_base, exe_path
  6. Reset capabilities: Clear capability table to the 6 baseline capabilities, then apply policy capabilities from /etc/aegis/caps.d/ based on binary name and authentication state. Exec is a capability boundary – the previous process’s capabilities do not propagate.
  7. Load new ELF via elf_load, including interpreter if PT_INTERP present
  8. Allocate fresh user stack (4 pages at 0x07FFFFFFB000)
  9. Build SysV ABI initial stack with argc, argv pointers, envp, and auxiliary vector
  10. Redirect SYSRET: modify the syscall frame to point RIP to the new entry and RSP to the new stack

sys_exit() / sys_exit_group() – Syscalls 60 / 231

sys_exit

  1. Store exit code in proc->exit_status (lower 8 bits)
  2. Handle clear_child_tid: write 0 and futex wake (for pthread_join)
  3. Session leader exit: send SIGHUP + SIGCONT to foreground process group, disassociate terminal
  4. PID 1 exit triggers system halt
  5. Reparent orphan children to init (PID 1)
  6. Call sched_exit() (never returns)

sys_exit_group

Same as sys_exit but also kills all other threads in the same thread group (tgid), performing clear_child_tid + futex wake for each killed thread.

sys_waitpid() – Syscall 61

Waits for a child process to change state:

  • pid > 0: wait for specific child
  • pid == -1 or pid == 0: wait for any child
  • WNOHANG flag: return immediately if no child has exited
  • WUNTRACED flag: also report stopped children

Walks the scheduler’s circular task list looking for zombie or stopped children. On finding a match:

  • Writes wstatus to user memory (exit code shifted left by 8)
  • For zombies: removes from task list, frees PCB, kernel stack, user page tables, decrements fork count
  • For stopped children: reports stop signal without reaping

If no matching child has exited and WNOHANG is not set, the parent blocks via sched_block() until woken by SIGCHLD from a child’s exit.

Process Lookup

aegis_process_t *proc_find_by_pid(uint32_t pid);

Walks the scheduler’s circular task list under sched_lock. Checks the current task first, then iterates. Returns NULL if no matching user process found. Used by sys_kill, sys_waitpid, sys_cap_query, and other syscalls that operate on a target PID.

User-Mode Entry

x86-64

proc_enter_user is a bare iretq label in syscall_entry.asm. The initial kernel stack frame constructed by proc_spawn places the entry point, user CS/SS, RFLAGS, and user RSP in the iretq frame. Before iretq, the trampoline pops pml4_phys from the stack and loads it into CR3 to switch to the user address space.

ARM64

proc_enter_user is an ERET trampoline in proc_enter.S. It loads TTBR0 (user page table base), sets SP_EL0, ELR_EL1 (return address), and SPSR_EL1, then issues ERET to enter EL0.