Processes & ELF Loading

Aegis processes are represented by aegis_process_t, an extension of the scheduler’s aegis_task_t. The kernel supports the full POSIX process lifecycle: fork, clone (threads), execve, waitpid, and exit. ELF64 binaries are loaded from either the initrd or the ext2 filesystem, with support for dynamically linked executables via a PT_INTERP interpreter.

v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The process subsystem is written in C and has not been subjected to adversarial testing. There are likely exploitable vulnerabilities in the ELF loader, fork/clone paths, and address space management, as would be expected in any from-scratch OS at this stage. Security audit findings to date have identified hypothetical threats, but real exploitable bugs almost certainly exist. A gradual migration from C to Rust is planned starting with the kernel; the capability system (kernel/cap/) is already in Rust. Contributions are welcome – file issues or propose changes at exec/aegis.

Process Control Block

aegis_process_t is defined in kernel/proc/proc.h. Its task field must be at offset 0 – the scheduler stores all tasks as aegis_task_t * and casts to aegis_process_t * when task.is_user == 1.

typedef struct aegis_process {
    aegis_task_t  task;                    /* offset 0 -- scheduler casts here */
    uint64_t      pml4_phys;              /* physical address of process PML4 */
    fd_table_t   *fd_table;               /* shared, refcounted fd table */
    cap_slot_t    caps[CAP_TABLE_SIZE];   /* capability table (64 slots) */
    uint32_t      authenticated;          /* 1 if session passed login auth */
    uint64_t      brk;                    /* current heap limit (user VA) */
    uint64_t      brk_base;               /* initial brk (ELF end); shrink floor */
    uint64_t      mmap_base;              /* next anonymous mmap VA; bump allocator */
    mmap_free_t   mmap_free[64];          /* VA freelist for munmap->mmap reuse */
    uint32_t      mmap_free_count;
    spinlock_t    mmap_free_lock;         /* guards mmap_free[] for CLONE_VM threads */
    vma_entry_t  *vma_table;              /* per-process VMA tracking */
    uint32_t      vma_count;
    uint32_t      vma_capacity;           /* max entries (170 per kva page) */
    uint32_t      vma_refcount;           /* 1 = sole owner; >1 = shared (CLONE_VM) */
    char          exe_path[256];          /* binary path, set at execve */
    uint32_t      pid;                    /* unique process ID; 1 = init */
    uint32_t      tgid;                   /* thread group ID (= leader PID) */
    uint32_t      thread_count;           /* live threads in this group */
    uint32_t      ppid;                   /* parent PID; 0 = no parent */
    uint32_t      uid, gid;              /* user/group ID; 0 = root */
    uint32_t      pgid;                   /* process group ID */
    uint32_t      sid;                    /* session ID */
    uint32_t      umask;                  /* file creation mask; default 022 */
    uint32_t      stop_signum;            /* signal that caused TASK_STOPPED */
    char          cwd[256];               /* current working directory */
    uint64_t      exit_status;            /* lower 8 bits = exit code */
    uint64_t      pending_signals;        /* bitmask; bit N = signal N pending */
    uint64_t      signal_mask;            /* blocked signals */
    k_sigaction_t sigactions[64];         /* per-signal handler/mask/flags */
} aegis_process_t;

Key Design Points

PID allocation is monotonically increasing under a spinlock (proc_alloc_pid). PID 1 is always init.
Thread group: tgid equals the leader’s PID. thread_count is tracked on the leader.
fd table is reference-counted and shared across CLONE_FILES threads. See VFS Layer for details.
Capability table: 64 slots of (capability kind, rights bitfield) pairs. Exec is a capability boundary – capabilities are reset to the baseline capabilities and then augmented by policy capabilities from /etc/aegis/caps.d/. See capability model.
VMA table: Dynamically allocated, reference-counted, supports sharing (CLONE_VM) and deep copy (fork).

User Address Space Layout

0x0000000000000000  +-----------------------+
                    | ELF text/data segments|  (loaded from binary)
                    +-----------------------+
                    | [heap: brk_base..brk] |  (grows up via sys_brk)
                    +-----------------------+
                    |         ...           |
0x0000700000000000  | mmap region           |  (grows up, bump allocator)
                    +-----------------------+
                    |         ...           |
0x0040000000 (1GB)  | Interpreter (ld.so)   |  (INTERP_BASE, if PT_INTERP)
                    +-----------------------+
                    |         ...           |
0x07FFFFFFB000      | User stack (16 KB)    |  (4 pages, grows down)
0x07FFFFFFF000      | (stack top)           |
                    +-----------------------+
0xFFFF800000000000  | Kernel space          |  (shared via pd_hi)

Region	Address	Size
ELF segments	Defined by ELF `p_vaddr`	Variable
Heap	`brk_base` to `brk`	Grows via `sys_brk`
mmap arena	`0x700000000000` upward	Bump allocator
Interpreter	`0x40000000` (INTERP_BASE)	If dynamically linked
User stack	`0x07FFFFFFB000` - `0x07FFFFFFF000`	16 KB (4 pages)

VMA Tracking

Each process maintains a sorted array of Virtual Memory Area descriptors:

typedef struct {
    uint64_t base;
    uint64_t len;
    uint32_t prot;    /* PROT_READ | PROT_WRITE | PROT_EXEC */
    uint8_t  type;    /* VMA_* constant */
} vma_entry_t;  /* 24 bytes, 170 per kva page */

VMA Type	Value	Description
`VMA_ELF_TEXT`	1	PT_LOAD with PROT_EXEC
`VMA_ELF_DATA`	2	PT_LOAD without PROT_EXEC
`VMA_HEAP`	3	brk region
`VMA_STACK`	4	User stack
`VMA_MMAP`	5	Anonymous mmap
`VMA_THREAD_STACK`	6	Thread stack via pthread_create
`VMA_GUARD`	7	Guard page (PROT_NONE)
`VMA_SHARED`	8	MAP_SHARED mapping (memfd)

Operations: vma_insert (merges adjacent entries), vma_remove (splits at boundaries), vma_clone (deep copy for fork), vma_share (refcount increment for CLONE_VM).

ELF Loading

ELF loading is implemented in kernel/proc/elf.c. The loader handles ELF64 executables (ET_EXEC) and position-independent executables (ET_DYN).

Note that the ELF parser is v1 C code that parses untrusted binary input. While basic sanity checks exist (magic validation, segment size caps, overflow guards), a crafted ELF binary could likely exploit parsing bugs to achieve kernel code execution. This is a primary target for the planned Rust migration.

`elf_load()`

int elf_load(uint64_t pml4_phys, const uint8_t *data,
             size_t len, uint64_t base, elf_load_result_t *out);

Parameters:

pml4_phys – physical address of the target process’s PML4
data – pointer to the ELF binary in kernel memory
base – added to all virtual addresses (0 for ET_EXEC, INTERP_BASE for interpreter)
out – result structure filled on success

Result structure:

typedef struct {
    uint64_t entry;       /* entry point VA */
    uint64_t brk;         /* first byte after last segment (page-aligned) */
    uint64_t phdr_va;     /* VA of program header table in loaded image */
    uint32_t phdr_count;  /* number of program headers */
    uint64_t base;        /* base address used for loading */
    char     interp[256]; /* PT_INTERP path (if present) */
} elf_load_result_t;

Loading Process

Validate ELF magic (\x7FELF) and check for ELF64, correct architecture (EM_X86_64 or EM_AARCH64)
Scan for PT_INTERP – extract interpreter path if present (max 255 bytes)
For each PT_LOAD segment:
- Page-align virtual base downward; compute sub-page offset
- Guard against integer overflow (4 GB per-segment cap)
- Allocate KVA pages for the segment
- Zero bytes before p_vaddr within the first page
- Copy file bytes (p_filesz) at the correct sub-page offset
- Zero BSS (bytes from p_filesz to p_memsz)
- Map pages into user PML4 with appropriate flags (VMM_FLAG_USER, VMM_FLAG_WRITABLE if PF_W)
- Record VMA entry for the segment
Compute results: entry = e_entry + base, brk = page-aligned end of last segment, phdr_va = first PT_LOAD vaddr + base + e_phoff

Dynamic Linking Support

If the main binary has a PT_INTERP segment, proc_spawn loads the interpreter (typically ld-musl-x86_64.so.1) at INTERP_BASE (0x40000000). The entry point becomes the interpreter’s entry rather than the binary’s. The auxiliary vector provides the information the interpreter needs to find and relocate the main binary.

Process Creation

`proc_spawn()` – Initial Process

Called from kernel_main to create the init process before sched_start():

Allocates PCB (2 KVA pages – the PCB exceeds 4 KB with capability and signal tables)
Allocates 16 KB kernel stack (4 KVA pages)
Creates per-process page tables via vmm_create_user_pml4
Loads ELF via elf_load, including interpreter if PT_INTERP present
Maps 4-page user stack at 0x07FFFFFFB000 - 0x07FFFFFFF000
Builds SysV ABI initial stack (see below)
Constructs kernel stack frame chaining through ctx_switch -> proc_enter_user -> iretq/ERET
Grants 7 baseline capabilities
Pre-opens fd 0 (stdin/keyboard), fd 1 (stdout/console), fd 2 (stderr/console)
Initializes heap break to top of ELF segments, mmap base to 0x700000000000
Adds to scheduler via sched_add()

SysV ABI Initial Stack

The initial user stack follows the System V AMD64 ABI:

high addresses
  +------------------+
  | argv[0] string   |  "- init" (login shell prefix)
  +------------------+
  | AT_RANDOM data   |  16 random bytes
  +------------------+
  | (alignment pad)  |
  +------------------+
  | AT_NULL pair     |
  | AT_RANDOM / va   |
  | AT_ENTRY / entry |
  | AT_PAGESZ / 4096 |
  | AT_PHNUM / count |
  | AT_PHDR / va     |
  | AT_BASE / interp |  (if dynamically linked)
  | AT_PHENT / 56    |  (if dynamically linked)
  +------------------+
  | envp NULL        |
  | argv NULL        |
  | argv[0] pointer  |
  | argc = 1         |  <- RSP (16-byte aligned + 8)
  +------------------+
low addresses

RSP is aligned so that RSP % 16 == 8 at _start, per the SysV ABI requirement that the stack is 16-byte aligned before the first call instruction.

Auxiliary Vector Entries

Tag	Value	Description
`AT_PHDR` (3)	phdr_va	VA of program header table
`AT_PHNUM` (5)	phdr_count	Number of program headers
`AT_PAGESZ` (6)	4096	System page size
`AT_ENTRY` (9)	entry	Binary entry point (before interpreter redirect)
`AT_RANDOM` (25)	va	Pointer to 16 random bytes
`AT_BASE` (7)	INTERP_BASE	Interpreter load address (if PT_INTERP)
`AT_PHENT` (4)	56	Program header entry size (if PT_INTERP)

`sys_fork()` – Syscall 57

Duplicates the calling process with a full deep copy of the address space:

Allocates child PCB (2 KVA pages)
Copies fd table via fd_table_copy (new table, increments driver refs)
Copies capability table and authenticated flag
Copies all scalar fields (brk, mmap_base, cwd, uid, gid, pgid, sid, umask)
Deep-copies VMA table via vma_clone
Creates new PML4 and deep-copies all user pages via vmm_copy_user_pml4
Allocates 16 KB child kernel stack
Builds initial kernel stack frame so first schedule returns via isr_post_dispatch -> iretq to user space with rax = 0
Signal state: inherits mask and dispositions, clears pending (Linux semantics)
Adds child to scheduler run queue
Returns child PID to parent

Process limit: Total processes capped at MAX_PROCESSES (64). Exceeding returns -EAGAIN.

Fork bomb protection: The s_fork_count counter is checked before allocating any resources. Note that this is a basic v1 safeguard – a hard global cap, not a per-user or cgroup-style limit. It prevents trivial fork bombs but does not constitute a robust resource isolation mechanism.

`sys_clone()` – Syscall 56

Creates a new thread (CLONE_VM set) or delegates to sys_fork (no CLONE_VM):

Clone Flags (Linux ABI)

Flag	Value	Effect
`CLONE_VM`	0x100	Share address space (same PML4)
`CLONE_FS`	0x200	Share filesystem info
`CLONE_FILES`	0x400	Share fd table (refcount increment)
`CLONE_SIGHAND`	0x800	Share signal handlers
`CLONE_THREAD`	0x10000	Same thread group (tgid)
`CLONE_SETTLS`	0x80000	Set TLS pointer for child
`CLONE_PARENT_SETTID`	0x100000	Write child TID to parent’s `*ptid`
`CLONE_CHILD_CLEARTID`	0x200000	Clear TID + futex wake on exit
`CLONE_CHILD_SETTID`	0x1000000	Write child TID to child’s `*ctid`
`CLONE_VFORK`	0x4000	Block parent until child exits/execs

Thread Creation Flow

Validate TLS and CLEARTID pointers (reject kernel addresses)
Check CAP_KIND_THREAD_CREATE capability
Check process limit
Allocate child PCB (2 KVA pages)
Share address space: child->pml4_phys = parent->pml4_phys
Share or copy fd table based on CLONE_FILES
Copy capabilities and scalar fields
Share VMA table via vma_share (refcount increment)
Set up thread group membership (CLONE_THREAD -> same tgid)
Configure TLS (CLONE_SETTLS), clear_child_tid (CLONE_CHILD_CLEARTID)
Allocate 16 KB kernel stack
Build initial kernel stack frame (same layout as fork)
Add to scheduler, handle CLONE_PARENT_SETTID / CLONE_CHILD_SETTID
If CLONE_VFORK: block parent until child exits or execs

Error Cleanup

If kernel stack allocation fails after partial setup, sys_clone carefully rolls back: decrements thread_count, calls vma_free (decrements refcount), unrefs fd table, and frees the PCB.

`sys_execve()` – Syscall 59

Replaces the calling process image in place:

Copy path and argv from user space (max 64 args, each max 255 bytes). argv buffer allocated from KVA (16 KB working area too large for kernel stack).
Binary lookup: initrd first (trusted, no permission check), then ext2 with X_OK DAC permission check
Load binary into KVA for ext2-backed files
Point of no return: vmm_free_user_pages destroys the old image. After this, any failure is fatal (matches Linux behavior after flush_old_exec).
Reset state: brk, mmap_base, mmap freelist, VMA table, fs_base, exe_path
Reset capabilities: Clear capability table to the 6 baseline capabilities, then apply policy capabilities from /etc/aegis/caps.d/ based on binary name and authentication state. Exec is a capability boundary – the previous process’s capabilities do not propagate.
Load new ELF via elf_load, including interpreter if PT_INTERP present
Allocate fresh user stack (4 pages at 0x07FFFFFFB000)
Build SysV ABI initial stack with argc, argv pointers, envp, and auxiliary vector
Redirect SYSRET: modify the syscall frame to point RIP to the new entry and RSP to the new stack

`sys_exit()` / `sys_exit_group()` – Syscalls 60 / 231

`sys_exit`

Store exit code in proc->exit_status (lower 8 bits)
Handle clear_child_tid: write 0 and futex wake (for pthread_join)
Session leader exit: send SIGHUP + SIGCONT to foreground process group, disassociate terminal
PID 1 exit triggers system halt
Reparent orphan children to init (PID 1)
Call sched_exit() (never returns)

`sys_exit_group`

Same as sys_exit but also kills all other threads in the same thread group (tgid), performing clear_child_tid + futex wake for each killed thread.

`sys_waitpid()` – Syscall 61

Waits for a child process to change state:

pid > 0: wait for specific child
pid == -1 or pid == 0: wait for any child
WNOHANG flag: return immediately if no child has exited
WUNTRACED flag: also report stopped children

Walks the scheduler’s circular task list looking for zombie or stopped children. On finding a match:

Writes wstatus to user memory (exit code shifted left by 8)
For zombies: removes from task list, frees PCB, kernel stack, user page tables, decrements fork count
For stopped children: reports stop signal without reaping

If no matching child has exited and WNOHANG is not set, the parent blocks via sched_block() until woken by SIGCHLD from a child’s exit.

Process Lookup

aegis_process_t *proc_find_by_pid(uint32_t pid);

Walks the scheduler’s circular task list under sched_lock. Checks the current task first, then iterates. Returns NULL if no matching user process found. Used by sys_kill, sys_waitpid, sys_cap_query, and other syscalls that operate on a target PID.

User-Mode Entry

x86-64

proc_enter_user is a bare iretq label in syscall_entry.asm. The initial kernel stack frame constructed by proc_spawn places the entry point, user CS/SS, RFLAGS, and user RSP in the iretq frame. Before iretq, the trampoline pops pml4_phys from the stack and loads it into CR3 to switch to the user address space.

ARM64

proc_enter_user is an ERET trampoline in proc_enter.S. It loads TTBR0 (user page table base), sets SP_EL0, ELR_EL1 (return address), and SPSR_EL1, then issues ERET to enter EL0.

Scheduler – task states, context switching, run queue
Syscall Interface – complete syscall table and dispatch
Memory Management – page tables, KVA allocation, VMM
Capability model – capability table and capability kinds
Security policy engine – policy capabilities from /etc/aegis/caps.d/

Processes & ELF Loading

Processes & ELF Loading

Process Control Block

Key Design Points

User Address Space Layout

VMA Tracking

ELF Loading

elf_load()

Loading Process

Dynamic Linking Support

Process Creation

proc_spawn() – Initial Process

SysV ABI Initial Stack

Auxiliary Vector Entries

sys_fork() – Syscall 57

sys_clone() – Syscall 56

Clone Flags (Linux ABI)

Thread Creation Flow

Error Cleanup

sys_execve() – Syscall 59

sys_exit() / sys_exit_group() – Syscalls 60 / 231

sys_exit

sys_exit_group

sys_waitpid() – Syscall 61

Process Lookup

User-Mode Entry

x86-64

ARM64

Related Documentation

`elf_load()`

`proc_spawn()` – Initial Process

`sys_fork()` – Syscall 57

`sys_clone()` – Syscall 56

`sys_execve()` – Syscall 59

`sys_exit()` / `sys_exit_group()` – Syscalls 60 / 231

`sys_exit`

`sys_exit_group`

`sys_waitpid()` – Syscall 61