Processes & ELF Loading
Process control blocks, ELF64 binary loading, fork/clone/execve lifecycle, and user-mode entry
Processes & ELF Loading
Aegis processes are represented by aegis_process_t, an extension of the scheduler’s aegis_task_t. The kernel supports the full POSIX process lifecycle: fork, clone (threads), execve, waitpid, and exit. ELF64 binaries are loaded from either the initrd or the ext2 filesystem, with support for dynamically linked executables via a PT_INTERP interpreter.
v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The process subsystem is written in C and has not been subjected to adversarial testing. There are likely exploitable vulnerabilities in the ELF loader, fork/clone paths, and address space management, as would be expected in any from-scratch OS at this stage. Security audit findings to date have identified hypothetical threats, but real exploitable bugs almost certainly exist. A gradual migration from C to Rust is planned starting with the kernel; the capability system (
kernel/cap/) is already in Rust. Contributions are welcome – file issues or propose changes at exec/aegis.
Process Control Block
aegis_process_t is defined in kernel/proc/proc.h. Its task field must be at offset 0 – the scheduler stores all tasks as aegis_task_t * and casts to aegis_process_t * when task.is_user == 1.
typedef struct aegis_process {
aegis_task_t task; /* offset 0 -- scheduler casts here */
uint64_t pml4_phys; /* physical address of process PML4 */
fd_table_t *fd_table; /* shared, refcounted fd table */
cap_slot_t caps[CAP_TABLE_SIZE]; /* capability table (64 slots) */
uint32_t authenticated; /* 1 if session passed login auth */
uint64_t brk; /* current heap limit (user VA) */
uint64_t brk_base; /* initial brk (ELF end); shrink floor */
uint64_t mmap_base; /* next anonymous mmap VA; bump allocator */
mmap_free_t mmap_free[64]; /* VA freelist for munmap->mmap reuse */
uint32_t mmap_free_count;
spinlock_t mmap_free_lock; /* guards mmap_free[] for CLONE_VM threads */
vma_entry_t *vma_table; /* per-process VMA tracking */
uint32_t vma_count;
uint32_t vma_capacity; /* max entries (170 per kva page) */
uint32_t vma_refcount; /* 1 = sole owner; >1 = shared (CLONE_VM) */
char exe_path[256]; /* binary path, set at execve */
uint32_t pid; /* unique process ID; 1 = init */
uint32_t tgid; /* thread group ID (= leader PID) */
uint32_t thread_count; /* live threads in this group */
uint32_t ppid; /* parent PID; 0 = no parent */
uint32_t uid, gid; /* user/group ID; 0 = root */
uint32_t pgid; /* process group ID */
uint32_t sid; /* session ID */
uint32_t umask; /* file creation mask; default 022 */
uint32_t stop_signum; /* signal that caused TASK_STOPPED */
char cwd[256]; /* current working directory */
uint64_t exit_status; /* lower 8 bits = exit code */
uint64_t pending_signals; /* bitmask; bit N = signal N pending */
uint64_t signal_mask; /* blocked signals */
k_sigaction_t sigactions[64]; /* per-signal handler/mask/flags */
} aegis_process_t;
Key Design Points
- PID allocation is monotonically increasing under a spinlock (
proc_alloc_pid). PID 1 is alwaysinit. - Thread group:
tgidequals the leader’s PID.thread_countis tracked on the leader. - fd table is reference-counted and shared across
CLONE_FILESthreads. See VFS Layer for details. - Capability table: 64 slots of
(capability kind, rights bitfield)pairs. Exec is a capability boundary – capabilities are reset to the baseline capabilities and then augmented by policy capabilities from/etc/aegis/caps.d/. See capability model. - VMA table: Dynamically allocated, reference-counted, supports sharing (
CLONE_VM) and deep copy (fork).
User Address Space Layout
0x0000000000000000 +-----------------------+
| ELF text/data segments| (loaded from binary)
+-----------------------+
| [heap: brk_base..brk] | (grows up via sys_brk)
+-----------------------+
| ... |
0x0000700000000000 | mmap region | (grows up, bump allocator)
+-----------------------+
| ... |
0x0040000000 (1GB) | Interpreter (ld.so) | (INTERP_BASE, if PT_INTERP)
+-----------------------+
| ... |
0x07FFFFFFB000 | User stack (16 KB) | (4 pages, grows down)
0x07FFFFFFF000 | (stack top) |
+-----------------------+
0xFFFF800000000000 | Kernel space | (shared via pd_hi)
| Region | Address | Size |
|---|---|---|
| ELF segments | Defined by ELF p_vaddr |
Variable |
| Heap | brk_base to brk |
Grows via sys_brk |
| mmap arena | 0x700000000000 upward |
Bump allocator |
| Interpreter | 0x40000000 (INTERP_BASE) |
If dynamically linked |
| User stack | 0x07FFFFFFB000 - 0x07FFFFFFF000 |
16 KB (4 pages) |
VMA Tracking
Each process maintains a sorted array of Virtual Memory Area descriptors:
typedef struct {
uint64_t base;
uint64_t len;
uint32_t prot; /* PROT_READ | PROT_WRITE | PROT_EXEC */
uint8_t type; /* VMA_* constant */
} vma_entry_t; /* 24 bytes, 170 per kva page */
| VMA Type | Value | Description |
|---|---|---|
VMA_ELF_TEXT |
1 | PT_LOAD with PROT_EXEC |
VMA_ELF_DATA |
2 | PT_LOAD without PROT_EXEC |
VMA_HEAP |
3 | brk region |
VMA_STACK |
4 | User stack |
VMA_MMAP |
5 | Anonymous mmap |
VMA_THREAD_STACK |
6 | Thread stack via pthread_create |
VMA_GUARD |
7 | Guard page (PROT_NONE) |
VMA_SHARED |
8 | MAP_SHARED mapping (memfd) |
Operations: vma_insert (merges adjacent entries), vma_remove (splits at boundaries), vma_clone (deep copy for fork), vma_share (refcount increment for CLONE_VM).
ELF Loading
ELF loading is implemented in kernel/proc/elf.c. The loader handles ELF64 executables (ET_EXEC) and position-independent executables (ET_DYN).
Note that the ELF parser is v1 C code that parses untrusted binary input. While basic sanity checks exist (magic validation, segment size caps, overflow guards), a crafted ELF binary could likely exploit parsing bugs to achieve kernel code execution. This is a primary target for the planned Rust migration.
elf_load()
int elf_load(uint64_t pml4_phys, const uint8_t *data,
size_t len, uint64_t base, elf_load_result_t *out);
Parameters:
pml4_phys– physical address of the target process’s PML4data– pointer to the ELF binary in kernel memorybase– added to all virtual addresses (0 for ET_EXEC,INTERP_BASEfor interpreter)out– result structure filled on success
Result structure:
typedef struct {
uint64_t entry; /* entry point VA */
uint64_t brk; /* first byte after last segment (page-aligned) */
uint64_t phdr_va; /* VA of program header table in loaded image */
uint32_t phdr_count; /* number of program headers */
uint64_t base; /* base address used for loading */
char interp[256]; /* PT_INTERP path (if present) */
} elf_load_result_t;
Loading Process
- Validate ELF magic (
\x7FELF) and check for ELF64, correct architecture (EM_X86_64orEM_AARCH64) - Scan for PT_INTERP – extract interpreter path if present (max 255 bytes)
- For each PT_LOAD segment:
- Page-align virtual base downward; compute sub-page offset
- Guard against integer overflow (4 GB per-segment cap)
- Allocate KVA pages for the segment
- Zero bytes before
p_vaddrwithin the first page - Copy file bytes (
p_filesz) at the correct sub-page offset - Zero BSS (bytes from
p_filesztop_memsz) - Map pages into user PML4 with appropriate flags (
VMM_FLAG_USER,VMM_FLAG_WRITABLEifPF_W) - Record VMA entry for the segment
- Compute results: entry =
e_entry + base, brk = page-aligned end of last segment, phdr_va = first PT_LOAD vaddr + base + e_phoff
Dynamic Linking Support
If the main binary has a PT_INTERP segment, proc_spawn loads the interpreter (typically ld-musl-x86_64.so.1) at INTERP_BASE (0x40000000). The entry point becomes the interpreter’s entry rather than the binary’s. The auxiliary vector provides the information the interpreter needs to find and relocate the main binary.
Process Creation
proc_spawn() – Initial Process
Called from kernel_main to create the init process before sched_start():
- Allocates PCB (2 KVA pages – the PCB exceeds 4 KB with capability and signal tables)
- Allocates 16 KB kernel stack (4 KVA pages)
- Creates per-process page tables via
vmm_create_user_pml4 - Loads ELF via
elf_load, including interpreter if PT_INTERP present - Maps 4-page user stack at
0x07FFFFFFB000-0x07FFFFFFF000 - Builds SysV ABI initial stack (see below)
- Constructs kernel stack frame chaining through
ctx_switch->proc_enter_user->iretq/ERET - Grants 7 baseline capabilities
- Pre-opens fd 0 (stdin/keyboard), fd 1 (stdout/console), fd 2 (stderr/console)
- Initializes heap break to top of ELF segments, mmap base to
0x700000000000 - Adds to scheduler via
sched_add()
SysV ABI Initial Stack
The initial user stack follows the System V AMD64 ABI:
high addresses
+------------------+
| argv[0] string | "- init" (login shell prefix)
+------------------+
| AT_RANDOM data | 16 random bytes
+------------------+
| (alignment pad) |
+------------------+
| AT_NULL pair |
| AT_RANDOM / va |
| AT_ENTRY / entry |
| AT_PAGESZ / 4096 |
| AT_PHNUM / count |
| AT_PHDR / va |
| AT_BASE / interp | (if dynamically linked)
| AT_PHENT / 56 | (if dynamically linked)
+------------------+
| envp NULL |
| argv NULL |
| argv[0] pointer |
| argc = 1 | <- RSP (16-byte aligned + 8)
+------------------+
low addresses
RSP is aligned so that RSP % 16 == 8 at _start, per the SysV ABI requirement that the stack is 16-byte aligned before the first call instruction.
Auxiliary Vector Entries
| Tag | Value | Description |
|---|---|---|
AT_PHDR (3) |
phdr_va | VA of program header table |
AT_PHNUM (5) |
phdr_count | Number of program headers |
AT_PAGESZ (6) |
4096 | System page size |
AT_ENTRY (9) |
entry | Binary entry point (before interpreter redirect) |
AT_RANDOM (25) |
va | Pointer to 16 random bytes |
AT_BASE (7) |
INTERP_BASE | Interpreter load address (if PT_INTERP) |
AT_PHENT (4) |
56 | Program header entry size (if PT_INTERP) |
sys_fork() – Syscall 57
Duplicates the calling process with a full deep copy of the address space:
- Allocates child PCB (2 KVA pages)
- Copies fd table via
fd_table_copy(new table, increments driver refs) - Copies capability table and
authenticatedflag - Copies all scalar fields (brk, mmap_base, cwd, uid, gid, pgid, sid, umask)
- Deep-copies VMA table via
vma_clone - Creates new PML4 and deep-copies all user pages via
vmm_copy_user_pml4 - Allocates 16 KB child kernel stack
- Builds initial kernel stack frame so first schedule returns via
isr_post_dispatch->iretqto user space withrax = 0 - Signal state: inherits mask and dispositions, clears pending (Linux semantics)
- Adds child to scheduler run queue
- Returns child PID to parent
Process limit: Total processes capped at MAX_PROCESSES (64). Exceeding returns -EAGAIN.
Fork bomb protection: The s_fork_count counter is checked before allocating any resources. Note that this is a basic v1 safeguard – a hard global cap, not a per-user or cgroup-style limit. It prevents trivial fork bombs but does not constitute a robust resource isolation mechanism.
sys_clone() – Syscall 56
Creates a new thread (CLONE_VM set) or delegates to sys_fork (no CLONE_VM):
Clone Flags (Linux ABI)
| Flag | Value | Effect |
|---|---|---|
CLONE_VM |
0x100 | Share address space (same PML4) |
CLONE_FS |
0x200 | Share filesystem info |
CLONE_FILES |
0x400 | Share fd table (refcount increment) |
CLONE_SIGHAND |
0x800 | Share signal handlers |
CLONE_THREAD |
0x10000 | Same thread group (tgid) |
CLONE_SETTLS |
0x80000 | Set TLS pointer for child |
CLONE_PARENT_SETTID |
0x100000 | Write child TID to parent’s *ptid |
CLONE_CHILD_CLEARTID |
0x200000 | Clear TID + futex wake on exit |
CLONE_CHILD_SETTID |
0x1000000 | Write child TID to child’s *ctid |
CLONE_VFORK |
0x4000 | Block parent until child exits/execs |
Thread Creation Flow
- Validate TLS and CLEARTID pointers (reject kernel addresses)
- Check
CAP_KIND_THREAD_CREATEcapability - Check process limit
- Allocate child PCB (2 KVA pages)
- Share address space:
child->pml4_phys = parent->pml4_phys - Share or copy fd table based on
CLONE_FILES - Copy capabilities and scalar fields
- Share VMA table via
vma_share(refcount increment) - Set up thread group membership (
CLONE_THREAD-> same tgid) - Configure TLS (
CLONE_SETTLS), clear_child_tid (CLONE_CHILD_CLEARTID) - Allocate 16 KB kernel stack
- Build initial kernel stack frame (same layout as fork)
- Add to scheduler, handle
CLONE_PARENT_SETTID/CLONE_CHILD_SETTID - If
CLONE_VFORK: block parent until child exits or execs
Error Cleanup
If kernel stack allocation fails after partial setup, sys_clone carefully rolls back: decrements thread_count, calls vma_free (decrements refcount), unrefs fd table, and frees the PCB.
sys_execve() – Syscall 59
Replaces the calling process image in place:
- Copy path and argv from user space (max 64 args, each max 255 bytes). argv buffer allocated from KVA (16 KB working area too large for kernel stack).
- Binary lookup: initrd first (trusted, no permission check), then ext2 with
X_OKDAC permission check - Load binary into KVA for ext2-backed files
- Point of no return:
vmm_free_user_pagesdestroys the old image. After this, any failure is fatal (matches Linux behavior afterflush_old_exec). - Reset state: brk, mmap_base, mmap freelist, VMA table, fs_base, exe_path
- Reset capabilities: Clear capability table to the 6 baseline capabilities, then apply policy capabilities from
/etc/aegis/caps.d/based on binary name and authentication state. Exec is a capability boundary – the previous process’s capabilities do not propagate. - Load new ELF via
elf_load, including interpreter if PT_INTERP present - Allocate fresh user stack (4 pages at
0x07FFFFFFB000) - Build SysV ABI initial stack with argc, argv pointers, envp, and auxiliary vector
- Redirect SYSRET: modify the syscall frame to point RIP to the new entry and RSP to the new stack
sys_exit() / sys_exit_group() – Syscalls 60 / 231
sys_exit
- Store exit code in
proc->exit_status(lower 8 bits) - Handle
clear_child_tid: write 0 and futex wake (forpthread_join) - Session leader exit: send SIGHUP + SIGCONT to foreground process group, disassociate terminal
- PID 1 exit triggers system halt
- Reparent orphan children to init (PID 1)
- Call
sched_exit()(never returns)
sys_exit_group
Same as sys_exit but also kills all other threads in the same thread group (tgid), performing clear_child_tid + futex wake for each killed thread.
sys_waitpid() – Syscall 61
Waits for a child process to change state:
pid > 0: wait for specific childpid == -1orpid == 0: wait for any childWNOHANGflag: return immediately if no child has exitedWUNTRACEDflag: also report stopped children
Walks the scheduler’s circular task list looking for zombie or stopped children. On finding a match:
- Writes
wstatusto user memory (exit code shifted left by 8) - For zombies: removes from task list, frees PCB, kernel stack, user page tables, decrements fork count
- For stopped children: reports stop signal without reaping
If no matching child has exited and WNOHANG is not set, the parent blocks via sched_block() until woken by SIGCHLD from a child’s exit.
Process Lookup
aegis_process_t *proc_find_by_pid(uint32_t pid);
Walks the scheduler’s circular task list under sched_lock. Checks the current task first, then iterates. Returns NULL if no matching user process found. Used by sys_kill, sys_waitpid, sys_cap_query, and other syscalls that operate on a target PID.
User-Mode Entry
x86-64
proc_enter_user is a bare iretq label in syscall_entry.asm. The initial kernel stack frame constructed by proc_spawn places the entry point, user CS/SS, RFLAGS, and user RSP in the iretq frame. Before iretq, the trampoline pops pml4_phys from the stack and loads it into CR3 to switch to the user address space.
ARM64
proc_enter_user is an ERET trampoline in proc_enter.S. It loads TTBR0 (user page table base), sets SP_EL0, ELR_EL1 (return address), and SPSR_EL1, then issues ERET to enter EL0.
Related Documentation
- Scheduler – task states, context switching, run queue
- Syscall Interface – complete syscall table and dispatch
- Memory Management – page tables, KVA allocation, VMM
- Capability model – capability table and capability kinds
- Security policy engine – policy capabilities from
/etc/aegis/caps.d/