Memory Management

Aegis implements a three-layer memory management architecture: a physical page allocator (PMM), a 4-level virtual memory manager (VMM), and a kernel virtual address allocator (KVA). User processes additionally track their virtual memory areas (VMAs) for mmap/mprotect/munmap support.

v1 maturity notice. The memory management subsystem is written entirely in C and represents v1 quality – functional and tested, but not battle-hardened. The PMM uses a simple bitmap allocator (O(n) scan), the VMM has a single-slot mapped-window allocator protected by a global spinlock, and there is no fault recovery table (extable) for user-memory access. These are deliberate v1 trade-offs, not permanent design decisions. As a from-scratch C kernel, there are likely exploitable vulnerabilities in the memory management code, as would be expected at this stage. A gradual Rust migration is planned for safety-critical kernel subsystems; kernel/cap/ is already in Rust. Contributions are welcome – file issues or propose changes at exec/aegis.

Architecture Overview

+-------------------+
|  User processes   |  VMA tracking (mmap, brk, stack)
+-------------------+
|       VMM         |  4-level page tables (PML4 -> PDPT -> PD -> PT)
+-------------------+
|       KVA         |  Kernel virtual address bump allocator
+-------------------+
|       PMM         |  Bitmap physical page allocator
+-------------------+
|  Physical RAM     |  Reported by multiboot2, managed as 4KB frames
+-------------------+

Initialization order in kernel_main:

  1. arch_mm_init(mb_info) – Parse multiboot2 memory map
  2. pmm_init() – Build physical page bitmap
  3. vmm_init() – Build kernel page tables, activate paging
  4. kva_init() – Initialize kernel VA bump allocator

Physical Memory Manager (PMM)

Source: kernel/mm/pmm.c, kernel/mm/pmm.h

Design

The PMM uses a bitmap allocator covering 4GB of physical address space. Each bit represents one 4KB page:

Parameter Value
Page size 4096 bytes (PAGE_SIZE)
Maximum pages 1,048,576 (4GB / 4KB)
Bitmap size 128KB (1M bits / 8)
Bitmap location BSS (.bss section)
Bit semantics 0 = free, 1 = allocated

Initialization

pmm_init() follows a five-step process:

  1. Mark everything reserved – Fill bitmap with 0xFF (safe default)
  2. Free usable RAM – Walk arch_mm_get_regions() (multiboot2 type=1 entries), clear bits for each usable page
  3. Re-reserve platform ranges – Walk arch_mm_get_reserved_regions():
    • First 1MB (BIOS data, VGA hole, ISA ROMs)
    • Multiboot2 info structure (may be above 1MB)
    • GRUB modules (rootfs, ESP image)
  4. Reserve kernel image[ARCH_KERNEL_PHYS_BASE, _kernel_end - KERN_VMA), covering .text through .bss (including the bitmap itself)
  5. Report – Print total usable MB across N regions

Allocation API

uint64_t pmm_alloc_page(void);   // Returns physical address, 0 on OOM
void     pmm_free_page(uint64_t addr);
void     pmm_ref_page(uint64_t addr);   // Increment refcount (COW fork)
uint64_t pmm_total_pages(void);
uint64_t pmm_free_pages(void);

pmm_alloc_page() performs a linear scan for the first byte != 0xFF, then finds the first clear bit. O(n) where n = bitmap size. Single-page allocation only – multi-page contiguous allocation is deferred to a future buddy-allocator upgrade.

All PMM operations are protected by pmm_lock (spinlock with IRQ save/restore).

Reference Counting

Every allocated page has an 8-bit reference count (pmm_refcount[], 1MB in BSS). This supports copy-on-write (COW) fork:

Operation Refcount effect
pmm_alloc_page() Set to 1
pmm_ref_page() Increment (panics on overflow at 255)
pmm_free_page() Decrement; free page only when refcount reaches 0

Pages outside the PMM-managed range (e.g., MMIO addresses like framebuffer physical addresses) are silently skipped by pmm_free_page().

Virtual Memory Manager (VMM)

Source: kernel/mm/vmm.c, kernel/mm/vmm.h

x86-64 Page Table Structure

Aegis uses the standard 4-level x86-64 page table hierarchy:

Virtual Address (48-bit canonical):
+--------+--------+--------+--------+----------+
| PML4   | PDPT   | PD     | PT     | Offset   |
| [47:39]| [38:30]| [29:21]| [20:12]| [11:0]   |
| 9 bits | 9 bits | 9 bits | 9 bits | 12 bits  |
+--------+--------+--------+--------+----------+

PML4 [512 entries]
  |
  +--[0]----> pdpt_lo [512 entries]    (identity map, removed after boot)
  |             +--[0]----> pd_lo [512 entries]
  |                           +--[0..511] -> 512 x 2MB huge pages
  |
  +--[511]--> pdpt_hi [512 entries]    (higher-half kernel)
                +--[510]--> pd_hi [512 entries]
                              +--[0] -> 2MB huge: PA 0x000000 (kernel .text)
                              +--[1] -> 2MB huge: PA 0x200000 (kernel cont.)
                              +--[2] -> 2MB huge: PA 0x400000 (kernel BSS)
                              +--[3] -> PT: mapped-window allocator
                              +--[4+] -> KVA 4KB pages

Page Table Entry Flags

Abstract flags defined in vmm.h (translated to hardware PTE bits by arch_pte_from_flags()):

Flag Bit Purpose
VMM_FLAG_PRESENT 0 Page is present in memory
VMM_FLAG_WRITABLE 1 Page is writable
VMM_FLAG_USER 2 Page is accessible from ring 3
VMM_FLAG_WC 3 Write-Combining cache (PWT bit, PAT entry 1)
VMM_FLAG_UCMINUS 4 Uncacheable-minus (PCD bit, PAT entry 2)
VMM_FLAG_COW 9 Copy-on-write marker (OS-available PTE bit)
VMM_FLAG_NX 63 No-execute (requires EFER.NXE)

Initialization (vmm_init)

vmm_init() builds a fresh set of page tables using the identity map, then switches CR3:

  1. Allocate 5 page-table pages via alloc_table_early() (uses identity map to zero pages)
  2. Identity map – PML4[0] -> pdpt_lo[0] -> pd_lo: 512 x 2MB huge pages covering [0..1GB)
  3. Higher-half map – PML4[511] -> pdpt_hi[510] -> pd_hi: 3 x 2MB huge pages for the kernel
  4. Install mapped-window PT into pd_hi[3] – a 4KB page table backing VMM_WINDOW_VA (0xFFFFFFFF80600000)
  5. Load CR3 with the new PML4 physical address

After this point, both identity and higher-half mappings are active. The identity map is torn down by vmm_teardown_identity() near the end of boot.

Mapped-Window Allocator

The VMM uses a “mapped window” to manipulate page tables without requiring an identity map. This is a fixed virtual address (VMM_WINDOW_VA) whose PTE can be pointed at any physical page:

void *vmm_window_map(uint64_t phys);   // Map phys at VMM_WINDOW_VA
void  vmm_window_unmap(void);          // Clear the mapping

// Two window slots available:
// Slot 0: s_window_pt[0] -> VMM_WINDOW_VA
// Slot 1: s_window_pt[1] -> VMM_WINDOW_VA + 4096

The window PTE pointer (s_window_pte) is declared volatile to ensure writes reach memory before the subsequent invlpg instruction.

Lock ordering: vmm_window_lock > pmm_lock > kva_lock. Code holding vmm_window_lock may acquire pmm_lock, but never the reverse.

Page Mapping

void vmm_map_page(uint64_t virt, uint64_t phys, uint64_t flags);

Walks the 4-level page table using ensure_table_phys(), which:

  1. Maps the parent table via the window
  2. Checks if entry at idx is present
  3. If absent: allocates a new page-table page, installs it with PRESENT WRITABLE (plus USER for user tables)
  4. Returns the physical address of the child table

The leaf PTE is set to phys | arch_pte_from_flags(flags | PRESENT). Double-mapping (leaf already present) panics.

User Address Spaces

Each user process has its own PML4:

uint64_t vmm_create_user_pml4(void);    // New PML4, copies kernel entries [256..511]
void vmm_map_user_page(pml4, virt, phys, flags);   // Map in user PML4
void vmm_switch_to(uint64_t pml4_phys);             // Load CR3
void vmm_free_user_pml4(uint64_t pml4_phys);        // Free user half + PT pages

User page table entries require VMM_FLAG_USER at every level of the walk (PML4e, PDPTe, PDe, PTe). The x86-64 MMU checks the USER bit at each level – a leaf with USER but an ancestor without causes a ring-3 #PF.

Copy-on-Write (COW)

The VMM exposes COW infrastructure for fork(), but the active fork path currently uses eager copy (vmm_copy_user_pages). COW was measured to be a net regression on the current workload (Aegis’s primary fork caller immediately execves), so activation is deferred until a workload motivates re-enabling it:

int vmm_copy_user_pages(uint64_t src_pml4, uint64_t dst_pml4);   /* active */
int vmm_cow_user_pages(uint64_t src_pml4, uint64_t dst_pml4);    /* infrastructure */
int vmm_cow_fault_handle(uint64_t pml4_phys, uint64_t fault_va); /* infrastructure */

vmm_cow_user_pages() (available, not currently called from sys_fork):

  • Clears the W bit and sets VMM_FLAG_COW (PTE bit 9) on every writable user page in the parent
  • Installs the same RO+COW mapping in the child (read-only pages are shared as-is)
  • Skips MMIO pages (any PTE with VMM_FLAG_WC or VMM_FLAG_UCMINUS)
  • Increments per-page refcounts via pmm_ref_page()
  • Invalidates the parent’s TLB for each modified page

vmm_cow_fault_handle() (available but not wired into the page fault path): walks the PML4 to the leaf PTE, verifies VMM_FLAG_COW is set, allocates a fresh frame, copies the old contents via the two-window-slot mechanism, and updates the PTE with W set and COW cleared. Returns 0 (handled), -1 (not COW -> SIGSEGV), or -2 (OOM -> SIGBUS).

Because vmm_cow_fault_handle is not currently invoked from isr_dispatch, a write to a COW-marked page would panic the kernel in v1.

Identity Map Teardown

vmm_teardown_identity() is called after all kernel objects are allocated via KVA:

void vmm_teardown_identity(void);   // PML4[0] = 0, reload CR3

After this point, physical addresses below KERN_VMA are only accessible through the mapped-window allocator or KVA mappings.

Kernel Virtual Allocator (KVA)

Source: kernel/mm/kva.c, kernel/mm/kva.h

Design

KVA provides a bump allocator for kernel-mode virtual addresses starting at KVA_BASE (KERN_VMA + 0x800000, i.e., 0xFFFFFFFF80800000). Each allocation gets contiguous VA space backed by individually-allocated PMM pages.

void *kva_alloc_pages(uint64_t n);          // Allocate n 4KB pages
void *kva_map_phys_pages(uint64_t phys, uint32_t n);  // Map existing physical pages
void  kva_free_pages(void *va, uint64_t n); // Unmap and free
uint64_t kva_page_phys(void *va);           // VA -> PA lookup

Freelist

KVA maintains a fixed-size freelist (KVA_FREE_MAX = 128 entries) for recycling freed VA ranges. On allocation, the freelist is checked first (best-fit search); on miss, the bump cursor advances. On free, the range is inserted into the freelist with coalescing of adjacent entries.

Allocation path:
  1. Try freelist (best-fit) -> return VA if hit
  2. Bump s_kva_next forward by n * PAGE_SIZE
  3. For each page: pmm_alloc_page() + vmm_map_page()

Free path:
  1. For each page: vmm_phys_of() -> vmm_unmap_page() -> pmm_free_page()
  2. Insert VA range into freelist (coalesce with neighbors)

KVA pages are mapped without VMM_FLAG_USER, so the MMU denies ring-3 access to all kernel objects (TCBs, kernel stacks, driver buffers).

All KVA operations are protected by kva_lock (spinlock with IRQ save/restore).

Per-Process VMA Tracking

Source: kernel/mm/vma.c, kernel/mm/vma.h

Design

Each process has a sorted array of vma_entry_t structures tracking its virtual memory regions. The table is allocated as a single KVA page (4096 bytes / 24 bytes per entry = 170 entries max).

typedef struct {
    uint64_t base;     // Region start VA
    uint64_t len;      // Region length in bytes
    uint32_t prot;     // PROT_READ | PROT_WRITE | PROT_EXEC
    uint8_t  type;     // VMA type constant
    uint8_t  _pad[3];
} vma_entry_t;         // 24 bytes

VMA Types

Constant Value Description
VMA_NONE 0 Untyped
VMA_ELF_TEXT 1 ELF PT_LOAD with PROT_EXEC
VMA_ELF_DATA 2 ELF PT_LOAD without PROT_EXEC
VMA_HEAP 3 [brk_base..brk]
VMA_STACK 4 User stack
VMA_MMAP 5 Anonymous mmap
VMA_THREAD_STACK 6 Thread stack via pthread_create
VMA_GUARD 7 Guard page (PROT_NONE)
VMA_SHARED 8 MAP_SHARED mapping (phys pages owned by memfd)

Operations

void vma_init(struct aegis_process *proc);     // Allocate table page
void vma_insert(proc, base, len, prot, type);  // Insert with merge
void vma_remove(proc, base, len);              // Remove with split
void vma_update_prot(proc, base, len, prot);   // Change permissions with split
void vma_clear(struct aegis_process *proc);     // Clear all entries (execve)
void vma_clone(dst, src);                       // Deep copy (fork)
void vma_share(child, parent);                  // Share table (CLONE_VM threads)
void vma_free(struct aegis_process *proc);      // Decrement refcount, free if 0

Insert merges with adjacent entries if they have matching prot and type. Remove and update_prot split entries at region boundaries when partial overlap occurs.

The table supports reference counting for CLONE_VM threads: vma_share() increments the refcount and gives the child a pointer to the parent’s table; vma_free() only deallocates when the refcount reaches 0.

User-Kernel Memory Access

Source: kernel/mm/uaccess.h

Aegis uses SMAP (Supervisor Mode Access Prevention) to prevent the kernel from accidentally accessing user memory. Controlled access is gated by STAC/CLAC instructions:

static inline void copy_from_user(void *dst, const void *src, uint64_t len) {
    arch_stac();                // Set RFLAGS.AC (permit user access)
    __builtin_memcpy(dst, src, len);
    arch_clac();                // Clear RFLAGS.AC (re-enable SMAP)
}

static inline void copy_to_user(void *dst, const void *src, uint64_t len) {
    arch_stac();
    __builtin_memcpy(dst, src, len);
    arch_clac();
}

Callers must validate user pointers before calling these functions. There is no fault recovery table (Linux extable); the entire user range must be mapped. This is a known v1 limitation – a malformed user pointer that passes validation but crosses into an unmapped page will cause a kernel panic rather than returning -EFAULT. Real-world exploitation of this class of bug is plausible in the current C codebase.

Cache Control

The PAT (Page Attribute Table) MSR is programmed by arch_pat_init() during early boot:

PAT Entry Index Type PTE Encoding
PA0 0 Write-Back (WB) PWT=0, PCD=0 (default)
PA1 1 Write-Combining (WC) PWT=1, PCD=0 (VMM_FLAG_WC)
PA2 2 UC- (weak uncacheable) PWT=0, PCD=1 (VMM_FLAG_UCMINUS)
PA3 3 Uncacheable (UC) PWT=1, PCD=1 (PCIe ECAM)

The framebuffer is mapped with VMM_FLAG_WC for write-combining performance. PCIe ECAM configuration space uses strong UC (PA3) with PWT|PCD set in the PTE.

See Also