Memory Management
Physical memory allocator, virtual memory manager, kernel virtual allocator, and per-process VMA tracking
Memory Management
Aegis implements a three-layer memory management architecture: a physical page allocator (PMM), a 4-level virtual memory manager (VMM), and a kernel virtual address allocator (KVA). User processes additionally track their virtual memory areas (VMAs) for mmap/mprotect/munmap support.
v1 maturity notice. The memory management subsystem is written entirely in C and represents v1 quality – functional and tested, but not battle-hardened. The PMM uses a simple bitmap allocator (O(n) scan), the VMM has a single-slot mapped-window allocator protected by a global spinlock, and there is no fault recovery table (
extable) for user-memory access. These are deliberate v1 trade-offs, not permanent design decisions. As a from-scratch C kernel, there are likely exploitable vulnerabilities in the memory management code, as would be expected at this stage. A gradual Rust migration is planned for safety-critical kernel subsystems;kernel/cap/is already in Rust. Contributions are welcome – file issues or propose changes at exec/aegis.
Architecture Overview
+-------------------+
| User processes | VMA tracking (mmap, brk, stack)
+-------------------+
| VMM | 4-level page tables (PML4 -> PDPT -> PD -> PT)
+-------------------+
| KVA | Kernel virtual address bump allocator
+-------------------+
| PMM | Bitmap physical page allocator
+-------------------+
| Physical RAM | Reported by multiboot2, managed as 4KB frames
+-------------------+
Initialization order in kernel_main:
arch_mm_init(mb_info)– Parse multiboot2 memory mappmm_init()– Build physical page bitmapvmm_init()– Build kernel page tables, activate pagingkva_init()– Initialize kernel VA bump allocator
Physical Memory Manager (PMM)
Source: kernel/mm/pmm.c, kernel/mm/pmm.h
Design
The PMM uses a bitmap allocator covering 4GB of physical address space. Each bit represents one 4KB page:
| Parameter | Value |
|---|---|
| Page size | 4096 bytes (PAGE_SIZE) |
| Maximum pages | 1,048,576 (4GB / 4KB) |
| Bitmap size | 128KB (1M bits / 8) |
| Bitmap location | BSS (.bss section) |
| Bit semantics | 0 = free, 1 = allocated |
Initialization
pmm_init() follows a five-step process:
- Mark everything reserved – Fill bitmap with
0xFF(safe default) - Free usable RAM – Walk
arch_mm_get_regions()(multiboot2 type=1 entries), clear bits for each usable page - Re-reserve platform ranges – Walk
arch_mm_get_reserved_regions():- First 1MB (BIOS data, VGA hole, ISA ROMs)
- Multiboot2 info structure (may be above 1MB)
- GRUB modules (rootfs, ESP image)
- Reserve kernel image –
[ARCH_KERNEL_PHYS_BASE, _kernel_end - KERN_VMA), covering.textthrough.bss(including the bitmap itself) - Report – Print total usable MB across N regions
Allocation API
uint64_t pmm_alloc_page(void); // Returns physical address, 0 on OOM
void pmm_free_page(uint64_t addr);
void pmm_ref_page(uint64_t addr); // Increment refcount (COW fork)
uint64_t pmm_total_pages(void);
uint64_t pmm_free_pages(void);
pmm_alloc_page() performs a linear scan for the first byte != 0xFF, then finds the first clear bit. O(n) where n = bitmap size. Single-page allocation only – multi-page contiguous allocation is deferred to a future buddy-allocator upgrade.
All PMM operations are protected by pmm_lock (spinlock with IRQ save/restore).
Reference Counting
Every allocated page has an 8-bit reference count (pmm_refcount[], 1MB in BSS). This supports copy-on-write (COW) fork:
| Operation | Refcount effect |
|---|---|
pmm_alloc_page() |
Set to 1 |
pmm_ref_page() |
Increment (panics on overflow at 255) |
pmm_free_page() |
Decrement; free page only when refcount reaches 0 |
Pages outside the PMM-managed range (e.g., MMIO addresses like framebuffer physical addresses) are silently skipped by pmm_free_page().
Virtual Memory Manager (VMM)
Source: kernel/mm/vmm.c, kernel/mm/vmm.h
x86-64 Page Table Structure
Aegis uses the standard 4-level x86-64 page table hierarchy:
Virtual Address (48-bit canonical):
+--------+--------+--------+--------+----------+
| PML4 | PDPT | PD | PT | Offset |
| [47:39]| [38:30]| [29:21]| [20:12]| [11:0] |
| 9 bits | 9 bits | 9 bits | 9 bits | 12 bits |
+--------+--------+--------+--------+----------+
PML4 [512 entries]
|
+--[0]----> pdpt_lo [512 entries] (identity map, removed after boot)
| +--[0]----> pd_lo [512 entries]
| +--[0..511] -> 512 x 2MB huge pages
|
+--[511]--> pdpt_hi [512 entries] (higher-half kernel)
+--[510]--> pd_hi [512 entries]
+--[0] -> 2MB huge: PA 0x000000 (kernel .text)
+--[1] -> 2MB huge: PA 0x200000 (kernel cont.)
+--[2] -> 2MB huge: PA 0x400000 (kernel BSS)
+--[3] -> PT: mapped-window allocator
+--[4+] -> KVA 4KB pages
Page Table Entry Flags
Abstract flags defined in vmm.h (translated to hardware PTE bits by arch_pte_from_flags()):
| Flag | Bit | Purpose |
|---|---|---|
VMM_FLAG_PRESENT |
0 | Page is present in memory |
VMM_FLAG_WRITABLE |
1 | Page is writable |
VMM_FLAG_USER |
2 | Page is accessible from ring 3 |
VMM_FLAG_WC |
3 | Write-Combining cache (PWT bit, PAT entry 1) |
VMM_FLAG_UCMINUS |
4 | Uncacheable-minus (PCD bit, PAT entry 2) |
VMM_FLAG_COW |
9 | Copy-on-write marker (OS-available PTE bit) |
VMM_FLAG_NX |
63 | No-execute (requires EFER.NXE) |
Initialization (vmm_init)
vmm_init() builds a fresh set of page tables using the identity map, then switches CR3:
- Allocate 5 page-table pages via
alloc_table_early()(uses identity map to zero pages) - Identity map – PML4[0] -> pdpt_lo[0] -> pd_lo: 512 x 2MB huge pages covering [0..1GB)
- Higher-half map – PML4[511] -> pdpt_hi[510] -> pd_hi: 3 x 2MB huge pages for the kernel
- Install mapped-window PT into pd_hi[3] – a 4KB page table backing
VMM_WINDOW_VA(0xFFFFFFFF80600000) - Load CR3 with the new PML4 physical address
After this point, both identity and higher-half mappings are active. The identity map is torn down by vmm_teardown_identity() near the end of boot.
Mapped-Window Allocator
The VMM uses a “mapped window” to manipulate page tables without requiring an identity map. This is a fixed virtual address (VMM_WINDOW_VA) whose PTE can be pointed at any physical page:
void *vmm_window_map(uint64_t phys); // Map phys at VMM_WINDOW_VA
void vmm_window_unmap(void); // Clear the mapping
// Two window slots available:
// Slot 0: s_window_pt[0] -> VMM_WINDOW_VA
// Slot 1: s_window_pt[1] -> VMM_WINDOW_VA + 4096
The window PTE pointer (s_window_pte) is declared volatile to ensure writes reach memory before the subsequent invlpg instruction.
Lock ordering: vmm_window_lock > pmm_lock > kva_lock. Code holding vmm_window_lock may acquire pmm_lock, but never the reverse.
Page Mapping
void vmm_map_page(uint64_t virt, uint64_t phys, uint64_t flags);
Walks the 4-level page table using ensure_table_phys(), which:
- Maps the parent table via the window
- Checks if entry at
idxis present -
If absent: allocates a new page-table page, installs it with PRESENT WRITABLE (plus USER for user tables) - Returns the physical address of the child table
The leaf PTE is set to phys | arch_pte_from_flags(flags | PRESENT). Double-mapping (leaf already present) panics.
User Address Spaces
Each user process has its own PML4:
uint64_t vmm_create_user_pml4(void); // New PML4, copies kernel entries [256..511]
void vmm_map_user_page(pml4, virt, phys, flags); // Map in user PML4
void vmm_switch_to(uint64_t pml4_phys); // Load CR3
void vmm_free_user_pml4(uint64_t pml4_phys); // Free user half + PT pages
User page table entries require VMM_FLAG_USER at every level of the walk (PML4e, PDPTe, PDe, PTe). The x86-64 MMU checks the USER bit at each level – a leaf with USER but an ancestor without causes a ring-3 #PF.
Copy-on-Write (COW)
The VMM exposes COW infrastructure for fork(), but the active fork path currently uses eager copy (vmm_copy_user_pages). COW was measured to be a net regression on the current workload (Aegis’s primary fork caller immediately execves), so activation is deferred until a workload motivates re-enabling it:
int vmm_copy_user_pages(uint64_t src_pml4, uint64_t dst_pml4); /* active */
int vmm_cow_user_pages(uint64_t src_pml4, uint64_t dst_pml4); /* infrastructure */
int vmm_cow_fault_handle(uint64_t pml4_phys, uint64_t fault_va); /* infrastructure */
vmm_cow_user_pages() (available, not currently called from sys_fork):
- Clears the W bit and sets
VMM_FLAG_COW(PTE bit 9) on every writable user page in the parent - Installs the same RO+COW mapping in the child (read-only pages are shared as-is)
- Skips MMIO pages (any PTE with
VMM_FLAG_WCorVMM_FLAG_UCMINUS) - Increments per-page refcounts via
pmm_ref_page() - Invalidates the parent’s TLB for each modified page
vmm_cow_fault_handle() (available but not wired into the page fault path): walks the PML4 to the leaf PTE, verifies VMM_FLAG_COW is set, allocates a fresh frame, copies the old contents via the two-window-slot mechanism, and updates the PTE with W set and COW cleared. Returns 0 (handled), -1 (not COW -> SIGSEGV), or -2 (OOM -> SIGBUS).
Because vmm_cow_fault_handle is not currently invoked from isr_dispatch, a write to a COW-marked page would panic the kernel in v1.
Identity Map Teardown
vmm_teardown_identity() is called after all kernel objects are allocated via KVA:
void vmm_teardown_identity(void); // PML4[0] = 0, reload CR3
After this point, physical addresses below KERN_VMA are only accessible through the mapped-window allocator or KVA mappings.
Kernel Virtual Allocator (KVA)
Source: kernel/mm/kva.c, kernel/mm/kva.h
Design
KVA provides a bump allocator for kernel-mode virtual addresses starting at KVA_BASE (KERN_VMA + 0x800000, i.e., 0xFFFFFFFF80800000). Each allocation gets contiguous VA space backed by individually-allocated PMM pages.
void *kva_alloc_pages(uint64_t n); // Allocate n 4KB pages
void *kva_map_phys_pages(uint64_t phys, uint32_t n); // Map existing physical pages
void kva_free_pages(void *va, uint64_t n); // Unmap and free
uint64_t kva_page_phys(void *va); // VA -> PA lookup
Freelist
KVA maintains a fixed-size freelist (KVA_FREE_MAX = 128 entries) for recycling freed VA ranges. On allocation, the freelist is checked first (best-fit search); on miss, the bump cursor advances. On free, the range is inserted into the freelist with coalescing of adjacent entries.
Allocation path:
1. Try freelist (best-fit) -> return VA if hit
2. Bump s_kva_next forward by n * PAGE_SIZE
3. For each page: pmm_alloc_page() + vmm_map_page()
Free path:
1. For each page: vmm_phys_of() -> vmm_unmap_page() -> pmm_free_page()
2. Insert VA range into freelist (coalesce with neighbors)
KVA pages are mapped without VMM_FLAG_USER, so the MMU denies ring-3 access to all kernel objects (TCBs, kernel stacks, driver buffers).
All KVA operations are protected by kva_lock (spinlock with IRQ save/restore).
Per-Process VMA Tracking
Source: kernel/mm/vma.c, kernel/mm/vma.h
Design
Each process has a sorted array of vma_entry_t structures tracking its virtual memory regions. The table is allocated as a single KVA page (4096 bytes / 24 bytes per entry = 170 entries max).
typedef struct {
uint64_t base; // Region start VA
uint64_t len; // Region length in bytes
uint32_t prot; // PROT_READ | PROT_WRITE | PROT_EXEC
uint8_t type; // VMA type constant
uint8_t _pad[3];
} vma_entry_t; // 24 bytes
VMA Types
| Constant | Value | Description |
|---|---|---|
VMA_NONE |
0 | Untyped |
VMA_ELF_TEXT |
1 | ELF PT_LOAD with PROT_EXEC |
VMA_ELF_DATA |
2 | ELF PT_LOAD without PROT_EXEC |
VMA_HEAP |
3 | [brk_base..brk] |
VMA_STACK |
4 | User stack |
VMA_MMAP |
5 | Anonymous mmap |
VMA_THREAD_STACK |
6 | Thread stack via pthread_create |
VMA_GUARD |
7 | Guard page (PROT_NONE) |
VMA_SHARED |
8 | MAP_SHARED mapping (phys pages owned by memfd) |
Operations
void vma_init(struct aegis_process *proc); // Allocate table page
void vma_insert(proc, base, len, prot, type); // Insert with merge
void vma_remove(proc, base, len); // Remove with split
void vma_update_prot(proc, base, len, prot); // Change permissions with split
void vma_clear(struct aegis_process *proc); // Clear all entries (execve)
void vma_clone(dst, src); // Deep copy (fork)
void vma_share(child, parent); // Share table (CLONE_VM threads)
void vma_free(struct aegis_process *proc); // Decrement refcount, free if 0
Insert merges with adjacent entries if they have matching prot and type. Remove and update_prot split entries at region boundaries when partial overlap occurs.
The table supports reference counting for CLONE_VM threads: vma_share() increments the refcount and gives the child a pointer to the parent’s table; vma_free() only deallocates when the refcount reaches 0.
User-Kernel Memory Access
Source: kernel/mm/uaccess.h
Aegis uses SMAP (Supervisor Mode Access Prevention) to prevent the kernel from accidentally accessing user memory. Controlled access is gated by STAC/CLAC instructions:
static inline void copy_from_user(void *dst, const void *src, uint64_t len) {
arch_stac(); // Set RFLAGS.AC (permit user access)
__builtin_memcpy(dst, src, len);
arch_clac(); // Clear RFLAGS.AC (re-enable SMAP)
}
static inline void copy_to_user(void *dst, const void *src, uint64_t len) {
arch_stac();
__builtin_memcpy(dst, src, len);
arch_clac();
}
Callers must validate user pointers before calling these functions. There is no fault recovery table (Linux extable); the entire user range must be mapped. This is a known v1 limitation – a malformed user pointer that passes validation but crosses into an unmapped page will cause a kernel panic rather than returning -EFAULT. Real-world exploitation of this class of bug is plausible in the current C codebase.
Cache Control
The PAT (Page Attribute Table) MSR is programmed by arch_pat_init() during early boot:
| PAT Entry | Index | Type | PTE Encoding |
|---|---|---|---|
| PA0 | 0 | Write-Back (WB) | PWT=0, PCD=0 (default) |
| PA1 | 1 | Write-Combining (WC) | PWT=1, PCD=0 (VMM_FLAG_WC) |
| PA2 | 2 | UC- (weak uncacheable) | PWT=0, PCD=1 (VMM_FLAG_UCMINUS) |
| PA3 | 3 | Uncacheable (UC) | PWT=1, PCD=1 (PCIe ECAM) |
The framebuffer is mapped with VMM_FLAG_WC for write-combining performance. PCIe ECAM configuration space uses strong UC (PA3) with PWT|PCD set in the PTE.
See Also
- Boot Process – Memory subsystem initialization order
- Interrupts & Exceptions – Page fault handling, CR3 switching in ISR
- Processes & ELF – Process address space layout, execve loading