Scheduler
Preemptive round-robin task scheduler with per-CPU run queues, wait queues, and cooperative blocking
Scheduler
Aegis uses a preemptive round-robin scheduler driven by the PIT timer at 100 Hz. All tasks – kernel threads and user processes alike – share the same scheduling infrastructure through the aegis_task_t control block. The scheduler is implemented in kernel/sched/sched.c with architecture-specific context switching in kernel/arch/x86_64/ctx_switch.asm (x86-64) and kernel/arch/arm64/ctx_switch.S (AArch64).
v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The scheduler has been tested under the workloads Aegis currently supports, but it has not been subjected to adversarial stress testing. There are likely exploitable race conditions or resource exhaustion paths in this code, as would be expected in any from-scratch C kernel at this stage. A gradual migration from C to Rust is planned, starting with the kernel; the capability system (
kernel/cap/) is already in Rust and represents the beginning of this path. Contributions are welcome – file issues or propose changes at exec/aegis.
Task Control Block
Every schedulable entity in Aegis is represented by an aegis_task_t. User processes extend this with aegis_process_t (which must have task at offset 0 for safe casting).
typedef struct aegis_task_t {
uint64_t sp; /* MUST be first -- ctx_switch reads [rdi+0] */
uint8_t *stack_base; /* bottom of kva-allocated stack */
uint64_t kernel_stack_top; /* RSP0 value for this task */
uint32_t tid; /* task ID */
uint8_t is_user; /* 1 = user process, 0 = kernel task */
uint64_t stack_pages; /* kva pages allocated for kernel stack */
uint32_t state; /* TASK_RUNNING / BLOCKED / ZOMBIE / STOPPED */
uint32_t waiting_for; /* PID this task waits for; 0 = any child */
uint64_t fs_base; /* per-thread TLS base (IA32_FS_BASE / TPIDR_EL0) */
uint64_t clear_child_tid; /* user VA: write 0 + futex_wake on exit */
uint64_t sleep_deadline; /* PIT tick when nanosleep expires; 0 = not sleeping */
int read_nonblock; /* 1 = current sys_read is O_NONBLOCK */
struct aegis_task_t *next; /* circular linked list (all tasks) */
struct aegis_task_t *next_run; /* RUNNING-only run queue (doubly-linked) */
struct aegis_task_t *prev_run;
struct aegis_task_t *wq_next; /* wait-queue linkage */
} aegis_task_t;
The sp field must be at offset 0 – this is enforced by a _Static_assert and is a hard requirement of the context switch assembly which loads/stores the stack pointer at [rdi+0] / [rsi+0].
Task States
+-----------+
spawn -->| RUNNING |<-- sched_wake / sched_resume
+-----------+
| | |
sched_block| | |sched_stop
v | v
+---------+ | +---------+
| BLOCKED | | | STOPPED |
+---------+ | +---------+
|
sys_exit |
v
+---------+
| ZOMBIE | (awaiting waitpid reap)
+---------+
| State | Value | Description |
|---|---|---|
TASK_RUNNING |
0 | Eligible for scheduling; in the run queue |
TASK_BLOCKED |
1 | Waiting on I/O, sleep, or wait queue; removed from run queue |
TASK_ZOMBIE |
2 | Exited but not yet reaped by parent via waitpid |
TASK_STOPPED |
3 | Stopped by signal (SIGSTOP/SIGTSTP); resumed by SIGCONT |
Data Structures
Full Task List
All tasks (regardless of state) are linked in a circular singly-linked list through the next pointer. This list is used for:
- PID lookup (
proc_find_by_pid) - Sleep deadline scanning in
sched_tick - Shutdown detection (scanning for live user tasks)
- Orphan reparenting on process exit
Run Queue
A separate circular doubly-linked list threads through next_run / prev_run, containing only tasks with state == TASK_RUNNING. This is anchored by a static sentinel node:
static aegis_task_t s_run_sentinel = {
.tid = 0xFFFFFFFF,
.state = TASK_BLOCKED, /* never picked by iteration */
.next_run = &s_run_sentinel,
.prev_run = &s_run_sentinel,
};
The sentinel is always in the list (never removed), so insertion and iteration never need null checks. sched_tick skips it by testing t == &s_run_sentinel.
Invariant: A task is in the run list if and only if state == TASK_RUNNING && next_run != NULL. next_run == NULL means the task is not in the run list.
This separation means sched_tick runs in O(R) time where R is the number of runnable tasks, rather than O(N) over all tasks including blocked and zombie ones.
Run Queue Operations
run_list_insert_locked(task) -- Insert at tail (before sentinel). Idempotent.
run_list_remove_locked(task) -- Unlink from list. Idempotent.
run_list_next_locked(cur) -- Return next RUNNING task after cur, skipping sentinel.
All three require sched_lock to be held.
Synchronization
The scheduler uses a single global ticket spinlock (sched_lock) that provides FIFO ordering and prevents starvation. While the locking discipline has been audited for correctness on the current single-core and limited-SMP configurations, this is v1 C code – there are likely subtle concurrency bugs that would surface under heavier SMP workloads or adversarial scheduling patterns. The planned Rust migration will allow the type system to enforce lock ordering and prevent data races at compile time.
spinlock_t sched_lock = SPINLOCK_INIT;
typedef struct {
volatile uint16_t owner;
volatile uint16_t next;
} spinlock_t;
All run queue mutations and state transitions acquire sched_lock with IRQ-save semantics (spin_lock_irqsave) to prevent deadlock from timer interrupts firing while the lock is held.
State transitions use __atomic_store_n with __ATOMIC_RELEASE ordering to ensure visibility across CPUs:
__atomic_store_n(&task->state, TASK_RUNNING, __ATOMIC_RELEASE);
Per-CPU State
Each CPU has a percpu_t structure accessible via the GS segment register:
typedef struct percpu {
struct percpu *self; /* gs:0 */
uint8_t cpu_id; /* gs:8 */
uint8_t lapic_id; /* gs:9 */
uint8_t _pad[6]; /* gs:10 */
struct aegis_task_t *current_task; /* gs:16 */
uint64_t kernel_stack; /* gs:24 */
uint64_t user_rsp_scratch; /* gs:32 */
uint64_t ticks; /* gs:40 */
void *prev_dying_tcb; /* gs:48 */
void *prev_dying_stack; /* gs:56 */
uint64_t prev_dying_stack_pages; /* gs:64 */
} percpu_t;
sched_current() reads gs:16 directly with inline assembly:
static inline aegis_task_t *sched_current(void) {
return percpu_current(); /* movq %%gs:16, %0 */
}
Context Switch
The context switch is implemented in assembly and saves/restores only callee-saved registers. The compiler handles caller-saved registers at call sites.
x86-64 (ctx_switch.asm)
ctx_switch:
; Save outgoing task's callee-saved registers
push rbx
push rbp
push r12
push r13
push r14
push r15
mov [rdi], rsp ; current->sp = rsp (sp at offset 0)
mov rsp, [rsi] ; rsp = next->sp
; Restore incoming task's callee-saved registers
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rbx
ret ; pops saved RIP -> jumps to new task
ARM64 (ctx_switch.S)
The ARM64 version saves/restores 6 register pairs (x19-x30) via stp/ldp and uses x30 (link register) as the return address.
New Task Entry
sched_spawn constructs an initial stack frame that looks as if ctx_switch had already run:
x86-64 initial stack (low to high):
[r15=0][r14=0][r13=0][r12=0][rbp=0][rbx=0][fn]
^-- ret pops this into RIP
The first ctx_switch into the task pops zeros for all callee-saved registers, then ret jumps to the task’s entry function.
Scheduler Operations
sched_init()
Initializes the per-CPU current task pointer to NULL, resets TID and task counters, and initializes the run queue sentinel to point to itself.
sched_spawn(fn)
Creates a new kernel task:
- Allocates a TCB (1 KVA page)
- Allocates a stack:
STACK_PAGES(4) usable pages plus one guard page at the bottom. The guard page is unmapped and its physical frame freed – stack overflow triggers a page fault instead of silent corruption - Constructs the initial stack frame matching
ctx_switchpop order - Inserts the task into both the full circular list and the run queue
- Assigns a monotonically increasing TID
sched_start()
Begins scheduling. Sets s_sched_ready = 1 (which gates sched_tick) and performs a one-way context switch into the first task:
aegis_task_t dummy;
ctx_switch(&dummy, first); /* never returns */
The dummy TCB receives the abandoned stack pointer. Each task enables interrupts on entry via the architecture-specific interrupt-enable primitive.
sched_tick()
Called by the PIT handler (IRQ0, vector 0x20) on every timer tick. This is the core preemption path:
- Guard check: Returns immediately if
!s_sched_readyor no current task - Sleep wake: Iterates the full task list to find blocked tasks whose
sleep_deadlinehas passed, waking them by flipping state to RUNNING and inserting into the run queue - Round-robin advance: Calls
run_list_next_locked(old)to find the next runnable task in the run queue - Context switch: Updates per-CPU state (current task, TSS RSP0, FS.base), then calls
ctx_switch(old, next) - Post-switch restore: After
ctx_switchreturns (in the resumed task’s context), restores FS.base for TLS correctness
CR3 policy: sched_tick does not switch CR3 to the incoming user task’s PML4. The switch happens via proc_enter_user (first entry) or isr_common_stub’s saved-CR3 restore (subsequent preemptions). This is critical because sched_tick runs on the outgoing task’s kernel stack – switching CR3 mid-function would use the wrong address space.
sched_block()
Voluntarily yields the CPU when a task must wait:
- Captures
old->next_runbefore removal (preserves round-robin order) - Sets
state = TASK_BLOCKEDand removes from run queue - The task remains in the full circular list for PID lookup
- Switches to the next runnable task
- On resume: restores CR3 and FS.base
sched_wake(task) / sched_wake_locked(task)
Transitions a blocked task back to RUNNING:
__atomic_store_n(&task->state, TASK_RUNNING, __ATOMIC_RELEASE);
run_list_insert_locked(task);
sched_wake acquires sched_lock internally. sched_wake_locked is for callers that already hold the lock (e.g., sched_exit -> signal_send_pid_locked).
sched_stop(task) / sched_resume(task)
Handle SIGSTOP/SIGCONT semantics:
sched_stop: If stopping self, mirrorssched_blockwithTASK_STOPPED. If stopping another task, just flips state and removes from run queue.sched_resume: Mirrorssched_wake– flips state to RUNNING and inserts into run queue. Works for both STOPPED and BLOCKED tasks (SIGCONT while blocked on a read returns EINTR).
sched_exit()
Handles task termination with two distinct paths:
User process exit:
- Performs deferred cleanup of the previous dying task (per-CPU
prev_dying_tcb) - Switches to master PML4
- Releases the shared fd table (must happen before zombie transition for pipe correctness)
- Marks self as ZOMBIE and removes from run queue
- Sends SIGCHLD to parent (using
_lockedvariant sincesched_lockis held) - Searches for a woken parent to switch to directly (avoids PIT dependency)
- If no woken parent, calls
sched_yield_to_next - Checks for system shutdown (no remaining live user tasks)
Kernel task exit:
- Removes from both circular list and run queue
- Records the dying task for deferred cleanup (TCB and stack freed at the next
sched_exitcall on this CPU) - Context-switches to next task
The deferred cleanup pattern exists because ctx_switch writes to dying->sp – the TCB and stack must remain valid until the stack pointer has fully switched.
Wait Queues
The waitq_t abstraction (kernel/sched/waitq.h, kernel/sched/waitq.c) provides a reusable blocking primitive. Internally it is a LIFO (push/pop at head) — wake_one returns the newest waiter, and wake_all drains in arrival order by reversing the list before walking it. For current workloads (single waiter per point, plus epoll which wakes everyone) LIFO-vs-FIFO is irrelevant; if a future caller needs FIFO the struct layout supports adding a cached tail pointer.
typedef struct waitq {
struct aegis_task_t *head;
spinlock_t lock;
} waitq_t;
#define WAITQ_INIT { .head = NULL, .lock = SPINLOCK_INIT }
Operations
| Function | Description |
|---|---|
waitq_wait(wq) |
Enqueue current task, then sched_block() |
waitq_wake_one(wq) |
Dequeue oldest waiter and sched_wake() it |
waitq_wake_all(wq) |
Wake all waiters in FIFO order |
Canonical Usage Pattern
/* Consumer side */
for (;;) {
if (condition_is_met(state))
break;
waitq_wait(&state->wq);
}
/* Producer side */
arrange_condition(state);
waitq_wake_one(&state->wq);
Implementation Details
- Internally a singly-linked list through
aegis_task_t::wq_next waitq_waitpushes to head (LIFO within the list)wake_onepops from head (newest waiter – strictly LIFO, not FIFO)wake_alldrains the list in arrival order by first reversing- Lock hierarchy:
sched_lockis abovewq->lock–sched_wakeis called outsidewq->lockto respect this
Migration Status
New blocking I/O paths should use waitq_t. Existing subsystems (pipe, tty, socket, unix_socket, epoll, futex, usb_mouse) retain their single-waiter patterns pending individual audit and migration.
Stack Layout
All tasks (kernel and user) get a 16 KB kernel stack (4 pages). This size accommodates deep call chains through the PIT ISR into virtio-net, IP, and TCP receive processing.
Kernel task stacks include an unmapped guard page at the bottom. Stack overflow triggers a page fault rather than silently corrupting adjacent KVA allocations.
User Process Initial Kernel Stack
For user processes, proc_spawn constructs an initial kernel stack frame that chains through ctx_switch into proc_enter_user (which performs CR3 switch + iretq/ERET to ring 3):
x86-64 user process initial stack (low to high):
[r15][r14][r13][r12][rbp][rbx] <- ctx_switch callee-saves
[proc_enter_user] <- ret target
[pml4_phys] <- popped by proc_enter_user
[entry_rip][CS=0x2B][RFLAGS=0x202][user_RSP][SS=0x23] <- iretq frame
Timer Integration
The scheduler is driven by the Programmable Interval Timer (PIT) configured at 100 Hz (10 ms tick interval):
- PIT fires IRQ0 (vector 0x20)
pit_handler()increments the tick counterpit_handler()callssched_tick()sched_tick()performs round-robin preemption
The s_sched_ready flag prevents sched_tick from performing context switches before sched_start() has been called, since PIT interrupts may fire during early boot before any tasks exist.
Related Documentation
- Processes & ELF Loading – process creation and ELF binary loading
- Interrupts & Exceptions – PIT timer and ISR dispatch
- Memory Management – KVA page allocation for stacks and TCBs
- Syscall Interface –
sys_exit,sys_clone,sys_waitpid