Scheduler

Aegis uses a preemptive round-robin scheduler driven by the PIT timer at 100 Hz. All tasks – kernel threads and user processes alike – share the same scheduling infrastructure through the aegis_task_t control block. The scheduler is implemented in kernel/sched/sched.c with architecture-specific context switching in kernel/arch/x86_64/ctx_switch.asm (x86-64) and kernel/arch/arm64/ctx_switch.S (AArch64).

v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The scheduler has been tested under the workloads Aegis currently supports, but it has not been subjected to adversarial stress testing. There are likely exploitable race conditions or resource exhaustion paths in this code, as would be expected in any from-scratch C kernel at this stage. A gradual migration from C to Rust is planned, starting with the kernel; the capability system (kernel/cap/) is already in Rust and represents the beginning of this path. Contributions are welcome – file issues or propose changes at exec/aegis.

Task Control Block

Every schedulable entity in Aegis is represented by an aegis_task_t. User processes extend this with aegis_process_t (which must have task at offset 0 for safe casting).

typedef struct aegis_task_t {
    uint64_t             sp;               /* MUST be first -- ctx_switch reads [rdi+0] */
    uint8_t             *stack_base;       /* bottom of kva-allocated stack */
    uint64_t             kernel_stack_top; /* RSP0 value for this task */
    uint32_t             tid;              /* task ID */
    uint8_t              is_user;          /* 1 = user process, 0 = kernel task */
    uint64_t             stack_pages;      /* kva pages allocated for kernel stack */
    uint32_t             state;            /* TASK_RUNNING / BLOCKED / ZOMBIE / STOPPED */
    uint32_t             waiting_for;      /* PID this task waits for; 0 = any child */
    uint64_t             fs_base;          /* per-thread TLS base (IA32_FS_BASE / TPIDR_EL0) */
    uint64_t             clear_child_tid;  /* user VA: write 0 + futex_wake on exit */
    uint64_t             sleep_deadline;   /* PIT tick when nanosleep expires; 0 = not sleeping */
    int                  read_nonblock;    /* 1 = current sys_read is O_NONBLOCK */
    struct aegis_task_t *next;             /* circular linked list (all tasks) */
    struct aegis_task_t *next_run;         /* RUNNING-only run queue (doubly-linked) */
    struct aegis_task_t *prev_run;
    struct aegis_task_t *wq_next;          /* wait-queue linkage */
} aegis_task_t;

The sp field must be at offset 0 – this is enforced by a _Static_assert and is a hard requirement of the context switch assembly which loads/stores the stack pointer at [rdi+0] / [rsi+0].

Task States

                    +-----------+
           spawn -->| RUNNING   |<-- sched_wake / sched_resume
                    +-----------+
                     |    |    |
          sched_block|    |    |sched_stop
                     v    |    v
              +---------+ | +---------+
              | BLOCKED | | | STOPPED |
              +---------+ | +---------+
                          |
                 sys_exit |
                          v
                    +---------+
                    | ZOMBIE  |  (awaiting waitpid reap)
                    +---------+
State Value Description
TASK_RUNNING 0 Eligible for scheduling; in the run queue
TASK_BLOCKED 1 Waiting on I/O, sleep, or wait queue; removed from run queue
TASK_ZOMBIE 2 Exited but not yet reaped by parent via waitpid
TASK_STOPPED 3 Stopped by signal (SIGSTOP/SIGTSTP); resumed by SIGCONT

Data Structures

Full Task List

All tasks (regardless of state) are linked in a circular singly-linked list through the next pointer. This list is used for:

  • PID lookup (proc_find_by_pid)
  • Sleep deadline scanning in sched_tick
  • Shutdown detection (scanning for live user tasks)
  • Orphan reparenting on process exit

Run Queue

A separate circular doubly-linked list threads through next_run / prev_run, containing only tasks with state == TASK_RUNNING. This is anchored by a static sentinel node:

static aegis_task_t s_run_sentinel = {
    .tid      = 0xFFFFFFFF,
    .state    = TASK_BLOCKED,    /* never picked by iteration */
    .next_run = &s_run_sentinel,
    .prev_run = &s_run_sentinel,
};

The sentinel is always in the list (never removed), so insertion and iteration never need null checks. sched_tick skips it by testing t == &s_run_sentinel.

Invariant: A task is in the run list if and only if state == TASK_RUNNING && next_run != NULL. next_run == NULL means the task is not in the run list.

This separation means sched_tick runs in O(R) time where R is the number of runnable tasks, rather than O(N) over all tasks including blocked and zombie ones.

Run Queue Operations

run_list_insert_locked(task)   -- Insert at tail (before sentinel). Idempotent.
run_list_remove_locked(task)   -- Unlink from list. Idempotent.
run_list_next_locked(cur)      -- Return next RUNNING task after cur, skipping sentinel.

All three require sched_lock to be held.

Synchronization

The scheduler uses a single global ticket spinlock (sched_lock) that provides FIFO ordering and prevents starvation. While the locking discipline has been audited for correctness on the current single-core and limited-SMP configurations, this is v1 C code – there are likely subtle concurrency bugs that would surface under heavier SMP workloads or adversarial scheduling patterns. The planned Rust migration will allow the type system to enforce lock ordering and prevent data races at compile time.

spinlock_t sched_lock = SPINLOCK_INIT;

typedef struct {
    volatile uint16_t owner;
    volatile uint16_t next;
} spinlock_t;

All run queue mutations and state transitions acquire sched_lock with IRQ-save semantics (spin_lock_irqsave) to prevent deadlock from timer interrupts firing while the lock is held.

State transitions use __atomic_store_n with __ATOMIC_RELEASE ordering to ensure visibility across CPUs:

__atomic_store_n(&task->state, TASK_RUNNING, __ATOMIC_RELEASE);

Per-CPU State

Each CPU has a percpu_t structure accessible via the GS segment register:

typedef struct percpu {
    struct percpu       *self;              /* gs:0  */
    uint8_t              cpu_id;            /* gs:8  */
    uint8_t              lapic_id;          /* gs:9  */
    uint8_t              _pad[6];           /* gs:10 */
    struct aegis_task_t *current_task;      /* gs:16 */
    uint64_t             kernel_stack;      /* gs:24 */
    uint64_t             user_rsp_scratch;  /* gs:32 */
    uint64_t             ticks;             /* gs:40 */
    void                *prev_dying_tcb;    /* gs:48 */
    void                *prev_dying_stack;  /* gs:56 */
    uint64_t             prev_dying_stack_pages; /* gs:64 */
} percpu_t;

sched_current() reads gs:16 directly with inline assembly:

static inline aegis_task_t *sched_current(void) {
    return percpu_current();  /* movq %%gs:16, %0 */
}

Context Switch

The context switch is implemented in assembly and saves/restores only callee-saved registers. The compiler handles caller-saved registers at call sites.

x86-64 (ctx_switch.asm)

ctx_switch:
    ; Save outgoing task's callee-saved registers
    push rbx
    push rbp
    push r12
    push r13
    push r14
    push r15

    mov [rdi], rsp      ; current->sp = rsp (sp at offset 0)
    mov rsp, [rsi]      ; rsp = next->sp

    ; Restore incoming task's callee-saved registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop rbp
    pop rbx
    ret                  ; pops saved RIP -> jumps to new task

ARM64 (ctx_switch.S)

The ARM64 version saves/restores 6 register pairs (x19-x30) via stp/ldp and uses x30 (link register) as the return address.

New Task Entry

sched_spawn constructs an initial stack frame that looks as if ctx_switch had already run:

x86-64 initial stack (low to high):
  [r15=0][r14=0][r13=0][r12=0][rbp=0][rbx=0][fn]
                                               ^-- ret pops this into RIP

The first ctx_switch into the task pops zeros for all callee-saved registers, then ret jumps to the task’s entry function.

Scheduler Operations

sched_init()

Initializes the per-CPU current task pointer to NULL, resets TID and task counters, and initializes the run queue sentinel to point to itself.

sched_spawn(fn)

Creates a new kernel task:

  1. Allocates a TCB (1 KVA page)
  2. Allocates a stack: STACK_PAGES (4) usable pages plus one guard page at the bottom. The guard page is unmapped and its physical frame freed – stack overflow triggers a page fault instead of silent corruption
  3. Constructs the initial stack frame matching ctx_switch pop order
  4. Inserts the task into both the full circular list and the run queue
  5. Assigns a monotonically increasing TID

sched_start()

Begins scheduling. Sets s_sched_ready = 1 (which gates sched_tick) and performs a one-way context switch into the first task:

aegis_task_t dummy;
ctx_switch(&dummy, first);  /* never returns */

The dummy TCB receives the abandoned stack pointer. Each task enables interrupts on entry via the architecture-specific interrupt-enable primitive.

sched_tick()

Called by the PIT handler (IRQ0, vector 0x20) on every timer tick. This is the core preemption path:

  1. Guard check: Returns immediately if !s_sched_ready or no current task
  2. Sleep wake: Iterates the full task list to find blocked tasks whose sleep_deadline has passed, waking them by flipping state to RUNNING and inserting into the run queue
  3. Round-robin advance: Calls run_list_next_locked(old) to find the next runnable task in the run queue
  4. Context switch: Updates per-CPU state (current task, TSS RSP0, FS.base), then calls ctx_switch(old, next)
  5. Post-switch restore: After ctx_switch returns (in the resumed task’s context), restores FS.base for TLS correctness

CR3 policy: sched_tick does not switch CR3 to the incoming user task’s PML4. The switch happens via proc_enter_user (first entry) or isr_common_stub’s saved-CR3 restore (subsequent preemptions). This is critical because sched_tick runs on the outgoing task’s kernel stack – switching CR3 mid-function would use the wrong address space.

sched_block()

Voluntarily yields the CPU when a task must wait:

  1. Captures old->next_run before removal (preserves round-robin order)
  2. Sets state = TASK_BLOCKED and removes from run queue
  3. The task remains in the full circular list for PID lookup
  4. Switches to the next runnable task
  5. On resume: restores CR3 and FS.base

sched_wake(task) / sched_wake_locked(task)

Transitions a blocked task back to RUNNING:

__atomic_store_n(&task->state, TASK_RUNNING, __ATOMIC_RELEASE);
run_list_insert_locked(task);

sched_wake acquires sched_lock internally. sched_wake_locked is for callers that already hold the lock (e.g., sched_exit -> signal_send_pid_locked).

sched_stop(task) / sched_resume(task)

Handle SIGSTOP/SIGCONT semantics:

  • sched_stop: If stopping self, mirrors sched_block with TASK_STOPPED. If stopping another task, just flips state and removes from run queue.
  • sched_resume: Mirrors sched_wake – flips state to RUNNING and inserts into run queue. Works for both STOPPED and BLOCKED tasks (SIGCONT while blocked on a read returns EINTR).

sched_exit()

Handles task termination with two distinct paths:

User process exit:

  1. Performs deferred cleanup of the previous dying task (per-CPU prev_dying_tcb)
  2. Switches to master PML4
  3. Releases the shared fd table (must happen before zombie transition for pipe correctness)
  4. Marks self as ZOMBIE and removes from run queue
  5. Sends SIGCHLD to parent (using _locked variant since sched_lock is held)
  6. Searches for a woken parent to switch to directly (avoids PIT dependency)
  7. If no woken parent, calls sched_yield_to_next
  8. Checks for system shutdown (no remaining live user tasks)

Kernel task exit:

  1. Removes from both circular list and run queue
  2. Records the dying task for deferred cleanup (TCB and stack freed at the next sched_exit call on this CPU)
  3. Context-switches to next task

The deferred cleanup pattern exists because ctx_switch writes to dying->sp – the TCB and stack must remain valid until the stack pointer has fully switched.

Wait Queues

The waitq_t abstraction (kernel/sched/waitq.h, kernel/sched/waitq.c) provides a reusable blocking primitive. Internally it is a LIFO (push/pop at head) — wake_one returns the newest waiter, and wake_all drains in arrival order by reversing the list before walking it. For current workloads (single waiter per point, plus epoll which wakes everyone) LIFO-vs-FIFO is irrelevant; if a future caller needs FIFO the struct layout supports adding a cached tail pointer.

typedef struct waitq {
    struct aegis_task_t *head;
    spinlock_t           lock;
} waitq_t;

#define WAITQ_INIT { .head = NULL, .lock = SPINLOCK_INIT }

Operations

Function Description
waitq_wait(wq) Enqueue current task, then sched_block()
waitq_wake_one(wq) Dequeue oldest waiter and sched_wake() it
waitq_wake_all(wq) Wake all waiters in FIFO order

Canonical Usage Pattern

/* Consumer side */
for (;;) {
    if (condition_is_met(state))
        break;
    waitq_wait(&state->wq);
}

/* Producer side */
arrange_condition(state);
waitq_wake_one(&state->wq);

Implementation Details

  • Internally a singly-linked list through aegis_task_t::wq_next
  • waitq_wait pushes to head (LIFO within the list)
  • wake_one pops from head (newest waiter – strictly LIFO, not FIFO)
  • wake_all drains the list in arrival order by first reversing
  • Lock hierarchy: sched_lock is above wq->locksched_wake is called outside wq->lock to respect this

Migration Status

New blocking I/O paths should use waitq_t. Existing subsystems (pipe, tty, socket, unix_socket, epoll, futex, usb_mouse) retain their single-waiter patterns pending individual audit and migration.

Stack Layout

All tasks (kernel and user) get a 16 KB kernel stack (4 pages). This size accommodates deep call chains through the PIT ISR into virtio-net, IP, and TCP receive processing.

Kernel task stacks include an unmapped guard page at the bottom. Stack overflow triggers a page fault rather than silently corrupting adjacent KVA allocations.

User Process Initial Kernel Stack

For user processes, proc_spawn constructs an initial kernel stack frame that chains through ctx_switch into proc_enter_user (which performs CR3 switch + iretq/ERET to ring 3):

x86-64 user process initial stack (low to high):
  [r15][r14][r13][r12][rbp][rbx]         <- ctx_switch callee-saves
  [proc_enter_user]                      <- ret target
  [pml4_phys]                            <- popped by proc_enter_user
  [entry_rip][CS=0x2B][RFLAGS=0x202][user_RSP][SS=0x23]  <- iretq frame

Timer Integration

The scheduler is driven by the Programmable Interval Timer (PIT) configured at 100 Hz (10 ms tick interval):

  1. PIT fires IRQ0 (vector 0x20)
  2. pit_handler() increments the tick counter
  3. pit_handler() calls sched_tick()
  4. sched_tick() performs round-robin preemption

The s_sched_ready flag prevents sched_tick from performing context switches before sched_start() has been called, since PIT interrupts may fire during early boot before any tasks exist.