Syscall Interface

Aegis implements a Linux-compatible syscall interface using the SYSCALL/SYSRET instructions on x86-64 and SVC/ERET on ARM64. The kernel provides over 100 syscalls covering file I/O, process management, memory management, networking, signals, and Aegis-specific extensions. User-space code (compiled against musl libc) issues standard Linux syscall numbers.

v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The syscall interface is the kernel’s primary attack surface: every syscall handler parses untrusted user input in C. While user pointer validation and kernel staging buffers are used throughout, there should be no expectation of real, battle-tested security. There are likely exploitable vulnerabilities in individual syscall implementations, as would be expected in any from-scratch C kernel of this scale. A gradual migration from C to Rust is planned starting with the kernel; the capability system (kernel/cap/) is already in Rust and represents the beginning of this path. Contributions are welcome – file issues or propose changes at exec/aegis.

Entry Mechanism

x86-64: SYSCALL/SYSRET

The syscall entry point is configured during boot by arch_syscall_init() (kernel/arch/x86_64/arch_syscall.c):

wrmsr(IA32_EFER, rdmsr(IA32_EFER) | 1UL);     /* Enable SCE bit */
wrmsr(IA32_STAR, (ARCH_KERNEL_DS << 48) | (ARCH_KERNEL_CS << 32));
wrmsr(IA32_LSTAR, (uint64_t)syscall_entry);     /* Entry point */
wrmsr(IA32_SFMASK, 0x700UL);                    /* Clear IF, TF, DF */
MSR Value Purpose
IA32_EFER SCE bit set Enable SYSCALL/SYSRET
IA32_STAR Kernel CS/DS selectors Segment selectors for transitions
IA32_LSTAR &syscall_entry RIP on SYSCALL
IA32_SFMASK 0x700 Clear IF (interrupts), TF (single-step), DF (direction)

CPU state on SYSCALL entry:

  • RAX = syscall number
  • RDI = arg1, RSI = arg2, RDX = arg3, R10 = arg4, R8 = arg5, R9 = arg6
  • RCX = return RIP, R11 = saved RFLAGS
  • RSP = user stack (unchanged by CPU)
  • IF=0 (interrupts disabled by SFMASK)

Syscall Entry Assembly (syscall_entry.asm)

1. SWAPGS                          -- switch GS to percpu_t
2. Save user RSP to gs:32          -- percpu.user_rsp_scratch
3. Load kernel RSP from gs:24      -- percpu.kernel_stack
4. Push restore frame:             -- user_rsp, rcx (RIP), r11 (RFLAGS)
5. Push callee-saved regs:         -- r15, r14, r13, r12, rbp, rbx
6. Push r8, r9, r10                -- forms syscall_frame_t
7. Shuffle to SysV 8-arg call:
     rdi=frame, rsi=num, rdx=arg1, rcx=arg2,
     r8=arg3, r9=arg4, [rsp+8]=arg5, [rsp+16]=arg6
8. call syscall_dispatch
9. Check for pending signals
10. Restore registers, SWAPGS, SYSRET

Syscall Frame

The registers pushed onto the kernel stack form a syscall_frame_t, used by sys_fork, sys_execve, and signal delivery to read or modify the user register state:

/* x86-64 */
typedef struct syscall_frame {
    uint64_t r10;       /* +0:  arg4 */
    uint64_t r9;        /* +8:  arg6 */
    uint64_t r8;        /* +16: arg5 */
    uint64_t rbx;       /* +24: callee-saved */
    uint64_t rbp;       /* +32: callee-saved */
    uint64_t r12;       /* +40: callee-saved */
    uint64_t r13;       /* +48: callee-saved */
    uint64_t r14;       /* +56: callee-saved */
    uint64_t r15;       /* +64: callee-saved */
    uint64_t rflags;    /* +72: saved RFLAGS */
    uint64_t rip;       /* +80: return RIP */
    uint64_t user_rsp;  /* +88: saved user RSP */
} syscall_frame_t;

ARM64: SVC/ERET

On ARM64, user-space issues SVC #0. The exception vector in vectors.S saves all 31 general-purpose registers plus SP_EL0, ELR_EL1, and SPSR_EL1 (34 slots), then calls syscall_dispatch. ARM64 uses different syscall numbers than x86-64; the dispatch function translates them before the main switch table.

CR3 Policy

The SYSCALL path does not switch CR3. The user PML4 remains loaded throughout syscall_dispatch so syscalls can directly dereference user virtual addresses. This is safe because:

  1. The user PML4 shares the kernel higher-half (PML4[511]) for all kernel code and KVA stacks
  2. Timer interrupts via isr_common_stub switch to master PML4 before ISR dispatch
  3. sched_exit switches to master PML4 before context-switching away

Dispatch Table

syscall_dispatch() in kernel/syscall/syscall.c is the central dispatch function. On ARM64, it first translates ARM64 syscall numbers to x86-64 equivalents, then dispatches via a switch statement.

ARM64 Translation Layer

ARM64 musl emits ARM64-native syscall numbers. The dispatch function maps them to x86-64 numbers before the main switch. Examples:

ARM64 x86-64 Syscall
63 0 read
64 1 write
57 3 close
93 60 exit
220 56 clone
221 59 execve
222 9 mmap
214 12 brk
172 39 getpid

The *at variants (openat, mkdirat, renameat2, etc.) are translated to their non-at equivalents by stripping the dirfd argument (Aegis uses absolute/CWD-relative paths).

User Pointer Validation

All syscalls validate user pointers before access:

static inline int user_ptr_valid(uint64_t addr, uint64_t len) {
    return len <= USER_ADDR_MAX && addr <= USER_ADDR_MAX - len;
}

This is overflow-safe. Data is copied between user and kernel space via copy_from_user / copy_to_user helpers. Writes use a kernel staging buffer to prevent TOCTOU attacks (user could modify or unmap the buffer between validation and use).

These are v1 mitigations – necessary but not sufficient for a hardened kernel. The validation confirms the pointer is in the canonical user address range but does not verify the page is actually mapped or that the access won’t fault. Individual syscalls may have additional validation gaps. Security audit findings to date have been hypothetical, but real exploitable bugs in argument handling almost certainly exist across the 100+ syscall implementations.

Capability Gates

Many syscalls check the calling process’s capability table before proceeding. Each entry in the capability table pairs a capability kind (CAP_KIND_*) with a rights bitfield (READ/WRITE/EXEC). Failing a capability check returns ENOCAP (errno 130). Capability checks appear at the entry of each guarded syscall, before any resource allocation.

The capability subsystem itself (kernel/cap/) is implemented in Rust – the first kernel component to be migrated from C. This makes the capability enforcement logic memory-safe by construction, even though the syscall handlers that call into it are still C. The planned kernel-wide Rust migration will extend this safety guarantee to the syscall dispatch and argument parsing layers.

Capability Kind Required For
CAP_KIND_VFS_OPEN sys_open, sys_openat
CAP_KIND_VFS_WRITE sys_write, sys_writev
CAP_KIND_VFS_READ sys_read
CAP_KIND_NET_SOCKET sys_socket and all socket syscalls
CAP_KIND_NET_ADMIN sys_netcfg
CAP_KIND_THREAD_CREATE sys_clone (CLONE_VM)
CAP_KIND_PROC_READ Reading /proc/[other-pid]
CAP_KIND_SETUID sys_setuid, sys_setgid
CAP_KIND_DISK_ADMIN sys_blkdev_io, sys_gpt_rescan
CAP_KIND_FB sys_fb_map
CAP_KIND_AUTH Opening /etc/shadow
CAP_KIND_POWER sys_reboot
CAP_KIND_IPC AF_UNIX sockets, memfd_create

Complete Syscall Reference

File I/O (sys_io.c, sys_file.c)

Number Name Signature Description
0 read (fd, buf, count) Read from fd into user buffer
1 write (fd, buf, count) Write user buffer to fd
2 open (path, flags, mode) Open file by path
3 close (fd) Close file descriptor
4 stat (path, statbuf) Get file status by path
5 fstat (fd, statbuf) Get file status by fd
6 lstat (path, statbuf) Get symlink status by path
8 lseek (fd, offset, whence) Reposition read/write offset
16 ioctl (fd, request, arg) Device control (TIOCGPGRP, TIOCSPGRP, TIOCGWINSZ, TIOCNOTTY)
20 writev (fd, iov, iovcnt) Scatter/gather write
21 access (path, mode) Check file accessibility
22 pipe (pipefd) Create unidirectional pipe (pipe2 with flags=0)
32 dup (oldfd) Duplicate file descriptor
33 dup2 (oldfd, newfd) Duplicate fd to specific number
72 fcntl (fd, cmd, arg) File descriptor operations (F_GETFL, F_SETFL, F_DUPFD)
77 ftruncate (fd, length) Truncate file to specified length
257 openat (dirfd, path, flags, mode) Open relative to directory fd
293 pipe2 (pipefd, flags) Create pipe with flags (O_CLOEXEC, O_NONBLOCK)

Directory Operations (sys_dir.c, sys_meta.c)

Number Name Signature Description
79 getcwd (buf, size) Get current working directory
80 chdir (path) Change working directory
82 rename (oldpath, newpath) Rename file
83 mkdir (path, mode) Create directory
87 unlink (path) Remove file
88 symlink (target, linkpath) Create symbolic link
89 readlink (path, buf, bufsiz) Read symbolic link target
90 chmod (path, mode) Change file permissions
91 fchmod (fd, mode) Change permissions by fd
92 chown (path, owner, group) Change file ownership
93 fchown (fd, owner, group) Change ownership by fd
94 lchown (path, owner, group) Change symlink ownership
162 sync () Flush all dirty blocks to disk
217 getdents64 (fd, dirp, count) Read directory entries

Memory Management (sys_memory.c)

Number Name Signature Description
9 mmap (addr, len, prot, flags, fd, offset) Map pages into address space
10 mprotect (addr, len, prot) Change page protections
11 munmap (addr, len) Unmap pages
12 brk (addr) Set/query heap break (0 = query)
319 memfd_create (name, flags) Create anonymous shared memory fd

sys_brk details: Page-aligns upward. Zeroes new pages (Linux guarantee). Rejects shrink below brk_base to prevent freeing ELF segments. On OOM, returns current break unchanged.

sys_mmap details: Supports MAP_ANONYMOUS | MAP_PRIVATE (bump allocator from mmap_base at 0x700000000000), MAP_FIXED, and file-backed MAP_SHARED (memfd). Recycles VAs from munmap freelist before bumping.

Process Management (sys_process.c, sys_exec.c)

Number Name Signature Description
56 clone (frame, flags, stack, ptid, ctid, tls) Create thread or process
57 fork (frame) Duplicate process (deep copy)
59 execve (frame, path, argv, envp) Replace process image
60 exit (status) Terminate calling thread
61 waitpid (pid, wstatus, options) Wait for child state change
231 exit_group (status) Terminate all threads in group
514 spawn (path, argv, envp, stdio_fd, cap_mask) Aegis-specific spawn with capability restriction

See Processes & ELF Loading for detailed fork/clone/execve documentation.

Identity & Session (sys_identity.c)

Number Name Signature Description
39 getpid () Returns TGID (thread group ID)
63 uname (buf) Get system identification
95 umask (mask) Set file creation mask
97 getrlimit (resource, rlim) Get resource limits
102 getuid () Get real user ID
104 getgid () Get real group ID
105 setuid (uid) Set user ID (requires CAP_KIND_SETUID)
106 setgid (gid) Set group ID (requires CAP_KIND_SETUID)
107 geteuid () Get effective user ID (= uid, no SUID)
108 getegid () Get effective group ID (= gid)
109 setpgid (pid, pgid) Set process group ID
110 getppid () Get parent PID
111 getpgrp () Get process group (= getpgid(0))
112 setsid () Create new session
121 getpgid (pid) Get process group of specific PID
158 arch_prctl (code, addr) Set/get FS.base (TLS pointer)
186 gettid () Get thread ID (= PID, unique per thread)
218 set_tid_address (tidptr) Set clear_child_tid pointer
273 set_robust_list (head, len) Set robust futex list (stub, returns 0)

sys_arch_prctl details:

  • ARCH_SET_FS (0x1002): Set FS.base for TLS. Rejects kernel addresses. Writes IA32_FS_BASE MSR and saves to task->fs_base.
  • ARCH_GET_FS (0x1003): Write current fs_base to user pointer.

Signal Handling (sys_signal.c)

Number Name Signature Description
13 rt_sigaction (signum, act, oldact, sigsetsize) Install signal handler
14 rt_sigprocmask (how, set, oldset, sigsetsize) Modify signal mask
15 rt_sigreturn (frame) Return from signal handler
62 kill (pid, sig) Send signal to process
130 rt_sigsuspend (mask, sigsetsize) Wait for signal with temporary mask

Signal delivery is checked at syscall return (SYSRET path). If a signal is pending and not masked, the kernel builds a signal frame on the user stack and redirects SYSRET to the handler. rt_sigreturn restores the original user context.

SIGKILL, SIGSTOP, and SIGCONT cannot be caught or ignored.

Time (sys_time.c)

Number Name Signature Description
35 nanosleep (req, rem) Sleep for specified duration
227 clock_settime (clk_id, tp) Set clock
228 clock_gettime (clk_id, tp) Get clock time (CLOCK_REALTIME, CLOCK_MONOTONIC)

sys_nanosleep: Computes a sleep_deadline from the PIT tick counter, sets task->sleep_deadline, and calls sched_block(). sched_tick wakes the task when the deadline passes.

Networking (sys_socket.c)

Number Name Signature Description
7 poll (fds, nfds, timeout) Wait for events on fds
23 select (nfds, rfds, wfds, efds, timeout) Synchronous I/O multiplexing
41 socket (domain, type, protocol) Create socket (AF_INET, AF_UNIX)
42 connect (fd, addr, addrlen) Connect to remote address
43 accept (fd, addr, addrlen) Accept incoming connection
44 sendto (fd, buf, len, flags, addr, addrlen) Send datagram
45 recvfrom (fd, buf, len, flags, addr, addrlen) Receive datagram
46 sendmsg (fd, msg, flags) Send message with ancillary data
47 recvmsg (fd, msg, flags) Receive message with ancillary data
48 shutdown (fd, how) Shut down socket
49 bind (fd, addr, addrlen) Bind socket to address
50 listen (fd, backlog) Mark socket as passive
51 getsockname (fd, addr, addrlen) Get socket local address
52 getpeername (fd, addr, addrlen) Get socket peer address
53 socketpair (domain, type, proto, sv) Create connected socket pair
54 setsockopt (fd, level, optname, optval, optlen) Set socket option
55 getsockopt (fd, level, optname, optval, optlen) Get socket option
291 epoll_create1 (flags) Create epoll instance
232 epoll_wait (epfd, events, maxevents, timeout) Wait for epoll events
233 epoll_ctl (epfd, op, fd, event) Control epoll instance

See Network Stack and Socket API for details.

Synchronization (futex.c)

Number Name Signature Description
202 futex (addr, op, val, timeout, addr2, val3) Fast userspace mutex

Supports FUTEX_WAIT (block if *addr == val) and FUTEX_WAKE (wake up to val waiters). The FUTEX_PRIVATE_FLAG (128) is accepted and ignored (single address space per process). Used internally for clear_child_tid on thread exit (pthread_join).

Random (sys_random.c)

Number Name Signature Description
318 getrandom (buf, buflen, flags) Get random bytes

Block Device / Framebuffer (sys_disk.c)

Number Name Signature Description
510 blkdev_list (buf, buflen) List block devices
511 blkdev_io (dev, op, lba, buf, count) Raw block device I/O
512 gpt_rescan (dev) Rescan GPT partition table
513 fb_map (info_buf) Map framebuffer into user space

Capability Management (sys_cap.c)

Number Name Signature Description
362 cap_query (pid, buf, buflen) Query process capabilities
363 cap_grant_runtime (target_pid, kind, rights) Grant capability at runtime
364 auth_session () Mark session as authenticated

Aegis Extensions

Number Name Signature Description
169 reboot (cmd) Shutdown/reboot system (requires CAP_KIND_POWER)
360 setfg (pgrp) Set terminal foreground process group
500 netcfg (op, arg1, arg2, arg3) Network configuration (requires CAP_KIND_NET_ADMIN)

Syscall numbers 360-364, 500-514 are Aegis-specific extensions outside the Linux syscall number range.

Error Handling

Syscalls return negative errno values on failure (Linux convention):

Error Value Common Cause
EPERM -1 Operation not permitted
ENOENT -2 File not found
ESRCH -3 No such process
EIO -5 I/O error
ENOEXEC -8 Not an executable
EBADF -9 Bad file descriptor
EAGAIN -11 Process limit reached / would block
ENOMEM -12 Out of memory
EACCES -13 Permission denied
EFAULT -14 Bad user pointer
ENOSYS -38 Syscall not implemented
ENOCAP (errno 130) -130 Capability check failed (Aegis-specific)

All unrecognized syscall numbers return ENOSYS.

Implementation Structure

The syscall implementation is split across multiple translation units for organization:

File Category
syscall.c Dispatch table and ARM64 translation
sys_io.c read, write, writev, close
sys_file.c open, stat, fstat, ioctl, fcntl, lseek, pipe, dup
sys_dir.c getdents64, mkdir, unlink, rename
sys_meta.c lstat, symlink, readlink, chmod, chown
sys_memory.c brk, mmap, munmap, mprotect, memfd_create, ftruncate
sys_process.c exit, exit_group, clone, fork, waitpid
sys_exec.c execve, spawn
sys_identity.c getpid, setuid, setpgid, setsid, arch_prctl, uname
sys_signal.c rt_sigaction, rt_sigprocmask, rt_sigreturn, kill, sigsuspend
sys_time.c nanosleep, clock_gettime, clock_settime
sys_socket.c All socket/networking syscalls, poll, select, epoll
sys_cap.c cap_query, cap_grant_runtime, auth_session
sys_random.c getrandom
sys_disk.c blkdev_list, blkdev_io, gpt_rescan, fb_map
futex.c futex wait/wake

All files include sys_impl.h which provides shared types, constants, and forward declarations.