Syscall Interface
System call entry mechanism, dispatch table, and complete reference for all implemented syscalls with numbers and signatures
Syscall Interface
Aegis implements a Linux-compatible syscall interface using the SYSCALL/SYSRET instructions on x86-64 and SVC/ERET on ARM64. The kernel provides over 100 syscalls covering file I/O, process management, memory management, networking, signals, and Aegis-specific extensions. User-space code (compiled against musl libc) issues standard Linux syscall numbers.
v1 maturity note: Aegis is v1 software – the first version deemed ready for public release, not a mature or production-hardened system. The syscall interface is the kernel’s primary attack surface: every syscall handler parses untrusted user input in C. While user pointer validation and kernel staging buffers are used throughout, there should be no expectation of real, battle-tested security. There are likely exploitable vulnerabilities in individual syscall implementations, as would be expected in any from-scratch C kernel of this scale. A gradual migration from C to Rust is planned starting with the kernel; the capability system (
kernel/cap/) is already in Rust and represents the beginning of this path. Contributions are welcome – file issues or propose changes at exec/aegis.
Entry Mechanism
x86-64: SYSCALL/SYSRET
The syscall entry point is configured during boot by arch_syscall_init() (kernel/arch/x86_64/arch_syscall.c):
wrmsr(IA32_EFER, rdmsr(IA32_EFER) | 1UL); /* Enable SCE bit */
wrmsr(IA32_STAR, (ARCH_KERNEL_DS << 48) | (ARCH_KERNEL_CS << 32));
wrmsr(IA32_LSTAR, (uint64_t)syscall_entry); /* Entry point */
wrmsr(IA32_SFMASK, 0x700UL); /* Clear IF, TF, DF */
| MSR | Value | Purpose |
|---|---|---|
IA32_EFER |
SCE bit set | Enable SYSCALL/SYSRET |
IA32_STAR |
Kernel CS/DS selectors | Segment selectors for transitions |
IA32_LSTAR |
&syscall_entry |
RIP on SYSCALL |
IA32_SFMASK |
0x700 |
Clear IF (interrupts), TF (single-step), DF (direction) |
CPU state on SYSCALL entry:
RAX= syscall numberRDI= arg1,RSI= arg2,RDX= arg3,R10= arg4,R8= arg5,R9= arg6RCX= return RIP,R11= saved RFLAGSRSP= user stack (unchanged by CPU)- IF=0 (interrupts disabled by SFMASK)
Syscall Entry Assembly (syscall_entry.asm)
1. SWAPGS -- switch GS to percpu_t
2. Save user RSP to gs:32 -- percpu.user_rsp_scratch
3. Load kernel RSP from gs:24 -- percpu.kernel_stack
4. Push restore frame: -- user_rsp, rcx (RIP), r11 (RFLAGS)
5. Push callee-saved regs: -- r15, r14, r13, r12, rbp, rbx
6. Push r8, r9, r10 -- forms syscall_frame_t
7. Shuffle to SysV 8-arg call:
rdi=frame, rsi=num, rdx=arg1, rcx=arg2,
r8=arg3, r9=arg4, [rsp+8]=arg5, [rsp+16]=arg6
8. call syscall_dispatch
9. Check for pending signals
10. Restore registers, SWAPGS, SYSRET
Syscall Frame
The registers pushed onto the kernel stack form a syscall_frame_t, used by sys_fork, sys_execve, and signal delivery to read or modify the user register state:
/* x86-64 */
typedef struct syscall_frame {
uint64_t r10; /* +0: arg4 */
uint64_t r9; /* +8: arg6 */
uint64_t r8; /* +16: arg5 */
uint64_t rbx; /* +24: callee-saved */
uint64_t rbp; /* +32: callee-saved */
uint64_t r12; /* +40: callee-saved */
uint64_t r13; /* +48: callee-saved */
uint64_t r14; /* +56: callee-saved */
uint64_t r15; /* +64: callee-saved */
uint64_t rflags; /* +72: saved RFLAGS */
uint64_t rip; /* +80: return RIP */
uint64_t user_rsp; /* +88: saved user RSP */
} syscall_frame_t;
ARM64: SVC/ERET
On ARM64, user-space issues SVC #0. The exception vector in vectors.S saves all 31 general-purpose registers plus SP_EL0, ELR_EL1, and SPSR_EL1 (34 slots), then calls syscall_dispatch. ARM64 uses different syscall numbers than x86-64; the dispatch function translates them before the main switch table.
CR3 Policy
The SYSCALL path does not switch CR3. The user PML4 remains loaded throughout syscall_dispatch so syscalls can directly dereference user virtual addresses. This is safe because:
- The user PML4 shares the kernel higher-half (PML4[511]) for all kernel code and KVA stacks
- Timer interrupts via
isr_common_stubswitch to master PML4 before ISR dispatch sched_exitswitches to master PML4 before context-switching away
Dispatch Table
syscall_dispatch() in kernel/syscall/syscall.c is the central dispatch function. On ARM64, it first translates ARM64 syscall numbers to x86-64 equivalents, then dispatches via a switch statement.
ARM64 Translation Layer
ARM64 musl emits ARM64-native syscall numbers. The dispatch function maps them to x86-64 numbers before the main switch. Examples:
| ARM64 | x86-64 | Syscall |
|---|---|---|
| 63 | 0 | read |
| 64 | 1 | write |
| 57 | 3 | close |
| 93 | 60 | exit |
| 220 | 56 | clone |
| 221 | 59 | execve |
| 222 | 9 | mmap |
| 214 | 12 | brk |
| 172 | 39 | getpid |
The *at variants (openat, mkdirat, renameat2, etc.) are translated to their non-at equivalents by stripping the dirfd argument (Aegis uses absolute/CWD-relative paths).
User Pointer Validation
All syscalls validate user pointers before access:
static inline int user_ptr_valid(uint64_t addr, uint64_t len) {
return len <= USER_ADDR_MAX && addr <= USER_ADDR_MAX - len;
}
This is overflow-safe. Data is copied between user and kernel space via copy_from_user / copy_to_user helpers. Writes use a kernel staging buffer to prevent TOCTOU attacks (user could modify or unmap the buffer between validation and use).
These are v1 mitigations – necessary but not sufficient for a hardened kernel. The validation confirms the pointer is in the canonical user address range but does not verify the page is actually mapped or that the access won’t fault. Individual syscalls may have additional validation gaps. Security audit findings to date have been hypothetical, but real exploitable bugs in argument handling almost certainly exist across the 100+ syscall implementations.
Capability Gates
Many syscalls check the calling process’s capability table before proceeding. Each entry in the capability table pairs a capability kind (CAP_KIND_*) with a rights bitfield (READ/WRITE/EXEC). Failing a capability check returns ENOCAP (errno 130). Capability checks appear at the entry of each guarded syscall, before any resource allocation.
The capability subsystem itself (kernel/cap/) is implemented in Rust – the first kernel component to be migrated from C. This makes the capability enforcement logic memory-safe by construction, even though the syscall handlers that call into it are still C. The planned kernel-wide Rust migration will extend this safety guarantee to the syscall dispatch and argument parsing layers.
| Capability Kind | Required For |
|---|---|
CAP_KIND_VFS_OPEN |
sys_open, sys_openat |
CAP_KIND_VFS_WRITE |
sys_write, sys_writev |
CAP_KIND_VFS_READ |
sys_read |
CAP_KIND_NET_SOCKET |
sys_socket and all socket syscalls |
CAP_KIND_NET_ADMIN |
sys_netcfg |
CAP_KIND_THREAD_CREATE |
sys_clone (CLONE_VM) |
CAP_KIND_PROC_READ |
Reading /proc/[other-pid] |
CAP_KIND_SETUID |
sys_setuid, sys_setgid |
CAP_KIND_DISK_ADMIN |
sys_blkdev_io, sys_gpt_rescan |
CAP_KIND_FB |
sys_fb_map |
CAP_KIND_AUTH |
Opening /etc/shadow |
CAP_KIND_POWER |
sys_reboot |
CAP_KIND_IPC |
AF_UNIX sockets, memfd_create |
Complete Syscall Reference
File I/O (sys_io.c, sys_file.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 0 | read | (fd, buf, count) |
Read from fd into user buffer |
| 1 | write | (fd, buf, count) |
Write user buffer to fd |
| 2 | open | (path, flags, mode) |
Open file by path |
| 3 | close | (fd) |
Close file descriptor |
| 4 | stat | (path, statbuf) |
Get file status by path |
| 5 | fstat | (fd, statbuf) |
Get file status by fd |
| 6 | lstat | (path, statbuf) |
Get symlink status by path |
| 8 | lseek | (fd, offset, whence) |
Reposition read/write offset |
| 16 | ioctl | (fd, request, arg) |
Device control (TIOCGPGRP, TIOCSPGRP, TIOCGWINSZ, TIOCNOTTY) |
| 20 | writev | (fd, iov, iovcnt) |
Scatter/gather write |
| 21 | access | (path, mode) |
Check file accessibility |
| 22 | pipe | (pipefd) |
Create unidirectional pipe (pipe2 with flags=0) |
| 32 | dup | (oldfd) |
Duplicate file descriptor |
| 33 | dup2 | (oldfd, newfd) |
Duplicate fd to specific number |
| 72 | fcntl | (fd, cmd, arg) |
File descriptor operations (F_GETFL, F_SETFL, F_DUPFD) |
| 77 | ftruncate | (fd, length) |
Truncate file to specified length |
| 257 | openat | (dirfd, path, flags, mode) |
Open relative to directory fd |
| 293 | pipe2 | (pipefd, flags) |
Create pipe with flags (O_CLOEXEC, O_NONBLOCK) |
Directory Operations (sys_dir.c, sys_meta.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 79 | getcwd | (buf, size) |
Get current working directory |
| 80 | chdir | (path) |
Change working directory |
| 82 | rename | (oldpath, newpath) |
Rename file |
| 83 | mkdir | (path, mode) |
Create directory |
| 87 | unlink | (path) |
Remove file |
| 88 | symlink | (target, linkpath) |
Create symbolic link |
| 89 | readlink | (path, buf, bufsiz) |
Read symbolic link target |
| 90 | chmod | (path, mode) |
Change file permissions |
| 91 | fchmod | (fd, mode) |
Change permissions by fd |
| 92 | chown | (path, owner, group) |
Change file ownership |
| 93 | fchown | (fd, owner, group) |
Change ownership by fd |
| 94 | lchown | (path, owner, group) |
Change symlink ownership |
| 162 | sync | () |
Flush all dirty blocks to disk |
| 217 | getdents64 | (fd, dirp, count) |
Read directory entries |
Memory Management (sys_memory.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 9 | mmap | (addr, len, prot, flags, fd, offset) |
Map pages into address space |
| 10 | mprotect | (addr, len, prot) |
Change page protections |
| 11 | munmap | (addr, len) |
Unmap pages |
| 12 | brk | (addr) |
Set/query heap break (0 = query) |
| 319 | memfd_create | (name, flags) |
Create anonymous shared memory fd |
sys_brk details: Page-aligns upward. Zeroes new pages (Linux guarantee). Rejects shrink below brk_base to prevent freeing ELF segments. On OOM, returns current break unchanged.
sys_mmap details: Supports MAP_ANONYMOUS | MAP_PRIVATE (bump allocator from mmap_base at 0x700000000000), MAP_FIXED, and file-backed MAP_SHARED (memfd). Recycles VAs from munmap freelist before bumping.
Process Management (sys_process.c, sys_exec.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 56 | clone | (frame, flags, stack, ptid, ctid, tls) |
Create thread or process |
| 57 | fork | (frame) |
Duplicate process (deep copy) |
| 59 | execve | (frame, path, argv, envp) |
Replace process image |
| 60 | exit | (status) |
Terminate calling thread |
| 61 | waitpid | (pid, wstatus, options) |
Wait for child state change |
| 231 | exit_group | (status) |
Terminate all threads in group |
| 514 | spawn | (path, argv, envp, stdio_fd, cap_mask) |
Aegis-specific spawn with capability restriction |
See Processes & ELF Loading for detailed fork/clone/execve documentation.
Identity & Session (sys_identity.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 39 | getpid | () |
Returns TGID (thread group ID) |
| 63 | uname | (buf) |
Get system identification |
| 95 | umask | (mask) |
Set file creation mask |
| 97 | getrlimit | (resource, rlim) |
Get resource limits |
| 102 | getuid | () |
Get real user ID |
| 104 | getgid | () |
Get real group ID |
| 105 | setuid | (uid) |
Set user ID (requires CAP_KIND_SETUID) |
| 106 | setgid | (gid) |
Set group ID (requires CAP_KIND_SETUID) |
| 107 | geteuid | () |
Get effective user ID (= uid, no SUID) |
| 108 | getegid | () |
Get effective group ID (= gid) |
| 109 | setpgid | (pid, pgid) |
Set process group ID |
| 110 | getppid | () |
Get parent PID |
| 111 | getpgrp | () |
Get process group (= getpgid(0)) |
| 112 | setsid | () |
Create new session |
| 121 | getpgid | (pid) |
Get process group of specific PID |
| 158 | arch_prctl | (code, addr) |
Set/get FS.base (TLS pointer) |
| 186 | gettid | () |
Get thread ID (= PID, unique per thread) |
| 218 | set_tid_address | (tidptr) |
Set clear_child_tid pointer |
| 273 | set_robust_list | (head, len) |
Set robust futex list (stub, returns 0) |
sys_arch_prctl details:
ARCH_SET_FS(0x1002): Set FS.base for TLS. Rejects kernel addresses. WritesIA32_FS_BASEMSR and saves totask->fs_base.ARCH_GET_FS(0x1003): Write currentfs_baseto user pointer.
Signal Handling (sys_signal.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 13 | rt_sigaction | (signum, act, oldact, sigsetsize) |
Install signal handler |
| 14 | rt_sigprocmask | (how, set, oldset, sigsetsize) |
Modify signal mask |
| 15 | rt_sigreturn | (frame) |
Return from signal handler |
| 62 | kill | (pid, sig) |
Send signal to process |
| 130 | rt_sigsuspend | (mask, sigsetsize) |
Wait for signal with temporary mask |
Signal delivery is checked at syscall return (SYSRET path). If a signal is pending and not masked, the kernel builds a signal frame on the user stack and redirects SYSRET to the handler. rt_sigreturn restores the original user context.
SIGKILL, SIGSTOP, and SIGCONT cannot be caught or ignored.
Time (sys_time.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 35 | nanosleep | (req, rem) |
Sleep for specified duration |
| 227 | clock_settime | (clk_id, tp) |
Set clock |
| 228 | clock_gettime | (clk_id, tp) |
Get clock time (CLOCK_REALTIME, CLOCK_MONOTONIC) |
sys_nanosleep: Computes a sleep_deadline from the PIT tick counter, sets task->sleep_deadline, and calls sched_block(). sched_tick wakes the task when the deadline passes.
Networking (sys_socket.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 7 | poll | (fds, nfds, timeout) |
Wait for events on fds |
| 23 | select | (nfds, rfds, wfds, efds, timeout) |
Synchronous I/O multiplexing |
| 41 | socket | (domain, type, protocol) |
Create socket (AF_INET, AF_UNIX) |
| 42 | connect | (fd, addr, addrlen) |
Connect to remote address |
| 43 | accept | (fd, addr, addrlen) |
Accept incoming connection |
| 44 | sendto | (fd, buf, len, flags, addr, addrlen) |
Send datagram |
| 45 | recvfrom | (fd, buf, len, flags, addr, addrlen) |
Receive datagram |
| 46 | sendmsg | (fd, msg, flags) |
Send message with ancillary data |
| 47 | recvmsg | (fd, msg, flags) |
Receive message with ancillary data |
| 48 | shutdown | (fd, how) |
Shut down socket |
| 49 | bind | (fd, addr, addrlen) |
Bind socket to address |
| 50 | listen | (fd, backlog) |
Mark socket as passive |
| 51 | getsockname | (fd, addr, addrlen) |
Get socket local address |
| 52 | getpeername | (fd, addr, addrlen) |
Get socket peer address |
| 53 | socketpair | (domain, type, proto, sv) |
Create connected socket pair |
| 54 | setsockopt | (fd, level, optname, optval, optlen) |
Set socket option |
| 55 | getsockopt | (fd, level, optname, optval, optlen) |
Get socket option |
| 291 | epoll_create1 | (flags) |
Create epoll instance |
| 232 | epoll_wait | (epfd, events, maxevents, timeout) |
Wait for epoll events |
| 233 | epoll_ctl | (epfd, op, fd, event) |
Control epoll instance |
See Network Stack and Socket API for details.
Synchronization (futex.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 202 | futex | (addr, op, val, timeout, addr2, val3) |
Fast userspace mutex |
Supports FUTEX_WAIT (block if *addr == val) and FUTEX_WAKE (wake up to val waiters). The FUTEX_PRIVATE_FLAG (128) is accepted and ignored (single address space per process). Used internally for clear_child_tid on thread exit (pthread_join).
Random (sys_random.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 318 | getrandom | (buf, buflen, flags) |
Get random bytes |
Block Device / Framebuffer (sys_disk.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 510 | blkdev_list | (buf, buflen) |
List block devices |
| 511 | blkdev_io | (dev, op, lba, buf, count) |
Raw block device I/O |
| 512 | gpt_rescan | (dev) |
Rescan GPT partition table |
| 513 | fb_map | (info_buf) |
Map framebuffer into user space |
Capability Management (sys_cap.c)
| Number | Name | Signature | Description |
|---|---|---|---|
| 362 | cap_query | (pid, buf, buflen) |
Query process capabilities |
| 363 | cap_grant_runtime | (target_pid, kind, rights) |
Grant capability at runtime |
| 364 | auth_session | () |
Mark session as authenticated |
Aegis Extensions
| Number | Name | Signature | Description |
|---|---|---|---|
| 169 | reboot | (cmd) |
Shutdown/reboot system (requires CAP_KIND_POWER) |
| 360 | setfg | (pgrp) |
Set terminal foreground process group |
| 500 | netcfg | (op, arg1, arg2, arg3) |
Network configuration (requires CAP_KIND_NET_ADMIN) |
Syscall numbers 360-364, 500-514 are Aegis-specific extensions outside the Linux syscall number range.
Error Handling
Syscalls return negative errno values on failure (Linux convention):
| Error | Value | Common Cause |
|---|---|---|
| EPERM | -1 | Operation not permitted |
| ENOENT | -2 | File not found |
| ESRCH | -3 | No such process |
| EIO | -5 | I/O error |
| ENOEXEC | -8 | Not an executable |
| EBADF | -9 | Bad file descriptor |
| EAGAIN | -11 | Process limit reached / would block |
| ENOMEM | -12 | Out of memory |
| EACCES | -13 | Permission denied |
| EFAULT | -14 | Bad user pointer |
| ENOSYS | -38 | Syscall not implemented |
| ENOCAP (errno 130) | -130 | Capability check failed (Aegis-specific) |
All unrecognized syscall numbers return ENOSYS.
Implementation Structure
The syscall implementation is split across multiple translation units for organization:
| File | Category |
|---|---|
syscall.c |
Dispatch table and ARM64 translation |
sys_io.c |
read, write, writev, close |
sys_file.c |
open, stat, fstat, ioctl, fcntl, lseek, pipe, dup |
sys_dir.c |
getdents64, mkdir, unlink, rename |
sys_meta.c |
lstat, symlink, readlink, chmod, chown |
sys_memory.c |
brk, mmap, munmap, mprotect, memfd_create, ftruncate |
sys_process.c |
exit, exit_group, clone, fork, waitpid |
sys_exec.c |
execve, spawn |
sys_identity.c |
getpid, setuid, setpgid, setsid, arch_prctl, uname |
sys_signal.c |
rt_sigaction, rt_sigprocmask, rt_sigreturn, kill, sigsuspend |
sys_time.c |
nanosleep, clock_gettime, clock_settime |
sys_socket.c |
All socket/networking syscalls, poll, select, epoll |
sys_cap.c |
cap_query, cap_grant_runtime, auth_session |
sys_random.c |
getrandom |
sys_disk.c |
blkdev_list, blkdev_io, gpt_rescan, fb_map |
futex.c |
futex wait/wake |
All files include sys_impl.h which provides shared types, constants, and forward declarations.
Related Documentation
- Processes & ELF Loading – fork, clone, execve details
- Scheduler – task states and blocking
- Interrupts & Exceptions – ISR entry path (compared with SYSCALL path)
- Capability model – capability table and capability kinds
- Security policy engine – policy capabilities and baseline capabilities
- Network Stack – socket implementation details