Virtual File System (VFS)
Architecture and implementation of the Aegis VFS layer, including the operations vtable, file descriptor model, path resolution, and backend dispatch.
Virtual File System (VFS)
The Aegis VFS layer provides a unified interface for file operations across eight distinct filesystem backends (initrd, ext2, ramfs, procfs, pipe, memfd, console, PTY). Unlike monolithic VFS designs (Linux, BSD) that use dcache/inode-cache hierarchies, Aegis employs a flat, prefix-based dispatch model: vfs_open matches the path prefix to a backend and delegates directly. There is no mount table, no dentry tree, and no inode cache outside of the ext2 block cache.
Source: kernel/fs/vfs.c, kernel/fs/vfs.h
v1 note: The VFS layer is v1 software – functional and tested, but not hardened against adversarial input. Contributions are welcome – file issues or propose changes at exec/aegis.
Core Data Structures
vfs_ops_t – Operations Vtable
Every open file carries a pointer to a vfs_ops_t structure that defines its driver’s behavior. This is the central abstraction of the VFS layer.
typedef struct {
int (*read)(void *priv, void *buf, uint64_t off, uint64_t len);
int (*write)(void *priv, const void *buf, uint64_t len);
void (*close)(void *priv);
int (*readdir)(void *priv, uint64_t index, char *name_out, uint8_t *type_out);
void (*dup)(void *priv);
int (*stat)(void *priv, k_stat_t *st);
uint16_t (*poll)(void *priv);
} vfs_ops_t;
| Callback | Semantics |
|---|---|
read |
Copy up to len bytes starting at off into buf (kernel buffer). Returns bytes copied, 0 for EOF, negative for error. |
write |
Copy len bytes from user-space buf to the backend. Returns bytes written or negative errno. NULL = not writable. |
close |
Release driver-side resources. Called when the last reference to this fd is dropped. |
readdir |
Fill name_out (>= 256 bytes) and type_out with the entry at index. Returns 0 on success, -1 past end. DT_REG=8, DT_DIR=4. |
dup |
Called on dup/dup2/fork to increment driver-side reference counts. NULL = stateless driver. |
stat |
Fill *st with file metadata. NULL = sys_fstat synthesizes a minimal stat. |
poll |
Return current readiness bitmask (POLLIN/POLLOUT/POLLHUP/POLLERR). NULL = caller assumes POLLIN|POLLOUT. |
Note that write does not take an offset parameter. Backends that support positional writes (ext2) maintain a write_offset in their private state. This is a deliberate simplification: Aegis does not support pwrite(2).
vfs_file_t – Open File Descriptor
typedef struct {
const vfs_ops_t *ops; /* NULL = free slot */
void *priv; /* driver-private data */
uint64_t offset; /* current read position */
uint64_t size; /* file size in bytes; 0 for devices/directories */
uint32_t flags; /* O_RDONLY(0)/O_WRONLY(1)/O_RDWR(2)/O_NONBLOCK */
uint32_t _pad;
} vfs_file_t;
/* Size: 40 bytes (enforced by _Static_assert) */
The ops pointer doubles as an “in-use” flag: a slot with ops == NULL is free. Each process has a fixed-size fd table of PROC_MAX_FDS (16) vfs_file_t entries.
k_stat_t – Kernel Stat Structure
typedef struct {
uint64_t st_dev; /* 0 */
uint64_t st_ino; /* 8 */
uint64_t st_nlink; /* 16 */
uint32_t st_mode; /* 24 */
uint32_t st_uid; /* 28 */
uint32_t st_gid; /* 32 */
uint32_t __pad0; /* 36 */
uint64_t st_rdev; /* 40 */
int64_t st_size; /* 48 */
int64_t st_blksize; /* 56 */
int64_t st_blocks; /* 64 */
int64_t st_atime; /* 72 */
/* ... timestamps and padding to 144 bytes total */
} k_stat_t;
Layout is binary-compatible with Linux x86-64 struct stat (musl libc). This is enforced by a _Static_assert(sizeof(k_stat_t) == 144).
Open Flags
The VFS defines these flag constants, matching Linux x86-64 values:
| Flag | Value | Description |
|---|---|---|
VFS_O_CREAT |
0x40 |
Create file if it does not exist (ext2 only) |
VFS_O_TRUNC |
0x200 |
Truncate file to zero length on open |
VFS_O_APPEND |
0x400 |
Writes start at end of file |
VFS_O_NONBLOCK |
0x800 |
Non-blocking I/O (pipes, PTY) |
VFS_O_CLOEXEC |
0x80000 |
Close fd on exec (same bit as VFS_FD_CLOEXEC) |
Path Resolution: vfs_open
vfs_open(path, flags, *out) is the central path resolution function, called by sys_open. It uses prefix matching to dispatch to the correct backend, checking each in strict priority order:
┌─────────────────────────────────────────────────────────┐
│ vfs_open(path) │
│ │
│ 1. /dev/ptmx ──→ ptmx_open() [PTY] │
│ 2. /dev/pts/N ──→ pts_open(N) [PTY] │
│ 3. /proc/... ──→ procfs_open() [procfs] │
│ 4. /dev/... ──→ initrd_open() [devices] │
│ 5. /tmp/... ──→ ramfs (tmp) [volatile] │
│ 6. /run/... ──→ ramfs (run) [volatile] │
│ 7. ext2 primary ──→ ext2_open() [disk] │
│ 8. initrd fallback ──→ initrd_open() [boot ROM] │
│ │
│ Returns: 0 success, -2 ENOENT, -12 ENOMEM, -13 EACCES │
└─────────────────────────────────────────────────────────┘
Path prefix matching is done byte-by-byte with inline comparisons (no strcmp). This avoids external dependencies – the kernel has no libc.
Priority Order Details
-
PTY master (
/dev/ptmx): Allocates a new pseudo-terminal pair. Returns a master fd; the slave is opened via/dev/pts/N. -
PTY slave (
/dev/pts/N): Opens the slave end of PTY index N. The index is parsed as a decimal integer from the path. -
procfs (
/proc/...): Delegated toprocfs_open()with the prefix stripped. See procfs documentation. -
Device files (
/dev/...): The initrd handles/dev/tty,/dev/urandom,/dev/random,/dev/mouse, and directory listings./dev/nullis handled only byvfs_stat_pathas a synthetic chardev (1:3) – there is noopen(2)backing for it. -
Tmp ramfs (
/tmp/...): In-memory volatile storage backed by a staticramfs_tinstance (s_tmp_ramfs). -
Run ramfs (
/run/...): Same as tmp, separate instance (s_run_ramfs). Used for runtime state (PID files, sockets). -
ext2 primary: The writable root filesystem on NVMe. If the file exists, performs DAC permission checks and capability gating before returning an fd. If the file does not exist and
O_CREATis set, creates it. -
initrd fallback: Read-only boot files compiled into the kernel image. Only reached if ext2 lookup fails.
Permission Checks on ext2 Open
When ext2 resolves a file for a user process (sched_current()->is_user), two layers of access control are applied:
DAC (Discretionary Access Control):
int want = 4; /* R_OK by default (O_RDONLY=0) */
if (flags & 1) want = 2; /* O_WRONLY */
if (flags & 2) want = 4 | 2; /* O_RDWR */
int perm = ext2_check_perm(ino, pr->uid, pr->gid, want);
if (perm != 0) return -13; /* EACCES */
ext2_check_perm implements standard POSIX owner/group/other permission bit matching with no root bypass – uid 0 gets no special treatment. This is a deliberate design choice aligned with the capability model: privilege is granted through capabilities, not uid.
Capability gating (post-symlink resolution):
uint32_t shadow_ino = ext2_get_shadow_ino();
if (shadow_ino != 0 && ino == shadow_ino && sched_current()->is_user) {
if (cap_check(pr->caps, CAP_TABLE_SIZE, CAP_KIND_AUTH, CAP_RIGHTS_READ) != 0)
return -13; /* EACCES */
}
/etc/shadow requires the CAP_KIND_AUTH capability kind with the CAP_RIGHTS_READ rights bitfield even for uid 0. The cap_check function validates the process’s capability table (the per-process caps array) for the required capability kind and rights. The check compares the resolved inode number against the shadow inode recorded at mount time, so symlink-based bypasses (ln -s /etc/shadow /tmp/x; open("/tmp/x")) are ineffective.
O_CREAT Path
When a file is not found on ext2 and O_CREAT is set:
ext2_lookup_parent()resolves the parent directory- DAC check requires write + execute on the parent directory
ext2_create(path, 0644)allocates an inode and adds a directory entry- The new file is opened and returned
Stat Resolution: vfs_stat_path
vfs_stat_path(path, *out) fills a k_stat_t for the given path. It follows a similar priority order to vfs_open but with additional synthetic entries:
| Path Pattern | Source | st_dev |
|---|---|---|
/proc/... |
procfs_stat | 5 |
/dev/console, /dev/tty, etc. |
Synthetic chardev | 1 |
/dev/urandom, /dev/random |
Synthetic chardev (1:9) | 1 |
/dev/null |
Synthetic chardev (1:3) | 1 |
/dev/mouse |
Synthetic chardev (13:0) | 1 |
/dev/ptmx |
Synthetic chardev (5:2) | 1 |
/dev/pts/N |
Synthetic chardev (136:0) | 1 |
/dev, /proc, /tmp, /run |
Synthetic directory (S_IFDIR|0555) | 1 |
/tmp/... |
ramfs_stat (tmp) | 3 |
/run/... |
ramfs_stat (run) | 3 |
| ext2 files | Disk inode | 2 |
| initrd files | initrd_stat_entry | 1 |
The st_dev values are synthetic device numbers assigned per-backend: 1 for initrd/devices, 2 for ext2 (NVMe), 3 for ramfs, 5 for procfs.
Symlink-Aware Stat: vfs_stat_path_ex
int vfs_stat_path_ex(const char *path, k_stat_t *out, int follow);
follow=1: Follow symlinks on the final component (stat behavior)follow=0: Do not follow (lstat behavior)
Non-ext2 paths delegate to vfs_stat_path directly since those filesystems have no symlinks. For ext2 paths, ext2_open_ex is used with the follow parameter.
Metadata Operations
vfs_fchmod
int vfs_fchmod(vfs_file_t *f, uint16_t mode);
Changes permission bits on an open ext2 fd. Preserves file type bits (upper 4 bits of i_mode), replaces the lower 12 permission bits. Returns -1 for non-ext2 fds.
vfs_fchown
int vfs_fchown(vfs_file_t *f, uint16_t uid, uint16_t gid);
Changes owner and group on an open ext2 fd. Writes directly to the on-disk inode via ext2_write_inode. Returns -1 for non-ext2 fds.
Filesystem Backends
The VFS dispatches to these backends, each implementing vfs_ops_t:
┌─────────────────────────────────────────────────────────────┐
│ VFS Layer (vfs.c) │
├──────────┬──────────┬──────────┬──────────┬─────────────────┤
│ initrd │ ext2 │ ramfs │ procfs │ special devs │
│ (boot │ (NVMe │ (/tmp, │ (/proc) │ (console, PTY, │
│ files) │ disk) │ /run) │ │ pipe, memfd) │
├──────────┼──────────┼──────────┼──────────┼─────────────────┤
│ R/O │ R/W │ R/W │ R/O │ varies │
│ static │ cached │ volatile │ dynamic │ per-device │
│ data │ block IO │ kva page │ gen-on- │ semantics │
│ │ │ │ open │ │
└──────────┴──────────┴──────────┴──────────┴─────────────────┘
initrd (Boot Image)
Source: kernel/fs/initrd.c
The initrd is a compile-time filesystem embedded in the kernel binary via objcopy --input binary. It provides:
- Static files:
/etc/motd,/etc/passwd,/etc/shadow,/etc/profile,/etc/hosts - Boot binaries:
/bin/login,/bin/vigil,/bin/sh,/bin/echo,/bin/cat,/bin/ls - Service configs:
/etc/vigil/services/{getty,httpd,dhcp,chronos}/{run,policy,caps} - Policy capability files:
/etc/aegis/caps.d/{login,bastion,httpd,dhcp,stsh,lumen,installer} - Device files:
/dev/tty(keyboard),/dev/urandom,/dev/random,/dev/mouse - Directory listings: Static
dir_entry_tarrays for/,/etc,/dev,/root, etc.
Files are read-only with zero-copy reads directly from the kernel’s .rodata section. The /bin directory listing is not provided by initrd – it falls through to ext2, which shows all binaries on the disk image.
ramfs (In-Memory Volatile Storage)
Source: kernel/fs/ramfs.c
Two static instances: s_tmp_ramfs for /tmp and s_run_ramfs for /run. Each is a flat array of RAMFS_MAX_FILES entries with a spinlock for concurrent access.
Key limitations:
- Files are stored in a single kva page (4 KB max per file,
RAMFS_MAX_SIZE) - Flat namespace – no subdirectories
- No persistence across reboots
- Writes use
copy_from_userfor SMAP correctness
Console
Source: kernel/fs/console.c
A write-only character device for /dev/console, /dev/stdout, /dev/stderr. Outputs to serial, VGA, and framebuffer simultaneously. Uses a 256-byte kernel bounce buffer with copy_from_user for SMAP safety. Reports POLLOUT always.
Pipe
Source: kernel/fs/pipe.c
Anonymous pipes with a ring buffer of PIPE_BUF_SIZE bytes (fits in one kva page along with metadata, total sizeof(pipe_t) == 4096). Separate vfs_ops_t for read and write ends with independent reference counting.
- Read end: Blocks if empty and write end is open; returns 0 (EOF) if all writers are gone
- Write end: Blocks if full; delivers SIGPIPE and returns -EPIPE if all readers are gone
- Poll support: read end reports POLLIN when data available or POLLHUP when writers gone; write end reports POLLOUT when space available or POLLERR when readers gone
memfd (Anonymous Shared Memory)
Source: kernel/fs/memfd.c
Anonymous memory-backed file descriptors, primarily used with mmap. Backed by physical pages allocated via PMM. Pool of MEMFD_MAX entries with reference counting. ftruncate allocates or frees physical pages. Write via the fd returns -ENOSYS (mmap is the intended interface).
Initialization
vfs_init() is called from kernel_main before sched_init:
void vfs_init(void)
{
ramfs_init(&s_run_ramfs);
ramfs_init(&s_tmp_ramfs);
printk("[VFS] OK: initialized\n");
initrd_register();
procfs_init();
}
The ext2 filesystem is mounted separately by ext2_mount("nvme0") after the NVMe driver initializes the block device.
Design Decisions
No mount table. Filesystem backends are hardcoded by path prefix. This eliminates mount/unmount complexity and attack surface at the cost of flexibility. New filesystems require kernel code changes.
No inode cache. Each vfs_open re-resolves from the backend. The ext2 block cache (16-slot LRU) provides caching at the block level, but there is no VFS-level inode or dentry cache. This simplifies memory management and avoids cache coherency issues.
No root privilege bypass. DAC checks do not skip for uid 0. All elevated operations require explicit capabilities. This is a core Aegis design principle – see the capability model.
Write offset in priv, not in vfs_file_t. The write callback takes no offset parameter. ext2 maintains write_offset in ext2_fd_priv_t, ramfs overwrites from offset 0, and console/pipe are stream-oriented. This avoids burdening every backend with offset tracking they may not need.
Byte-by-byte path matching. Path prefixes are checked with inline character comparisons rather than strcmp. The kernel has no libc, and this approach generates branchless comparison sequences that the compiler can optimize into word-sized comparisons.