Virtual File System (VFS)

The Aegis VFS layer provides a unified interface for file operations across eight distinct filesystem backends (initrd, ext2, ramfs, procfs, pipe, memfd, console, PTY). Unlike monolithic VFS designs (Linux, BSD) that use dcache/inode-cache hierarchies, Aegis employs a flat, prefix-based dispatch model: vfs_open matches the path prefix to a backend and delegates directly. There is no mount table, no dentry tree, and no inode cache outside of the ext2 block cache.

Source: kernel/fs/vfs.c, kernel/fs/vfs.h

v1 note: The VFS layer is v1 software – functional and tested, but not hardened against adversarial input. Contributions are welcome – file issues or propose changes at exec/aegis.


Core Data Structures

vfs_ops_t – Operations Vtable

Every open file carries a pointer to a vfs_ops_t structure that defines its driver’s behavior. This is the central abstraction of the VFS layer.

typedef struct {
    int      (*read)(void *priv, void *buf, uint64_t off, uint64_t len);
    int      (*write)(void *priv, const void *buf, uint64_t len);
    void     (*close)(void *priv);
    int      (*readdir)(void *priv, uint64_t index, char *name_out, uint8_t *type_out);
    void     (*dup)(void *priv);
    int      (*stat)(void *priv, k_stat_t *st);
    uint16_t (*poll)(void *priv);
} vfs_ops_t;
Callback Semantics
read Copy up to len bytes starting at off into buf (kernel buffer). Returns bytes copied, 0 for EOF, negative for error.
write Copy len bytes from user-space buf to the backend. Returns bytes written or negative errno. NULL = not writable.
close Release driver-side resources. Called when the last reference to this fd is dropped.
readdir Fill name_out (>= 256 bytes) and type_out with the entry at index. Returns 0 on success, -1 past end. DT_REG=8, DT_DIR=4.
dup Called on dup/dup2/fork to increment driver-side reference counts. NULL = stateless driver.
stat Fill *st with file metadata. NULL = sys_fstat synthesizes a minimal stat.
poll Return current readiness bitmask (POLLIN/POLLOUT/POLLHUP/POLLERR). NULL = caller assumes POLLIN|POLLOUT.

Note that write does not take an offset parameter. Backends that support positional writes (ext2) maintain a write_offset in their private state. This is a deliberate simplification: Aegis does not support pwrite(2).

vfs_file_t – Open File Descriptor

typedef struct {
    const vfs_ops_t *ops;    /* NULL = free slot */
    void            *priv;   /* driver-private data */
    uint64_t         offset; /* current read position */
    uint64_t         size;   /* file size in bytes; 0 for devices/directories */
    uint32_t         flags;  /* O_RDONLY(0)/O_WRONLY(1)/O_RDWR(2)/O_NONBLOCK */
    uint32_t         _pad;
} vfs_file_t;
/* Size: 40 bytes (enforced by _Static_assert) */

The ops pointer doubles as an “in-use” flag: a slot with ops == NULL is free. Each process has a fixed-size fd table of PROC_MAX_FDS (16) vfs_file_t entries.

k_stat_t – Kernel Stat Structure

typedef struct {
    uint64_t st_dev;        /*   0 */
    uint64_t st_ino;        /*   8 */
    uint64_t st_nlink;      /*  16 */
    uint32_t st_mode;       /*  24 */
    uint32_t st_uid;        /*  28 */
    uint32_t st_gid;        /*  32 */
    uint32_t __pad0;        /*  36 */
    uint64_t st_rdev;       /*  40 */
    int64_t  st_size;       /*  48 */
    int64_t  st_blksize;    /*  56 */
    int64_t  st_blocks;     /*  64 */
    int64_t  st_atime;      /*  72 */
    /* ... timestamps and padding to 144 bytes total */
} k_stat_t;

Layout is binary-compatible with Linux x86-64 struct stat (musl libc). This is enforced by a _Static_assert(sizeof(k_stat_t) == 144).


Open Flags

The VFS defines these flag constants, matching Linux x86-64 values:

Flag Value Description
VFS_O_CREAT 0x40 Create file if it does not exist (ext2 only)
VFS_O_TRUNC 0x200 Truncate file to zero length on open
VFS_O_APPEND 0x400 Writes start at end of file
VFS_O_NONBLOCK 0x800 Non-blocking I/O (pipes, PTY)
VFS_O_CLOEXEC 0x80000 Close fd on exec (same bit as VFS_FD_CLOEXEC)

Path Resolution: vfs_open

vfs_open(path, flags, *out) is the central path resolution function, called by sys_open. It uses prefix matching to dispatch to the correct backend, checking each in strict priority order:

┌─────────────────────────────────────────────────────────┐
│                    vfs_open(path)                       │
│                                                         │
│  1. /dev/ptmx           ──→  ptmx_open()    [PTY]       │
│  2. /dev/pts/N          ──→  pts_open(N)    [PTY]       │
│  3. /proc/...           ──→  procfs_open()  [procfs]    │
│  4. /dev/...            ──→  initrd_open()  [devices]   │
│  5. /tmp/...            ──→  ramfs (tmp)    [volatile]  │
│  6. /run/...            ──→  ramfs (run)    [volatile]  │
│  7. ext2 primary        ──→  ext2_open()    [disk]      │
│  8. initrd fallback     ──→  initrd_open()  [boot ROM]  │
│                                                         │
│  Returns: 0 success, -2 ENOENT, -12 ENOMEM, -13 EACCES  │
└─────────────────────────────────────────────────────────┘

Path prefix matching is done byte-by-byte with inline comparisons (no strcmp). This avoids external dependencies – the kernel has no libc.

Priority Order Details

  1. PTY master (/dev/ptmx): Allocates a new pseudo-terminal pair. Returns a master fd; the slave is opened via /dev/pts/N.

  2. PTY slave (/dev/pts/N): Opens the slave end of PTY index N. The index is parsed as a decimal integer from the path.

  3. procfs (/proc/...): Delegated to procfs_open() with the prefix stripped. See procfs documentation.

  4. Device files (/dev/...): The initrd handles /dev/tty, /dev/urandom, /dev/random, /dev/mouse, and directory listings. /dev/null is handled only by vfs_stat_path as a synthetic chardev (1:3) – there is no open(2) backing for it.

  5. Tmp ramfs (/tmp/...): In-memory volatile storage backed by a static ramfs_t instance (s_tmp_ramfs).

  6. Run ramfs (/run/...): Same as tmp, separate instance (s_run_ramfs). Used for runtime state (PID files, sockets).

  7. ext2 primary: The writable root filesystem on NVMe. If the file exists, performs DAC permission checks and capability gating before returning an fd. If the file does not exist and O_CREAT is set, creates it.

  8. initrd fallback: Read-only boot files compiled into the kernel image. Only reached if ext2 lookup fails.

Permission Checks on ext2 Open

When ext2 resolves a file for a user process (sched_current()->is_user), two layers of access control are applied:

DAC (Discretionary Access Control):

int want = 4;                        /* R_OK by default (O_RDONLY=0) */
if (flags & 1) want = 2;            /* O_WRONLY */
if (flags & 2) want = 4 | 2;        /* O_RDWR */
int perm = ext2_check_perm(ino, pr->uid, pr->gid, want);
if (perm != 0) return -13;          /* EACCES */

ext2_check_perm implements standard POSIX owner/group/other permission bit matching with no root bypass – uid 0 gets no special treatment. This is a deliberate design choice aligned with the capability model: privilege is granted through capabilities, not uid.

Capability gating (post-symlink resolution):

uint32_t shadow_ino = ext2_get_shadow_ino();
if (shadow_ino != 0 && ino == shadow_ino && sched_current()->is_user) {
    if (cap_check(pr->caps, CAP_TABLE_SIZE, CAP_KIND_AUTH, CAP_RIGHTS_READ) != 0)
        return -13;  /* EACCES */
}

/etc/shadow requires the CAP_KIND_AUTH capability kind with the CAP_RIGHTS_READ rights bitfield even for uid 0. The cap_check function validates the process’s capability table (the per-process caps array) for the required capability kind and rights. The check compares the resolved inode number against the shadow inode recorded at mount time, so symlink-based bypasses (ln -s /etc/shadow /tmp/x; open("/tmp/x")) are ineffective.

O_CREAT Path

When a file is not found on ext2 and O_CREAT is set:

  1. ext2_lookup_parent() resolves the parent directory
  2. DAC check requires write + execute on the parent directory
  3. ext2_create(path, 0644) allocates an inode and adds a directory entry
  4. The new file is opened and returned

Stat Resolution: vfs_stat_path

vfs_stat_path(path, *out) fills a k_stat_t for the given path. It follows a similar priority order to vfs_open but with additional synthetic entries:

Path Pattern Source st_dev
/proc/... procfs_stat 5
/dev/console, /dev/tty, etc. Synthetic chardev 1
/dev/urandom, /dev/random Synthetic chardev (1:9) 1
/dev/null Synthetic chardev (1:3) 1
/dev/mouse Synthetic chardev (13:0) 1
/dev/ptmx Synthetic chardev (5:2) 1
/dev/pts/N Synthetic chardev (136:0) 1
/dev, /proc, /tmp, /run Synthetic directory (S_IFDIR|0555) 1
/tmp/... ramfs_stat (tmp) 3
/run/... ramfs_stat (run) 3
ext2 files Disk inode 2
initrd files initrd_stat_entry 1

The st_dev values are synthetic device numbers assigned per-backend: 1 for initrd/devices, 2 for ext2 (NVMe), 3 for ramfs, 5 for procfs.

int vfs_stat_path_ex(const char *path, k_stat_t *out, int follow);
  • follow=1: Follow symlinks on the final component (stat behavior)
  • follow=0: Do not follow (lstat behavior)

Non-ext2 paths delegate to vfs_stat_path directly since those filesystems have no symlinks. For ext2 paths, ext2_open_ex is used with the follow parameter.


Metadata Operations

vfs_fchmod

int vfs_fchmod(vfs_file_t *f, uint16_t mode);

Changes permission bits on an open ext2 fd. Preserves file type bits (upper 4 bits of i_mode), replaces the lower 12 permission bits. Returns -1 for non-ext2 fds.

vfs_fchown

int vfs_fchown(vfs_file_t *f, uint16_t uid, uint16_t gid);

Changes owner and group on an open ext2 fd. Writes directly to the on-disk inode via ext2_write_inode. Returns -1 for non-ext2 fds.


Filesystem Backends

The VFS dispatches to these backends, each implementing vfs_ops_t:

┌─────────────────────────────────────────────────────────────┐
│                      VFS Layer (vfs.c)                      │
├──────────┬──────────┬──────────┬──────────┬─────────────────┤
│  initrd  │   ext2   │  ramfs   │  procfs  │  special devs   │
│ (boot    │ (NVMe    │ (/tmp,   │ (/proc)  │ (console, PTY,  │
│  files)  │  disk)   │  /run)   │          │  pipe, memfd)   │
├──────────┼──────────┼──────────┼──────────┼─────────────────┤
│ R/O      │ R/W      │ R/W      │ R/O      │ varies          │
│ static   │ cached   │ volatile │ dynamic  │ per-device      │
│ data     │ block IO │ kva page │ gen-on-  │ semantics       │
│          │          │          │ open     │                 │
└──────────┴──────────┴──────────┴──────────┴─────────────────┘

initrd (Boot Image)

Source: kernel/fs/initrd.c

The initrd is a compile-time filesystem embedded in the kernel binary via objcopy --input binary. It provides:

  • Static files: /etc/motd, /etc/passwd, /etc/shadow, /etc/profile, /etc/hosts
  • Boot binaries: /bin/login, /bin/vigil, /bin/sh, /bin/echo, /bin/cat, /bin/ls
  • Service configs: /etc/vigil/services/{getty,httpd,dhcp,chronos}/{run,policy,caps}
  • Policy capability files: /etc/aegis/caps.d/{login,bastion,httpd,dhcp,stsh,lumen,installer}
  • Device files: /dev/tty (keyboard), /dev/urandom, /dev/random, /dev/mouse
  • Directory listings: Static dir_entry_t arrays for /, /etc, /dev, /root, etc.

Files are read-only with zero-copy reads directly from the kernel’s .rodata section. The /bin directory listing is not provided by initrd – it falls through to ext2, which shows all binaries on the disk image.

ramfs (In-Memory Volatile Storage)

Source: kernel/fs/ramfs.c

Two static instances: s_tmp_ramfs for /tmp and s_run_ramfs for /run. Each is a flat array of RAMFS_MAX_FILES entries with a spinlock for concurrent access.

Key limitations:

  • Files are stored in a single kva page (4 KB max per file, RAMFS_MAX_SIZE)
  • Flat namespace – no subdirectories
  • No persistence across reboots
  • Writes use copy_from_user for SMAP correctness

Console

Source: kernel/fs/console.c

A write-only character device for /dev/console, /dev/stdout, /dev/stderr. Outputs to serial, VGA, and framebuffer simultaneously. Uses a 256-byte kernel bounce buffer with copy_from_user for SMAP safety. Reports POLLOUT always.

Pipe

Source: kernel/fs/pipe.c

Anonymous pipes with a ring buffer of PIPE_BUF_SIZE bytes (fits in one kva page along with metadata, total sizeof(pipe_t) == 4096). Separate vfs_ops_t for read and write ends with independent reference counting.

  • Read end: Blocks if empty and write end is open; returns 0 (EOF) if all writers are gone
  • Write end: Blocks if full; delivers SIGPIPE and returns -EPIPE if all readers are gone
  • Poll support: read end reports POLLIN when data available or POLLHUP when writers gone; write end reports POLLOUT when space available or POLLERR when readers gone

memfd (Anonymous Shared Memory)

Source: kernel/fs/memfd.c

Anonymous memory-backed file descriptors, primarily used with mmap. Backed by physical pages allocated via PMM. Pool of MEMFD_MAX entries with reference counting. ftruncate allocates or frees physical pages. Write via the fd returns -ENOSYS (mmap is the intended interface).


Initialization

vfs_init() is called from kernel_main before sched_init:

void vfs_init(void)
{
    ramfs_init(&s_run_ramfs);
    ramfs_init(&s_tmp_ramfs);
    printk("[VFS] OK: initialized\n");
    initrd_register();
    procfs_init();
}

The ext2 filesystem is mounted separately by ext2_mount("nvme0") after the NVMe driver initializes the block device.


Design Decisions

No mount table. Filesystem backends are hardcoded by path prefix. This eliminates mount/unmount complexity and attack surface at the cost of flexibility. New filesystems require kernel code changes.

No inode cache. Each vfs_open re-resolves from the backend. The ext2 block cache (16-slot LRU) provides caching at the block level, but there is no VFS-level inode or dentry cache. This simplifies memory management and avoids cache coherency issues.

No root privilege bypass. DAC checks do not skip for uid 0. All elevated operations require explicit capabilities. This is a core Aegis design principle – see the capability model.

Write offset in priv, not in vfs_file_t. The write callback takes no offset parameter. ext2 maintains write_offset in ext2_fd_priv_t, ramfs overwrites from offset 0, and console/pipe are stream-oriented. This avoids burdening every backend with offset tracking they may not need.

Byte-by-byte path matching. Path prefixes are checked with inline character comparisons rather than strcmp. The kernel has no libc, and this approach generates branchless comparison sequences that the compiler can optimize into word-sized comparisons.