procfs and Special Filesystems

Beyond the primary ext2 filesystem, Aegis provides several special-purpose filesystems that serve as VFS backends. Each implements the vfs_ops_t interface documented in the VFS layer.

v1 note: These filesystem implementations are v1 software – functional and tested, but not hardened. Contributions are welcome – file issues or propose changes at exec/aegis.


procfs – Process Information Filesystem

Source: kernel/fs/procfs.c

The procfs implementation uses a generate-on-open design: opening a /proc file allocates a kernel virtual address (kva) page, generates the file content into it, and stores the buffer in the fd’s priv field. Subsequent reads copy from this snapshot; close frees the buffer. Content is never regenerated after open – it is a point-in-time snapshot.

Namespace Layout

/proc/
├── self/             → symlink-like alias to /proc/<current_pid>/
│   ├── maps          → memory map (VMA table)
│   ├── status        → process status summary
│   ├── stat          → single-line stat (Linux /proc/<pid>/stat format)
│   ├── exe           → executable path
│   ├── cmdline       → NUL-terminated argv
│   └── fd/           → directory listing of open file descriptors
├── <pid>/            → per-process directory (same entries as self/)
├── meminfo           → system memory summary
├── version           → kernel version string
└── cmdline           → kernel command line

Capability Gating

Access to /proc/self/ is always permitted. Accessing /proc/<pid>/ for a different process requires the CAP_KIND_PROC_READ capability kind in the caller’s capability table:

static int procfs_check_access(uint32_t target_pid)
{
    aegis_task_t *cur = sched_current();
    if (!cur || !cur->is_user) return -1;
    aegis_process_t *caller = (aegis_process_t *)cur;
    if (target_pid == caller->pid)
        return 0;  /* self always OK */
    return cap_check(caller->caps, CAP_TABLE_SIZE,
                     CAP_KIND_PROC_READ, CAP_RIGHTS_READ);
}

The cap_check function scans the process’s capability table (the per-process caps array) for an entry matching the CAP_KIND_PROC_READ capability kind with the CAP_RIGHTS_READ rights bitfield. If the capability is not present, the call returns ENOCAP (errno 130), and procfs_open_pid denies access. This prevents unprivileged processes from inspecting other processes’ memory maps, file descriptors, or credentials. See the capability model for the full capability system.

Per-Process Files

/proc/<pid>/maps

Generated by gen_maps(). Outputs one line per VMA entry in the process’s VMA table:

00400000-00401000 r-xp 00000000 00:00 0         /bin/stsh
00600000-00601000 rw-p 00000000 00:00 0         /bin/stsh
01000000-01001000 rw-p 00000000 00:00 0         [heap]
7fff0000-80000000 rw-p 00000000 00:00 0         [stack]

VMA types are mapped to names:

VMA Type Label
VMA_ELF_TEXT / VMA_ELF_DATA Executable path
VMA_HEAP [heap]
VMA_STACK [stack]
VMA_GUARD [guard]
VMA_THREAD_STACK [thread_stack]

Permission bits: r (PROT_READ=1), w (PROT_WRITE=2), x (PROT_EXEC=4), p (always private).

/proc/<pid>/status

Generated by gen_status(). Multi-line key-value format:

Name:   stsh
State:  R (running)
Tgid:   3
Pid:    3
PPid:   1
Uid:    0
Gid:    0
VmSize: 8192 kB

Task states:

State Character Description
TASK_RUNNING R Currently runnable or on CPU
TASK_BLOCKED S Sleeping (waiting for event)
TASK_ZOMBIE Z Terminated, awaiting parent wait()
TASK_STOPPED T Stopped (signal)

/proc/<pid>/stat

Generated by gen_stat(). Single-line format compatible with Linux /proc/<pid>/stat:

3 (stsh) R 1 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Fields: pid (comm) state ppid pgid sid tty_nr tpgid ... with most fields zero-padded for compatibility.

/proc/<pid>/exe

The process’s executable path followed by a newline.

/proc/<pid>/cmdline

The process’s executable path as a NUL-terminated string (argv[0] only).

/proc/<pid>/fd/

Directory listing of open file descriptors. Enumerates the process’s fd table (PROC_MAX_FDS = 16 slots), listing each fd number where fds[i].ops != NULL.

Global Files

/proc/meminfo

Generated by gen_meminfo(). Reports physical memory statistics from the PMM:

MemTotal:       131072 kB
MemFree:        98304 kB
MemAvailable:   98304 kB

Values are derived from pmm_total_pages() and pmm_free_pages() multiplied by 4 (4 KB pages to KB). MemAvailable is set equal to MemFree (no page cache accounting).

/proc/version

Static string: Aegis 0.31.0\n

/proc/cmdline

The kernel command line retrieved via arch_get_cmdline().

Internal Architecture

File priv structure:

typedef struct {
    char    *buf;     /* kva-allocated content buffer (1 page) */
    uint32_t len;     /* content length in bytes */
    uint32_t _pad;
} procfs_file_priv_t;

Directory priv structure:

typedef struct {
    uint32_t pid;     /* 0 = root /proc/ dir */
    uint8_t  is_fd;   /* 1 = /proc/[pid]/fd/ */
    uint8_t  _pad[3];
} procfs_dir_priv_t;

VFS ops tables:

Ops Table Used For Operations
s_procfs_file_ops Generated-content files read, close, stat
s_procfs_dir_ops Directory listings readdir, close, stat

Memory management: Each file open allocates two kva pages – one for the procfs_file_priv_t struct and one for the content buffer. Both are freed on close. Directories allocate one kva page for the procfs_dir_priv_t.

Root directory enumeration: The /proc/ root directory lists: self, meminfo, version, cmdline, then all live user processes by iterating the circular task list.


ramfs – In-Memory Volatile Storage

Source: kernel/fs/ramfs.c, kernel/fs/ramfs.h

The ramfs provides simple in-memory file storage used for /tmp and /run. Two static instances are initialized by vfs_init():

static ramfs_t s_run_ramfs;   /* /run/ */
static ramfs_t s_tmp_ramfs;   /* /tmp/ */

Data Structure

typedef struct {
    char     name[64];    /* RAMFS_MAX_NAMELEN */
    uint8_t *data;        /* kva-allocated page; NULL until first write */
    uint32_t size;        /* current byte count */
    uint8_t  in_use;
} ramfs_file_t;

typedef struct {
    ramfs_file_t files[32];   /* RAMFS_MAX_FILES */
    spinlock_t   lock;
} ramfs_t;

Characteristics

Property Value
Max files per instance 32 (RAMFS_MAX_FILES)
Max filename length 63 characters (RAMFS_MAX_NAMELEN - 1)
Max file size 4096 bytes (RAMFS_MAX_SIZE, one kva page)
Subdirectories Not supported (flat namespace)
Persistence None (volatile)
Concurrency Per-instance spinlock with IRQ save/restore

Operations

  • ramfs_open: Finds an existing file by name or creates one if VFS_O_CREAT is set. The file data page is not allocated until the first write.
  • ramfs_stat: Returns synthetic stat with st_dev = 3, S_IFREG | 0644.
  • ramfs_opendir: Returns a directory handle that enumerates all in_use files.
  • ramfs_populate: Kernel-side write helper. Used to pre-populate files without going through the user-space write path. Copies directly from a kernel buffer using __builtin_memcpy.

Write Path and SMAP

The ramfs write function accepts user-space pointers from sys_write. It uses copy_from_user (STAC/CLAC) with page-boundary clamping to avoid crossing unmapped pages:

while (done < len) {
    uint64_t page_off = (uintptr_t)(buf + done) & 0xFFF;
    uint64_t to_end = 0x1000 - page_off;
    chunk = min(len - done, to_end);
    copy_from_user(f->data + done, buf + done, chunk);
    done += chunk;
}

This is necessary because the kernel runs with SMAP (Supervisor Mode Access Prevention) enabled, and a single copy_from_user call must not span a page boundary into an unmapped region.


Pipes – Anonymous Inter-Process Communication

Source: kernel/fs/pipe.c, kernel/fs/pipe.h

Pipes provide unidirectional byte streams between processes. Each pipe is a single kva page (4096 bytes) containing a ring buffer and metadata:

typedef struct {
    uint8_t       buf[4056];        /* PIPE_BUF_SIZE */
    uint32_t      read_pos;         /* ring buffer read cursor */
    uint32_t      write_pos;        /* ring buffer write cursor */
    uint32_t      count;            /* bytes currently buffered */
    uint32_t      read_refs;        /* open read-end fd count */
    uint32_t      write_refs;       /* open write-end fd count */
    spinlock_t    lock;             /* per-pipe spinlock */
    aegis_task_t *reader_waiting;   /* blocked reader task */
    aegis_task_t *writer_waiting;   /* blocked writer task */
} pipe_t;
/* sizeof(pipe_t) == 4096, enforced by _Static_assert */

Ring Buffer Layout

                      PIPE_BUF_SIZE = 4056
  ┌────────────────────────────────────────────┐
  │  ......[data]............[free]......      │
  │        ^read_pos        ^write_pos         │
  └────────────────────────────────────────────┘
  count = bytes between read_pos and write_pos (with wrap)

Blocking Semantics

Operation Empty + Writers Open Empty + Writers Closed Full + Readers Open Full + Readers Closed
Read Block (sleep) Return 0 (EOF) N/A N/A
Write N/A N/A Block (sleep) SIGPIPE + return -EPIPE

Blocking is implemented as a retry loop: the task stores itself in reader_waiting or writer_waiting, calls sched_block(), and re-evaluates conditions when woken. Defensive checks reset read_pos/write_pos to 0 if they exceed PIPE_BUF_SIZE (protects against kernel bugs).

Reference Counting

Read and write ends have separate reference counts (read_refs, write_refs). dup/fork increments the appropriate counter. close decrements it and:

  • Read close: wakes blocked writer (so it can observe read_refs == 0 and return EPIPE)
  • Write close: wakes blocked reader (so it can observe write_refs == 0 and return EOF)
  • When both counts reach 0: the kva page is freed

Poll Support

End POLLIN POLLOUT POLLHUP POLLERR
Read count > 0 or write_refs == 0 write_refs == 0
Write count < PIPE_BUF_SIZE or read_refs == 0 read_refs == 0

SMAP Safety

The write path copies from user space via a stack-allocated staging buffer:

char staging[PIPE_BUF_SIZE];  /* 4056 bytes on kernel stack */
copy_from_user(staging, buf, n);
/* then memcpy from staging into ring buffer */

Stack budget: sys_write -> pipe_write_fn totals ~4400 bytes. Kernel stack is 4 pages (16 KB).


memfd – Anonymous Shared Memory

Source: kernel/fs/memfd.c, kernel/fs/memfd.h

memfd provides anonymous memory-backed file descriptors, primarily used with mmap for shared memory between processes (e.g., framebuffer sharing between the compositor and GUI applications).

Data Structure

typedef struct {
    uint8_t   in_use;
    uint32_t  refcount;
    char      name[32];                   /* debug name */
    uint64_t  phys_pages[2048];           /* MEMFD_PAGES_MAX */
    uint32_t  page_count;                 /* allocated pages */
    uint64_t  size;                       /* logical size in bytes */
} memfd_t;

static memfd_t s_memfds[16];             /* MEMFD_MAX */

Characteristics

Property Value
Max concurrent memfds 16 (MEMFD_MAX)
Max size per memfd 8 MB (MEMFD_PAGES_MAX * 4096)
Backed by Physical pages (PMM)
Write via fd Not supported (-ENOSYS); use mmap
Read via fd Supported (reads from physical pages via vmm_window_map)

Lifecycle

  1. memfd_alloc(name): Allocates a slot in s_memfds[], sets refcount = 1
  2. memfd_open_fd(id, proc): Installs a vfs_file_t in the process’s fd table
  3. memfd_truncate(id, size): Allocates or frees physical pages to match the requested size. Pages are allocated via pmm_alloc_page() and zeroed via vmm_window_map
  4. mmap(fd, ...): Maps the physical pages into the process’s virtual address space (handled by the mmap syscall, not memfd itself)
  5. Close: Decrements refcount; when it reaches 0, frees all physical pages via pmm_free_page

Lock Ordering

The memfd_lock spinlock protects all memfd operations. To avoid lock inversion with vmm_window_lock, the read path:

  1. Acquires memfd_lock, snapshots phys_pages[i]
  2. Releases memfd_lock
  3. Calls vmm_window_map(phys) to access the page
  4. Re-acquires memfd_lock to continue

This interleaving is safe because the refcount prevents page deallocation while the fd is open.


initrd – Boot Image Filesystem

Source: kernel/fs/initrd.c

The initrd is a compile-time filesystem embedded directly in the kernel binary. Files are stored as static data in the kernel’s .rodata and .data sections. Binary executables are embedded via objcopy --input binary, producing link-time symbols like _binary_login_bin_start and _binary_login_bin_end.

File Table

typedef struct {
    const char          *name;    /* absolute path */
    const unsigned char *start;   /* data start */
    const unsigned char *end;     /* data end */
} initrd_entry_t;

static const initrd_entry_t s_files[] = {
    { "/etc/motd",    ..., ... },
    { "/bin/login",   _binary_login_bin_start, _binary_login_bin_end },
    { "/bin/vigil",   _binary_vigil_bin_start, _binary_vigil_bin_end },
    { "/bin/sh",      _binary_shell_bin_start, _binary_shell_bin_end },
    /* ... 32 entries total, NULL-terminated */
};

File Categories

Boot binaries (embedded ELF executables):

  • /bin/login – authentication program
  • /bin/vigil – init/service manager
  • /bin/sh – shell (stsh)
  • /bin/echo, /bin/cat, /bin/ls – core utilities

Configuration files (static strings):

  • /etc/motd – message of the day (ASCII banner)
  • /etc/banner, /etc/banner.net – login banners
  • /etc/passwd – user database (root:x:0:0:root:/root:/bin/stsh)
  • /etc/shadow – password hashes (SHA-512)
  • /etc/profile – shell profile (PS1, PATH)
  • /etc/hosts – static host table

Vigil service definitions (/etc/vigil/services/<service>/{run,policy,caps}):

  • getty – console login service
  • httpd – HTTP server
  • dhcp – DHCP client
  • chronos – NTP time sync

Policy capability files (/etc/aegis/caps.d/<binary>):

  • Per-binary policy capabilities read at execve time by the security policy engine
  • Format: tier CAP1 CAP2 ... per line
  • Two tiers: service (unconditional) and admin (requires authenticated session)
  • These policy capabilities are loaded into the process’s capability table in addition to the baseline capabilities that every exec’d process receives

Directory Listings

Directories are implemented as static dir_entry_t arrays:

typedef struct { const char *name; uint8_t type; } dir_entry_t;

static const dir_entry_t s_root_entries[] = {
    { "etc", 4 }, { "bin", 4 }, { "dev", 4 }, { "lib", 4 },
    { "root", 4 }, { "tmp", 4 }, { "run", 4 }, { "proc", 4 },
    { NULL, 0 }
};

Note: /bin directory listing is not provided by initrd. The ls /bin command falls through to ext2, which shows all binaries on the disk image. Individual initrd files (e.g., /bin/login) are still found by exact path match.

Device Files

The initrd also handles device file opens:

Path Backend Description
/dev/tty kbd_vfs_open() Keyboard input device
/dev/urandom, /dev/random CSPRNG (random_get_bytes) Random bytes (4 KB max per read)
/dev/mouse USB HID mouse Event-based mouse input (mouse_event_t)

/dev/urandom and /dev/random share the same backing implementation (modern Linux semantics). Writes to /dev/urandom are accepted but do not seed the pool.

/dev/mouse returns mouse_event_t structs in non-blocking mode. Returns -EAGAIN if no events are available.

Security: /etc/shadow Protection

The initrd stat function assigns /etc/shadow mode 0640 (not world-readable), while all other files get 0555. The VFS layer enforces an additional capability gate requiring the CAP_KIND_AUTH capability kind in the process’s capability table for /etc/shadow access on the initrd path, using byte-by-byte path comparison (no symlinks in initrd, so the path is canonical). Without this capability kind, the open returns ENOCAP (errno 130).

Read Path

Reads are zero-copy: the read callback copies directly from the kernel’s .rodata section via __builtin_memcpy. There is no buffer allocation or data duplication.


Console Device

Source: kernel/fs/console.c

The console is a write-only character device for /dev/console, used as stdout/stderr for user processes. It is a stateless singleton – all instances share the same ops table and priv pointer (NULL).

Output Sinks

Console output is written to three sinks simultaneously:

  1. Serial port (serial_write_string) – always active
  2. VGA text mode (vga_write_string) – active when not in quiet mode and VGA is available
  3. Framebuffer (fb_putchar) – active when not in quiet mode and FB is available

The quiet mode check (printk_get_quiet()) suppresses screen output during graphical boot to prevent boot log flash before the compositor takes over.

SMAP Safety

Console write uses a 256-byte kernel bounce buffer:

char kbuf[256];
n = min(len, 256);
n = min(n, page_boundary_distance);
copy_from_user(kbuf, buf, n);

Characters are then written one at a time to each output sink to properly handle control characters (\b, \r, \n).

VFS Interface

static const vfs_ops_t s_console_ops = {
    .read    = console_read_fn,   /* returns -ENOSYS */
    .write   = console_write_fn,
    .close   = console_close_fn,  /* no-op */
    .readdir = NULL,
    .dup     = NULL,              /* stateless */
    .stat    = console_stat_fn,   /* S_IFCHR|0600, major=5 minor=1 */
    .poll    = console_poll_fn,   /* POLLOUT always */
};

Summary: Backend Comparison

Backend Mount Point Writable Persistent Max Size Required Capability Kinds
procfs /proc/ No N/A ~4 KB/file CAP_KIND_PROC_READ (cross-pid access)
ramfs /tmp/, /run/ Yes No 4 KB/file None
Pipes Anonymous Yes No 4056 bytes None
memfd Anonymous mmap only No 8 MB None
initrd /, /bin/, /etc/, /dev/ No Yes (ROM) Varies CAP_KIND_AUTH (/etc/shadow)
Console /dev/console Write-only N/A N/A None
ext2 Root filesystem Yes Yes (NVMe) ~48 KB writable CAP_KIND_AUTH (/etc/shadow) + DAC

All backends register their vfs_ops_t tables statically. There is no dynamic filesystem registration mechanism – adding a new filesystem requires kernel code changes and recompilation.