procfs and Special Filesystems
Documentation of the Aegis procfs virtual filesystem, ramfs volatile storage, anonymous pipes, memfd shared memory, console device, and initrd boot image.
procfs and Special Filesystems
Beyond the primary ext2 filesystem, Aegis provides several special-purpose filesystems that serve as VFS backends. Each implements the vfs_ops_t interface documented in the VFS layer.
v1 note: These filesystem implementations are v1 software – functional and tested, but not hardened. Contributions are welcome – file issues or propose changes at exec/aegis.
procfs – Process Information Filesystem
Source: kernel/fs/procfs.c
The procfs implementation uses a generate-on-open design: opening a /proc file allocates a kernel virtual address (kva) page, generates the file content into it, and stores the buffer in the fd’s priv field. Subsequent reads copy from this snapshot; close frees the buffer. Content is never regenerated after open – it is a point-in-time snapshot.
Namespace Layout
/proc/
├── self/ → symlink-like alias to /proc/<current_pid>/
│ ├── maps → memory map (VMA table)
│ ├── status → process status summary
│ ├── stat → single-line stat (Linux /proc/<pid>/stat format)
│ ├── exe → executable path
│ ├── cmdline → NUL-terminated argv
│ └── fd/ → directory listing of open file descriptors
├── <pid>/ → per-process directory (same entries as self/)
├── meminfo → system memory summary
├── version → kernel version string
└── cmdline → kernel command line
Capability Gating
Access to /proc/self/ is always permitted. Accessing /proc/<pid>/ for a different process requires the CAP_KIND_PROC_READ capability kind in the caller’s capability table:
static int procfs_check_access(uint32_t target_pid)
{
aegis_task_t *cur = sched_current();
if (!cur || !cur->is_user) return -1;
aegis_process_t *caller = (aegis_process_t *)cur;
if (target_pid == caller->pid)
return 0; /* self always OK */
return cap_check(caller->caps, CAP_TABLE_SIZE,
CAP_KIND_PROC_READ, CAP_RIGHTS_READ);
}
The cap_check function scans the process’s capability table (the per-process caps array) for an entry matching the CAP_KIND_PROC_READ capability kind with the CAP_RIGHTS_READ rights bitfield. If the capability is not present, the call returns ENOCAP (errno 130), and procfs_open_pid denies access. This prevents unprivileged processes from inspecting other processes’ memory maps, file descriptors, or credentials. See the capability model for the full capability system.
Per-Process Files
/proc/<pid>/maps
Generated by gen_maps(). Outputs one line per VMA entry in the process’s VMA table:
00400000-00401000 r-xp 00000000 00:00 0 /bin/stsh
00600000-00601000 rw-p 00000000 00:00 0 /bin/stsh
01000000-01001000 rw-p 00000000 00:00 0 [heap]
7fff0000-80000000 rw-p 00000000 00:00 0 [stack]
VMA types are mapped to names:
| VMA Type | Label |
|---|---|
VMA_ELF_TEXT / VMA_ELF_DATA |
Executable path |
VMA_HEAP |
[heap] |
VMA_STACK |
[stack] |
VMA_GUARD |
[guard] |
VMA_THREAD_STACK |
[thread_stack] |
Permission bits: r (PROT_READ=1), w (PROT_WRITE=2), x (PROT_EXEC=4), p (always private).
/proc/<pid>/status
Generated by gen_status(). Multi-line key-value format:
Name: stsh
State: R (running)
Tgid: 3
Pid: 3
PPid: 1
Uid: 0
Gid: 0
VmSize: 8192 kB
Task states:
| State | Character | Description |
|---|---|---|
TASK_RUNNING |
R | Currently runnable or on CPU |
TASK_BLOCKED |
S | Sleeping (waiting for event) |
TASK_ZOMBIE |
Z | Terminated, awaiting parent wait() |
TASK_STOPPED |
T | Stopped (signal) |
/proc/<pid>/stat
Generated by gen_stat(). Single-line format compatible with Linux /proc/<pid>/stat:
3 (stsh) R 1 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fields: pid (comm) state ppid pgid sid tty_nr tpgid ... with most fields zero-padded for compatibility.
/proc/<pid>/exe
The process’s executable path followed by a newline.
/proc/<pid>/cmdline
The process’s executable path as a NUL-terminated string (argv[0] only).
/proc/<pid>/fd/
Directory listing of open file descriptors. Enumerates the process’s fd table (PROC_MAX_FDS = 16 slots), listing each fd number where fds[i].ops != NULL.
Global Files
/proc/meminfo
Generated by gen_meminfo(). Reports physical memory statistics from the PMM:
MemTotal: 131072 kB
MemFree: 98304 kB
MemAvailable: 98304 kB
Values are derived from pmm_total_pages() and pmm_free_pages() multiplied by 4 (4 KB pages to KB). MemAvailable is set equal to MemFree (no page cache accounting).
/proc/version
Static string: Aegis 0.31.0\n
/proc/cmdline
The kernel command line retrieved via arch_get_cmdline().
Internal Architecture
File priv structure:
typedef struct {
char *buf; /* kva-allocated content buffer (1 page) */
uint32_t len; /* content length in bytes */
uint32_t _pad;
} procfs_file_priv_t;
Directory priv structure:
typedef struct {
uint32_t pid; /* 0 = root /proc/ dir */
uint8_t is_fd; /* 1 = /proc/[pid]/fd/ */
uint8_t _pad[3];
} procfs_dir_priv_t;
VFS ops tables:
| Ops Table | Used For | Operations |
|---|---|---|
s_procfs_file_ops |
Generated-content files | read, close, stat |
s_procfs_dir_ops |
Directory listings | readdir, close, stat |
Memory management: Each file open allocates two kva pages – one for the procfs_file_priv_t struct and one for the content buffer. Both are freed on close. Directories allocate one kva page for the procfs_dir_priv_t.
Root directory enumeration: The /proc/ root directory lists: self, meminfo, version, cmdline, then all live user processes by iterating the circular task list.
ramfs – In-Memory Volatile Storage
Source: kernel/fs/ramfs.c, kernel/fs/ramfs.h
The ramfs provides simple in-memory file storage used for /tmp and /run. Two static instances are initialized by vfs_init():
static ramfs_t s_run_ramfs; /* /run/ */
static ramfs_t s_tmp_ramfs; /* /tmp/ */
Data Structure
typedef struct {
char name[64]; /* RAMFS_MAX_NAMELEN */
uint8_t *data; /* kva-allocated page; NULL until first write */
uint32_t size; /* current byte count */
uint8_t in_use;
} ramfs_file_t;
typedef struct {
ramfs_file_t files[32]; /* RAMFS_MAX_FILES */
spinlock_t lock;
} ramfs_t;
Characteristics
| Property | Value |
|---|---|
| Max files per instance | 32 (RAMFS_MAX_FILES) |
| Max filename length | 63 characters (RAMFS_MAX_NAMELEN - 1) |
| Max file size | 4096 bytes (RAMFS_MAX_SIZE, one kva page) |
| Subdirectories | Not supported (flat namespace) |
| Persistence | None (volatile) |
| Concurrency | Per-instance spinlock with IRQ save/restore |
Operations
ramfs_open: Finds an existing file by name or creates one ifVFS_O_CREATis set. The file data page is not allocated until the first write.ramfs_stat: Returns synthetic stat withst_dev = 3,S_IFREG | 0644.ramfs_opendir: Returns a directory handle that enumerates allin_usefiles.ramfs_populate: Kernel-side write helper. Used to pre-populate files without going through the user-space write path. Copies directly from a kernel buffer using__builtin_memcpy.
Write Path and SMAP
The ramfs write function accepts user-space pointers from sys_write. It uses copy_from_user (STAC/CLAC) with page-boundary clamping to avoid crossing unmapped pages:
while (done < len) {
uint64_t page_off = (uintptr_t)(buf + done) & 0xFFF;
uint64_t to_end = 0x1000 - page_off;
chunk = min(len - done, to_end);
copy_from_user(f->data + done, buf + done, chunk);
done += chunk;
}
This is necessary because the kernel runs with SMAP (Supervisor Mode Access Prevention) enabled, and a single copy_from_user call must not span a page boundary into an unmapped region.
Pipes – Anonymous Inter-Process Communication
Source: kernel/fs/pipe.c, kernel/fs/pipe.h
Pipes provide unidirectional byte streams between processes. Each pipe is a single kva page (4096 bytes) containing a ring buffer and metadata:
typedef struct {
uint8_t buf[4056]; /* PIPE_BUF_SIZE */
uint32_t read_pos; /* ring buffer read cursor */
uint32_t write_pos; /* ring buffer write cursor */
uint32_t count; /* bytes currently buffered */
uint32_t read_refs; /* open read-end fd count */
uint32_t write_refs; /* open write-end fd count */
spinlock_t lock; /* per-pipe spinlock */
aegis_task_t *reader_waiting; /* blocked reader task */
aegis_task_t *writer_waiting; /* blocked writer task */
} pipe_t;
/* sizeof(pipe_t) == 4096, enforced by _Static_assert */
Ring Buffer Layout
PIPE_BUF_SIZE = 4056
┌────────────────────────────────────────────┐
│ ......[data]............[free]...... │
│ ^read_pos ^write_pos │
└────────────────────────────────────────────┘
count = bytes between read_pos and write_pos (with wrap)
Blocking Semantics
| Operation | Empty + Writers Open | Empty + Writers Closed | Full + Readers Open | Full + Readers Closed |
|---|---|---|---|---|
| Read | Block (sleep) | Return 0 (EOF) | N/A | N/A |
| Write | N/A | N/A | Block (sleep) | SIGPIPE + return -EPIPE |
Blocking is implemented as a retry loop: the task stores itself in reader_waiting or writer_waiting, calls sched_block(), and re-evaluates conditions when woken. Defensive checks reset read_pos/write_pos to 0 if they exceed PIPE_BUF_SIZE (protects against kernel bugs).
Reference Counting
Read and write ends have separate reference counts (read_refs, write_refs). dup/fork increments the appropriate counter. close decrements it and:
- Read close: wakes blocked writer (so it can observe
read_refs == 0and return EPIPE) - Write close: wakes blocked reader (so it can observe
write_refs == 0and return EOF) - When both counts reach 0: the kva page is freed
Poll Support
| End | POLLIN | POLLOUT | POLLHUP | POLLERR |
|---|---|---|---|---|
| Read | count > 0 or write_refs == 0 |
– | write_refs == 0 |
– |
| Write | – | count < PIPE_BUF_SIZE or read_refs == 0 |
– | read_refs == 0 |
SMAP Safety
The write path copies from user space via a stack-allocated staging buffer:
char staging[PIPE_BUF_SIZE]; /* 4056 bytes on kernel stack */
copy_from_user(staging, buf, n);
/* then memcpy from staging into ring buffer */
Stack budget: sys_write -> pipe_write_fn totals ~4400 bytes. Kernel stack is 4 pages (16 KB).
memfd – Anonymous Shared Memory
Source: kernel/fs/memfd.c, kernel/fs/memfd.h
memfd provides anonymous memory-backed file descriptors, primarily used with mmap for shared memory between processes (e.g., framebuffer sharing between the compositor and GUI applications).
Data Structure
typedef struct {
uint8_t in_use;
uint32_t refcount;
char name[32]; /* debug name */
uint64_t phys_pages[2048]; /* MEMFD_PAGES_MAX */
uint32_t page_count; /* allocated pages */
uint64_t size; /* logical size in bytes */
} memfd_t;
static memfd_t s_memfds[16]; /* MEMFD_MAX */
Characteristics
| Property | Value |
|---|---|
| Max concurrent memfds | 16 (MEMFD_MAX) |
| Max size per memfd | 8 MB (MEMFD_PAGES_MAX * 4096) |
| Backed by | Physical pages (PMM) |
| Write via fd | Not supported (-ENOSYS); use mmap |
| Read via fd | Supported (reads from physical pages via vmm_window_map) |
Lifecycle
memfd_alloc(name): Allocates a slot ins_memfds[], setsrefcount = 1memfd_open_fd(id, proc): Installs avfs_file_tin the process’s fd tablememfd_truncate(id, size): Allocates or frees physical pages to match the requested size. Pages are allocated viapmm_alloc_page()and zeroed viavmm_window_mapmmap(fd, ...): Maps the physical pages into the process’s virtual address space (handled by the mmap syscall, not memfd itself)- Close: Decrements
refcount; when it reaches 0, frees all physical pages viapmm_free_page
Lock Ordering
The memfd_lock spinlock protects all memfd operations. To avoid lock inversion with vmm_window_lock, the read path:
- Acquires
memfd_lock, snapshotsphys_pages[i] - Releases
memfd_lock - Calls
vmm_window_map(phys)to access the page - Re-acquires
memfd_lockto continue
This interleaving is safe because the refcount prevents page deallocation while the fd is open.
initrd – Boot Image Filesystem
Source: kernel/fs/initrd.c
The initrd is a compile-time filesystem embedded directly in the kernel binary. Files are stored as static data in the kernel’s .rodata and .data sections. Binary executables are embedded via objcopy --input binary, producing link-time symbols like _binary_login_bin_start and _binary_login_bin_end.
File Table
typedef struct {
const char *name; /* absolute path */
const unsigned char *start; /* data start */
const unsigned char *end; /* data end */
} initrd_entry_t;
static const initrd_entry_t s_files[] = {
{ "/etc/motd", ..., ... },
{ "/bin/login", _binary_login_bin_start, _binary_login_bin_end },
{ "/bin/vigil", _binary_vigil_bin_start, _binary_vigil_bin_end },
{ "/bin/sh", _binary_shell_bin_start, _binary_shell_bin_end },
/* ... 32 entries total, NULL-terminated */
};
File Categories
Boot binaries (embedded ELF executables):
/bin/login– authentication program/bin/vigil– init/service manager/bin/sh– shell (stsh)/bin/echo,/bin/cat,/bin/ls– core utilities
Configuration files (static strings):
/etc/motd– message of the day (ASCII banner)/etc/banner,/etc/banner.net– login banners/etc/passwd– user database (root:x:0:0:root:/root:/bin/stsh)/etc/shadow– password hashes (SHA-512)/etc/profile– shell profile (PS1, PATH)/etc/hosts– static host table
Vigil service definitions (/etc/vigil/services/<service>/{run,policy,caps}):
getty– console login servicehttpd– HTTP serverdhcp– DHCP clientchronos– NTP time sync
Policy capability files (/etc/aegis/caps.d/<binary>):
- Per-binary policy capabilities read at
execvetime by the security policy engine - Format:
tier CAP1 CAP2 ...per line - Two tiers:
service(unconditional) andadmin(requires authenticated session) - These policy capabilities are loaded into the process’s capability table in addition to the baseline capabilities that every exec’d process receives
Directory Listings
Directories are implemented as static dir_entry_t arrays:
typedef struct { const char *name; uint8_t type; } dir_entry_t;
static const dir_entry_t s_root_entries[] = {
{ "etc", 4 }, { "bin", 4 }, { "dev", 4 }, { "lib", 4 },
{ "root", 4 }, { "tmp", 4 }, { "run", 4 }, { "proc", 4 },
{ NULL, 0 }
};
Note: /bin directory listing is not provided by initrd. The ls /bin command falls through to ext2, which shows all binaries on the disk image. Individual initrd files (e.g., /bin/login) are still found by exact path match.
Device Files
The initrd also handles device file opens:
| Path | Backend | Description |
|---|---|---|
/dev/tty |
kbd_vfs_open() |
Keyboard input device |
/dev/urandom, /dev/random |
CSPRNG (random_get_bytes) |
Random bytes (4 KB max per read) |
/dev/mouse |
USB HID mouse | Event-based mouse input (mouse_event_t) |
/dev/urandom and /dev/random share the same backing implementation (modern Linux semantics). Writes to /dev/urandom are accepted but do not seed the pool.
/dev/mouse returns mouse_event_t structs in non-blocking mode. Returns -EAGAIN if no events are available.
Security: /etc/shadow Protection
The initrd stat function assigns /etc/shadow mode 0640 (not world-readable), while all other files get 0555. The VFS layer enforces an additional capability gate requiring the CAP_KIND_AUTH capability kind in the process’s capability table for /etc/shadow access on the initrd path, using byte-by-byte path comparison (no symlinks in initrd, so the path is canonical). Without this capability kind, the open returns ENOCAP (errno 130).
Read Path
Reads are zero-copy: the read callback copies directly from the kernel’s .rodata section via __builtin_memcpy. There is no buffer allocation or data duplication.
Console Device
Source: kernel/fs/console.c
The console is a write-only character device for /dev/console, used as stdout/stderr for user processes. It is a stateless singleton – all instances share the same ops table and priv pointer (NULL).
Output Sinks
Console output is written to three sinks simultaneously:
- Serial port (
serial_write_string) – always active - VGA text mode (
vga_write_string) – active when not in quiet mode and VGA is available - Framebuffer (
fb_putchar) – active when not in quiet mode and FB is available
The quiet mode check (printk_get_quiet()) suppresses screen output during graphical boot to prevent boot log flash before the compositor takes over.
SMAP Safety
Console write uses a 256-byte kernel bounce buffer:
char kbuf[256];
n = min(len, 256);
n = min(n, page_boundary_distance);
copy_from_user(kbuf, buf, n);
Characters are then written one at a time to each output sink to properly handle control characters (\b, \r, \n).
VFS Interface
static const vfs_ops_t s_console_ops = {
.read = console_read_fn, /* returns -ENOSYS */
.write = console_write_fn,
.close = console_close_fn, /* no-op */
.readdir = NULL,
.dup = NULL, /* stateless */
.stat = console_stat_fn, /* S_IFCHR|0600, major=5 minor=1 */
.poll = console_poll_fn, /* POLLOUT always */
};
Summary: Backend Comparison
| Backend | Mount Point | Writable | Persistent | Max Size | Required Capability Kinds |
|---|---|---|---|---|---|
| procfs | /proc/ |
No | N/A | ~4 KB/file | CAP_KIND_PROC_READ (cross-pid access) |
| ramfs | /tmp/, /run/ |
Yes | No | 4 KB/file | None |
| Pipes | Anonymous | Yes | No | 4056 bytes | None |
| memfd | Anonymous | mmap only | No | 8 MB | None |
| initrd | /, /bin/, /etc/, /dev/ |
No | Yes (ROM) | Varies | CAP_KIND_AUTH (/etc/shadow) |
| Console | /dev/console |
Write-only | N/A | N/A | None |
| ext2 | Root filesystem | Yes | Yes (NVMe) | ~48 KB writable | CAP_KIND_AUTH (/etc/shadow) + DAC |
All backends register their vfs_ops_t tables statically. There is no dynamic filesystem registration mechanism – adding a new filesystem requires kernel code changes and recompilation.