ext2 Filesystem Implementation

Aegis includes a read-write ext2 filesystem driver that serves as the primary writable root filesystem. The implementation supports the core ext2 feature set: direct and indirect block addressing, directory entry management, symbolic links (fast and slow), POSIX DAC permission checks, and file creation/deletion. It does not include journaling (ext3/4), extended attributes, or triple-indirect blocks.

v1 note: The ext2 driver is v1 software – functional and tested, but not hardened against crafted filesystem images or adversarial input. Contributions are welcome – file issues or propose changes at exec/aegis.

Source files:

  • kernel/fs/ext2.c – mount, inode I/O, block addressing, read/write, create/unlink/mkdir/rename, symlinks, chmod/chown
  • kernel/fs/ext2_cache.c – 16-slot LRU block cache, sync
  • kernel/fs/ext2_dir.c – directory entry helpers (lookup_parent, add_entry, remove_entry)
  • kernel/fs/ext2_vfs.c – VFS adapter (fd pool, vfs_ops_t callbacks)
  • kernel/fs/ext2.h – public API and on-disk structure definitions
  • kernel/fs/ext2_internal.h – internal shared state between TUs
  • kernel/fs/ext2_vfs.h – per-fd state and VFS ops table

On-Disk Structures

Superblock (ext2_superblock_t)

Located at byte offset 1024 from the partition start (LBA 2 in 512-byte sectors). Read into a static s_sb during mount.

typedef struct __attribute__((packed)) {
    uint32_t s_inodes_count;
    uint32_t s_blocks_count;
    uint32_t s_r_blocks_count;
    uint32_t s_free_blocks_count;
    uint32_t s_free_inodes_count;
    uint32_t s_first_data_block;
    uint32_t s_log_block_size;      /* block_size = 1024 << s_log_block_size */
    uint32_t s_log_frag_size;
    uint32_t s_blocks_per_group;
    uint32_t s_frags_per_group;
    uint32_t s_inodes_per_group;
    uint32_t s_mtime;
    uint32_t s_wtime;
    uint16_t s_mnt_count;
    uint16_t s_max_mnt_count;
    uint16_t s_magic;               /* must be 0xEF53 */
    uint16_t s_state;
    uint16_t s_errors;
    uint16_t s_minor_rev_level;
    uint32_t s_lastcheck;
    uint32_t s_checkinterval;
    uint32_t s_creator_os;
    uint32_t s_rev_level;
    uint16_t s_def_resuid;
    uint16_t s_def_resgid;
    /* EXT2_DYNAMIC_REV fields */
    uint32_t s_first_ino;
    uint16_t s_inode_size;          /* 128 for rev 0, variable for rev >= 1 */
    uint16_t s_block_group_nr;
    uint32_t s_feature_compat;
    uint32_t s_feature_incompat;
    uint32_t s_feature_ro_compat;
    uint8_t  s_uuid[16];
    char     s_volume_name[16];
    char     s_last_mounted[64];
    uint32_t s_algo_bitmap;
} ext2_superblock_t;

Key derived values:

  • s_block_size = 1024 << s_sb.s_log_block_size (validated: s_log_block_size <= 6)
  • s_num_groups = ceil(s_blocks_count / s_blocks_per_group) (capped at 32)
  • inode_size = s_inode_size for rev >= 1, else 128 bytes

Block Group Descriptor (ext2_bgd_t)

Offset  Size  Field
──────  ────  ─────────────────────
  0       4   bg_block_bitmap
  4       4   bg_inode_bitmap
  8       4   bg_inode_table
 12       2   bg_free_blocks_count
 14       2   bg_free_inodes_count
 16       2   bg_used_dirs_count
 18       2   bg_pad
 20      12   bg_reserved
──────  ────  ─────────────────────
Total: 32 bytes

The BGD table is stored at the block immediately after the superblock:

  • For 1024-byte blocks: superblock is in block 1, BGD at block 2
  • For larger blocks: superblock is in block 0, BGD at block 1

Up to 32 block groups are supported (static array s_bgd[32]).

Inode (ext2_inode_t)

Offset  Size  Field
──────  ────  ─────────────────────
  0       2   i_mode          (file type + permission bits)
  2       2   i_uid
  4       4   i_size          (file size in bytes, 32-bit)
  8       4   i_atime
 12       4   i_ctime
 16       4   i_mtime
 20       4   i_dtime         (deletion time; non-zero = deleted)
 24       2   i_gid
 26       2   i_links_count
 28       4   i_blocks        (512-byte sector count)
 32       4   i_flags
 36       4   i_osd1
 40      60   i_block[15]     (block pointers)
100       4   i_generation
104       4   i_file_acl
108       4   i_dir_acl
112       4   i_faddr
116      12   i_osd2
──────  ────  ─────────────────────
Total: 128 bytes

The i_block[15] array encodes block addresses:

┌──────────────────────────────────────────────────┐
│ i_block[0..11]  → 12 direct block pointers       │
│ i_block[12]     → single indirect block          │
│ i_block[13]     → double indirect block          │
│ i_block[14]     → triple indirect (NOT supported)│
└──────────────────────────────────────────────────┘

File type bits in i_mode:

Constant Value Description
EXT2_S_IFREG 0x8000 Regular file
EXT2_S_IFDIR 0x4000 Directory
EXT2_S_IFLNK 0xA000 Symbolic link

Directory Entry (ext2_dirent_t)

typedef struct __attribute__((packed)) {
    uint32_t inode;
    uint16_t rec_len;    /* total size of this entry (padded) */
    uint8_t  name_len;   /* actual name length */
    uint8_t  file_type;  /* EXT2_FT_REG_FILE=1, EXT2_FT_DIR=2, EXT2_FT_SYMLINK=7 */
    char     name[255];
} ext2_dirent_t;

Directory entries are variable-length records packed within directory data blocks. rec_len includes padding to maintain 4-byte alignment. An entry with inode == 0 is a deleted/free slot.


On-Disk Layout

Block 0      Block 1         Block 2+        Block N...
┌──────────┬──────────────┬──────────────┬─────────────────┐
│ Boot     │ Superblock   │ Block Group  │ Block Group 0   │
│ sector   │ (1024 bytes  │ Descriptor   │ ┌─────────────┐ │
│ (unused) │  at byte     │ Table        │ │Block Bitmap │ │
│          │  offset      │              │ │Inode Bitmap │ │
│          │  1024)       │              │ │Inode Table  │ │
│          │              │              │ │Data Blocks  │ │
│          │              │              │ └─────────────┘ │
└──────────┴──────────────┴──────────────┴─────────────────┘

For a typical 1024-byte block size (the default):

  • Block 0: boot block (unused by ext2)
  • Block 1: superblock
  • Block 2: BGD table
  • Blocks 3+: block bitmap, inode bitmap, inode table, data blocks (per group)

Block Cache

Source: kernel/fs/ext2_cache.c

The ext2 driver uses a 16-slot LRU block cache. Each slot holds one full block (up to 4096 bytes) plus metadata:

typedef struct {
    uint32_t block_num;   /* 0 + age=0 means unused */
    uint8_t  dirty;       /* 1 = modified, needs writeback */
    uint32_t age;         /* LRU timestamp (higher = more recent) */
    uint8_t  data[4096];  /* block data */
} cache_slot_t;

cache_slot_t s_cache[16];
uint32_t s_cache_age = 0;  /* monotonically increasing counter */

Cache Operations

cache_get_slot(block_num) – The primary access function. Returns a pointer to the cached block data, loading from disk if necessary.

cache_get_slot(block_num)
    │
    ├── cache_find(block_num)
    │   ├── Found: update age, return data
    │   └── Not found:
    │       ├── cache_evict() → pick LRU slot
    │       │   ├── Prefer clean slots over dirty
    │       │   ├── Within same cleanliness, pick lowest age
    │       │   └── Write back dirty slot before evicting
    │       ├── Read block from disk via blkdev
    │       └── Return data pointer
    └── NULL on I/O error

cache_mark_dirty(block_num) – Marks a cached block as dirty. Called after any in-place modification of cached data (inode writes, bitmap updates, directory entry changes).

ext2_sync() – Flushes all dirty cache slots to disk. Called from process exit and explicit sync. Acquires ext2_lock (inner lock in the lock ordering: sched_lock > ext2_lock).

Cache Design Notes

  • 16 slots is small but sufficient for the typical Aegis workload (embedded/teaching OS)
  • The eviction policy prefers clean slots to minimize writeback I/O
  • Block-to-LBA translation: lba = block_num * (block_size / 512)
  • No read-ahead or prefetching
  • All cache slots are allocated statically (no dynamic memory)

Mount Process

ext2_mount(devname) performs these steps under ext2_lock:

  1. Initialize all 16 cache slots to age 0 (unused)
  2. Get block device handle via blkdev_get(devname)
  3. Read superblock from LBA 2 (byte offset 1024) – 2 sectors of 512 bytes
  4. Validate magic number (0xEF53)
  5. Validate s_log_block_size <= 6 and inode size bounds (128..4096 for rev >= 1)
  6. Compute s_block_size, s_num_groups (capped at 32)
  7. Read BGD table from the block after the superblock
  8. Set s_mounted = 1
  9. Record /etc/shadow inode number for capability enforcement in vfs_open:
    uint32_t shadow_ino = 0;
    if (ext2_open_ex("/etc/shadow", &shadow_ino, 1) == 0)
        s_shadow_ino = shadow_ino;
    

The shadow inode lookup must occur after s_mounted = 1 because ext2_open_ex needs to walk the directory tree.


Inode Operations

Reading an Inode

ext2_read_inode(ino, *out) locates an inode on disk:

group       = (ino - 1) / s_inodes_per_group
index       = (ino - 1) % s_inodes_per_group
inode_size  = s_inode_size (rev >= 1) or 128
byte_offset = index * inode_size
block       = bgd[group].bg_inode_table + byte_offset / block_size
in_block    = byte_offset % block_size

The inode table block is loaded via cache_get_slot, and 128 bytes are copied out byte-by-byte. Inode 0 is rejected (-EIO).

Writing an Inode

ext2_write_inode(ino, *inode) performs the same address calculation but copies data into the cache and calls cache_mark_dirty on the containing block.


Block Addressing

ext2_block_num(inode, file_block) resolves a logical file block index to a physical block number:

ptrs_per_block = block_size / 4

file_block < 12:
    → inode->i_block[file_block]                    (direct)

file_block < 12 + ptrs_per_block:
    → read inode->i_block[12]                       (single indirect)
    → index into indirect block at (file_block - 12)

file_block < 12 + ptrs_per_block + ptrs_per_block^2:
    → read inode->i_block[13]                       (double indirect)
    → index into outer block, then inner block

Triple indirect (i_block[14]) is NOT supported.

For 4096-byte blocks, ptrs_per_block = 1024:

  • Direct: blocks 0..11 (up to 48 KB)
  • Single indirect: blocks 12..1035 (up to ~4 MB)
  • Double indirect: blocks 1036..1049611 (up to ~4 GB)

A return value of 0 indicates a sparse block (the file has a hole at that position).


ext2_open_ex(path, *inode_out, follow_final)

The primary path resolution function. Walks the directory tree component by component, handling symlinks:

ext2_open_ex("/usr/bin/foo", &ino, 1)
    │
    ├── Start at inode 2 (EXT2_ROOT_INODE)
    ├── For each path component:
    │   ├── Read current directory inode
    │   ├── Verify it is a directory (EXT2_S_IFDIR)
    │   ├── Scan directory entries for component name
    │   ├── If child is a symlink:
    │   │   ├── If final component and !follow_final → return symlink inode
    │   │   ├── Increment depth (max SYMLINK_MAX_DEPTH = 8)
    │   │   ├── Read symlink target
    │   │   ├── Construct new path: target + "/" + remaining
    │   │   └── goto restart_walk
    │   └── Advance to child inode
    └── Return resolved inode number

Special handling for ..: clamps to root inode (inode 2) rather than walking up. This prevents escaping the root.

  • Fast symlink (target <= 60 bytes): Target stored directly in i_block[] – no data block needed. Threshold is 60 because i_block[0..14] is 60 bytes.
  • Slow symlink (target > 60 bytes): Target stored in first data block (i_block[0]).

ext2_open(path, *inode_out)

Thin wrapper: ext2_open_ex(path, inode_out, 1) – always follows symlinks on the final component.


Read Path

ext2_read(inode_num, buf, offset, len):

  1. Acquire ext2_lock
  2. Read inode, validate size (256 MB cap against malicious images)
  3. Clamp read range to file size
  4. Loop over file blocks:
    • Compute file block index and offset within block
    • Resolve physical block via ext2_block_num
    • If sparse (block 0): fill zeros
    • Otherwise: load block via cache, copy bytes
  5. Release lock, return bytes read

The 256 MB size cap is a security measure against crafted ext2 images with corrupted i_size values that could cause excessive memory reads.


Write Path

ext2_write(inode_num, buf, offset, len):

  1. Acquire ext2_lock
  2. Read inode
  3. Loop over file blocks:
    • Compute file block index
    • Resolve or allocate physical block:
      • Existing block: use directly
      • Sparse/new block (direct only, file_block < 12): ext2_alloc_block(0), zero the new block, assign to i_block[file_block]
      • Indirect allocation is not supported for writes (breaks at block 12)
    • Copy data into cached block, mark dirty
  4. Update i_size and i_blocks if file grew
  5. Release lock, return bytes written

VFS Write Adapter

ext2_vfs_write_fn in ext2_vfs.c bridges the VFS write interface with ext2:

/* User buf → kernel bounce buffer → ext2_write */
uint8_t s_kbuf[4096];  /* stack-allocated bounce buffer */
while (done < len) {
    chunk = min(len - done, 4096);
    chunk = min(chunk, page_boundary_distance);  /* SMAP guard */
    copy_from_user(s_kbuf, buf + done, chunk);
    n = ext2_write(ino, s_kbuf, write_offset, chunk);
    write_offset += n;
    done += n;
}

The 4096-byte stack buffer limits kernel stack usage. The page-boundary clamping prevents copy_from_user from crossing an unmapped page.


Block and Inode Allocation

ext2_alloc_block(preferred_group)

Scans block bitmaps starting from preferred_group (locality hint), wrapping around all groups:

  1. Skip groups with bg_free_blocks_count == 0
  2. Load block bitmap via cache
  3. Scan bits for first free block
  4. Set bit, mark bitmap dirty, decrement free counts
  5. Return: group * blocks_per_group + bit_index + first_data_block

ext2_alloc_inode(preferred_group)

Same algorithm but for inode bitmaps. Returns 1-based inode number: group * inodes_per_group + bit_index + 1


Directory Operations

ext2_readdir(dir_inode, index, name_out, type_out)

Index-based directory iteration. Walks through all directory entry blocks, counting non-deleted entries (inode != 0) until reaching the requested index. Returns:

  • DT_DIR (4) for directories
  • DT_LNK (10) for symlinks
  • DT_REG (8) for everything else

ext2_dir_add_entry(dir_ino, child_ino, name, file_type)

Adds a new directory entry. Strategy:

  1. Scan existing blocks for:
    • A deleted entry (inode == 0) with enough rec_len
    • An entry with enough trailing free space to split
  2. If no space found, allocate a new directory block (direct blocks only)
  3. Write the new entry, mark block dirty, update directory inode

Entry splitting: if an existing entry’s rec_len exceeds its actual size by at least needed bytes (8 + name_len, rounded to 4-byte boundary), the entry is shrunk and the new entry is placed in the freed space.

ext2_dir_remove_entry(dir_ino, name)

Removes a directory entry by name. If the entry has a predecessor in the same block, merges rec_len into the predecessor. If it is the first entry in the block, sets inode = 0 (marks as deleted without merging).


File Operations

ext2_create(path, mode)

Creates a new regular file:

  1. ext2_lookup_parent to find parent directory inode and basename
  2. ext2_alloc_inode(0) to allocate a new inode
  3. Initialize inode: i_mode = EXT2_S_IFREG | (mode & 0x1FF), i_links_count = 1
  4. ext2_write_inode to persist inode to disk
  5. ext2_dir_add_entry to add the directory entry (type EXT2_FT_REG_FILE)

ext2_unlink(path)

Removes a file:

  1. Look up parent directory and resolve file inode
  2. Decrement i_links_count
  3. If links reach 0:
    • Free all direct data blocks (clear bitmap bits, increment free counts)
    • Free the inode (clear bitmap bit, increment free count)
    • Set i_dtime = 1 to mark as deleted
  4. Remove directory entry from parent

Note: only direct blocks (0..11) are freed. Indirect block chains are not deallocated.

ext2_mkdir(path, mode)

Creates a new directory:

  1. Look up parent, allocate inode and data block
  2. Initialize inode: EXT2_S_IFDIR | mode, i_links_count = 2, i_size = block_size
  3. Write . and .. entries into the new block:
    • . entry: rec_len = 12, points to self
    • .. entry: rec_len = block_size - 12, points to parent
  4. Increment parent’s i_links_count for the .. back-reference
  5. Update block group directory count (bg_used_dirs_count++)
  6. Add entry in parent directory (type EXT2_FT_DIR)

ext2_rename(old_path, new_path)

Atomic rename within the filesystem:

  1. Resolve the source inode
  2. Look up both old and new parent directories
  3. Determine file type (directory or regular file)
  4. Remove entry from old parent
  5. Add entry in new parent

Cross-directory renames are supported. The implementation does not update .. entries for directory renames (a known limitation).

ext2_symlink(linkpath, target)

Creates a symbolic link:

  1. Look up parent directory
  2. Allocate new inode: i_mode = EXT2_S_IFLNK | 0777
  3. If target length <= 60: store in i_block[] (fast symlink)
  4. If target length > 60: allocate data block, write target (slow symlink)
  5. Add directory entry (type EXT2_FT_SYMLINK)

Permission Checks

POSIX DAC: ext2_check_perm

int ext2_check_perm(uint32_t ino, uint16_t proc_uid, uint16_t proc_gid, int want)

Standard POSIX owner/group/other permission matching:

if (inode.i_uid == proc_uid)
    perm = (mode >> 6) & 7     /* owner bits */
else if (inode.i_gid == proc_gid)
    perm = (mode >> 3) & 7     /* group bits */
else
    perm = mode & 7            /* other bits */

return (perm & want == want) ? 0 : -EACCES

There is no root bypass. uid 0 is treated identically to any other uid. This is a deliberate design choice: privilege escalation requires capabilities, not uid. See the capability model.

Capability Gate: /etc/shadow

The VFS layer enforces the CAP_KIND_AUTH capability kind with CAP_RIGHTS_READ rights bitfield for /etc/shadow access. The check validates the process’s capability table against the required capability kind using the inode number recorded at mount time. This check occurs after symlink resolution, preventing symlink-based bypasses.


VFS Integration

ext2_fd_priv_t – Per-FD State

typedef struct {
    uint32_t ino;           /* ext2 inode number */
    uint32_t write_offset;  /* current sequential write position */
    uint32_t in_use;        /* 1 if slot is occupied */
    uint32_t ref_count;     /* number of open fds sharing this slot */
} ext2_fd_priv_t;

Allocated from a static pool of 32 slots (s_ext2_pool). Reference counted: dup/fork increments, close decrements, slot is freed when ref_count reaches 0.

VFS Operations Table

const vfs_ops_t s_ext2_ops = {
    .read    = ext2_vfs_read_fn,    /* → ext2_read(ino, buf, off, len) */
    .write   = ext2_vfs_write_fn,   /* → bounce buffer → ext2_write */
    .close   = ext2_vfs_close_fn,   /* → ext2_pool_free */
    .readdir = ext2_vfs_readdir_fn, /* → ext2_readdir */
    .dup     = ext2_vfs_dup_fn,     /* → ref_count++ */
    .stat    = ext2_vfs_stat_fn,    /* → ext2_read_inode → k_stat_t */
    .poll    = NULL,                /* ext2 files are always ready */
};

Concurrency

All ext2 operations acquire ext2_lock (a spinlock with IRQ save/restore). The lock ordering is sched_lock > ext2_lock, allowing ext2_sync to be called from sched_exit.

The block cache is protected by ext2_lock – there is no separate cache lock. This simplifies the implementation at the cost of coarser granularity (all ext2 operations are serialized).


Limitations

Limitation Impact
32 block groups max ~4 GB filesystem (with 4K blocks, 128 MB/group)
32-bit i_size 4 GB max file size
No triple indirect Max ~4 GB with double indirect (4K blocks)
Write only to direct blocks Max 48 KB writable per file (12 * 4K)
Indirect not freed on unlink Leaked blocks for large deleted files
No .. update on directory rename Stale parent reference after cross-directory rename
16-slot block cache High miss rate for large directory trees
32-slot fd pool Max 32 simultaneously open ext2 files
No extended attributes No ACLs, no security labels
No timestamps i_atime/i_mtime/i_ctime are not updated