ext2 Filesystem Implementation
Detailed documentation of the Aegis ext2 filesystem driver: on-disk layout, block cache, inode operations, directory management, symlink resolution, and write path.
ext2 Filesystem Implementation
Aegis includes a read-write ext2 filesystem driver that serves as the primary writable root filesystem. The implementation supports the core ext2 feature set: direct and indirect block addressing, directory entry management, symbolic links (fast and slow), POSIX DAC permission checks, and file creation/deletion. It does not include journaling (ext3/4), extended attributes, or triple-indirect blocks.
v1 note: The ext2 driver is v1 software – functional and tested, but not hardened against crafted filesystem images or adversarial input. Contributions are welcome – file issues or propose changes at exec/aegis.
Source files:
kernel/fs/ext2.c– mount, inode I/O, block addressing, read/write, create/unlink/mkdir/rename, symlinks, chmod/chownkernel/fs/ext2_cache.c– 16-slot LRU block cache, synckernel/fs/ext2_dir.c– directory entry helpers (lookup_parent, add_entry, remove_entry)kernel/fs/ext2_vfs.c– VFS adapter (fd pool,vfs_ops_tcallbacks)kernel/fs/ext2.h– public API and on-disk structure definitionskernel/fs/ext2_internal.h– internal shared state between TUskernel/fs/ext2_vfs.h– per-fd state and VFS ops table
On-Disk Structures
Superblock (ext2_superblock_t)
Located at byte offset 1024 from the partition start (LBA 2 in 512-byte sectors). Read into a static s_sb during mount.
typedef struct __attribute__((packed)) {
uint32_t s_inodes_count;
uint32_t s_blocks_count;
uint32_t s_r_blocks_count;
uint32_t s_free_blocks_count;
uint32_t s_free_inodes_count;
uint32_t s_first_data_block;
uint32_t s_log_block_size; /* block_size = 1024 << s_log_block_size */
uint32_t s_log_frag_size;
uint32_t s_blocks_per_group;
uint32_t s_frags_per_group;
uint32_t s_inodes_per_group;
uint32_t s_mtime;
uint32_t s_wtime;
uint16_t s_mnt_count;
uint16_t s_max_mnt_count;
uint16_t s_magic; /* must be 0xEF53 */
uint16_t s_state;
uint16_t s_errors;
uint16_t s_minor_rev_level;
uint32_t s_lastcheck;
uint32_t s_checkinterval;
uint32_t s_creator_os;
uint32_t s_rev_level;
uint16_t s_def_resuid;
uint16_t s_def_resgid;
/* EXT2_DYNAMIC_REV fields */
uint32_t s_first_ino;
uint16_t s_inode_size; /* 128 for rev 0, variable for rev >= 1 */
uint16_t s_block_group_nr;
uint32_t s_feature_compat;
uint32_t s_feature_incompat;
uint32_t s_feature_ro_compat;
uint8_t s_uuid[16];
char s_volume_name[16];
char s_last_mounted[64];
uint32_t s_algo_bitmap;
} ext2_superblock_t;
Key derived values:
s_block_size = 1024 << s_sb.s_log_block_size(validated:s_log_block_size <= 6)s_num_groups = ceil(s_blocks_count / s_blocks_per_group)(capped at 32)inode_size = s_inode_sizefor rev >= 1, else 128 bytes
Block Group Descriptor (ext2_bgd_t)
Offset Size Field
────── ──── ─────────────────────
0 4 bg_block_bitmap
4 4 bg_inode_bitmap
8 4 bg_inode_table
12 2 bg_free_blocks_count
14 2 bg_free_inodes_count
16 2 bg_used_dirs_count
18 2 bg_pad
20 12 bg_reserved
────── ──── ─────────────────────
Total: 32 bytes
The BGD table is stored at the block immediately after the superblock:
- For 1024-byte blocks: superblock is in block 1, BGD at block 2
- For larger blocks: superblock is in block 0, BGD at block 1
Up to 32 block groups are supported (static array s_bgd[32]).
Inode (ext2_inode_t)
Offset Size Field
────── ──── ─────────────────────
0 2 i_mode (file type + permission bits)
2 2 i_uid
4 4 i_size (file size in bytes, 32-bit)
8 4 i_atime
12 4 i_ctime
16 4 i_mtime
20 4 i_dtime (deletion time; non-zero = deleted)
24 2 i_gid
26 2 i_links_count
28 4 i_blocks (512-byte sector count)
32 4 i_flags
36 4 i_osd1
40 60 i_block[15] (block pointers)
100 4 i_generation
104 4 i_file_acl
108 4 i_dir_acl
112 4 i_faddr
116 12 i_osd2
────── ──── ─────────────────────
Total: 128 bytes
The i_block[15] array encodes block addresses:
┌──────────────────────────────────────────────────┐
│ i_block[0..11] → 12 direct block pointers │
│ i_block[12] → single indirect block │
│ i_block[13] → double indirect block │
│ i_block[14] → triple indirect (NOT supported)│
└──────────────────────────────────────────────────┘
File type bits in i_mode:
| Constant | Value | Description |
|---|---|---|
EXT2_S_IFREG |
0x8000 |
Regular file |
EXT2_S_IFDIR |
0x4000 |
Directory |
EXT2_S_IFLNK |
0xA000 |
Symbolic link |
Directory Entry (ext2_dirent_t)
typedef struct __attribute__((packed)) {
uint32_t inode;
uint16_t rec_len; /* total size of this entry (padded) */
uint8_t name_len; /* actual name length */
uint8_t file_type; /* EXT2_FT_REG_FILE=1, EXT2_FT_DIR=2, EXT2_FT_SYMLINK=7 */
char name[255];
} ext2_dirent_t;
Directory entries are variable-length records packed within directory data blocks. rec_len includes padding to maintain 4-byte alignment. An entry with inode == 0 is a deleted/free slot.
On-Disk Layout
Block 0 Block 1 Block 2+ Block N...
┌──────────┬──────────────┬──────────────┬─────────────────┐
│ Boot │ Superblock │ Block Group │ Block Group 0 │
│ sector │ (1024 bytes │ Descriptor │ ┌─────────────┐ │
│ (unused) │ at byte │ Table │ │Block Bitmap │ │
│ │ offset │ │ │Inode Bitmap │ │
│ │ 1024) │ │ │Inode Table │ │
│ │ │ │ │Data Blocks │ │
│ │ │ │ └─────────────┘ │
└──────────┴──────────────┴──────────────┴─────────────────┘
For a typical 1024-byte block size (the default):
- Block 0: boot block (unused by ext2)
- Block 1: superblock
- Block 2: BGD table
- Blocks 3+: block bitmap, inode bitmap, inode table, data blocks (per group)
Block Cache
Source: kernel/fs/ext2_cache.c
The ext2 driver uses a 16-slot LRU block cache. Each slot holds one full block (up to 4096 bytes) plus metadata:
typedef struct {
uint32_t block_num; /* 0 + age=0 means unused */
uint8_t dirty; /* 1 = modified, needs writeback */
uint32_t age; /* LRU timestamp (higher = more recent) */
uint8_t data[4096]; /* block data */
} cache_slot_t;
cache_slot_t s_cache[16];
uint32_t s_cache_age = 0; /* monotonically increasing counter */
Cache Operations
cache_get_slot(block_num) – The primary access function. Returns a pointer to the cached block data, loading from disk if necessary.
cache_get_slot(block_num)
│
├── cache_find(block_num)
│ ├── Found: update age, return data
│ └── Not found:
│ ├── cache_evict() → pick LRU slot
│ │ ├── Prefer clean slots over dirty
│ │ ├── Within same cleanliness, pick lowest age
│ │ └── Write back dirty slot before evicting
│ ├── Read block from disk via blkdev
│ └── Return data pointer
└── NULL on I/O error
cache_mark_dirty(block_num) – Marks a cached block as dirty. Called after any in-place modification of cached data (inode writes, bitmap updates, directory entry changes).
ext2_sync() – Flushes all dirty cache slots to disk. Called from process exit and explicit sync. Acquires ext2_lock (inner lock in the lock ordering: sched_lock > ext2_lock).
Cache Design Notes
- 16 slots is small but sufficient for the typical Aegis workload (embedded/teaching OS)
- The eviction policy prefers clean slots to minimize writeback I/O
- Block-to-LBA translation:
lba = block_num * (block_size / 512) - No read-ahead or prefetching
- All cache slots are allocated statically (no dynamic memory)
Mount Process
ext2_mount(devname) performs these steps under ext2_lock:
- Initialize all 16 cache slots to age 0 (unused)
- Get block device handle via
blkdev_get(devname) - Read superblock from LBA 2 (byte offset 1024) – 2 sectors of 512 bytes
- Validate magic number (
0xEF53) - Validate
s_log_block_size <= 6and inode size bounds (128..4096 for rev >= 1) - Compute
s_block_size,s_num_groups(capped at 32) - Read BGD table from the block after the superblock
- Set
s_mounted = 1 - Record
/etc/shadowinode number for capability enforcement invfs_open:uint32_t shadow_ino = 0; if (ext2_open_ex("/etc/shadow", &shadow_ino, 1) == 0) s_shadow_ino = shadow_ino;
The shadow inode lookup must occur after s_mounted = 1 because ext2_open_ex needs to walk the directory tree.
Inode Operations
Reading an Inode
ext2_read_inode(ino, *out) locates an inode on disk:
group = (ino - 1) / s_inodes_per_group
index = (ino - 1) % s_inodes_per_group
inode_size = s_inode_size (rev >= 1) or 128
byte_offset = index * inode_size
block = bgd[group].bg_inode_table + byte_offset / block_size
in_block = byte_offset % block_size
The inode table block is loaded via cache_get_slot, and 128 bytes are copied out byte-by-byte. Inode 0 is rejected (-EIO).
Writing an Inode
ext2_write_inode(ino, *inode) performs the same address calculation but copies data into the cache and calls cache_mark_dirty on the containing block.
Block Addressing
ext2_block_num(inode, file_block) resolves a logical file block index to a physical block number:
ptrs_per_block = block_size / 4
file_block < 12:
→ inode->i_block[file_block] (direct)
file_block < 12 + ptrs_per_block:
→ read inode->i_block[12] (single indirect)
→ index into indirect block at (file_block - 12)
file_block < 12 + ptrs_per_block + ptrs_per_block^2:
→ read inode->i_block[13] (double indirect)
→ index into outer block, then inner block
Triple indirect (i_block[14]) is NOT supported.
For 4096-byte blocks, ptrs_per_block = 1024:
- Direct: blocks 0..11 (up to 48 KB)
- Single indirect: blocks 12..1035 (up to ~4 MB)
- Double indirect: blocks 1036..1049611 (up to ~4 GB)
A return value of 0 indicates a sparse block (the file has a hole at that position).
Path Walk and Symlink Resolution
ext2_open_ex(path, *inode_out, follow_final)
The primary path resolution function. Walks the directory tree component by component, handling symlinks:
ext2_open_ex("/usr/bin/foo", &ino, 1)
│
├── Start at inode 2 (EXT2_ROOT_INODE)
├── For each path component:
│ ├── Read current directory inode
│ ├── Verify it is a directory (EXT2_S_IFDIR)
│ ├── Scan directory entries for component name
│ ├── If child is a symlink:
│ │ ├── If final component and !follow_final → return symlink inode
│ │ ├── Increment depth (max SYMLINK_MAX_DEPTH = 8)
│ │ ├── Read symlink target
│ │ ├── Construct new path: target + "/" + remaining
│ │ └── goto restart_walk
│ └── Advance to child inode
└── Return resolved inode number
Special handling for ..: clamps to root inode (inode 2) rather than walking up. This prevents escaping the root.
Symlink Types
- Fast symlink (target <= 60 bytes): Target stored directly in
i_block[]– no data block needed. Threshold is 60 becausei_block[0..14]is 60 bytes. - Slow symlink (target > 60 bytes): Target stored in first data block (
i_block[0]).
ext2_open(path, *inode_out)
Thin wrapper: ext2_open_ex(path, inode_out, 1) – always follows symlinks on the final component.
Read Path
ext2_read(inode_num, buf, offset, len):
- Acquire
ext2_lock - Read inode, validate size (256 MB cap against malicious images)
- Clamp read range to file size
- Loop over file blocks:
- Compute file block index and offset within block
- Resolve physical block via
ext2_block_num - If sparse (block 0): fill zeros
- Otherwise: load block via cache, copy bytes
- Release lock, return bytes read
The 256 MB size cap is a security measure against crafted ext2 images with corrupted i_size values that could cause excessive memory reads.
Write Path
ext2_write(inode_num, buf, offset, len):
- Acquire
ext2_lock - Read inode
- Loop over file blocks:
- Compute file block index
- Resolve or allocate physical block:
- Existing block: use directly
- Sparse/new block (direct only,
file_block < 12):ext2_alloc_block(0), zero the new block, assign toi_block[file_block] - Indirect allocation is not supported for writes (breaks at block 12)
- Copy data into cached block, mark dirty
- Update
i_sizeandi_blocksif file grew - Release lock, return bytes written
VFS Write Adapter
ext2_vfs_write_fn in ext2_vfs.c bridges the VFS write interface with ext2:
/* User buf → kernel bounce buffer → ext2_write */
uint8_t s_kbuf[4096]; /* stack-allocated bounce buffer */
while (done < len) {
chunk = min(len - done, 4096);
chunk = min(chunk, page_boundary_distance); /* SMAP guard */
copy_from_user(s_kbuf, buf + done, chunk);
n = ext2_write(ino, s_kbuf, write_offset, chunk);
write_offset += n;
done += n;
}
The 4096-byte stack buffer limits kernel stack usage. The page-boundary clamping prevents copy_from_user from crossing an unmapped page.
Block and Inode Allocation
ext2_alloc_block(preferred_group)
Scans block bitmaps starting from preferred_group (locality hint), wrapping around all groups:
- Skip groups with
bg_free_blocks_count == 0 - Load block bitmap via cache
- Scan bits for first free block
- Set bit, mark bitmap dirty, decrement free counts
- Return:
group * blocks_per_group + bit_index + first_data_block
ext2_alloc_inode(preferred_group)
Same algorithm but for inode bitmaps. Returns 1-based inode number:
group * inodes_per_group + bit_index + 1
Directory Operations
ext2_readdir(dir_inode, index, name_out, type_out)
Index-based directory iteration. Walks through all directory entry blocks, counting non-deleted entries (inode != 0) until reaching the requested index. Returns:
DT_DIR(4) for directoriesDT_LNK(10) for symlinksDT_REG(8) for everything else
ext2_dir_add_entry(dir_ino, child_ino, name, file_type)
Adds a new directory entry. Strategy:
- Scan existing blocks for:
- A deleted entry (
inode == 0) with enoughrec_len - An entry with enough trailing free space to split
- A deleted entry (
- If no space found, allocate a new directory block (direct blocks only)
- Write the new entry, mark block dirty, update directory inode
Entry splitting: if an existing entry’s rec_len exceeds its actual size by at least needed bytes (8 + name_len, rounded to 4-byte boundary), the entry is shrunk and the new entry is placed in the freed space.
ext2_dir_remove_entry(dir_ino, name)
Removes a directory entry by name. If the entry has a predecessor in the same block, merges rec_len into the predecessor. If it is the first entry in the block, sets inode = 0 (marks as deleted without merging).
File Operations
ext2_create(path, mode)
Creates a new regular file:
ext2_lookup_parentto find parent directory inode and basenameext2_alloc_inode(0)to allocate a new inode- Initialize inode:
i_mode = EXT2_S_IFREG | (mode & 0x1FF),i_links_count = 1 ext2_write_inodeto persist inode to diskext2_dir_add_entryto add the directory entry (typeEXT2_FT_REG_FILE)
ext2_unlink(path)
Removes a file:
- Look up parent directory and resolve file inode
- Decrement
i_links_count - If links reach 0:
- Free all direct data blocks (clear bitmap bits, increment free counts)
- Free the inode (clear bitmap bit, increment free count)
- Set
i_dtime = 1to mark as deleted
- Remove directory entry from parent
Note: only direct blocks (0..11) are freed. Indirect block chains are not deallocated.
ext2_mkdir(path, mode)
Creates a new directory:
- Look up parent, allocate inode and data block
- Initialize inode:
EXT2_S_IFDIR | mode,i_links_count = 2,i_size = block_size - Write
.and..entries into the new block:.entry:rec_len = 12, points to self..entry:rec_len = block_size - 12, points to parent
- Increment parent’s
i_links_countfor the..back-reference - Update block group directory count (
bg_used_dirs_count++) - Add entry in parent directory (type
EXT2_FT_DIR)
ext2_rename(old_path, new_path)
Atomic rename within the filesystem:
- Resolve the source inode
- Look up both old and new parent directories
- Determine file type (directory or regular file)
- Remove entry from old parent
- Add entry in new parent
Cross-directory renames are supported. The implementation does not update .. entries for directory renames (a known limitation).
ext2_symlink(linkpath, target)
Creates a symbolic link:
- Look up parent directory
- Allocate new inode:
i_mode = EXT2_S_IFLNK | 0777 - If target length <= 60: store in
i_block[](fast symlink) - If target length > 60: allocate data block, write target (slow symlink)
- Add directory entry (type
EXT2_FT_SYMLINK)
Permission Checks
POSIX DAC: ext2_check_perm
int ext2_check_perm(uint32_t ino, uint16_t proc_uid, uint16_t proc_gid, int want)
Standard POSIX owner/group/other permission matching:
if (inode.i_uid == proc_uid)
perm = (mode >> 6) & 7 /* owner bits */
else if (inode.i_gid == proc_gid)
perm = (mode >> 3) & 7 /* group bits */
else
perm = mode & 7 /* other bits */
return (perm & want == want) ? 0 : -EACCES
There is no root bypass. uid 0 is treated identically to any other uid. This is a deliberate design choice: privilege escalation requires capabilities, not uid. See the capability model.
Capability Gate: /etc/shadow
The VFS layer enforces the CAP_KIND_AUTH capability kind with CAP_RIGHTS_READ rights bitfield for /etc/shadow access. The check validates the process’s capability table against the required capability kind using the inode number recorded at mount time. This check occurs after symlink resolution, preventing symlink-based bypasses.
VFS Integration
ext2_fd_priv_t – Per-FD State
typedef struct {
uint32_t ino; /* ext2 inode number */
uint32_t write_offset; /* current sequential write position */
uint32_t in_use; /* 1 if slot is occupied */
uint32_t ref_count; /* number of open fds sharing this slot */
} ext2_fd_priv_t;
Allocated from a static pool of 32 slots (s_ext2_pool). Reference counted: dup/fork increments, close decrements, slot is freed when ref_count reaches 0.
VFS Operations Table
const vfs_ops_t s_ext2_ops = {
.read = ext2_vfs_read_fn, /* → ext2_read(ino, buf, off, len) */
.write = ext2_vfs_write_fn, /* → bounce buffer → ext2_write */
.close = ext2_vfs_close_fn, /* → ext2_pool_free */
.readdir = ext2_vfs_readdir_fn, /* → ext2_readdir */
.dup = ext2_vfs_dup_fn, /* → ref_count++ */
.stat = ext2_vfs_stat_fn, /* → ext2_read_inode → k_stat_t */
.poll = NULL, /* ext2 files are always ready */
};
Concurrency
All ext2 operations acquire ext2_lock (a spinlock with IRQ save/restore). The lock ordering is sched_lock > ext2_lock, allowing ext2_sync to be called from sched_exit.
The block cache is protected by ext2_lock – there is no separate cache lock. This simplifies the implementation at the cost of coarser granularity (all ext2 operations are serialized).
Limitations
| Limitation | Impact |
|---|---|
| 32 block groups max | ~4 GB filesystem (with 4K blocks, 128 MB/group) |
32-bit i_size |
4 GB max file size |
| No triple indirect | Max ~4 GB with double indirect (4K blocks) |
| Write only to direct blocks | Max 48 KB writable per file (12 * 4K) |
| Indirect not freed on unlink | Leaked blocks for large deleted files |
No .. update on directory rename |
Stale parent reference after cross-directory rename |
| 16-slot block cache | High miss rate for large directory trees |
| 32-slot fd pool | Max 32 simultaneously open ext2 files |
| No extended attributes | No ACLs, no security labels |
| No timestamps | i_atime/i_mtime/i_ctime are not updated |