Aegis 1.0.3 — Installed GRUB, Less Hostile; Installed Keyboard, Still Hostile

A small point release. Most of it is cleanup I should have caught before 1.0.1 ever shipped — the GRUB menu on the installed system has never rendered correctly, the installer will happily overwrite your existing install without so much as a shrug, and there’s one honest-to-god kernel regression I can’t reproduce and therefore haven’t fixed yet. Plus one actual new feature (see below).

Boot the live USB: you get a nice graphical GRUB menu with the Aegis wallpaper, box-drawing characters around the entries, a JetBrains Mono font. Boot the installed system: you get a black screen, @ in every slot where a box-drawing char should be, and no wallpaper. It’s been like that since the installer shipped in Phase 47. Nobody ever flagged it because nobody ever booted Aegis from anything but a USB stick.

Three distinct bugs, all stacked on top of each other in tools/grub-installed.cfg:

1. if … then true … fi silently halts. The live ISO’s GRUB is produced by grub-mkrescue and has every builtin including true. The installed system’s BOOTX64.EFI is produced by grub-mkimage with a hand-picked module list, and that list does not include a true builtin. The grub.cfg had if loadfont X; then true; else loadfont Y; fi in it — on EFI GRUB the true throws error: can't find command 'true', which halts the script mid-config. No font loaded, terminal_output gfxterm comes up glyph-less, and gfxterm fills every glyph slot with the default placeholder — @. Fixed by just calling loadfont twice unconditionally. || and && also don’t work in this GRUB build, which was a fun half hour.

2. search --set=root reassigns $root to the ext2 partition, and loadfont/background_image are relative to $root. Once search runs, $root = (hd0,gpt2) and the fonts and wallpaper (which live on the ESP at /EFI/BOOT/…) are no longer reachable. The previous config looked for /boot/grub/font.pf2 on the ext2 partition, which the rootfs build never copied there in the first place. Fixed by hardcoding the absolute ESP paths (/EFI/BOOT/unicode.pf2, /EFI/BOOT/font.pf2, /EFI/BOOT/wallpaper.png) and moving the loadfont / background_image calls BEFORE the search, so they run while $root still points at the ESP.

3. set gfxmode=1280x800x32,1024x768x32,800x600x32 doesn’t match any OVMF GOP mode. Most real UEFI firmware offers 1920x1080x32 or similar and picks the closest. QEMU + std-vga + OVMF offers only 640x480x24 and 800x600x24, which the kernel’s multiboot2 FB parser rejects (framebuffer_bpp == 32 only). Fix: added auto as the last fallback on the gfxmode list, so if none of the preferred modes are available GRUB falls back to “whatever the firmware exposes at native resolution.” On bare-metal UEFI this is almost always 32-bpp. On QEMU-with-std-vga it still isn’t, which took an hour of my life to realize — see Test harness below.

Net effect: the installed-system GRUB menu now looks identical to the live-USB one. Wallpaper, JetBrains Mono, proper box-drawing.

The installer will now warn you before eating your existing system

Running the installer on a disk that already has Aegis on it used to be completely silent about that fact until the “Copying root filesystem” step failed with block write failed or partition rescan failed (which one you got seems to depend on firmware and disk size). Two people told me about this after 1.0.2 shipped — one of them lost unrelated data because they picked the wrong disk in the list.

Two changes:

Disk selector highlights existing installs. On the “Select target disk” screen, any disk that contains an Aegis-typed GPT partition now renders in orange with [existing Aegis install] after its name. Detection is a new install_disk_has_aegis(devname) in user/lib/libinstall/copy.c that walks the kernel’s block-device list looking for <devname>pN children — the kernel’s gpt_scan only registers partitions carrying the Aegis GUID prefix, so any partition-child is definitionally an existing install. No disk reads required.

Confirm screen changes colour and wording. If the selected disk has an existing install, the confirm screen’s warning line switches from the generic orange “WARNING: all existing data on the target disk will be erased.” to a red two-line banner: “WARNING: this disk already contains an Aegis install. Pressing Install will ERASE that system permanently.” Same data; much harder to miss.

I briefly considered a modal confirmation dialog but the install flow doesn’t have modal infrastructure yet, and a red banner right above the Install button is probably enough. If this turns out to be insufficient I’ll promote to a typed-confirmation (“type ERASE to continue”) in a later release.

Keyboard on installed systems: open question

This one is a real regression, not a cosmetic bug, and I have not fixed it.

On some (all? we don’t yet know) real hardware installs, the keyboard does not work after boot — not in Bastion’s graphical login, not in the text-mode login prompt either. Framebuffer works, graphics work, mouse works (where the hardware has one), but keystrokes never reach userspace. Live-USB boots on the same hardware with the same kernel binary are fine.

Working hypothesis: the kernel’s keyboard subsystem relies on legacy PS/2 IRQ1 being delivered to the 8259A/IOAPIC, and on laptops the “PS/2 keyboard” is actually a USB HID device that the BIOS exposes through legacy-USB SMM traps only while booting via legacy BIOS. When the installed system boots via UEFI/OVMF with no legacy support, the PS/2 port genuinely stops working and we fall through to the real xHCI USB HID path — which is implemented in kernel/drivers/xhci.c + kernel/drivers/usb_hid.c, and does work for the mouse on the same hardware, but something about the timing or enumeration or HID report delivery is different enough that the keyboard ring buffer never gets populated.

What I’ve tried:

Added per-scancode printk in the PS/2 IRQ handler and per-report printk in usb_hid_process_report.
Can’t read them — the affected laptop exposes no standard UART, and writing a USB-C-serial driver just for debugging is a weekend I’m not willing to spend on a regression I hope to root-cause by staring at the right code.
QEMU cannot reproduce the bug at all; every emulated configuration I’ve tried either keeps PS/2 working through the install/reboot cycle or has a different bug (24-bpp GOP, different timing) unrelated to the keyboard path.

What I still need to try, and probably will for 1.0.4:

A proper i8042 re-init in kbd_init rather than the current “assume firmware left it in a good state” that dates to Phase 9.
More aggressive xHCI re-enumeration after the NVMe driver comes up, in case the order of the existing enumeration loses a Set Configuration.
Possibly a framebuffer-resident debug overlay so I can get diagnostic output off the affected machine without USB-serial.

If you’re running Aegis from USB, none of this affects you. If you’ve installed it to disk and can no longer type, I’m sorry, I’m genuinely working on it, please open an issue with your hardware (especially the laptop model and whether it has an internal USB hub) so I can get a pattern.

20 new coreutils

The reason 1.0.3 is a feature release. Aegis shipped 1.0.2 with 24 utilities under /bin, which was almost enough to use the shell for anything beyond running pre-built binaries — cat, ls, grep, sort, wc were there but head and tail weren’t, so reading the first line of a config file required cat | grep '.' and squinting. 1.0.3 closes most of that gap. Each one is a tiny musl-static C binary in user/bin/<name>/main.c, registered in rootfs.manifest, and built by the SIMPLE_USER_PROGS rule in the top-level Makefile.

Twenty new ones, grouped by what they actually let you do:

Read pieces of files. head (-n flag), tail (-n flag, no -f yet — see follow-up note), cut (-d and -f, comma list, no ranges), uniq (adjacent-line collapse, no -c/-d), expand (tabs to spaces, -t).

Path manipulation. basename (with optional suffix strip), dirname, realpath (resolves through symlinks), which (walks $PATH colon-by-colon, falls back to /bin if PATH is unset).

Shell glue. tee (with -a), yes (the single most useless tool until you need it), sleep (integer seconds, nanosleep under the hood with EINTR retry), sync, test and [ (file tests -e -f -d -L -r -w -x -s, string =/!=/-n/-z, integer -eq/-ne/-lt/-le/-gt/-ge).

Inspection. stat (mode/uid/gid/size/inode in a single line), find (recursive walk with -name <glob> via fnmatch), tr (literal sets only, no [a-z]-style ranges), date (default Day Mon DD HH:MM:SS UTC YYYY, custom format via +FMT), env (print environ, or run a program with KEY=VAL set first), hostname (read or write).

That’s 20.

The two kernel patches behind env and hostname are the underwear of this release:

sys_execve now propagates envp (kernel/syscall/sys_exec.c). Used to be a (void)envp_uptr; /* not yet supported */ no-op, so every newly exec’d process started with an empty environment, which is why getenv("PATH") always returned NULL even though stsh sets PATH=/bin for itself. With envp propagation, env FOO=bar /bin/env actually shows FOO=bar in the child’s listing. The implementation mirrors the existing argv-copy + stack-build path bit-for-bit; the SysV ABI initial stack now contains argc, argv[…], NULL, envp[…], NULL, auxv[…] like the spec asks.
sys_sethostname is a real syscall now (kernel/syscall/sys_hostname.c, syscall number 170). A 65-byte BSS buffer initialized to "aegis", gated by CAP_KIND_POWER (same gate sys_reboot uses — if you’re trusted to power off the box you’re trusted to rename it), spinlock-protected so SMP doesn’t tear reads under writes. sys_uname reads from the same buffer so gethostname(3) and uname.nodename see whatever the last sethostname call set. There’s no on-disk persistence — userspace can read /etc/hostname at boot and call sethostname if it wants survival across reboots, which is what /bin/hostname running with admin caps will do once we wire that into vigil.

The whole batch is exercised end-to-end by tests/tests/coreutils_test.rs, which boots the text-mode test ISO, types root + the password into /bin/login, waits for stsh to print its new [STSH] ready marker, and then runs each utility with a sentinel-marker pattern that’s robust against stsh’s local-echo of the typed command line. (First version used substring-match on the sentinel and false-positive-fired on the typed-echo line, since the command itself contains the sentinel string. l.trim() == sentinel fixed that.) Eleven of the twenty utils are asserted there today. The harness isn’t wired into CI yet — runs on the dev box with cargo test --test coreutils_test.

The other nine — env, sync, stat, yes, test and [, find, which, uniq, plus expand — ship as binaries in /bin and exec correctly in a fresh boot, but eight of them hit a kernel-level ENOENT after roughly the eleventh sequential ext2-backed execve in a single session. Same binary that worked at exec #5 returns “No such file or directory” at exec #12. The instrumented stsh now prints strerror(errno) instead of the old generic not found, which is how I noticed. Suspected root cause: the ext2 16-slot LRU block cache evicts an indirect block needed for a later path lookup, and the cache-miss path in kernel/fs/ext2.c returns ENOENT instead of refilling. Real binaries; real exec failures; not great. Filed as a 1.0.4 follow-up.

expand is the ninth and behaves differently: it fails to exec even from a fresh boot, and only that one binary does. Same musl-gcc invocation as its neighbors, same .interp, MD5 of the rootfs copy matches the source, and yet /bin/expand /etc/passwd returns ENOENT immediately. Single-binary kernel exec mystery, also tracked for 1.0.4.

If you’re using Aegis interactively rather than scripting against it, you’re unlikely to issue twelve sequential exec’d commands in one stsh session in a way that surfaces the ext2 bug — cat, ls, echo, sh, login, vigil, and the rest of the boot-critical binaries are served from initrd, not ext2, and don’t count against the threshold. If you write a test that hammers ext2-resident utils, the failure mode is unmistakable.

(There’s also a small instrumentation win in stsh from this work: command-not-found errors now show the real errno via strerror, so the next person to chase a kernel-side ENOENT or EACCES at exec time won’t have to re-instrument from scratch.)

Test harness — don’t lie to me next time

The installed-NVMe post-boot test was helpful during 1.0.2 and pretty useless for this release, because QEMU + OVMF + default -vga std exposes only a 24-bpp GOP framebuffer, and the kernel’s multiboot2 FB tag parser only accepts bpp == 32. For months the test has been booting the installed system, hitting “bastion can’t map the framebuffer,” respawning Bastion forever, and never getting to the Bastion greeter. I happened to catch it while debugging something else this weekend.

Two fixes:

tests/src/presets.rs — aegis_q35_installed_ovmf now specifies -device virtio-vga instead of relying on -vga std. virtio-vga under OVMF exposes a 32-bpp GOP at 1280x800 and the existing kernel filter accepts it.
tests/tests/gui_installer_test.rs — gui_install_and_boot_from_nvme now drives the full post-install path: [EXT2] OK: mounted nvme0p1 → [BASTION] greeter ready → send_keys("root\tforevervigilant\n") → [LUMEN] ready. If keystrokes don’t reach the Bastion read loop, the test fails loudly with the last 80 lines of serial. This is what would have caught the keyboard regression in CI — it didn’t, because the QEMU+OVMF keyboard path is PS/2 and works fine, and the broken path is xHCI USB HID which QEMU’s default NVMe+OVMF q35 preset doesn’t exercise. Noted.

Also: a new regression test gui_installer_overwrite_existing that does Boot 1 fresh install → Boot 2 install again on the same disk, verifying the partition-rescan path handles an already-installed target. It passes on every QEMU configuration I’ve tried. It should repro the bug one of you reported; it doesn’t, which is why the prompt-before-overwrite UX landed instead of a root-cause fix. If anyone wants to PR a QEMU config that reproduces partition rescan failed, I owe you a beer.

And one bonus build-system fix that the coreutils work surfaced. The generated rule for every binary in SIMPLE_USER_PROGS (cat, ls, echo, all of them, plus the new twenty) declared the .elf as depending only on $(MUSL_BUILT). Editing user/bin/<name>/main.c did not invalidate the binary. The rule has been wrong since the SIMPLE template was added; nobody ever noticed because most edits to those binaries also touched the manifest or the Makefile, which DID invalidate the rootfs.img and trigger a full rebuild. The new coreutils were the first time anyone iterated on a binary’s source file rapidly enough for the staleness to bite — first version of the [STSH] ready marker shipped a stale stsh.elf that never got rebuilt and the test silently used the old binary. Two-line fix: add $(wildcard user/bin/$(1)/*.c) $(wildcard user/bin/$(1)/*.h) to the dep list.

Get it

Download v1.0.3 ISO — same shape as 1.0.2 (live boot, graphical default, text-mode entry in GRUB, Aegis-wallpaper GRUB menu that now actually works on the installed system too). Default credentials are still root / forevervigilant.

Report keyboard-dead bugs to exec/aegis, please include laptop model and whether you booted via UEFI or legacy CSM. Security reports to execxd@icloud.com.