Network Stack Overview
Architecture and design of the Aegis OS in-kernel IPv4 network stack
Network Stack Overview
Aegis implements a minimal, polling-based IPv4 network stack entirely within the kernel. The stack supports Ethernet framing, ARP resolution, IPv4 routing, ICMP echo, UDP, and TCP (RFC 793 state machine with retransmit). All protocol processing runs either in PIT ISR context (RX path at 100 Hz) or in syscall/task context (TX path via the socket API).
v1 maturity notice: The network stack is written entirely in C and has not been audited for memory safety or protocol-level vulnerabilities. As a v1 from-scratch implementation, it should be assumed to contain exploitable bugs – buffer handling errors, integer overflows in length calculations, or state machine edge cases that could be triggered by crafted packets. This is expected for any network stack at this stage of development. The planned gradual migration from C to Rust (already underway in the capability model) will eventually cover the networking code, but that work has not yet begun for
kernel/net/. Contributions are welcome – file issues or propose changes at exec/aegis.
Architecture
+--------------------------------------------------------------+
| User Space |
| socket() / bind() / listen() / accept() / connect() |
| send() / recv() / sendto() / recvfrom() / epoll_wait() |
+-------------------------------+------------------------------+
| syscall boundary
+-------------------------------+------------------------------+
| Socket Layer socket.c | unix_socket.c | epoll.c |
| - sock_t table (64 slots) | AF_UNIX domain | epoll fds |
| - VFS integration (read/write/close/dup/stat) |
+-------------------------------+------------------------------+
| Transport tcp.c | udp.c |
| - TCP state machine (32 conns, 16KB rbuf, 8KB sbuf) |
| - UDP binding table (16 ports) |
+-------------------------------+------------------------------+
| Network ip.c |
| - IPv4 send/receive, routing (gateway vs. on-link) |
| - ICMP echo request/reply |
| - Loopback queue (127.0.0.0/8 and self-addressed) |
+-------------------------------+------------------------------+
| Data Link eth.c |
| - Ethernet framing (14-byte header) |
| - ARP table (16 entries, request/reply, anti-spoofing) |
+-------------------------------+------------------------------+
| Device netdev.c |
| - netdev_t registry (up to 4 NICs) |
| - poll/send abstraction |
+-------------------------------+------------------------------+
| Drivers virtio_net.c | rtl8169.c |
| - PCI device discovery via ECAM |
| - DMA descriptor rings, MMIO register access |
+--------------------------------------------------------------+
Source Files
| File | Purpose |
|---|---|
kernel/net/net.h |
Shared types: ip4_addr_t, mac_addr_t, byte-order macros, checksum API |
kernel/net/netdev.h / .c |
Network device abstraction layer |
kernel/net/eth.h / .c |
Ethernet framing, ARP table, ARP resolution |
kernel/net/ip.h / .c |
IPv4 send/receive, ICMP echo, loopback, net_init() |
kernel/net/tcp.h / .c |
TCP state machine, retransmit timer, socket-layer helpers |
kernel/net/udp.h / .c |
UDP send/receive, port binding table |
kernel/net/socket.h / .c |
Socket table, VFS ops for AF_INET sockets |
kernel/net/unix_socket.h / .c |
AF_UNIX domain sockets with fd passing |
kernel/net/epoll.h / .c |
epoll implementation (epoll_create, epoll_ctl, epoll_wait) |
kernel/drivers/virtio_net.c |
virtio 1.0 modern NIC driver (QEMU) |
kernel/drivers/rtl8169.c |
Realtek RTL8168/8169 PCIe gigabit Ethernet driver |
Initialization
The network stack is initialized during boot via net_init() (called from kernel_main()):
void net_init(void)
{
netdev_t *dev = netdev_get("eth0");
if (!dev) {
/* No NIC registered — silent return.
* boot.txt must NOT contain any [NET] lines. */
return;
}
eth_init();
udp_init();
tcp_init();
}
Prior to net_init(), the NIC driver (virtio_net_init() or rtl8169_init()) has already:
- Scanned the PCIe device table for its vendor/device ID
- Mapped BAR MMIO regions into kernel virtual address space
- Set up DMA descriptor rings (RX and TX)
- Read the MAC address from device registers
- Registered itself via
netdev_register()as"eth0"
IP configuration is set statically via net_set_config():
void net_set_config(ip4_addr_t ip, ip4_addr_t mask, ip4_addr_t gw);
The DHCP client (running in userspace) can update this at runtime.
Polling Model
Aegis uses a polling-based network model rather than interrupt-driven I/O. The PIT timer ISR fires at 100 Hz and calls three network functions:
netdev_poll_all()– iterates registered NICs, callsdev->poll()to drain RX queuestcp_tick()– processes TCP retransmit timers and TIME_WAIT expiryip_loopback_poll()– drains the loopback packet queue
The g_in_netdev_poll flag is set during netdev_poll_all() so that code called from the ISR RX path (e.g., arp_resolve()) knows it must not block:
void netdev_poll_all(void)
{
irqflags_t fl = spin_lock_irqsave(&netdev_lock);
g_in_netdev_poll = 1;
int i;
for (i = 0; i < s_count; i++) {
if (s_devices[i]->poll)
s_devices[i]->poll(s_devices[i]);
}
g_in_netdev_poll = 0;
spin_unlock_irqrestore(&netdev_lock, fl);
}
This design eliminates the complexity of interrupt-driven networking (MSI/MSI-X setup, deferred processing, NAPI-style budgeting) at the cost of latency (up to 10 ms between poll ticks).
Packet Flow
Receive Path (Inbound)
NIC hardware writes frame to DMA buffer
|
v
PIT ISR (100 Hz) → netdev_poll_all()
|
v
dev->poll(dev) [virtio_net_poll / rtl8169_poll]
- Drains used ring / RX descriptors
- Strips driver header (virtio: 12B, RTL: 4B FCS)
|
v
netdev_rx_deliver(dev, frame, len)
|
v
eth_rx(dev, frame, len)
- Parse 14-byte Ethernet header
- Dispatch on ethertype:
0x0806 → arp_rx_pkt() [ARP reply → update cache]
0x0800 → ip_rx() [IPv4 → next step]
|
v
ip_rx(dev, frame, ip_payload, len)
- Validate: version=4, IHL=5, header checksum
- Destination filter: my_ip, loopback, broadcast, subnet broadcast
- Dispatch on protocol:
1 → icmp_rx() [echo reply: drop; echo request: reply]
6 → tcp_rx() [TCP segment → state machine]
17 → udp_rx() [UDP datagram → binding table]
|
v
tcp_rx() / udp_rx()
- Validate transport checksum (pseudo-header)
- Deliver to socket layer (ring buffer / accept queue)
- Defer wake/epoll_notify outside lock
Transmit Path (Outbound)
User calls write(fd, ...) / send(fd, ...) / sendto(fd, ...)
|
v
sys_write → sock_vfs_write → tcp_conn_send / udp_send
|
v
tcp_send_segment / udp_send
- Build transport header
- Compute checksum with pseudo-header
|
v
ip_send(dev, dst_ip, proto, payload, len)
- Build 20-byte IPv4 header
- Loopback check (127.0.0.0/8 or self) → queue to ring
- Routing: same subnet → dst_ip; else → gateway
- ARP resolve next-hop MAC
|
v
eth_send(dev, dst_mac, ethertype, payload, len)
- Build 14-byte Ethernet header in static TX buffer
|
v
dev->send(dev, frame, total_len) [virtio_net_send / rtl8169_send]
- Copy frame to DMA bounce buffer
- Update descriptor ring, kick doorbell
- Poll for TX completion
Shared Types and Byte Order
All network addresses are stored in network byte order (big-endian). Conversion macros use GCC builtins:
typedef uint32_t ip4_addr_t; /* network byte order */
typedef struct { uint8_t b[6]; } mac_addr_t;
#define htons(x) __builtin_bswap16((uint16_t)(x))
#define ntohs(x) __builtin_bswap16((uint16_t)(x))
#define htonl(x) __builtin_bswap32((uint32_t)(x))
#define ntohl(x) __builtin_bswap32((uint32_t)(x))
Checksum Implementation
The one’s-complement checksum used by IP, TCP, UDP, and ICMP is implemented as a two-phase operation for non-contiguous regions:
uint32_t net_checksum(const void *data, uint32_t len);
uint16_t net_checksum_finish(uint32_t sum);
Usage pattern for TCP (pseudo-header + segment):
tcp_pseudo_hdr_t ph;
ph.src = conn->local_ip;
ph.dst = conn->remote_ip;
ph.zero = 0;
ph.proto = IP_PROTO_TCP;
ph.tcp_len = htons(tcp_len);
uint32_t sum = 0;
sum += net_checksum(&ph, sizeof(ph));
sum += net_checksum(s_tcp_buf, tcp_len);
hdr->checksum = net_checksum_finish(sum);
Lock Ordering
The network stack uses spinlocks with IRQ save/restore. The lock ordering is critical to avoid deadlocks, particularly because protocol processing can be re-entered from the PIT ISR:
sock_lock > tcp_lock > udp_lock > arp_lock > ip_lock > netdev_lock
Key lock ordering constraints:
-
tcp_lock->sock_lock: TCP RX processing collects socket wake events undertcp_lock, then releases it before callingsock_wake()(which acquiressock_lock). This preventstcp_lock -> sock_lockordering inversions. -
ip_lock->arp_lock:ip_send()copies the assembled packet to a local buffer and snapshots the IP configuration underip_lock, then releases it before callingarp_resolve()(which acquiresarp_lock). -
g_in_netdev_pollflag: When set (inside PIT ISR),arp_resolve()returns -1 immediately instead of blocking. Blocking inside the ISR while holdingnetdev_lockwould deadlock the next PIT tick.
Loopback
Packets destined for the local IP or 127.0.0.0/8 are queued in a ring buffer rather than sent to the NIC:
#define LO_RING_SIZE 8
#define LO_PKT_MAX 1500
static uint8_t s_lo_ring[LO_RING_SIZE][LO_PKT_MAX];
static uint16_t s_lo_len[LO_RING_SIZE];
static uint32_t s_lo_head, s_lo_tail;
ip_loopback_poll() drains this queue at 100 Hz, calling ip_rx() for each queued packet. This deferred delivery prevents re-entrant crashes from synchronous loopback (e.g., ip_send(SYN) -> ip_rx -> tcp_rx -> ip_send(SYN-ACK) -> ip_rx -> ... clobbering static buffers).
Static Buffers
The network stack uses static, file-scoped buffers throughout instead of dynamic allocation:
| Buffer | Size | Location | Purpose |
|---|---|---|---|
s_tx_buf |
1514 B | eth.c |
Ethernet frame assembly |
s_ip_buf |
1500 B | ip.c |
IP packet assembly |
s_tcp_buf |
1480 B | tcp.c |
TCP segment assembly |
s_udp_buf |
1480 B | udp.c |
UDP packet assembly |
s_icmp_buf |
1480 B | ip.c |
ICMP reply assembly |
s_lo_ring |
8 x 1500 B | ip.c |
Loopback queue |
All TX paths are serialized by their respective spinlocks, so static buffers are safe.
Limitations
- No IP fragmentation:
ip_send()rejects payloads larger than 1480 bytes (1500 MTU minus 20-byte IP header). The maximum TCP segment is ~1460 bytes. - No IPv6: The stack is IPv4-only.
- No multicast routing: Multicast frames are accepted at the Ethernet level but not routed.
- No IP options: Only the minimum 20-byte IP header (IHL=5) is supported.
- Polling latency: Up to 10 ms between packet arrival and processing (100 Hz PIT).
- Static table sizes: 4 netdevs, 16 ARP entries, 32 TCP connections, 16 UDP bindings, 64 sockets.
- No congestion control: TCP sends at line rate with no slow-start, congestion avoidance, or fast retransmit.
- No TLS/encryption: All traffic is plaintext. There is no in-kernel cryptographic support.
Security Considerations
The entire network stack is C code operating on untrusted data from the wire. As v1 software, it has not undergone formal security review and should be assumed to contain vulnerabilities typical of C network code at this maturity level:
- Buffer overflows: Static buffers are used throughout. While length checks exist at each layer, the absence of memory-safe language guarantees means any single missed check could be exploitable.
- Integer overflow: Length fields are
uint16_tand combined across layers. Wrapping arithmetic on crafted packets could bypass bounds checks. - ARP spoofing mitigation is partial: The anti-spoofing logic (reject unsolicited replies) helps but does not defend against a race where a legitimate request is pending.
- No rate limiting: There is no protection against SYN floods, ARP storms, or other denial-of-service traffic patterns.
- Static buffers as shared state: The TX path uses file-scoped static buffers protected by spinlocks, but lock ordering bugs could lead to data corruption under adversarial timing.
These are not hypothetical concerns documented for completeness – they are the expected reality of a from-scratch C network stack that has not yet been hardened. The long-term plan is to gradually migrate the network stack to Rust as part of the broader kernel Rust migration (see Capability Model for the first component already in Rust).
Related Documentation
- TCP/IP Implementation – detailed TCP state machine and protocol handling
- Socket API – userspace socket interface and syscalls
- Device Drivers – NIC driver internals (virtio-net, RTL8169)