Network Stack Overview

Aegis implements a minimal, polling-based IPv4 network stack entirely within the kernel. The stack supports Ethernet framing, ARP resolution, IPv4 routing, ICMP echo, UDP, and TCP (RFC 793 state machine with retransmit). All protocol processing runs either in PIT ISR context (RX path at 100 Hz) or in syscall/task context (TX path via the socket API).

v1 maturity notice: The network stack is written entirely in C and has not been audited for memory safety or protocol-level vulnerabilities. As a v1 from-scratch implementation, it should be assumed to contain exploitable bugs – buffer handling errors, integer overflows in length calculations, or state machine edge cases that could be triggered by crafted packets. This is expected for any network stack at this stage of development. The planned gradual migration from C to Rust (already underway in the capability model) will eventually cover the networking code, but that work has not yet begun for kernel/net/. Contributions are welcome – file issues or propose changes at exec/aegis.

Architecture

+--------------------------------------------------------------+
|                     User Space                               |
|  socket() / bind() / listen() / accept() / connect()         |
|  send() / recv() / sendto() / recvfrom() / epoll_wait()      |
+-------------------------------+------------------------------+
                                |  syscall boundary
+-------------------------------+------------------------------+
|  Socket Layer        socket.c | unix_socket.c | epoll.c      |
|  - sock_t table (64 slots)   | AF_UNIX domain | epoll fds    |
|  - VFS integration (read/write/close/dup/stat)               |
+-------------------------------+------------------------------+
|  Transport           tcp.c   | udp.c                         |
|  - TCP state machine (32 conns, 16KB rbuf, 8KB sbuf)         |
|  - UDP binding table (16 ports)                              |
+-------------------------------+------------------------------+
|  Network             ip.c                                    |
|  - IPv4 send/receive, routing (gateway vs. on-link)          |
|  - ICMP echo request/reply                                   |
|  - Loopback queue (127.0.0.0/8 and self-addressed)           |
+-------------------------------+------------------------------+
|  Data Link           eth.c                                   |
|  - Ethernet framing (14-byte header)                         |
|  - ARP table (16 entries, request/reply, anti-spoofing)      |
+-------------------------------+------------------------------+
|  Device              netdev.c                                |
|  - netdev_t registry (up to 4 NICs)                          |
|  - poll/send abstraction                                     |
+-------------------------------+------------------------------+
|  Drivers             virtio_net.c | rtl8169.c                |
|  - PCI device discovery via ECAM                             |
|  - DMA descriptor rings, MMIO register access                |
+--------------------------------------------------------------+

Source Files

File Purpose
kernel/net/net.h Shared types: ip4_addr_t, mac_addr_t, byte-order macros, checksum API
kernel/net/netdev.h / .c Network device abstraction layer
kernel/net/eth.h / .c Ethernet framing, ARP table, ARP resolution
kernel/net/ip.h / .c IPv4 send/receive, ICMP echo, loopback, net_init()
kernel/net/tcp.h / .c TCP state machine, retransmit timer, socket-layer helpers
kernel/net/udp.h / .c UDP send/receive, port binding table
kernel/net/socket.h / .c Socket table, VFS ops for AF_INET sockets
kernel/net/unix_socket.h / .c AF_UNIX domain sockets with fd passing
kernel/net/epoll.h / .c epoll implementation (epoll_create, epoll_ctl, epoll_wait)
kernel/drivers/virtio_net.c virtio 1.0 modern NIC driver (QEMU)
kernel/drivers/rtl8169.c Realtek RTL8168/8169 PCIe gigabit Ethernet driver

Initialization

The network stack is initialized during boot via net_init() (called from kernel_main()):

void net_init(void)
{
    netdev_t *dev = netdev_get("eth0");
    if (!dev) {
        /* No NIC registered — silent return.
         * boot.txt must NOT contain any [NET] lines. */
        return;
    }
    eth_init();
    udp_init();
    tcp_init();
}

Prior to net_init(), the NIC driver (virtio_net_init() or rtl8169_init()) has already:

  1. Scanned the PCIe device table for its vendor/device ID
  2. Mapped BAR MMIO regions into kernel virtual address space
  3. Set up DMA descriptor rings (RX and TX)
  4. Read the MAC address from device registers
  5. Registered itself via netdev_register() as "eth0"

IP configuration is set statically via net_set_config():

void net_set_config(ip4_addr_t ip, ip4_addr_t mask, ip4_addr_t gw);

The DHCP client (running in userspace) can update this at runtime.

Polling Model

Aegis uses a polling-based network model rather than interrupt-driven I/O. The PIT timer ISR fires at 100 Hz and calls three network functions:

  1. netdev_poll_all() – iterates registered NICs, calls dev->poll() to drain RX queues
  2. tcp_tick() – processes TCP retransmit timers and TIME_WAIT expiry
  3. ip_loopback_poll() – drains the loopback packet queue

The g_in_netdev_poll flag is set during netdev_poll_all() so that code called from the ISR RX path (e.g., arp_resolve()) knows it must not block:

void netdev_poll_all(void)
{
    irqflags_t fl = spin_lock_irqsave(&netdev_lock);
    g_in_netdev_poll = 1;
    int i;
    for (i = 0; i < s_count; i++) {
        if (s_devices[i]->poll)
            s_devices[i]->poll(s_devices[i]);
    }
    g_in_netdev_poll = 0;
    spin_unlock_irqrestore(&netdev_lock, fl);
}

This design eliminates the complexity of interrupt-driven networking (MSI/MSI-X setup, deferred processing, NAPI-style budgeting) at the cost of latency (up to 10 ms between poll ticks).

Packet Flow

Receive Path (Inbound)

NIC hardware writes frame to DMA buffer
        |
        v
PIT ISR (100 Hz) → netdev_poll_all()
        |
        v
dev->poll(dev)       [virtio_net_poll / rtl8169_poll]
  - Drains used ring / RX descriptors
  - Strips driver header (virtio: 12B, RTL: 4B FCS)
        |
        v
netdev_rx_deliver(dev, frame, len)
        |
        v
eth_rx(dev, frame, len)
  - Parse 14-byte Ethernet header
  - Dispatch on ethertype:
    0x0806 → arp_rx_pkt()     [ARP reply → update cache]
    0x0800 → ip_rx()          [IPv4 → next step]
        |
        v
ip_rx(dev, frame, ip_payload, len)
  - Validate: version=4, IHL=5, header checksum
  - Destination filter: my_ip, loopback, broadcast, subnet broadcast
  - Dispatch on protocol:
    1  → icmp_rx()            [echo reply: drop; echo request: reply]
    6  → tcp_rx()             [TCP segment → state machine]
    17 → udp_rx()             [UDP datagram → binding table]
        |
        v
tcp_rx() / udp_rx()
  - Validate transport checksum (pseudo-header)
  - Deliver to socket layer (ring buffer / accept queue)
  - Defer wake/epoll_notify outside lock

Transmit Path (Outbound)

User calls write(fd, ...) / send(fd, ...) / sendto(fd, ...)
        |
        v
sys_write → sock_vfs_write → tcp_conn_send / udp_send
        |
        v
tcp_send_segment / udp_send
  - Build transport header
  - Compute checksum with pseudo-header
        |
        v
ip_send(dev, dst_ip, proto, payload, len)
  - Build 20-byte IPv4 header
  - Loopback check (127.0.0.0/8 or self) → queue to ring
  - Routing: same subnet → dst_ip; else → gateway
  - ARP resolve next-hop MAC
        |
        v
eth_send(dev, dst_mac, ethertype, payload, len)
  - Build 14-byte Ethernet header in static TX buffer
        |
        v
dev->send(dev, frame, total_len)    [virtio_net_send / rtl8169_send]
  - Copy frame to DMA bounce buffer
  - Update descriptor ring, kick doorbell
  - Poll for TX completion

Shared Types and Byte Order

All network addresses are stored in network byte order (big-endian). Conversion macros use GCC builtins:

typedef uint32_t ip4_addr_t;          /* network byte order */
typedef struct { uint8_t b[6]; } mac_addr_t;

#define htons(x)  __builtin_bswap16((uint16_t)(x))
#define ntohs(x)  __builtin_bswap16((uint16_t)(x))
#define htonl(x)  __builtin_bswap32((uint32_t)(x))
#define ntohl(x)  __builtin_bswap32((uint32_t)(x))

Checksum Implementation

The one’s-complement checksum used by IP, TCP, UDP, and ICMP is implemented as a two-phase operation for non-contiguous regions:

uint32_t net_checksum(const void *data, uint32_t len);
uint16_t net_checksum_finish(uint32_t sum);

Usage pattern for TCP (pseudo-header + segment):

tcp_pseudo_hdr_t ph;
ph.src     = conn->local_ip;
ph.dst     = conn->remote_ip;
ph.zero    = 0;
ph.proto   = IP_PROTO_TCP;
ph.tcp_len = htons(tcp_len);

uint32_t sum = 0;
sum += net_checksum(&ph, sizeof(ph));
sum += net_checksum(s_tcp_buf, tcp_len);
hdr->checksum = net_checksum_finish(sum);

Lock Ordering

The network stack uses spinlocks with IRQ save/restore. The lock ordering is critical to avoid deadlocks, particularly because protocol processing can be re-entered from the PIT ISR:

sock_lock > tcp_lock > udp_lock > arp_lock > ip_lock > netdev_lock

Key lock ordering constraints:

  • tcp_lock -> sock_lock: TCP RX processing collects socket wake events under tcp_lock, then releases it before calling sock_wake() (which acquires sock_lock). This prevents tcp_lock -> sock_lock ordering inversions.

  • ip_lock -> arp_lock: ip_send() copies the assembled packet to a local buffer and snapshots the IP configuration under ip_lock, then releases it before calling arp_resolve() (which acquires arp_lock).

  • g_in_netdev_poll flag: When set (inside PIT ISR), arp_resolve() returns -1 immediately instead of blocking. Blocking inside the ISR while holding netdev_lock would deadlock the next PIT tick.

Loopback

Packets destined for the local IP or 127.0.0.0/8 are queued in a ring buffer rather than sent to the NIC:

#define LO_RING_SIZE 8
#define LO_PKT_MAX  1500
static uint8_t  s_lo_ring[LO_RING_SIZE][LO_PKT_MAX];
static uint16_t s_lo_len[LO_RING_SIZE];
static uint32_t s_lo_head, s_lo_tail;

ip_loopback_poll() drains this queue at 100 Hz, calling ip_rx() for each queued packet. This deferred delivery prevents re-entrant crashes from synchronous loopback (e.g., ip_send(SYN) -> ip_rx -> tcp_rx -> ip_send(SYN-ACK) -> ip_rx -> ... clobbering static buffers).

Static Buffers

The network stack uses static, file-scoped buffers throughout instead of dynamic allocation:

Buffer Size Location Purpose
s_tx_buf 1514 B eth.c Ethernet frame assembly
s_ip_buf 1500 B ip.c IP packet assembly
s_tcp_buf 1480 B tcp.c TCP segment assembly
s_udp_buf 1480 B udp.c UDP packet assembly
s_icmp_buf 1480 B ip.c ICMP reply assembly
s_lo_ring 8 x 1500 B ip.c Loopback queue

All TX paths are serialized by their respective spinlocks, so static buffers are safe.

Limitations

  • No IP fragmentation: ip_send() rejects payloads larger than 1480 bytes (1500 MTU minus 20-byte IP header). The maximum TCP segment is ~1460 bytes.
  • No IPv6: The stack is IPv4-only.
  • No multicast routing: Multicast frames are accepted at the Ethernet level but not routed.
  • No IP options: Only the minimum 20-byte IP header (IHL=5) is supported.
  • Polling latency: Up to 10 ms between packet arrival and processing (100 Hz PIT).
  • Static table sizes: 4 netdevs, 16 ARP entries, 32 TCP connections, 16 UDP bindings, 64 sockets.
  • No congestion control: TCP sends at line rate with no slow-start, congestion avoidance, or fast retransmit.
  • No TLS/encryption: All traffic is plaintext. There is no in-kernel cryptographic support.

Security Considerations

The entire network stack is C code operating on untrusted data from the wire. As v1 software, it has not undergone formal security review and should be assumed to contain vulnerabilities typical of C network code at this maturity level:

  • Buffer overflows: Static buffers are used throughout. While length checks exist at each layer, the absence of memory-safe language guarantees means any single missed check could be exploitable.
  • Integer overflow: Length fields are uint16_t and combined across layers. Wrapping arithmetic on crafted packets could bypass bounds checks.
  • ARP spoofing mitigation is partial: The anti-spoofing logic (reject unsolicited replies) helps but does not defend against a race where a legitimate request is pending.
  • No rate limiting: There is no protection against SYN floods, ARP storms, or other denial-of-service traffic patterns.
  • Static buffers as shared state: The TX path uses file-scoped static buffers protected by spinlocks, but lock ordering bugs could lead to data corruption under adversarial timing.

These are not hypothetical concerns documented for completeness – they are the expected reality of a from-scratch C network stack that has not yet been hardened. The long-term plan is to gradually migrate the network stack to Rust as part of the broader kernel Rust migration (see Capability Model for the first component already in Rust).