TCP/IP Implementation

This document covers the transport and network layer protocols implemented in the Aegis kernel: TCP (RFC 793), UDP, IPv4, ICMP echo, and ARP.

v1 maturity notice: These protocol implementations are written in C and parse untrusted network data. As first-version code, they should be assumed to contain exploitable vulnerabilities – this is the reality of any from-scratch protocol stack at this stage, not a hypothetical concern. There has been no formal security audit of the packet parsing or state machine logic. The planned C-to-Rust migration will eventually cover the network protocols, but this work has not yet started for kernel/net/. Contributions are welcome – file issues or propose changes at exec/aegis.

IPv4 Layer (ip.c)

IP Configuration

The kernel maintains a single static IP configuration protected by ip_lock:

static ip4_addr_t s_my_ip;
static ip4_addr_t s_netmask;
static ip4_addr_t s_gateway;

Configuration is set via net_set_config() and queried via net_get_config(). Both acquire ip_lock with IRQ save/restore.

IP Header

The stack uses a fixed 20-byte IPv4 header with no options (IHL=5):

typedef struct __attribute__((packed)) {
    uint8_t    ver_ihl;     /* 0x45: version 4, IHL 5 */
    uint8_t    dscp_ecn;
    uint16_t   total_len;   /* network byte order */
    uint16_t   id;
    uint16_t   flags_frag;
    uint8_t    ttl;         /* always 64 */
    uint8_t    proto;       /* 1=ICMP, 6=TCP, 17=UDP */
    uint16_t   checksum;
    ip4_addr_t src;
    ip4_addr_t dst;
} ip_hdr_t;

Sending (ip_send)

ip_send() handles both loopback and real NIC transmission:

ip_send(dev, dst_ip, proto, payload, len)
    |
    +-- len > 1480? → return -1 (no fragmentation)
    |
    +-- dst == my_ip || 127.0.0.0/8?
    |       → Queue in loopback ring (deferred delivery)
    |       → return 0
    |
    +-- Build IP header in s_ip_buf
    |   - ver_ihl=0x45, ttl=64, proto=proto
    |   - Compute header checksum
    |
    +-- Copy packet to stack-local buffer
    |   Release ip_lock (avoid lock ordering inversion with arp_lock)
    |
    +-- Determine next-hop:
    |   - 255.255.255.255 → broadcast MAC (ff:ff:ff:ff:ff:ff)
    |   - Same subnet → ARP resolve dst_ip directly
    |   - Different subnet → ARP resolve gateway
    |
    +-- eth_send(dev, next_hop_mac, ETHERTYPE_IP, packet, total)

Lock ordering detail: ip_send() must release ip_lock before calling arp_resolve() because the lock ordering is arp_lock > ip_lock. The assembled packet and IP config are copied to local variables before releasing the lock.

Receiving (ip_rx)

ip_rx() is called by eth_rx() for Ethernet frames with ethertype 0x0800:

  1. Validate header: version=4, IHL=5, checksum
  2. Destination filter: Accept if destination matches:
    • Local IP (s_my_ip)
    • Loopback range (127.0.0.0/8)
    • Limited broadcast (255.255.255.255)
    • Subnet broadcast (my_ip | ~netmask)
  3. Dispatch on protocol field:
    • IP_PROTO_ICMP (1) -> icmp_rx()
    • IP_PROTO_TCP (6) -> tcp_rx()
    • IP_PROTO_UDP (17) -> udp_rx()

Loopback

Loopback uses an 8-slot ring buffer to dequeue delivery from the send path:

#define LO_RING_SIZE 8
#define LO_PKT_MAX  1500
static uint8_t  s_lo_ring[LO_RING_SIZE][LO_PKT_MAX];

ip_loopback_poll() drains the ring at 100 Hz from the PIT handler. Without this deferred delivery, a synchronous loopback TCP handshake would clobber the static packet buffers through recursive ip_send() calls.

ICMP (ip.c)

Only ICMP echo is implemented. The handler in icmp_rx():

  • Type 8 (Echo Request): Copies the request, changes type to 0, recomputes checksum, sends reply via ip_send().
  • Type 0 (Echo Reply): Silently dropped (no ping client in kernel).
  • All other types: Dropped.
typedef struct __attribute__((packed)) {
    uint8_t  type;
    uint8_t  code;
    uint16_t checksum;
    uint16_t id;
    uint16_t seq;
} icmp_hdr_t;

Ethernet Layer (eth.c)

Frame Format

+--------+--------+----------+---------+
| Dst MAC| Src MAC| EtherType| Payload |
| 6 bytes| 6 bytes| 2 bytes  | 46-1500 |
+--------+--------+----------+---------+
         14-byte header
typedef struct __attribute__((packed)) {
    mac_addr_t  dst;
    mac_addr_t  src;
    uint16_t    ethertype;   /* network byte order */
} eth_hdr_t;

Sending (eth_send)

eth_send() assembles an Ethernet frame in a static 1514-byte buffer under arp_lock:

int eth_send(netdev_t *dev, const mac_addr_t *dst_mac,
             uint16_t ethertype, const void *payload, uint16_t len)
  • Returns -1 if dev is NULL or len > 1500
  • Copies the device MAC from dev->mac[6] as source
  • Calls dev->send() with the complete frame

Receiving (eth_rx)

eth_rx() dispatches on ethertype:

EtherType Handler
0x0800 (IPv4) ip_rx()
0x0806 (ARP) arp_rx_pkt()
Other Drop silently

ARP (eth.c)

ARP Table

The ARP cache is a static table of 16 entries:

typedef struct {
    ip4_addr_t ip;
    mac_addr_t mac;
    uint32_t   age;      /* PIT ticks since last use */
    uint8_t    valid;
    uint8_t    resolved; /* 0 = pending request, 1 = reply received */
} arp_entry_t;

static arp_entry_t s_arp_table[ARP_TABLE_SIZE];

Anti-Spoofing

The ARP implementation includes protection against unsolicited ARP replies (cache poisoning):

  1. arp_insert_pending(): Before sending an ARP request, a pending entry is created in the table with resolved=0.
  2. arp_rx_pkt(): Only updates entries that already exist in the table (i.e., entries we explicitly requested). Unsolicited replies are silently dropped.
static void arp_rx_pkt(const arp_pkt_t *pkt)
{
    /* ... validation ... */
    if (ntohs(pkt->oper) != 2) return;  /* only cache REPLY */
    
    arp_entry_t *e = arp_find(pkt->spa);
    if (!e) return;  /* no pending entry -- unsolicited reply, drop */
    e->mac      = pkt->sha;
    e->resolved = 1;
}

ARP Resolution (arp_resolve)

arp_resolve() is the core function for mapping an IP address to a MAC address:

arp_resolve(dev, ip, mac_out)
    |
    +-- Cache hit (resolved)? → return MAC immediately
    |
    +-- Send ARP request (creates pending entry)
    |
    +-- Called from ISR (g_in_netdev_poll)?
    |       → return -1 (caller retries on next tick)
    |
    +-- Syscall context: busy-poll loop
    |   - arch_wait_for_irq() yields to QEMU SLIRP
    |   - dev->poll() processes pending RX frames
    |   - Check cache for resolved entry
    |   - Up to 500 iterations (~5 seconds)
    |
    +-- Timeout → return -1

The ISR/syscall distinction is critical: blocking inside the PIT ISR while holding netdev_lock would deadlock. The g_in_netdev_poll volatile flag signals this context.

ARP Packet Format

typedef struct __attribute__((packed)) {
    uint16_t   htype;   /* 1 = Ethernet */
    uint16_t   ptype;   /* 0x0800 = IPv4 */
    uint8_t    hlen;    /* 6 */
    uint8_t    plen;    /* 4 */
    uint16_t   oper;    /* 1 = REQUEST, 2 = REPLY */
    mac_addr_t sha;     /* sender hardware address */
    ip4_addr_t spa;     /* sender protocol address */
    mac_addr_t tha;     /* target hardware address */
    ip4_addr_t tpa;     /* target protocol address */
} arp_pkt_t;

TCP (tcp.c)

Connection Table

TCP connections are stored in a static array of 32 slots:

#define TCP_MAX_CONNS  32
#define TCP_RBUF_SIZE  16384   /* 16 KB receive buffer per connection */
#define TCP_SBUF_SIZE  8192    /* 8 KB send buffer per connection */

typedef struct {
    tcp_state_t state;
    ip4_addr_t  local_ip,  remote_ip;
    uint16_t    local_port, remote_port;
    netdev_t   *dev;
    uint32_t    snd_nxt, snd_una;
    uint32_t    rcv_nxt;
    uint16_t    snd_wnd;
    uint8_t     rbuf[TCP_RBUF_SIZE];
    uint32_t    rbuf_head, rbuf_tail;
    uint8_t     sbuf[TCP_SBUF_SIZE];
    uint32_t    sbuf_head, sbuf_tail;
    uint32_t    retransmit_at;
    uint8_t     retransmit_count;
    uint32_t    timewait_at;
    uint32_t    sock_id;
    uint32_t    listener_id;
} tcp_conn_t;

State Machine

The TCP implementation follows the RFC 793 state machine:

                              +---------+
                              |  CLOSED |
                              +---------+
                  passive open /         \ active open
                  tcp_listen  /           \ tcp_connect
                             v             v
                      +---------+    +-----------+
                      | LISTEN  |    | SYN_SENT  |
                      +---------+    +-----------+
                 rcv SYN /                \ rcv SYN+ACK
            send SYN+ACK/                  \ send ACK
                       v                    v
                  +-----------+      +-------------+
                  | SYN_RCVD  |      | ESTABLISHED |
                  +-----------+      +-------------+
                 rcv ACK /            close /     \ rcv FIN
                        /         send FIN /       \ send ACK
                       v                  v         v
                +-------------+   +------------+ +-----------+
                | ESTABLISHED |   | FIN_WAIT_1 | | CLOSE_WAIT|
                +-------------+   +------------+ +-----------+
                                  rcv ACK / \ rcv FIN   \ close
                                         /   \ send ACK  \ send FIN
                                        v     v           v
                                 +----------+ +--------+ +----------+
                                 |FIN_WAIT_2| |CLOSING | | LAST_ACK |
                                 +----------+ +--------+ +----------+
                                  rcv FIN /    rcv ACK \    rcv ACK \
                                 send ACK/              \            \
                                        v                v            v
                                  +-----------+    +---------+  +---------+
                                  | TIME_WAIT |    |TIME_WAIT|  | CLOSED  |
                                  +-----------+    +---------+  +---------+
                                   2MSL timeout         |
                                        |               |
                                        v               v
                                     CLOSED           CLOSED

All states are defined in tcp.h:

typedef enum {
    TCP_CLOSED,
    TCP_LISTEN,
    TCP_SYN_RCVD,
    TCP_SYN_SENT,
    TCP_ESTABLISHED,
    TCP_FIN_WAIT_1,
    TCP_FIN_WAIT_2,
    TCP_CLOSING,
    TCP_CLOSE_WAIT,
    TCP_LAST_ACK,
    TCP_TIME_WAIT
} tcp_state_t;

Sequence Number Arithmetic

TCP sequence numbers are 32-bit and wrap. Plain comparison gives wrong results near 2^31. The implementation uses signed difference for correct modular arithmetic:

static inline int seq_lt(uint32_t a, uint32_t b) {
    return (int32_t)(a - b) < 0;
}
static inline int seq_le(uint32_t a, uint32_t b) {
    return (int32_t)(a - b) <= 0;
}
static inline int seq_gt(uint32_t a, uint32_t b) {
    return (int32_t)(a - b) > 0;
}

TCP Header

typedef struct __attribute__((packed)) {
    uint16_t src_port;
    uint16_t dst_port;
    uint32_t seq;
    uint32_t ack;
    uint8_t  data_off;   /* upper 4 bits = header length / 4 */
    uint8_t  flags;
    uint16_t window;
    uint16_t checksum;
    uint16_t urgent;
} tcp_hdr_t;

#define TCP_FIN 0x01
#define TCP_SYN 0x02
#define TCP_RST 0x04
#define TCP_PSH 0x08
#define TCP_ACK 0x10

Segment Sending (tcp_send_segment)

Builds a TCP segment in the static s_tcp_buf under tcp_lock:

  1. Fill header fields from tcp_conn_t
  2. Compute window advertisement based on available receive buffer space
  3. Compute checksum over pseudo-header + TCP segment
  4. Call ip_send()
int tcp_send_segment(netdev_t *dev, tcp_conn_t *conn,
                     uint8_t flags, const void *payload, uint16_t len);

The window advertisement reflects actual receive buffer availability:

uint32_t used  = (conn->rbuf_tail - conn->rbuf_head) & (TCP_RBUF_SIZE - 1);
uint32_t avail = TCP_RBUF_SIZE - used;
if (avail > 0xFFFFu) avail = 0xFFFFu;
hdr->window = htons((uint16_t)avail);

Connection Lookup

Two lookup functions serve different purposes:

  • tcp_find(): Exact 4-tuple match (remote_ip, remote_port, local_ip, local_port) for established connections
  • tcp_find_listener(): Match on local_port only (local_ip=0 means INADDR_ANY) for LISTEN state connections

Receive Processing (tcp_rx)

tcp_rx() is the core state machine handler, called from ip_rx(). Key design:

  1. Checksum validation using TCP pseudo-header
  2. Connection lookup by 4-tuple
  3. New connection handling: If no match and SYN received, look for a listener. If no listener, send RST ACK.
  4. State transitions: Per the RFC 793 state diagram
  5. Deferred socket operations: Wake events and accept queue pushes are collected under tcp_lock, then executed after release to prevent lock ordering inversions

The deferred wake pattern:

/* Collected under tcp_lock */
uint32_t wake_ids[TCP_RX_WAKE_MAX];
uint32_t wake_epoll_events[TCP_RX_WAKE_MAX];
uint32_t wake_count = 0;

/* ... state machine processing ... */

spin_unlock_irqrestore(&tcp_lock, fl);

/* Executed outside tcp_lock */
if (connect_sock_id != SOCK_NONE) {
    sock_t *sk = sock_get(connect_sock_id);
    if (sk) sk->state = SOCK_CONNECTED;
}
for (w = 0; w < wake_count; w++) {
    sock_wake(wake_ids[w]);
    if (wake_epoll_events[w] != 0)
        epoll_notify(wake_ids[w], wake_epoll_events[w]);
}

State Transitions in Detail

SYN_SENT (active open)

  • SYN+ACK received with matching ack == snd_nxt:
    • Transition to ESTABLISHED
    • Send ACK
    • Wake connect() caller with EPOLLOUT
    • Mark socket as SOCK_CONNECTED
  • RST received:
    • Transition to CLOSED
    • Wake connect() caller with EPOLLERR

SYN_RCVD (passive open)

  • ACK received with ack == snd_nxt:
    • Transition to ESTABLISHED
    • Push to listener’s accept queue
    • Wake listener with EPOLLIN

ESTABLISHED

  • Data received (payload_len > 0, seq == rcv_nxt):
    • Copy to receive ring buffer (if space available)
    • Advance rcv_nxt
    • Send ACK
    • Wake recv() caller with EPOLLIN
  • FIN received:
    • Advance rcv_nxt by 1
    • Transition to CLOSE_WAIT
    • Send ACK
    • Wake with EPOLLHUP
  • RST received:
    • Transition to CLOSED
    • Wake blocked callers

FIN_WAIT_1 / FIN_WAIT_2 / CLOSING / LAST_ACK

Standard RFC 793 transitions. TIME_WAIT uses a 4-second timer (shortened 2MSL, acceptable for non-production):

#define TCP_TIMEWAIT_TICKS  400   /* 4 s at 100 Hz */

Retransmit Timer (tcp_tick)

Called at 100 Hz from the PIT handler. For each non-CLOSED connection:

  • TIME_WAIT: Check if timewait_at has elapsed; transition to CLOSED
  • Retransmit: If retransmit_at has elapsed:
    • Increment retransmit_count
    • Double the RTO (exponential backoff): rto = TCP_RTO_INITIAL << retransmit_count
    • Cap at TCP_RTO_MAX (8 seconds)
    • Retransmit the appropriate segment (SYN, SYN+ACK)
    • After TCP_RETRANSMIT_MAX (3) retries: send RST, close, wake blocked callers
#define TCP_RTO_INITIAL     100   /* 1 s at 100 Hz */
#define TCP_RTO_MAX         800   /* 8 s at 100 Hz */
#define TCP_RETRANSMIT_MAX  3

Active Open (tcp_connect)

int tcp_connect(uint32_t sock_id, ip4_addr_t dst_ip, uint16_t dst_port,
                uint32_t *conn_id_out)
  1. Allocate a free connection slot
  2. Set state to TCP_SYN_SENT
  3. Assign ephemeral port: 49152 + (arch_get_ticks() & 0x3FFF)
  4. Set initial sequence number from arch_get_ticks()
  5. Release tcp_lock before sending (to avoid lock ordering issues)
  6. Send SYN segment
  7. Increment snd_nxt after sending (SYN must go out with seq=ISN)

The order of snd_nxt increment matters:

/* Bug: this was before tcp_send_segment, causing SYN seq=ISN+1
 * and the remote's ack=ISN+2 to never match snd_nxt=ISN+1. */
tcp_send_segment(dev, &s_tcp[i], TCP_SYN, NULL, 0);
s_tcp[i].snd_nxt++;  /* Now ISN+1; SYN_SENT handler matches ack=ISN+1 */

Passive Open (tcp_listen)

int tcp_listen(uint16_t port, uint32_t sock_id)

Creates a connection slot in TCP_LISTEN state. Incoming SYN packets matching the port trigger SYN_RCVD -> SYN+ACK -> (await ACK) -> ESTABLISHED.

Receive Buffer

Each TCP connection has a 16 KB circular receive buffer:

uint8_t  rbuf[TCP_RBUF_SIZE];   /* 16384 bytes */
uint32_t rbuf_head, rbuf_tail;

tcp_conn_recv() reads from this buffer:

  • max_len=0: Peek mode, returns available byte count
  • Returns -11 (EAGAIN) if buffer empty and connection alive
  • Returns 0 if buffer empty and FIN received (EOF)

Known Issue: Outbound TCP Race

When tcp_connect() releases tcp_lock to send the SYN, there is a brief window where tcp_tick() could fire and attempt a retransmit before the initial SYN is sent. This is mitigated by setting retransmit_at far in the future before releasing the lock:

s_tcp[i].retransmit_at = (uint32_t)arch_get_ticks() + TCP_RTO_INITIAL + 200;
spin_unlock_irqrestore(&tcp_lock, fl);
tcp_send_segment(dev, &s_tcp[i], TCP_SYN, NULL, 0);
s_tcp[i].snd_nxt++;
s_tcp[i].retransmit_at = (uint32_t)arch_get_ticks() + TCP_RTO_INITIAL;

UDP (udp.c)

Binding Table

UDP uses a simple binding table mapping ports to socket IDs:

typedef struct {
    uint16_t port;      /* host byte order; 0 = free */
    uint32_t sock_id;   /* index into socket table */
} udp_binding_t;

static udp_binding_t s_udp[UDP_BINDINGS_MAX];  /* 16 slots */

Sending (udp_send)

int udp_send(netdev_t *dev, uint16_t src_port, ip4_addr_t dst_ip,
             uint16_t dst_port, const void *payload, uint16_t len)

Builds an 8-byte UDP header in s_udp_buf, sets checksum to 0 (optional per RFC 768 for IPv4), and calls ip_send().

Receiving (udp_rx)

  1. Validate minimum length (8 bytes)
  2. Validate UDP length field
  3. Validate checksum (if non-zero) using pseudo-header
  4. Look up destination port in binding table
  5. Copy payload into the socket’s UDP receive ring buffer
  6. Defer wake/epoll_notify outside udp_lock

UDP Checksum Validation

static uint16_t
udp_checksum_verify(uint32_t src_ip, uint32_t dst_ip,
                    const uint8_t *udp_pkt, uint16_t udp_len)

Computes the one’s-complement sum over the pseudo-header (source IP, destination IP, protocol=17, UDP length) plus the UDP header and payload. Returns 0 if valid.

Per RFC 768, a checksum field of 0 means the sender did not compute a checksum; the receiver skips validation.

Port Unbinding

udp_unbind() is called from sock_vfs_close() to release port bindings when a UDP socket is closed. Without this, port bindings would leak and subsequent bind() calls to the same port would fail with EADDRINUSE. This bug was discovered through the DHCP client retry loop.

void udp_unbind(uint16_t port)
{
    if (port == 0) return;
    irqflags_t fl = spin_lock_irqsave(&udp_lock);
    for (i = 0; i < UDP_BINDINGS_MAX; i++) {
        if (s_udp[i].port == port) {
            s_udp[i].port    = 0;
            s_udp[i].sock_id = 0;
        }
    }
    spin_unlock_irqrestore(&udp_lock, fl);
}

Network Device Layer (netdev.c)

Device Registry

Up to NETDEV_MAX (4) network devices can be registered:

static netdev_t *s_devices[NETDEV_MAX];
static int        s_count = 0;

netdev_t Structure

typedef struct netdev {
    char     name[16];      /* "eth0", "eth1", ... */
    uint8_t  mac[6];        /* hardware MAC address */
    uint16_t mtu;           /* 1500 for Ethernet */
    int    (*send)(struct netdev *dev, const void *pkt, uint16_t len);
    void   (*poll)(struct netdev *dev);
    void    *priv;          /* driver-private data */
} netdev_t;

The send callback transmits a complete Ethernet frame (including 14-byte header). The poll callback drains the RX ring and calls netdev_rx_deliver() for each received frame.

Device Lookup

netdev_get() performs a linear name comparison under netdev_lock:

netdev_t *netdev_get(const char *name)

This is used by net_init() to find "eth0" and by tcp_connect() to find the NIC for outbound connections.

Security Implications

The TCP and UDP implementations process raw packets from the network with no sandboxing or privilege separation. Several structural properties of the v1 code are worth highlighting:

  • All parsing is manual C: Header field extraction, checksum computation, and length validation are hand-written. There are no safe abstractions or bounds-checked accessors. A single off-by-one error in any parsing path is potentially exploitable.
  • Static connection table: The 32-slot TCP connection table is a fixed array. A SYN flood filling all slots denies service to legitimate connections, with no SYN cookies or backlog mechanism.
  • No sequence number randomization: Initial sequence numbers are derived from arch_get_ticks(), making them predictable to an attacker who can observe or estimate the system uptime.
  • Ephemeral port allocation: Source ports are 49152 + (ticks & 0x3FFF), which is similarly predictable.
  • Lock ordering as security boundary: The deferred-wake pattern prevents deadlocks but adds complexity. An incorrect lock ordering change could cause deadlocks under attacker-controlled packet timing.

These are not theoretical – they are the kinds of real, exploitable issues expected in any v1 C network stack. The Aegis project’s planned Rust migration will address the memory safety class of these bugs; protocol-level hardening (SYN cookies, ISN randomization, rate limiting) is separate future work.