TCP/IP Implementation
Detailed documentation of the Aegis OS TCP state machine, UDP demultiplexing, IPv4 routing, ARP resolution, and ICMP echo
TCP/IP Implementation
This document covers the transport and network layer protocols implemented in the Aegis kernel: TCP (RFC 793), UDP, IPv4, ICMP echo, and ARP.
v1 maturity notice: These protocol implementations are written in C and parse untrusted network data. As first-version code, they should be assumed to contain exploitable vulnerabilities – this is the reality of any from-scratch protocol stack at this stage, not a hypothetical concern. There has been no formal security audit of the packet parsing or state machine logic. The planned C-to-Rust migration will eventually cover the network protocols, but this work has not yet started for
kernel/net/. Contributions are welcome – file issues or propose changes at exec/aegis.
IPv4 Layer (ip.c)
IP Configuration
The kernel maintains a single static IP configuration protected by ip_lock:
static ip4_addr_t s_my_ip;
static ip4_addr_t s_netmask;
static ip4_addr_t s_gateway;
Configuration is set via net_set_config() and queried via net_get_config(). Both acquire ip_lock with IRQ save/restore.
IP Header
The stack uses a fixed 20-byte IPv4 header with no options (IHL=5):
typedef struct __attribute__((packed)) {
uint8_t ver_ihl; /* 0x45: version 4, IHL 5 */
uint8_t dscp_ecn;
uint16_t total_len; /* network byte order */
uint16_t id;
uint16_t flags_frag;
uint8_t ttl; /* always 64 */
uint8_t proto; /* 1=ICMP, 6=TCP, 17=UDP */
uint16_t checksum;
ip4_addr_t src;
ip4_addr_t dst;
} ip_hdr_t;
Sending (ip_send)
ip_send() handles both loopback and real NIC transmission:
ip_send(dev, dst_ip, proto, payload, len)
|
+-- len > 1480? → return -1 (no fragmentation)
|
+-- dst == my_ip || 127.0.0.0/8?
| → Queue in loopback ring (deferred delivery)
| → return 0
|
+-- Build IP header in s_ip_buf
| - ver_ihl=0x45, ttl=64, proto=proto
| - Compute header checksum
|
+-- Copy packet to stack-local buffer
| Release ip_lock (avoid lock ordering inversion with arp_lock)
|
+-- Determine next-hop:
| - 255.255.255.255 → broadcast MAC (ff:ff:ff:ff:ff:ff)
| - Same subnet → ARP resolve dst_ip directly
| - Different subnet → ARP resolve gateway
|
+-- eth_send(dev, next_hop_mac, ETHERTYPE_IP, packet, total)
Lock ordering detail: ip_send() must release ip_lock before calling arp_resolve() because the lock ordering is arp_lock > ip_lock. The assembled packet and IP config are copied to local variables before releasing the lock.
Receiving (ip_rx)
ip_rx() is called by eth_rx() for Ethernet frames with ethertype 0x0800:
- Validate header: version=4, IHL=5, checksum
- Destination filter: Accept if destination matches:
- Local IP (
s_my_ip) - Loopback range (
127.0.0.0/8) - Limited broadcast (
255.255.255.255) - Subnet broadcast (
my_ip | ~netmask)
- Local IP (
- Dispatch on protocol field:
IP_PROTO_ICMP(1) ->icmp_rx()IP_PROTO_TCP(6) ->tcp_rx()IP_PROTO_UDP(17) ->udp_rx()
Loopback
Loopback uses an 8-slot ring buffer to dequeue delivery from the send path:
#define LO_RING_SIZE 8
#define LO_PKT_MAX 1500
static uint8_t s_lo_ring[LO_RING_SIZE][LO_PKT_MAX];
ip_loopback_poll() drains the ring at 100 Hz from the PIT handler. Without this deferred delivery, a synchronous loopback TCP handshake would clobber the static packet buffers through recursive ip_send() calls.
ICMP (ip.c)
Only ICMP echo is implemented. The handler in icmp_rx():
- Type 8 (Echo Request): Copies the request, changes type to 0, recomputes checksum, sends reply via
ip_send(). - Type 0 (Echo Reply): Silently dropped (no ping client in kernel).
- All other types: Dropped.
typedef struct __attribute__((packed)) {
uint8_t type;
uint8_t code;
uint16_t checksum;
uint16_t id;
uint16_t seq;
} icmp_hdr_t;
Ethernet Layer (eth.c)
Frame Format
+--------+--------+----------+---------+
| Dst MAC| Src MAC| EtherType| Payload |
| 6 bytes| 6 bytes| 2 bytes | 46-1500 |
+--------+--------+----------+---------+
14-byte header
typedef struct __attribute__((packed)) {
mac_addr_t dst;
mac_addr_t src;
uint16_t ethertype; /* network byte order */
} eth_hdr_t;
Sending (eth_send)
eth_send() assembles an Ethernet frame in a static 1514-byte buffer under arp_lock:
int eth_send(netdev_t *dev, const mac_addr_t *dst_mac,
uint16_t ethertype, const void *payload, uint16_t len)
- Returns -1 if
devis NULL orlen > 1500 - Copies the device MAC from
dev->mac[6]as source - Calls
dev->send()with the complete frame
Receiving (eth_rx)
eth_rx() dispatches on ethertype:
| EtherType | Handler |
|---|---|
0x0800 (IPv4) |
ip_rx() |
0x0806 (ARP) |
arp_rx_pkt() |
| Other | Drop silently |
ARP (eth.c)
ARP Table
The ARP cache is a static table of 16 entries:
typedef struct {
ip4_addr_t ip;
mac_addr_t mac;
uint32_t age; /* PIT ticks since last use */
uint8_t valid;
uint8_t resolved; /* 0 = pending request, 1 = reply received */
} arp_entry_t;
static arp_entry_t s_arp_table[ARP_TABLE_SIZE];
Anti-Spoofing
The ARP implementation includes protection against unsolicited ARP replies (cache poisoning):
arp_insert_pending(): Before sending an ARP request, a pending entry is created in the table withresolved=0.arp_rx_pkt(): Only updates entries that already exist in the table (i.e., entries we explicitly requested). Unsolicited replies are silently dropped.
static void arp_rx_pkt(const arp_pkt_t *pkt)
{
/* ... validation ... */
if (ntohs(pkt->oper) != 2) return; /* only cache REPLY */
arp_entry_t *e = arp_find(pkt->spa);
if (!e) return; /* no pending entry -- unsolicited reply, drop */
e->mac = pkt->sha;
e->resolved = 1;
}
ARP Resolution (arp_resolve)
arp_resolve() is the core function for mapping an IP address to a MAC address:
arp_resolve(dev, ip, mac_out)
|
+-- Cache hit (resolved)? → return MAC immediately
|
+-- Send ARP request (creates pending entry)
|
+-- Called from ISR (g_in_netdev_poll)?
| → return -1 (caller retries on next tick)
|
+-- Syscall context: busy-poll loop
| - arch_wait_for_irq() yields to QEMU SLIRP
| - dev->poll() processes pending RX frames
| - Check cache for resolved entry
| - Up to 500 iterations (~5 seconds)
|
+-- Timeout → return -1
The ISR/syscall distinction is critical: blocking inside the PIT ISR while holding netdev_lock would deadlock. The g_in_netdev_poll volatile flag signals this context.
ARP Packet Format
typedef struct __attribute__((packed)) {
uint16_t htype; /* 1 = Ethernet */
uint16_t ptype; /* 0x0800 = IPv4 */
uint8_t hlen; /* 6 */
uint8_t plen; /* 4 */
uint16_t oper; /* 1 = REQUEST, 2 = REPLY */
mac_addr_t sha; /* sender hardware address */
ip4_addr_t spa; /* sender protocol address */
mac_addr_t tha; /* target hardware address */
ip4_addr_t tpa; /* target protocol address */
} arp_pkt_t;
TCP (tcp.c)
Connection Table
TCP connections are stored in a static array of 32 slots:
#define TCP_MAX_CONNS 32
#define TCP_RBUF_SIZE 16384 /* 16 KB receive buffer per connection */
#define TCP_SBUF_SIZE 8192 /* 8 KB send buffer per connection */
typedef struct {
tcp_state_t state;
ip4_addr_t local_ip, remote_ip;
uint16_t local_port, remote_port;
netdev_t *dev;
uint32_t snd_nxt, snd_una;
uint32_t rcv_nxt;
uint16_t snd_wnd;
uint8_t rbuf[TCP_RBUF_SIZE];
uint32_t rbuf_head, rbuf_tail;
uint8_t sbuf[TCP_SBUF_SIZE];
uint32_t sbuf_head, sbuf_tail;
uint32_t retransmit_at;
uint8_t retransmit_count;
uint32_t timewait_at;
uint32_t sock_id;
uint32_t listener_id;
} tcp_conn_t;
State Machine
The TCP implementation follows the RFC 793 state machine:
+---------+
| CLOSED |
+---------+
passive open / \ active open
tcp_listen / \ tcp_connect
v v
+---------+ +-----------+
| LISTEN | | SYN_SENT |
+---------+ +-----------+
rcv SYN / \ rcv SYN+ACK
send SYN+ACK/ \ send ACK
v v
+-----------+ +-------------+
| SYN_RCVD | | ESTABLISHED |
+-----------+ +-------------+
rcv ACK / close / \ rcv FIN
/ send FIN / \ send ACK
v v v
+-------------+ +------------+ +-----------+
| ESTABLISHED | | FIN_WAIT_1 | | CLOSE_WAIT|
+-------------+ +------------+ +-----------+
rcv ACK / \ rcv FIN \ close
/ \ send ACK \ send FIN
v v v
+----------+ +--------+ +----------+
|FIN_WAIT_2| |CLOSING | | LAST_ACK |
+----------+ +--------+ +----------+
rcv FIN / rcv ACK \ rcv ACK \
send ACK/ \ \
v v v
+-----------+ +---------+ +---------+
| TIME_WAIT | |TIME_WAIT| | CLOSED |
+-----------+ +---------+ +---------+
2MSL timeout |
| |
v v
CLOSED CLOSED
All states are defined in tcp.h:
typedef enum {
TCP_CLOSED,
TCP_LISTEN,
TCP_SYN_RCVD,
TCP_SYN_SENT,
TCP_ESTABLISHED,
TCP_FIN_WAIT_1,
TCP_FIN_WAIT_2,
TCP_CLOSING,
TCP_CLOSE_WAIT,
TCP_LAST_ACK,
TCP_TIME_WAIT
} tcp_state_t;
Sequence Number Arithmetic
TCP sequence numbers are 32-bit and wrap. Plain comparison gives wrong results near 2^31. The implementation uses signed difference for correct modular arithmetic:
static inline int seq_lt(uint32_t a, uint32_t b) {
return (int32_t)(a - b) < 0;
}
static inline int seq_le(uint32_t a, uint32_t b) {
return (int32_t)(a - b) <= 0;
}
static inline int seq_gt(uint32_t a, uint32_t b) {
return (int32_t)(a - b) > 0;
}
TCP Header
typedef struct __attribute__((packed)) {
uint16_t src_port;
uint16_t dst_port;
uint32_t seq;
uint32_t ack;
uint8_t data_off; /* upper 4 bits = header length / 4 */
uint8_t flags;
uint16_t window;
uint16_t checksum;
uint16_t urgent;
} tcp_hdr_t;
#define TCP_FIN 0x01
#define TCP_SYN 0x02
#define TCP_RST 0x04
#define TCP_PSH 0x08
#define TCP_ACK 0x10
Segment Sending (tcp_send_segment)
Builds a TCP segment in the static s_tcp_buf under tcp_lock:
- Fill header fields from
tcp_conn_t - Compute window advertisement based on available receive buffer space
- Compute checksum over pseudo-header + TCP segment
- Call
ip_send()
int tcp_send_segment(netdev_t *dev, tcp_conn_t *conn,
uint8_t flags, const void *payload, uint16_t len);
The window advertisement reflects actual receive buffer availability:
uint32_t used = (conn->rbuf_tail - conn->rbuf_head) & (TCP_RBUF_SIZE - 1);
uint32_t avail = TCP_RBUF_SIZE - used;
if (avail > 0xFFFFu) avail = 0xFFFFu;
hdr->window = htons((uint16_t)avail);
Connection Lookup
Two lookup functions serve different purposes:
tcp_find(): Exact 4-tuple match (remote_ip, remote_port, local_ip, local_port) for established connectionstcp_find_listener(): Match on local_port only (local_ip=0 means INADDR_ANY) forLISTENstate connections
Receive Processing (tcp_rx)
tcp_rx() is the core state machine handler, called from ip_rx(). Key design:
- Checksum validation using TCP pseudo-header
- Connection lookup by 4-tuple
-
New connection handling: If no match and SYN received, look for a listener. If no listener, send RST ACK. - State transitions: Per the RFC 793 state diagram
- Deferred socket operations: Wake events and accept queue pushes are collected under
tcp_lock, then executed after release to prevent lock ordering inversions
The deferred wake pattern:
/* Collected under tcp_lock */
uint32_t wake_ids[TCP_RX_WAKE_MAX];
uint32_t wake_epoll_events[TCP_RX_WAKE_MAX];
uint32_t wake_count = 0;
/* ... state machine processing ... */
spin_unlock_irqrestore(&tcp_lock, fl);
/* Executed outside tcp_lock */
if (connect_sock_id != SOCK_NONE) {
sock_t *sk = sock_get(connect_sock_id);
if (sk) sk->state = SOCK_CONNECTED;
}
for (w = 0; w < wake_count; w++) {
sock_wake(wake_ids[w]);
if (wake_epoll_events[w] != 0)
epoll_notify(wake_ids[w], wake_epoll_events[w]);
}
State Transitions in Detail
SYN_SENT (active open)
- SYN+ACK received with matching
ack == snd_nxt:- Transition to
ESTABLISHED - Send ACK
- Wake connect() caller with
EPOLLOUT - Mark socket as
SOCK_CONNECTED
- Transition to
- RST received:
- Transition to
CLOSED - Wake connect() caller with
EPOLLERR
- Transition to
SYN_RCVD (passive open)
- ACK received with
ack == snd_nxt:- Transition to
ESTABLISHED - Push to listener’s accept queue
- Wake listener with
EPOLLIN
- Transition to
ESTABLISHED
- Data received (
payload_len > 0, seq == rcv_nxt):- Copy to receive ring buffer (if space available)
- Advance
rcv_nxt - Send ACK
- Wake recv() caller with
EPOLLIN
- FIN received:
- Advance
rcv_nxtby 1 - Transition to
CLOSE_WAIT - Send ACK
- Wake with
EPOLLHUP
- Advance
- RST received:
- Transition to
CLOSED - Wake blocked callers
- Transition to
FIN_WAIT_1 / FIN_WAIT_2 / CLOSING / LAST_ACK
Standard RFC 793 transitions. TIME_WAIT uses a 4-second timer (shortened 2MSL, acceptable for non-production):
#define TCP_TIMEWAIT_TICKS 400 /* 4 s at 100 Hz */
Retransmit Timer (tcp_tick)
Called at 100 Hz from the PIT handler. For each non-CLOSED connection:
- TIME_WAIT: Check if
timewait_athas elapsed; transition to CLOSED - Retransmit: If
retransmit_athas elapsed:- Increment
retransmit_count - Double the RTO (exponential backoff):
rto = TCP_RTO_INITIAL << retransmit_count - Cap at
TCP_RTO_MAX(8 seconds) - Retransmit the appropriate segment (SYN, SYN+ACK)
- After
TCP_RETRANSMIT_MAX(3) retries: send RST, close, wake blocked callers
- Increment
#define TCP_RTO_INITIAL 100 /* 1 s at 100 Hz */
#define TCP_RTO_MAX 800 /* 8 s at 100 Hz */
#define TCP_RETRANSMIT_MAX 3
Active Open (tcp_connect)
int tcp_connect(uint32_t sock_id, ip4_addr_t dst_ip, uint16_t dst_port,
uint32_t *conn_id_out)
- Allocate a free connection slot
- Set state to
TCP_SYN_SENT - Assign ephemeral port:
49152 + (arch_get_ticks() & 0x3FFF) - Set initial sequence number from
arch_get_ticks() - Release
tcp_lockbefore sending (to avoid lock ordering issues) - Send SYN segment
- Increment
snd_nxtafter sending (SYN must go out withseq=ISN)
The order of snd_nxt increment matters:
/* Bug: this was before tcp_send_segment, causing SYN seq=ISN+1
* and the remote's ack=ISN+2 to never match snd_nxt=ISN+1. */
tcp_send_segment(dev, &s_tcp[i], TCP_SYN, NULL, 0);
s_tcp[i].snd_nxt++; /* Now ISN+1; SYN_SENT handler matches ack=ISN+1 */
Passive Open (tcp_listen)
int tcp_listen(uint16_t port, uint32_t sock_id)
Creates a connection slot in TCP_LISTEN state. Incoming SYN packets matching the port trigger SYN_RCVD -> SYN+ACK -> (await ACK) -> ESTABLISHED.
Receive Buffer
Each TCP connection has a 16 KB circular receive buffer:
uint8_t rbuf[TCP_RBUF_SIZE]; /* 16384 bytes */
uint32_t rbuf_head, rbuf_tail;
tcp_conn_recv() reads from this buffer:
max_len=0: Peek mode, returns available byte count- Returns
-11(EAGAIN) if buffer empty and connection alive - Returns
0if buffer empty and FIN received (EOF)
Known Issue: Outbound TCP Race
When tcp_connect() releases tcp_lock to send the SYN, there is a brief window where tcp_tick() could fire and attempt a retransmit before the initial SYN is sent. This is mitigated by setting retransmit_at far in the future before releasing the lock:
s_tcp[i].retransmit_at = (uint32_t)arch_get_ticks() + TCP_RTO_INITIAL + 200;
spin_unlock_irqrestore(&tcp_lock, fl);
tcp_send_segment(dev, &s_tcp[i], TCP_SYN, NULL, 0);
s_tcp[i].snd_nxt++;
s_tcp[i].retransmit_at = (uint32_t)arch_get_ticks() + TCP_RTO_INITIAL;
UDP (udp.c)
Binding Table
UDP uses a simple binding table mapping ports to socket IDs:
typedef struct {
uint16_t port; /* host byte order; 0 = free */
uint32_t sock_id; /* index into socket table */
} udp_binding_t;
static udp_binding_t s_udp[UDP_BINDINGS_MAX]; /* 16 slots */
Sending (udp_send)
int udp_send(netdev_t *dev, uint16_t src_port, ip4_addr_t dst_ip,
uint16_t dst_port, const void *payload, uint16_t len)
Builds an 8-byte UDP header in s_udp_buf, sets checksum to 0 (optional per RFC 768 for IPv4), and calls ip_send().
Receiving (udp_rx)
- Validate minimum length (8 bytes)
- Validate UDP length field
- Validate checksum (if non-zero) using pseudo-header
- Look up destination port in binding table
- Copy payload into the socket’s UDP receive ring buffer
- Defer wake/epoll_notify outside
udp_lock
UDP Checksum Validation
static uint16_t
udp_checksum_verify(uint32_t src_ip, uint32_t dst_ip,
const uint8_t *udp_pkt, uint16_t udp_len)
Computes the one’s-complement sum over the pseudo-header (source IP, destination IP, protocol=17, UDP length) plus the UDP header and payload. Returns 0 if valid.
Per RFC 768, a checksum field of 0 means the sender did not compute a checksum; the receiver skips validation.
Port Unbinding
udp_unbind() is called from sock_vfs_close() to release port bindings when a UDP socket is closed. Without this, port bindings would leak and subsequent bind() calls to the same port would fail with EADDRINUSE. This bug was discovered through the DHCP client retry loop.
void udp_unbind(uint16_t port)
{
if (port == 0) return;
irqflags_t fl = spin_lock_irqsave(&udp_lock);
for (i = 0; i < UDP_BINDINGS_MAX; i++) {
if (s_udp[i].port == port) {
s_udp[i].port = 0;
s_udp[i].sock_id = 0;
}
}
spin_unlock_irqrestore(&udp_lock, fl);
}
Network Device Layer (netdev.c)
Device Registry
Up to NETDEV_MAX (4) network devices can be registered:
static netdev_t *s_devices[NETDEV_MAX];
static int s_count = 0;
netdev_t Structure
typedef struct netdev {
char name[16]; /* "eth0", "eth1", ... */
uint8_t mac[6]; /* hardware MAC address */
uint16_t mtu; /* 1500 for Ethernet */
int (*send)(struct netdev *dev, const void *pkt, uint16_t len);
void (*poll)(struct netdev *dev);
void *priv; /* driver-private data */
} netdev_t;
The send callback transmits a complete Ethernet frame (including 14-byte header). The poll callback drains the RX ring and calls netdev_rx_deliver() for each received frame.
Device Lookup
netdev_get() performs a linear name comparison under netdev_lock:
netdev_t *netdev_get(const char *name)
This is used by net_init() to find "eth0" and by tcp_connect() to find the NIC for outbound connections.
Security Implications
The TCP and UDP implementations process raw packets from the network with no sandboxing or privilege separation. Several structural properties of the v1 code are worth highlighting:
- All parsing is manual C: Header field extraction, checksum computation, and length validation are hand-written. There are no safe abstractions or bounds-checked accessors. A single off-by-one error in any parsing path is potentially exploitable.
- Static connection table: The 32-slot TCP connection table is a fixed array. A SYN flood filling all slots denies service to legitimate connections, with no SYN cookies or backlog mechanism.
- No sequence number randomization: Initial sequence numbers are derived from
arch_get_ticks(), making them predictable to an attacker who can observe or estimate the system uptime. - Ephemeral port allocation: Source ports are
49152 + (ticks & 0x3FFF), which is similarly predictable. - Lock ordering as security boundary: The deferred-wake pattern prevents deadlocks but adds complexity. An incorrect lock ordering change could cause deadlocks under attacker-controlled packet timing.
These are not theoretical – they are the kinds of real, exploitable issues expected in any v1 C network stack. The Aegis project’s planned Rust migration will address the memory safety class of these bugs; protocol-level hardening (SYN cookies, ISN randomization, rate limiting) is separate future work.
Related Documentation
- Network Stack Overview – architecture and design
- Socket API – userspace socket interface
- Device Drivers – NIC driver internals