Socket API

The Aegis socket layer bridges userspace networking syscalls to the kernel’s TCP/UDP/Unix transport implementations. It provides three address families (AF_INET TCP, AF_INET UDP, AF_UNIX), an epoll I/O multiplexer, and VFS integration so sockets can be used with read(), write(), and close().

v1 maturity notice: The socket layer is the boundary between untrusted userspace and the kernel’s network stack. All code here is C, operating on user-supplied addresses and buffer pointers (with SMAP enforcement via copy_from_user/copy_to_user). As v1 software, this code has not been audited and likely contains exploitable bugs at the syscall boundary – privilege escalation via crafted sockaddr structures, fd table corruption through race conditions, or information leaks through uninitialized buffer contents. These are expected realities for a from-scratch OS at this stage, not hypothetical concerns. Contributions are welcome – file issues or propose changes at exec/aegis.

AF_INET Socket Table (socket.c)

Socket Structure

#define SOCK_TABLE_SIZE  64
#define SOCK_NONE        0xFFFFFFFFU

typedef struct {
    sock_state_t  state;
    uint8_t       type;          /* SOCK_TYPE_STREAM (1) or SOCK_TYPE_DGRAM (2) */
    uint8_t       nonblocking;
    ip4_addr_t    local_ip;
    uint16_t      local_port;
    ip4_addr_t    remote_ip;
    uint16_t      remote_port;
    uint32_t      tcp_conn_id;   /* index into tcp_conn table; SOCK_NONE if none */
    /* accept queue: ring of completed tcp_conn_id values */
    uint32_t      accept_queue[8];
    uint8_t       accept_head, accept_tail;
    /* UDP receive ring */
    udp_rx_slot_t udp_rx[UDP_RX_SLOTS];
    uint8_t       udp_rx_head, udp_rx_tail;
    /* blocking waiter */
    aegis_task_t *waiter_task;
    /* epoll back-reference */
    uint32_t      epoll_id;
    uint64_t      epoll_events;
    /* options */
    uint8_t       reuseaddr;
    uint8_t       broadcast;
    uint32_t      rcvtimeo_ticks;
    uint32_t      sndtimeo_ticks;
} sock_t;

Socket States

typedef enum {
    SOCK_FREE,         /* slot available */
    SOCK_CREATED,      /* allocated, not bound */
    SOCK_BOUND,        /* bound to local address */
    SOCK_LISTENING,    /* listening for connections (TCP) */
    SOCK_CONNECTING,   /* connect() in progress (TCP) */
    SOCK_CONNECTED,    /* established (TCP) */
    SOCK_CLOSED        /* connection terminated */
} sock_state_t;

Socket Lifecycle

socket()    → sock_alloc() → SOCK_CREATED
bind()      → SOCK_BOUND
listen()    → SOCK_LISTENING         [TCP server]
accept()    → new socket in SOCK_CONNECTED
connect()   → SOCK_CONNECTING → SOCK_CONNECTED  [TCP client]
close()     → sock_vfs_close() → sock_free()

VFS Integration

Each AF_INET socket is backed by a VFS file descriptor through vfs_ops_t:

static const vfs_ops_t s_sock_ops = {
    .read    = sock_vfs_read,
    .write   = sock_vfs_write,
    .close   = sock_vfs_close,
    .readdir = NULL,
    .dup     = sock_vfs_dup,
    .stat    = sock_vfs_stat,
    .poll    = NULL,
};
VFS Op Behavior
read TCP: blocking recv from receive buffer. Returns data, 0 (EOF), or -EAGAIN (nonblocking). UDP: returns -ENOSYS (use recvfrom).
write TCP: send via tcp_conn_send(), chunked in 1460-byte MSS segments. Copies from userspace via copy_from_user() (SMAP). UDP: returns -ENOSYS (use sendto).
close UDP: calls udp_unbind() to release port. Then sock_free().
dup No-op (sockets have no refcount).
stat Returns st_mode = S_IFSOCK \| 0666.

TCP Read (Blocking)

The TCP read path implements blocking semantics with a careful wakeup protocol:

static int sock_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len)
{
    for (;;) {
        /* Set waiter BEFORE checking data -- prevents lost wakeup */
        s->waiter_task = (aegis_task_t *)sched_current();
        
        int avail = tcp_conn_recv(s->tcp_conn_id, NULL, 0);  /* peek */
        if (avail > 0) {
            s->waiter_task = NULL;
            return tcp_conn_recv(s->tcp_conn_id, buf, want);
        }
        
        /* Check for EOF (FIN received) */
        tcp_conn_t *tc = tcp_conn_get(s->tcp_conn_id);
        if (!tc || tc->state == TCP_CLOSE_WAIT || 
            tc->state == TCP_CLOSED || tc->state == TCP_TIME_WAIT) {
            s->waiter_task = NULL;
            return 0;  /* EOF */
        }
        
        if (s->nonblocking) {
            s->waiter_task = NULL;
            return -11;  /* EAGAIN */
        }
        sched_block();
    }
}

The waiter_task is set before checking for data to prevent the lost-wakeup race: if sock_wake() fires between the peek and sched_block() while waiter_task is NULL, the wakeup would be silently lost.

TCP Write (SMAP-Safe)

The write path bounces through a 1460-byte kernel staging buffer to avoid SMAP faults when passing userspace pointers to tcp_conn_send():

uint8_t s_sndbuf[1460];
while (sent < len) {
    uint32_t chunk = len - sent;
    if (chunk > 1460) chunk = 1460;
    copy_from_user(s_sndbuf, buf + sent, chunk);
    int n = tcp_conn_send(s->tcp_conn_id, s_sndbuf, chunk);
    if (n <= 0) return sent > 0 ? sent : -32;  /* EPIPE */
    sent += n;
}

UDP Receive Ring

UDP datagrams are buffered in an 8-slot ring within each socket:

#define UDP_RX_SLOTS  8
#define UDP_RX_MAXBUF 1500

typedef struct {
    uint8_t    data[UDP_RX_MAXBUF];
    uint16_t   len;
    ip4_addr_t src_ip;
    uint16_t   src_port;
    uint8_t    in_use;
} udp_rx_slot_t;

The udp_rx() handler in udp.c writes datagrams into this ring. The recvfrom syscall reads from it.

Sockaddr Layout

The kernel uses a k_sockaddr_in_t matching the musl libc struct sockaddr_in layout:

typedef struct {
    uint16_t sin_family;    /* AF_INET */
    uint16_t sin_port;      /* network byte order */
    uint32_t sin_addr;      /* network byte order */
    uint8_t  sin_zero[8];
} k_sockaddr_in_t;

_Static_assert(sizeof(k_sockaddr_in_t) == 16, "...");

AF_UNIX Domain Sockets (unix_socket.c)

AF_UNIX sockets provide local IPC with connected, bidirectional, byte-stream semantics. They support fd passing (SCM_RIGHTS), peer credential retrieval, and VFS-backed file descriptors.

Socket Structure

#define UNIX_SOCK_MAX    32
#define UNIX_PATH_MAX    108
#define UNIX_BUF_SIZE    4056   /* fits in one kva page */

typedef struct {
    uint8_t        in_use;
    unix_state_t   state;
    uint8_t        nonblocking;
    char           path[UNIX_PATH_MAX];
    /* Ring buffer -- this socket's TX direction (peer reads from it) */
    uint8_t       *ring;           /* kva-allocated page */
    uint16_t       ring_head;      /* write position */
    uint16_t       ring_tail;      /* read position */
    /* Peer link */
    uint32_t       peer_id;
    /* Accept queue (listening sockets) */
    uint32_t       accept_queue[8];
    uint8_t        accept_head, accept_tail;
    /* Blocking */
    aegis_task_t  *waiter_task;
    /* Peer credentials */
    uint32_t       peer_pid, peer_uid, peer_gid;
    /* fd passing staging area */
    unix_passed_fd_t passed_fds[UNIX_PASSED_FD_MAX];  /* 16 slots */
    uint8_t          passed_fd_count;
    /* Refcount for dup/fork */
    uint32_t       refcount;
} unix_sock_t;

States

typedef enum {
    UNIX_FREE, UNIX_CREATED, UNIX_BOUND, UNIX_LISTENING,
    UNIX_CONNECTING, UNIX_CONNECTED, UNIX_CLOSED
} unix_state_t;

Name Table

Bound sockets are registered in a static name table mapping paths to socket IDs:

#define UNIX_NAME_MAX 32
typedef struct {
    char     path[UNIX_PATH_MAX];
    uint32_t sock_id;
    uint8_t  in_use;
} unix_name_t;
  • name_register(): Registers a path, returns -EADDRINUSE if already bound
  • name_unregister(): Removes the binding (called on socket close)
  • name_lookup(): Returns the socket ID for a path, or UNIX_NONE

Connection Flow

Client                          Server
------                          ------
socket(AF_UNIX, STREAM, 0)      socket(AF_UNIX, STREAM, 0)
                                bind("/run/service.sock")
                                listen()
connect("/run/service.sock")    
  |                             accept() [blocks]
  +-- name_lookup → listener_id
  +-- Allocate server-side socket
  +-- Allocate ring buffers (2 pages)
  +-- Cross-link peer_ids
  +-- Enqueue in listener accept queue
  +-- Wake listener
  |                               |
  v                               v
UNIX_CONNECTED                  UNIX_CONNECTED (new fd)

Ring Buffer Design

Each connected Unix socket has its own TX ring buffer (a single kva_alloc_pages(1) page, 4056 usable bytes). The peer reads from this ring:

Socket A                    Socket B
+---------+                +---------+
| ring_a  | --- read ----> | peer    |
| (A's TX)|                | reads   |
+---------+                +---------+
                           | ring_b  | --- read ----> Socket A reads
                           | (B's TX)|
                           +---------+
  • Write: A writes to its own ring at ring_head
  • Read: A reads from peer’s ring at peer’s ring_tail

Ring buffer functions:

static uint16_t ring_used(unix_sock_t *s) {
    return (uint16_t)((s->ring_head - s->ring_tail) & (UNIX_BUF_SIZE - 1));
}
static uint16_t ring_free(unix_sock_t *s) {
    return (uint16_t)(UNIX_BUF_SIZE - 1 - ring_used(s));
}

Close and Ring Lifetime

Ring buffer ownership is carefully managed across close:

  • When socket A closes but peer B is still alive: A’s ring is not freed because B still reads from it. The ring pointer remains valid even after in_use=0.
  • When B subsequently closes: B frees both its own ring and A’s orphaned ring.
  • This prevents use-after-free while allowing the peer to drain remaining buffered data after the sender closes.

Peer Credentials

Credentials are captured at connect() time:

/* Client credentials → server-side socket */
s_unix[server_id].peer_pid = proc->pid;
s_unix[server_id].peer_uid = proc->uid;
s_unix[server_id].peer_gid = proc->gid;

/* Server credentials → client socket (filled at accept time) */
client->peer_pid = accepting_proc->pid;

Retrieved via unix_sock_peercred():

int unix_sock_peercred(uint32_t id, uint32_t *pid, uint32_t *uid, uint32_t *gid);

fd Passing (SCM_RIGHTS)

Unix sockets support passing file descriptors between processes:

Staging (sender side):

int unix_sock_stage_fds(uint32_t peer_id, unix_passed_fd_t *fds, uint8_t count);

Copies VFS ops/priv/flags into the peer’s staging area (up to 16 fds).

Receiving (receiver side):

int unix_sock_recv_fds(uint32_t id, int *fd_out, int max_fds);

Installs staged fds into the receiving process’s fd table. Any fds that cannot be installed (fd table full) are closed via their ops->close callback.

On socket close, any unreceived staged fds are also cleaned up.

VFS Integration

AF_UNIX sockets use SMAP-safe VFS ops that bounce through a 1024-byte kernel buffer:

static int unix_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len) {
    uint8_t kbuf[1024];
    int n = unix_sock_read(id, kbuf, want);
    if (n > 0) copy_to_user(buf, kbuf, n);
    return n;
}

epoll (epoll.c)

Design

The epoll implementation provides scalable I/O event notification compatible with the Linux epoll API. It supports both level-triggered and edge-triggered modes.

Structures

#define EPOLL_MAX_INSTANCES  8
#define EPOLL_MAX_WATCHES    64

typedef struct {
    uint32_t fd;
    uint32_t events;
    uint64_t data;       /* user data */
    uint8_t  in_use;
} epoll_watch_t;

typedef struct {
    uint8_t        in_use;
    epoll_watch_t  watches[EPOLL_MAX_WATCHES];
    uint8_t        nwatches;
    aegis_task_t  *waiter_task;
    uint32_t       ready[EPOLL_MAX_WATCHES];
    uint8_t        nready;
} epoll_fd_t;

Event Flags

#define EPOLLIN   0x00000001U   /* readable */
#define EPOLLOUT  0x00000004U   /* writable */
#define EPOLLERR  0x00000008U   /* error */
#define EPOLLHUP  0x00000010U   /* hangup */
#define EPOLLET   0x80000000U   /* edge-triggered */

epoll_ctl

int epoll_ctl_impl(uint32_t epoll_id, int op, int fd, k_epoll_event_t *ev);
Operation Effect
EPOLL_CTL_ADD Add a watch for fd. Returns -EEXIST if already watched.
EPOLL_CTL_DEL Remove the watch. Returns -ENOENT if not found.
EPOLL_CTL_MOD Update events/data for existing watch.

epoll_notify

Called from TCP and UDP when data or connection events occur:

void epoll_notify(uint32_t sock_id_as_fd, uint32_t events);

For each epoll instance, scans watches for matching fd and events. Adds to the ready list (with dedup) and wakes any blocked epoll_wait caller.

epoll_wait

int epoll_wait_impl(uint32_t epoll_id, uint64_t events_uptr,
                    int maxevents, uint32_t timeout_ticks);

The implementation:

  1. VFS poll sweep: For non-socket fds (pipes, console), calls the VFS poll op to check readiness. This enables epoll to monitor heterogeneous fd types.
  2. Ready check (under epoll_lock): If events are ready, copy them to userspace via copy_to_user().
  3. Edge-triggered: Entries with EPOLLET are removed from the ready list after delivery.
  4. Level-triggered: Entries remain in the ready list.
  5. Blocking: If no events and timeout > 0, set waiter_task and sched_block().

The atomicity fix (Bug C6): The nready check and waiter_task assignment are performed under epoll_lock to prevent a lost wakeup when epoll_notify fires between checking nready==0 and setting waiter_task.

epoll Event Structure

typedef struct __attribute__((packed)) {
    uint32_t events;
    uint64_t data;    /* user data (epoll_data_t union) */
} k_epoll_event_t;

_Static_assert(sizeof(k_epoll_event_t) == 12, "matches Linux ABI");

Socket API Summary

Syscall AF_INET TCP AF_INET UDP AF_UNIX
socket() sock_alloc(STREAM) sock_alloc(DGRAM) unix_sock_alloc()
bind() Set local_ip/port udp_bind() unix_sock_bind()
listen() tcp_listen() N/A unix_sock_listen()
accept() Pop accept queue N/A unix_sock_accept()
connect() tcp_connect() Set remote addr unix_sock_connect()
send/write tcp_conn_send() N/A unix_sock_write()
recv/read tcp_conn_recv() N/A unix_sock_read()
sendto N/A udp_send() N/A
recvfrom N/A Pop UDP RX ring N/A
close() tcp_conn_close() udp_unbind() unix_sock_free()
epoll_create - - -
epoll_ctl epoll_ctl_impl() epoll_ctl_impl() N/A
epoll_wait epoll_wait_impl() epoll_wait_impl() N/A

Table Size Limits

Resource Limit Defined In
AF_INET sockets 64 socket.h SOCK_TABLE_SIZE
TCP connections 32 tcp.h TCP_MAX_CONNS
TCP receive buffer 16 KB tcp.h TCP_RBUF_SIZE
TCP send buffer 8 KB tcp.h TCP_SBUF_SIZE
UDP bindings 16 udp.h UDP_BINDINGS_MAX
UDP RX ring slots 8 socket.h UDP_RX_SLOTS
UDP RX max datagram 1500 B socket.h UDP_RX_MAXBUF
AF_UNIX sockets 32 unix_socket.h UNIX_SOCK_MAX
AF_UNIX buffer size 4056 B unix_socket.h UNIX_BUF_SIZE
AF_UNIX path max 108 chars unix_socket.h UNIX_PATH_MAX
AF_UNIX passed fds 16 unix_socket.h UNIX_PASSED_FD_MAX
epoll instances 8 epoll.h EPOLL_MAX_INSTANCES
epoll watches 64 epoll.h EPOLL_MAX_WATCHES
Network devices 4 netdev.h NETDEV_MAX

Security Considerations

The socket layer is a critical attack surface – it processes attacker-controlled data (sockaddr structures, buffer lengths, fd numbers) in kernel context. Key v1 concerns:

  • AF_UNIX fd passing: unix_sock_stage_fds() copies VFS ops pointers between processes. A bug in this path could allow arbitrary kernel function pointer injection.
  • copy_from_user / copy_to_user: SMAP enforcement prevents direct kernel access to userspace memory, but the bounce-buffer pattern (1024 bytes for Unix, 1460 bytes for TCP) means buffer overflows in the staging buffers are a risk if length validation is incorrect.
  • Socket table as shared mutable state: 64-slot sock_t array accessed under sock_lock from both syscall and ISR context. Lock ordering bugs or missed lock acquisitions could corrupt socket state.
  • No capability checks on socket operations: The capability model does not yet gate network operations – any process can open sockets, bind to any port, or connect to any address. Integrating capability checks (CAP_KIND_NET) into the socket syscall path is planned future work.

The long-term plan includes migrating the socket layer to Rust as part of the broader kernel C-to-Rust migration. The capability system (kernel/cap/lib.rs) is the first kernel component already written in Rust, establishing the FFI patterns that will be extended to networking.