Socket API

The Aegis socket layer bridges userspace networking syscalls to the kernel’s TCP/UDP/Unix transport implementations. It provides three address families (AF_INET TCP, AF_INET UDP, AF_UNIX), an epoll I/O multiplexer, and VFS integration so sockets can be used with read(), write(), and close().

v1 maturity notice: The socket layer is the boundary between untrusted userspace and the kernel’s network stack. All code here is C, operating on user-supplied addresses and buffer pointers (with SMAP enforcement via copy_from_user/copy_to_user). As v1 software, this code has not been audited and likely contains exploitable bugs at the syscall boundary – privilege escalation via crafted sockaddr structures, fd table corruption through race conditions, or information leaks through uninitialized buffer contents. These are expected realities for a from-scratch OS at this stage, not hypothetical concerns. Contributions are welcome – file issues or propose changes at exec/aegis.

AF_INET Socket Table (`socket.c`)

Socket Structure

#define SOCK_TABLE_SIZE  64
#define SOCK_NONE        0xFFFFFFFFU

typedef struct {
    sock_state_t  state;
    uint8_t       type;          /* SOCK_TYPE_STREAM (1) or SOCK_TYPE_DGRAM (2) */
    uint8_t       nonblocking;
    ip4_addr_t    local_ip;
    uint16_t      local_port;
    ip4_addr_t    remote_ip;
    uint16_t      remote_port;
    uint32_t      tcp_conn_id;   /* index into tcp_conn table; SOCK_NONE if none */
    /* accept queue: ring of completed tcp_conn_id values */
    uint32_t      accept_queue[8];
    uint8_t       accept_head, accept_tail;
    /* UDP receive ring */
    udp_rx_slot_t udp_rx[UDP_RX_SLOTS];
    uint8_t       udp_rx_head, udp_rx_tail;
    /* blocking waiter */
    aegis_task_t *waiter_task;
    /* epoll back-reference */
    uint32_t      epoll_id;
    uint64_t      epoll_events;
    /* options */
    uint8_t       reuseaddr;
    uint8_t       broadcast;
    uint32_t      rcvtimeo_ticks;
    uint32_t      sndtimeo_ticks;
} sock_t;

Socket States

typedef enum {
    SOCK_FREE,         /* slot available */
    SOCK_CREATED,      /* allocated, not bound */
    SOCK_BOUND,        /* bound to local address */
    SOCK_LISTENING,    /* listening for connections (TCP) */
    SOCK_CONNECTING,   /* connect() in progress (TCP) */
    SOCK_CONNECTED,    /* established (TCP) */
    SOCK_CLOSED        /* connection terminated */
} sock_state_t;

Socket Lifecycle

socket()    → sock_alloc() → SOCK_CREATED
bind()      → SOCK_BOUND
listen()    → SOCK_LISTENING         [TCP server]
accept()    → new socket in SOCK_CONNECTED
connect()   → SOCK_CONNECTING → SOCK_CONNECTED  [TCP client]
close()     → sock_vfs_close() → sock_free()

VFS Integration

Each AF_INET socket is backed by a VFS file descriptor through vfs_ops_t:

static const vfs_ops_t s_sock_ops = {
    .read    = sock_vfs_read,
    .write   = sock_vfs_write,
    .close   = sock_vfs_close,
    .readdir = NULL,
    .dup     = sock_vfs_dup,
    .stat    = sock_vfs_stat,
    .poll    = NULL,
};

VFS Op	Behavior
`read`	TCP: blocking recv from receive buffer. Returns data, 0 (EOF), or -EAGAIN (nonblocking). UDP: returns -ENOSYS (use `recvfrom`).
`write`	TCP: send via `tcp_conn_send()`, chunked in 1460-byte MSS segments. Copies from userspace via `copy_from_user()` (SMAP). UDP: returns -ENOSYS (use `sendto`).
`close`	UDP: calls `udp_unbind()` to release port. Then `sock_free()`.
`dup`	No-op (sockets have no refcount).
`stat`	Returns `st_mode = S_IFSOCK \\| 0666`.

TCP Read (Blocking)

The TCP read path implements blocking semantics with a careful wakeup protocol:

static int sock_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len)
{
    for (;;) {
        /* Set waiter BEFORE checking data -- prevents lost wakeup */
        s->waiter_task = (aegis_task_t *)sched_current();
        
        int avail = tcp_conn_recv(s->tcp_conn_id, NULL, 0);  /* peek */
        if (avail > 0) {
            s->waiter_task = NULL;
            return tcp_conn_recv(s->tcp_conn_id, buf, want);
        }
        
        /* Check for EOF (FIN received) */
        tcp_conn_t *tc = tcp_conn_get(s->tcp_conn_id);
        if (!tc || tc->state == TCP_CLOSE_WAIT || 
            tc->state == TCP_CLOSED || tc->state == TCP_TIME_WAIT) {
            s->waiter_task = NULL;
            return 0;  /* EOF */
        }
        
        if (s->nonblocking) {
            s->waiter_task = NULL;
            return -11;  /* EAGAIN */
        }
        sched_block();
    }
}

The waiter_task is set before checking for data to prevent the lost-wakeup race: if sock_wake() fires between the peek and sched_block() while waiter_task is NULL, the wakeup would be silently lost.

TCP Write (SMAP-Safe)

The write path bounces through a 1460-byte kernel staging buffer to avoid SMAP faults when passing userspace pointers to tcp_conn_send():

uint8_t s_sndbuf[1460];
while (sent < len) {
    uint32_t chunk = len - sent;
    if (chunk > 1460) chunk = 1460;
    copy_from_user(s_sndbuf, buf + sent, chunk);
    int n = tcp_conn_send(s->tcp_conn_id, s_sndbuf, chunk);
    if (n <= 0) return sent > 0 ? sent : -32;  /* EPIPE */
    sent += n;
}

UDP Receive Ring

UDP datagrams are buffered in an 8-slot ring within each socket:

#define UDP_RX_SLOTS  8
#define UDP_RX_MAXBUF 1500

typedef struct {
    uint8_t    data[UDP_RX_MAXBUF];
    uint16_t   len;
    ip4_addr_t src_ip;
    uint16_t   src_port;
    uint8_t    in_use;
} udp_rx_slot_t;

The udp_rx() handler in udp.c writes datagrams into this ring. The recvfrom syscall reads from it.

Sockaddr Layout

The kernel uses a k_sockaddr_in_t matching the musl libc struct sockaddr_in layout:

typedef struct {
    uint16_t sin_family;    /* AF_INET */
    uint16_t sin_port;      /* network byte order */
    uint32_t sin_addr;      /* network byte order */
    uint8_t  sin_zero[8];
} k_sockaddr_in_t;

_Static_assert(sizeof(k_sockaddr_in_t) == 16, "...");

AF_UNIX Domain Sockets (`unix_socket.c`)

AF_UNIX sockets provide local IPC with connected, bidirectional, byte-stream semantics. They support fd passing (SCM_RIGHTS), peer credential retrieval, and VFS-backed file descriptors.

Socket Structure

#define UNIX_SOCK_MAX    32
#define UNIX_PATH_MAX    108
#define UNIX_BUF_SIZE    4056   /* fits in one kva page */

typedef struct {
    uint8_t        in_use;
    unix_state_t   state;
    uint8_t        nonblocking;
    char           path[UNIX_PATH_MAX];
    /* Ring buffer -- this socket's TX direction (peer reads from it) */
    uint8_t       *ring;           /* kva-allocated page */
    uint16_t       ring_head;      /* write position */
    uint16_t       ring_tail;      /* read position */
    /* Peer link */
    uint32_t       peer_id;
    /* Accept queue (listening sockets) */
    uint32_t       accept_queue[8];
    uint8_t        accept_head, accept_tail;
    /* Blocking */
    aegis_task_t  *waiter_task;
    /* Peer credentials */
    uint32_t       peer_pid, peer_uid, peer_gid;
    /* fd passing staging area */
    unix_passed_fd_t passed_fds[UNIX_PASSED_FD_MAX];  /* 16 slots */
    uint8_t          passed_fd_count;
    /* Refcount for dup/fork */
    uint32_t       refcount;
} unix_sock_t;

States

typedef enum {
    UNIX_FREE, UNIX_CREATED, UNIX_BOUND, UNIX_LISTENING,
    UNIX_CONNECTING, UNIX_CONNECTED, UNIX_CLOSED
} unix_state_t;

Name Table

Bound sockets are registered in a static name table mapping paths to socket IDs:

#define UNIX_NAME_MAX 32
typedef struct {
    char     path[UNIX_PATH_MAX];
    uint32_t sock_id;
    uint8_t  in_use;
} unix_name_t;

name_register(): Registers a path, returns -EADDRINUSE if already bound
name_unregister(): Removes the binding (called on socket close)
name_lookup(): Returns the socket ID for a path, or UNIX_NONE

Connection Flow

Client                          Server
------                          ------
socket(AF_UNIX, STREAM, 0)      socket(AF_UNIX, STREAM, 0)
                                bind("/run/service.sock")
                                listen()
connect("/run/service.sock")    
  |                             accept() [blocks]
  +-- name_lookup → listener_id
  +-- Allocate server-side socket
  +-- Allocate ring buffers (2 pages)
  +-- Cross-link peer_ids
  +-- Enqueue in listener accept queue
  +-- Wake listener
  |                               |
  v                               v
UNIX_CONNECTED                  UNIX_CONNECTED (new fd)

Ring Buffer Design

Each connected Unix socket has its own TX ring buffer (a single kva_alloc_pages(1) page, 4056 usable bytes). The peer reads from this ring:

Socket A                    Socket B
+---------+                +---------+
| ring_a  | --- read ----> | peer    |
| (A's TX)|                | reads   |
+---------+                +---------+
                           | ring_b  | --- read ----> Socket A reads
                           | (B's TX)|
                           +---------+

Write: A writes to its own ring at ring_head
Read: A reads from peer’s ring at peer’s ring_tail

Ring buffer functions:

static uint16_t ring_used(unix_sock_t *s) {
    return (uint16_t)((s->ring_head - s->ring_tail) & (UNIX_BUF_SIZE - 1));
}
static uint16_t ring_free(unix_sock_t *s) {
    return (uint16_t)(UNIX_BUF_SIZE - 1 - ring_used(s));
}

Close and Ring Lifetime

Ring buffer ownership is carefully managed across close:

When socket A closes but peer B is still alive: A’s ring is not freed because B still reads from it. The ring pointer remains valid even after in_use=0.
When B subsequently closes: B frees both its own ring and A’s orphaned ring.
This prevents use-after-free while allowing the peer to drain remaining buffered data after the sender closes.

Peer Credentials

Credentials are captured at connect() time:

/* Client credentials → server-side socket */
s_unix[server_id].peer_pid = proc->pid;
s_unix[server_id].peer_uid = proc->uid;
s_unix[server_id].peer_gid = proc->gid;

/* Server credentials → client socket (filled at accept time) */
client->peer_pid = accepting_proc->pid;

Retrieved via unix_sock_peercred():

int unix_sock_peercred(uint32_t id, uint32_t *pid, uint32_t *uid, uint32_t *gid);

fd Passing (SCM_RIGHTS)

Unix sockets support passing file descriptors between processes:

Staging (sender side):

int unix_sock_stage_fds(uint32_t peer_id, unix_passed_fd_t *fds, uint8_t count);

Copies VFS ops/priv/flags into the peer’s staging area (up to 16 fds).

Receiving (receiver side):

int unix_sock_recv_fds(uint32_t id, int *fd_out, int max_fds);

Installs staged fds into the receiving process’s fd table. Any fds that cannot be installed (fd table full) are closed via their ops->close callback.

On socket close, any unreceived staged fds are also cleaned up.

VFS Integration

AF_UNIX sockets use SMAP-safe VFS ops that bounce through a 1024-byte kernel buffer:

static int unix_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len) {
    uint8_t kbuf[1024];
    int n = unix_sock_read(id, kbuf, want);
    if (n > 0) copy_to_user(buf, kbuf, n);
    return n;
}

epoll (`epoll.c`)

Design

The epoll implementation provides scalable I/O event notification compatible with the Linux epoll API. It supports both level-triggered and edge-triggered modes.

Structures

#define EPOLL_MAX_INSTANCES  8
#define EPOLL_MAX_WATCHES    64

typedef struct {
    uint32_t fd;
    uint32_t events;
    uint64_t data;       /* user data */
    uint8_t  in_use;
} epoll_watch_t;

typedef struct {
    uint8_t        in_use;
    epoll_watch_t  watches[EPOLL_MAX_WATCHES];
    uint8_t        nwatches;
    aegis_task_t  *waiter_task;
    uint32_t       ready[EPOLL_MAX_WATCHES];
    uint8_t        nready;
} epoll_fd_t;

Event Flags

#define EPOLLIN   0x00000001U   /* readable */
#define EPOLLOUT  0x00000004U   /* writable */
#define EPOLLERR  0x00000008U   /* error */
#define EPOLLHUP  0x00000010U   /* hangup */
#define EPOLLET   0x80000000U   /* edge-triggered */

epoll_ctl

int epoll_ctl_impl(uint32_t epoll_id, int op, int fd, k_epoll_event_t *ev);

Operation	Effect
`EPOLL_CTL_ADD`	Add a watch for fd. Returns `-EEXIST` if already watched.
`EPOLL_CTL_DEL`	Remove the watch. Returns `-ENOENT` if not found.
`EPOLL_CTL_MOD`	Update events/data for existing watch.

epoll_notify

Called from TCP and UDP when data or connection events occur:

void epoll_notify(uint32_t sock_id_as_fd, uint32_t events);

For each epoll instance, scans watches for matching fd and events. Adds to the ready list (with dedup) and wakes any blocked epoll_wait caller.

epoll_wait

int epoll_wait_impl(uint32_t epoll_id, uint64_t events_uptr,
                    int maxevents, uint32_t timeout_ticks);

The implementation:

VFS poll sweep: For non-socket fds (pipes, console), calls the VFS poll op to check readiness. This enables epoll to monitor heterogeneous fd types.
Ready check (under epoll_lock): If events are ready, copy them to userspace via copy_to_user().
Edge-triggered: Entries with EPOLLET are removed from the ready list after delivery.
Level-triggered: Entries remain in the ready list.
Blocking: If no events and timeout > 0, set waiter_task and sched_block().

The atomicity fix (Bug C6): The nready check and waiter_task assignment are performed under epoll_lock to prevent a lost wakeup when epoll_notify fires between checking nready==0 and setting waiter_task.

epoll Event Structure

typedef struct __attribute__((packed)) {
    uint32_t events;
    uint64_t data;    /* user data (epoll_data_t union) */
} k_epoll_event_t;

_Static_assert(sizeof(k_epoll_event_t) == 12, "matches Linux ABI");

Socket API Summary

Syscall	AF_INET TCP	AF_INET UDP	AF_UNIX
`socket()`	`sock_alloc(STREAM)`	`sock_alloc(DGRAM)`	`unix_sock_alloc()`
`bind()`	Set local_ip/port	`udp_bind()`	`unix_sock_bind()`
`listen()`	`tcp_listen()`	N/A	`unix_sock_listen()`
`accept()`	Pop accept queue	N/A	`unix_sock_accept()`
`connect()`	`tcp_connect()`	Set remote addr	`unix_sock_connect()`
`send/write`	`tcp_conn_send()`	N/A	`unix_sock_write()`
`recv/read`	`tcp_conn_recv()`	N/A	`unix_sock_read()`
`sendto`	N/A	`udp_send()`	N/A
`recvfrom`	N/A	Pop UDP RX ring	N/A
`close()`	`tcp_conn_close()`	`udp_unbind()`	`unix_sock_free()`
`epoll_create`	-	-	-
`epoll_ctl`	`epoll_ctl_impl()`	`epoll_ctl_impl()`	N/A
`epoll_wait`	`epoll_wait_impl()`	`epoll_wait_impl()`	N/A

Table Size Limits

Resource	Limit	Defined In
AF_INET sockets	64	`socket.h` `SOCK_TABLE_SIZE`
TCP connections	32	`tcp.h` `TCP_MAX_CONNS`
TCP receive buffer	16 KB	`tcp.h` `TCP_RBUF_SIZE`
TCP send buffer	8 KB	`tcp.h` `TCP_SBUF_SIZE`
UDP bindings	16	`udp.h` `UDP_BINDINGS_MAX`
UDP RX ring slots	8	`socket.h` `UDP_RX_SLOTS`
UDP RX max datagram	1500 B	`socket.h` `UDP_RX_MAXBUF`
AF_UNIX sockets	32	`unix_socket.h` `UNIX_SOCK_MAX`
AF_UNIX buffer size	4056 B	`unix_socket.h` `UNIX_BUF_SIZE`
AF_UNIX path max	108 chars	`unix_socket.h` `UNIX_PATH_MAX`
AF_UNIX passed fds	16	`unix_socket.h` `UNIX_PASSED_FD_MAX`
epoll instances	8	`epoll.h` `EPOLL_MAX_INSTANCES`
epoll watches	64	`epoll.h` `EPOLL_MAX_WATCHES`
Network devices	4	`netdev.h` `NETDEV_MAX`

Security Considerations

The socket layer is a critical attack surface – it processes attacker-controlled data (sockaddr structures, buffer lengths, fd numbers) in kernel context. Key v1 concerns:

AF_UNIX fd passing: unix_sock_stage_fds() copies VFS ops pointers between processes. A bug in this path could allow arbitrary kernel function pointer injection.
copy_from_user / copy_to_user: SMAP enforcement prevents direct kernel access to userspace memory, but the bounce-buffer pattern (1024 bytes for Unix, 1460 bytes for TCP) means buffer overflows in the staging buffers are a risk if length validation is incorrect.
Socket table as shared mutable state: 64-slot sock_t array accessed under sock_lock from both syscall and ISR context. Lock ordering bugs or missed lock acquisitions could corrupt socket state.
No capability checks on socket operations: The capability model does not yet gate network operations – any process can open sockets, bind to any port, or connect to any address. Integrating capability checks (CAP_KIND_NET) into the socket syscall path is planned future work.

The long-term plan includes migrating the socket layer to Rust as part of the broader kernel C-to-Rust migration. The capability system (kernel/cap/lib.rs) is the first kernel component already written in Rust, establishing the FFI patterns that will be extended to networking.

Network Stack Overview – architecture and packet flow
TCP/IP Implementation – protocol internals
Device Drivers – NIC driver details
Syscall Interface – syscall numbers and signatures

Socket API

Socket API

AF_INET Socket Table (socket.c)

Socket Structure

Socket States

Socket Lifecycle

VFS Integration

TCP Read (Blocking)

TCP Write (SMAP-Safe)

UDP Receive Ring

Sockaddr Layout

AF_UNIX Domain Sockets (unix_socket.c)

Socket Structure

States

Name Table

Connection Flow

Ring Buffer Design

Close and Ring Lifetime

Peer Credentials

fd Passing (SCM_RIGHTS)

VFS Integration

epoll (epoll.c)

Design

Structures

Event Flags

epoll_ctl

epoll_notify

epoll_wait

epoll Event Structure

Socket API Summary

Table Size Limits

Security Considerations

Related Documentation

AF_INET Socket Table (`socket.c`)

AF_UNIX Domain Sockets (`unix_socket.c`)

epoll (`epoll.c`)