Socket API
Documentation of the Aegis OS socket layer: AF_INET, AF_UNIX, epoll, and VFS integration
Socket API
The Aegis socket layer bridges userspace networking syscalls to the kernel’s TCP/UDP/Unix transport implementations. It provides three address families (AF_INET TCP, AF_INET UDP, AF_UNIX), an epoll I/O multiplexer, and VFS integration so sockets can be used with read(), write(), and close().
v1 maturity notice: The socket layer is the boundary between untrusted userspace and the kernel’s network stack. All code here is C, operating on user-supplied addresses and buffer pointers (with SMAP enforcement via
copy_from_user/copy_to_user). As v1 software, this code has not been audited and likely contains exploitable bugs at the syscall boundary – privilege escalation via crafted sockaddr structures, fd table corruption through race conditions, or information leaks through uninitialized buffer contents. These are expected realities for a from-scratch OS at this stage, not hypothetical concerns. Contributions are welcome – file issues or propose changes at exec/aegis.
AF_INET Socket Table (socket.c)
Socket Structure
#define SOCK_TABLE_SIZE 64
#define SOCK_NONE 0xFFFFFFFFU
typedef struct {
sock_state_t state;
uint8_t type; /* SOCK_TYPE_STREAM (1) or SOCK_TYPE_DGRAM (2) */
uint8_t nonblocking;
ip4_addr_t local_ip;
uint16_t local_port;
ip4_addr_t remote_ip;
uint16_t remote_port;
uint32_t tcp_conn_id; /* index into tcp_conn table; SOCK_NONE if none */
/* accept queue: ring of completed tcp_conn_id values */
uint32_t accept_queue[8];
uint8_t accept_head, accept_tail;
/* UDP receive ring */
udp_rx_slot_t udp_rx[UDP_RX_SLOTS];
uint8_t udp_rx_head, udp_rx_tail;
/* blocking waiter */
aegis_task_t *waiter_task;
/* epoll back-reference */
uint32_t epoll_id;
uint64_t epoll_events;
/* options */
uint8_t reuseaddr;
uint8_t broadcast;
uint32_t rcvtimeo_ticks;
uint32_t sndtimeo_ticks;
} sock_t;
Socket States
typedef enum {
SOCK_FREE, /* slot available */
SOCK_CREATED, /* allocated, not bound */
SOCK_BOUND, /* bound to local address */
SOCK_LISTENING, /* listening for connections (TCP) */
SOCK_CONNECTING, /* connect() in progress (TCP) */
SOCK_CONNECTED, /* established (TCP) */
SOCK_CLOSED /* connection terminated */
} sock_state_t;
Socket Lifecycle
socket() → sock_alloc() → SOCK_CREATED
bind() → SOCK_BOUND
listen() → SOCK_LISTENING [TCP server]
accept() → new socket in SOCK_CONNECTED
connect() → SOCK_CONNECTING → SOCK_CONNECTED [TCP client]
close() → sock_vfs_close() → sock_free()
VFS Integration
Each AF_INET socket is backed by a VFS file descriptor through vfs_ops_t:
static const vfs_ops_t s_sock_ops = {
.read = sock_vfs_read,
.write = sock_vfs_write,
.close = sock_vfs_close,
.readdir = NULL,
.dup = sock_vfs_dup,
.stat = sock_vfs_stat,
.poll = NULL,
};
| VFS Op | Behavior |
|---|---|
read |
TCP: blocking recv from receive buffer. Returns data, 0 (EOF), or -EAGAIN (nonblocking). UDP: returns -ENOSYS (use recvfrom). |
write |
TCP: send via tcp_conn_send(), chunked in 1460-byte MSS segments. Copies from userspace via copy_from_user() (SMAP). UDP: returns -ENOSYS (use sendto). |
close |
UDP: calls udp_unbind() to release port. Then sock_free(). |
dup |
No-op (sockets have no refcount). |
stat |
Returns st_mode = S_IFSOCK \| 0666. |
TCP Read (Blocking)
The TCP read path implements blocking semantics with a careful wakeup protocol:
static int sock_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len)
{
for (;;) {
/* Set waiter BEFORE checking data -- prevents lost wakeup */
s->waiter_task = (aegis_task_t *)sched_current();
int avail = tcp_conn_recv(s->tcp_conn_id, NULL, 0); /* peek */
if (avail > 0) {
s->waiter_task = NULL;
return tcp_conn_recv(s->tcp_conn_id, buf, want);
}
/* Check for EOF (FIN received) */
tcp_conn_t *tc = tcp_conn_get(s->tcp_conn_id);
if (!tc || tc->state == TCP_CLOSE_WAIT ||
tc->state == TCP_CLOSED || tc->state == TCP_TIME_WAIT) {
s->waiter_task = NULL;
return 0; /* EOF */
}
if (s->nonblocking) {
s->waiter_task = NULL;
return -11; /* EAGAIN */
}
sched_block();
}
}
The waiter_task is set before checking for data to prevent the lost-wakeup race: if sock_wake() fires between the peek and sched_block() while waiter_task is NULL, the wakeup would be silently lost.
TCP Write (SMAP-Safe)
The write path bounces through a 1460-byte kernel staging buffer to avoid SMAP faults when passing userspace pointers to tcp_conn_send():
uint8_t s_sndbuf[1460];
while (sent < len) {
uint32_t chunk = len - sent;
if (chunk > 1460) chunk = 1460;
copy_from_user(s_sndbuf, buf + sent, chunk);
int n = tcp_conn_send(s->tcp_conn_id, s_sndbuf, chunk);
if (n <= 0) return sent > 0 ? sent : -32; /* EPIPE */
sent += n;
}
UDP Receive Ring
UDP datagrams are buffered in an 8-slot ring within each socket:
#define UDP_RX_SLOTS 8
#define UDP_RX_MAXBUF 1500
typedef struct {
uint8_t data[UDP_RX_MAXBUF];
uint16_t len;
ip4_addr_t src_ip;
uint16_t src_port;
uint8_t in_use;
} udp_rx_slot_t;
The udp_rx() handler in udp.c writes datagrams into this ring. The recvfrom syscall reads from it.
Sockaddr Layout
The kernel uses a k_sockaddr_in_t matching the musl libc struct sockaddr_in layout:
typedef struct {
uint16_t sin_family; /* AF_INET */
uint16_t sin_port; /* network byte order */
uint32_t sin_addr; /* network byte order */
uint8_t sin_zero[8];
} k_sockaddr_in_t;
_Static_assert(sizeof(k_sockaddr_in_t) == 16, "...");
AF_UNIX Domain Sockets (unix_socket.c)
AF_UNIX sockets provide local IPC with connected, bidirectional, byte-stream semantics. They support fd passing (SCM_RIGHTS), peer credential retrieval, and VFS-backed file descriptors.
Socket Structure
#define UNIX_SOCK_MAX 32
#define UNIX_PATH_MAX 108
#define UNIX_BUF_SIZE 4056 /* fits in one kva page */
typedef struct {
uint8_t in_use;
unix_state_t state;
uint8_t nonblocking;
char path[UNIX_PATH_MAX];
/* Ring buffer -- this socket's TX direction (peer reads from it) */
uint8_t *ring; /* kva-allocated page */
uint16_t ring_head; /* write position */
uint16_t ring_tail; /* read position */
/* Peer link */
uint32_t peer_id;
/* Accept queue (listening sockets) */
uint32_t accept_queue[8];
uint8_t accept_head, accept_tail;
/* Blocking */
aegis_task_t *waiter_task;
/* Peer credentials */
uint32_t peer_pid, peer_uid, peer_gid;
/* fd passing staging area */
unix_passed_fd_t passed_fds[UNIX_PASSED_FD_MAX]; /* 16 slots */
uint8_t passed_fd_count;
/* Refcount for dup/fork */
uint32_t refcount;
} unix_sock_t;
States
typedef enum {
UNIX_FREE, UNIX_CREATED, UNIX_BOUND, UNIX_LISTENING,
UNIX_CONNECTING, UNIX_CONNECTED, UNIX_CLOSED
} unix_state_t;
Name Table
Bound sockets are registered in a static name table mapping paths to socket IDs:
#define UNIX_NAME_MAX 32
typedef struct {
char path[UNIX_PATH_MAX];
uint32_t sock_id;
uint8_t in_use;
} unix_name_t;
name_register(): Registers a path, returns-EADDRINUSEif already boundname_unregister(): Removes the binding (called on socket close)name_lookup(): Returns the socket ID for a path, orUNIX_NONE
Connection Flow
Client Server
------ ------
socket(AF_UNIX, STREAM, 0) socket(AF_UNIX, STREAM, 0)
bind("/run/service.sock")
listen()
connect("/run/service.sock")
| accept() [blocks]
+-- name_lookup → listener_id
+-- Allocate server-side socket
+-- Allocate ring buffers (2 pages)
+-- Cross-link peer_ids
+-- Enqueue in listener accept queue
+-- Wake listener
| |
v v
UNIX_CONNECTED UNIX_CONNECTED (new fd)
Ring Buffer Design
Each connected Unix socket has its own TX ring buffer (a single kva_alloc_pages(1) page, 4056 usable bytes). The peer reads from this ring:
Socket A Socket B
+---------+ +---------+
| ring_a | --- read ----> | peer |
| (A's TX)| | reads |
+---------+ +---------+
| ring_b | --- read ----> Socket A reads
| (B's TX)|
+---------+
- Write: A writes to its own
ringatring_head - Read: A reads from peer’s
ringat peer’sring_tail
Ring buffer functions:
static uint16_t ring_used(unix_sock_t *s) {
return (uint16_t)((s->ring_head - s->ring_tail) & (UNIX_BUF_SIZE - 1));
}
static uint16_t ring_free(unix_sock_t *s) {
return (uint16_t)(UNIX_BUF_SIZE - 1 - ring_used(s));
}
Close and Ring Lifetime
Ring buffer ownership is carefully managed across close:
- When socket A closes but peer B is still alive: A’s ring is not freed because B still reads from it. The ring pointer remains valid even after
in_use=0. - When B subsequently closes: B frees both its own ring and A’s orphaned ring.
- This prevents use-after-free while allowing the peer to drain remaining buffered data after the sender closes.
Peer Credentials
Credentials are captured at connect() time:
/* Client credentials → server-side socket */
s_unix[server_id].peer_pid = proc->pid;
s_unix[server_id].peer_uid = proc->uid;
s_unix[server_id].peer_gid = proc->gid;
/* Server credentials → client socket (filled at accept time) */
client->peer_pid = accepting_proc->pid;
Retrieved via unix_sock_peercred():
int unix_sock_peercred(uint32_t id, uint32_t *pid, uint32_t *uid, uint32_t *gid);
fd Passing (SCM_RIGHTS)
Unix sockets support passing file descriptors between processes:
Staging (sender side):
int unix_sock_stage_fds(uint32_t peer_id, unix_passed_fd_t *fds, uint8_t count);
Copies VFS ops/priv/flags into the peer’s staging area (up to 16 fds).
Receiving (receiver side):
int unix_sock_recv_fds(uint32_t id, int *fd_out, int max_fds);
Installs staged fds into the receiving process’s fd table. Any fds that cannot be installed (fd table full) are closed via their ops->close callback.
On socket close, any unreceived staged fds are also cleaned up.
VFS Integration
AF_UNIX sockets use SMAP-safe VFS ops that bounce through a 1024-byte kernel buffer:
static int unix_vfs_read(void *priv, void *buf, uint64_t off, uint64_t len) {
uint8_t kbuf[1024];
int n = unix_sock_read(id, kbuf, want);
if (n > 0) copy_to_user(buf, kbuf, n);
return n;
}
epoll (epoll.c)
Design
The epoll implementation provides scalable I/O event notification compatible with the Linux epoll API. It supports both level-triggered and edge-triggered modes.
Structures
#define EPOLL_MAX_INSTANCES 8
#define EPOLL_MAX_WATCHES 64
typedef struct {
uint32_t fd;
uint32_t events;
uint64_t data; /* user data */
uint8_t in_use;
} epoll_watch_t;
typedef struct {
uint8_t in_use;
epoll_watch_t watches[EPOLL_MAX_WATCHES];
uint8_t nwatches;
aegis_task_t *waiter_task;
uint32_t ready[EPOLL_MAX_WATCHES];
uint8_t nready;
} epoll_fd_t;
Event Flags
#define EPOLLIN 0x00000001U /* readable */
#define EPOLLOUT 0x00000004U /* writable */
#define EPOLLERR 0x00000008U /* error */
#define EPOLLHUP 0x00000010U /* hangup */
#define EPOLLET 0x80000000U /* edge-triggered */
epoll_ctl
int epoll_ctl_impl(uint32_t epoll_id, int op, int fd, k_epoll_event_t *ev);
| Operation | Effect |
|---|---|
EPOLL_CTL_ADD |
Add a watch for fd. Returns -EEXIST if already watched. |
EPOLL_CTL_DEL |
Remove the watch. Returns -ENOENT if not found. |
EPOLL_CTL_MOD |
Update events/data for existing watch. |
epoll_notify
Called from TCP and UDP when data or connection events occur:
void epoll_notify(uint32_t sock_id_as_fd, uint32_t events);
For each epoll instance, scans watches for matching fd and events. Adds to the ready list (with dedup) and wakes any blocked epoll_wait caller.
epoll_wait
int epoll_wait_impl(uint32_t epoll_id, uint64_t events_uptr,
int maxevents, uint32_t timeout_ticks);
The implementation:
- VFS poll sweep: For non-socket fds (pipes, console), calls the VFS
pollop to check readiness. This enables epoll to monitor heterogeneous fd types. - Ready check (under
epoll_lock): If events are ready, copy them to userspace viacopy_to_user(). - Edge-triggered: Entries with
EPOLLETare removed from the ready list after delivery. - Level-triggered: Entries remain in the ready list.
- Blocking: If no events and timeout > 0, set
waiter_taskandsched_block().
The atomicity fix (Bug C6): The nready check and waiter_task assignment are performed under epoll_lock to prevent a lost wakeup when epoll_notify fires between checking nready==0 and setting waiter_task.
epoll Event Structure
typedef struct __attribute__((packed)) {
uint32_t events;
uint64_t data; /* user data (epoll_data_t union) */
} k_epoll_event_t;
_Static_assert(sizeof(k_epoll_event_t) == 12, "matches Linux ABI");
Socket API Summary
| Syscall | AF_INET TCP | AF_INET UDP | AF_UNIX |
|---|---|---|---|
socket() |
sock_alloc(STREAM) |
sock_alloc(DGRAM) |
unix_sock_alloc() |
bind() |
Set local_ip/port | udp_bind() |
unix_sock_bind() |
listen() |
tcp_listen() |
N/A | unix_sock_listen() |
accept() |
Pop accept queue | N/A | unix_sock_accept() |
connect() |
tcp_connect() |
Set remote addr | unix_sock_connect() |
send/write |
tcp_conn_send() |
N/A | unix_sock_write() |
recv/read |
tcp_conn_recv() |
N/A | unix_sock_read() |
sendto |
N/A | udp_send() |
N/A |
recvfrom |
N/A | Pop UDP RX ring | N/A |
close() |
tcp_conn_close() |
udp_unbind() |
unix_sock_free() |
epoll_create |
- | - | - |
epoll_ctl |
epoll_ctl_impl() |
epoll_ctl_impl() |
N/A |
epoll_wait |
epoll_wait_impl() |
epoll_wait_impl() |
N/A |
Table Size Limits
| Resource | Limit | Defined In |
|---|---|---|
| AF_INET sockets | 64 | socket.h SOCK_TABLE_SIZE |
| TCP connections | 32 | tcp.h TCP_MAX_CONNS |
| TCP receive buffer | 16 KB | tcp.h TCP_RBUF_SIZE |
| TCP send buffer | 8 KB | tcp.h TCP_SBUF_SIZE |
| UDP bindings | 16 | udp.h UDP_BINDINGS_MAX |
| UDP RX ring slots | 8 | socket.h UDP_RX_SLOTS |
| UDP RX max datagram | 1500 B | socket.h UDP_RX_MAXBUF |
| AF_UNIX sockets | 32 | unix_socket.h UNIX_SOCK_MAX |
| AF_UNIX buffer size | 4056 B | unix_socket.h UNIX_BUF_SIZE |
| AF_UNIX path max | 108 chars | unix_socket.h UNIX_PATH_MAX |
| AF_UNIX passed fds | 16 | unix_socket.h UNIX_PASSED_FD_MAX |
| epoll instances | 8 | epoll.h EPOLL_MAX_INSTANCES |
| epoll watches | 64 | epoll.h EPOLL_MAX_WATCHES |
| Network devices | 4 | netdev.h NETDEV_MAX |
Security Considerations
The socket layer is a critical attack surface – it processes attacker-controlled data (sockaddr structures, buffer lengths, fd numbers) in kernel context. Key v1 concerns:
- AF_UNIX fd passing:
unix_sock_stage_fds()copies VFS ops pointers between processes. A bug in this path could allow arbitrary kernel function pointer injection. copy_from_user/copy_to_user: SMAP enforcement prevents direct kernel access to userspace memory, but the bounce-buffer pattern (1024 bytes for Unix, 1460 bytes for TCP) means buffer overflows in the staging buffers are a risk if length validation is incorrect.- Socket table as shared mutable state: 64-slot
sock_tarray accessed undersock_lockfrom both syscall and ISR context. Lock ordering bugs or missed lock acquisitions could corrupt socket state. - No capability checks on socket operations: The capability model does not yet gate network operations – any process can open sockets, bind to any port, or connect to any address. Integrating capability checks (
CAP_KIND_NET) into the socket syscall path is planned future work.
The long-term plan includes migrating the socket layer to Rust as part of the broader kernel C-to-Rust migration. The capability system (kernel/cap/lib.rs) is the first kernel component already written in Rust, establishing the FFI patterns that will be extended to networking.
Related Documentation
- Network Stack Overview – architecture and packet flow
- TCP/IP Implementation – protocol internals
- Device Drivers – NIC driver details
- Syscall Interface – syscall numbers and signatures