Skip to content

Chapter 16: Networking

Socket layer, NetBuf, routing, TCP stack, congestion control, kTLS, overlays/tunnels, netlink, packet filtering, interface naming, network service provider


The network stack is built around NetBuf — a zero-copy packet buffer that eliminates sk_buff allocation overhead on hot paths. The TCP/IP stack, routing (FIB), congestion control, kTLS, and packet filtering (BPF-based) are all specified. MPTCP is a first-class transport. Queue disciplines follow the same replaceable-policy pattern as the I/O scheduler.

16.1 TCP Stack Extensibility

Linux problem: MPTCP took many years to get into mainline because it required deep changes to the TCP stack. The monolithic TCP implementation made it hard to add new transport protocols. Congestion control algorithms are pluggable, but the socket layer itself is tightly coupled to TCP internals — adding a fundamentally new transport like QUIC kernel offload requires invasive surgery across net/ipv4/, net/ipv6/, and the socket layer. TCP socket options are implemented as a single 3,000-line do_tcp_setsockopt() function with no modularity boundary. Each new option (TFO, TCP_REPAIR, kTLS) requires modifying the monolithic switch statement and threading new state through the tcp_sock structure.

UmkaOS design: The TCP stack is factored into composable subsystems with well-defined internal interfaces:

  • Connection state machine: RFC 793 states, timers, sequence tracking — specified in Section 16.8.
  • Congestion control: Pluggable via CongestionOps trait — Section 16.10.
  • Socket options: Modular dispatch with per-option validation, specified in this document.
  • Upper layer protocols: TCP_ULP extension point for kTLS — Section 16.15.
  • MPTCP: First-class subflow architecture designed from the start, not patched on top — Section 16.11.

The setsockopt / getsockopt dispatch for TCP uses a match table indexed by option number, not a monolithic switch. Each option is implemented as an independent handler function with its own validation logic, making the TCP option surface area modular and auditable.

16.1.1 TCP Socket Options (SOL_TCP = IPPROTO_TCP = 6)

Every TCP socket option exposed to userspace via setsockopt() / getsockopt() is listed below. Values and types match Linux's include/uapi/linux/tcp.h for binary compatibility — unmodified applications calling setsockopt(SOL_TCP, TCP_NODELAY, ...) work without recompilation.

Option Value Type Get Set Description
TCP_NODELAY 1 i32 (bool) yes yes Disable Nagle's algorithm (RFC 896). When set, segments are sent immediately without waiting for more data or an ACK of previously sent data.
TCP_MAXSEG 2 i32 yes yes User-requested maximum segment size. Clamped to [88, 65535 - 40]. The lower bound of 88 prevents SACK-based resource amplification (CVE-2019-11479: an attacker setting MSS=8 causes excessive memory and CPU consumption via SACK processing of many tiny segments). Effective MSS is min(user_mss, peer_mss, path_mtu - headers).
TCP_CORK 3 i32 (bool) yes yes Cork output: accumulate small writes into full MSS segments. Complementary to TCP_NODELAY — when both set, TCP_NODELAY wins (data sent immediately).
TCP_KEEPIDLE 4 i32 yes yes Seconds of idle time before the first keepalive probe. Overrides net.ipv4.tcp_keepalive_time for this socket. Requires SO_KEEPALIVE enabled.
TCP_KEEPINTVL 5 i32 yes yes Seconds between successive keepalive probes. Overrides net.ipv4.tcp_keepalive_intvl.
TCP_KEEPCNT 6 i32 yes yes Number of keepalive probes before declaring the connection dead. Overrides net.ipv4.tcp_keepalive_probes.
TCP_SYNCNT 7 i32 yes yes Maximum number of SYN retransmits for this socket. Overrides net.ipv4.tcp_syn_retries. Range: [1, 255].
TCP_LINGER2 8 i32 yes yes Per-socket FIN_WAIT2 timeout in seconds. Overrides net.ipv4.tcp_fin_timeout. Value -1 means use the system default.
TCP_DEFER_ACCEPT 9 i32 yes yes Wake accept() only after data arrives (not on bare SYN+ACK). Value is timeout in seconds; kernel converts to SYN-ACK retransmit count internally.
TCP_WINDOW_CLAMP 10 i32 yes yes Clamp the advertised receive window to this value. Used by proxies that need to limit peer send rate. Minimum: SOCK_MIN_RCVBUF / 2.
TCP_INFO 11 TcpInfo yes no Read-only. Returns comprehensive connection statistics (see TcpInfo struct below).
TCP_QUICKACK 12 i32 (bool) yes yes Enable/disable quick-ACK mode. When enabled, ACKs are sent immediately instead of delayed. Resets after each ACK; must be re-set per-receive-cycle for persistent effect.
TCP_CONGESTION 13 [u8; 16] yes yes Get/set congestion control algorithm name (null-terminated ASCII). See Section 16.10.
TCP_MD5SIG 14 TcpMd5Sig no yes TCP-MD5 signature keys for BGP session protection (RFC 2385). See TCP-MD5 section below.
TCP_THIN_LINEAR_TIMEOUTS 16 i32 (bool) yes yes Use linear (not exponential) RTO backoff for thin streams (fewer than 4 packets in flight).
TCP_THIN_DUPACK 17 i32 (bool) yes yes Trigger fast retransmit after 1 dupACK (instead of 3) for thin streams.
TCP_USER_TIMEOUT 18 u32 yes yes Milliseconds before aborting on unacknowledged data. 0 = system default. When set, connection is aborted if data remains unacked for this duration regardless of retry count.
TCP_REPAIR 19 i32 yes yes Enter/exit TCP repair mode (CRIU checkpoint/restore). See TCP_REPAIR section below.
TCP_REPAIR_QUEUE 20 i32 yes yes Select queue for repair-mode I/O: 0 = no queue, 1 = receive queue, 2 = send queue.
TCP_QUEUE_SEQ 21 u32 yes yes Set sequence number for the selected repair queue.
TCP_REPAIR_OPTIONS 22 [TcpRepairOpt] no yes Set TCP options (MSS, window scale, timestamps, SACK) in repair mode.
TCP_FASTOPEN 23 i32 yes yes Enable TCP Fast Open on a listening socket. Value is the maximum pending TFO connection queue length. See TFO section below.
TCP_TIMESTAMP 24 u32 yes yes Get current TCP timestamp value. Set is only valid in repair mode.
TCP_NOTSENT_LOWAT 25 u32 yes yes Unsent data low watermark for epoll/poll writability. POLLOUT is set only when unsent data is below this threshold. Default 0 = legacy behavior (unlimited). Used by HTTP/2 servers to avoid write stalls.
TCP_CC_INFO 26 varies yes no Read-only. Congestion-control-specific info struct. Layout depends on active algorithm (see CongestionOps::get_info() in Section 16.10).
TCP_SAVE_SYN 27 i32 (bool) yes yes Save the SYN packet headers for inspection by the accepting process.
TCP_SAVED_SYN 28 [u8] yes no Read-only. Returns saved SYN packet headers (requires TCP_SAVE_SYN set before accept).
TCP_REPAIR_WINDOW 29 TcpRepairWindow no yes Set window parameters (snd_wl1, snd_wnd, max_window, rcv_wnd, rcv_wup) in repair mode.
TCP_FASTOPEN_CONNECT 30 i32 (bool) yes yes Enable TFO on connect() — send data in SYN (client-side). See TFO section below.
TCP_ULP 31 [u8; 16] no yes Attach an Upper Layer Protocol. Currently only "tls" is supported (Phase 2). See Section 16.15.
TCP_MD5SIG_EXT 32 TcpMd5Sig no yes Extended MD5 signature (adds flags field for prefix-based matching).
TCP_FASTOPEN_KEY 33 [u8; 16] yes yes TFO server cookie key (128-bit). Writable for cluster-consistent TFO across load-balanced servers.
TCP_FASTOPEN_NO_COOKIE 34 i32 (bool) yes yes Enable TFO without cookie verification (for trusted networks).
TCP_ZEROCOPY_RECEIVE 35 TcpZeroCopyReceive yes yes Zero-copy receive via mmap. See Section 16.8.
TCP_INQ 36 i32 (bool) yes yes Report in-queue bytes via TCP_CM_INQ control message on recvmsg(). Used by gRPC for message framing decisions.
TCP_TX_DELAY 37 u32 yes yes Microseconds to delay before TX (pacing assist). Allows deliberate throttling without modifying congestion control. Range: [0, 4_000_000] (max 4 seconds).

Option dispatch implementation: Each option is registered in a compile-time table that maps (option_value, direction) to a handler function. The setsockopt/getsockopt path is a single array index (O(1)) followed by a function call — no linear scan or match cascade:

/// TCP socket option handler. One per option, registered at compile time.
struct TcpSockoptHandler {
    /// Option number (TCP_NODELAY = 1, TCP_MAXSEG = 2, etc.).
    opt: i32,
    /// Set handler. None for read-only options (TCP_INFO, TCP_CC_INFO).
    set_fn: Option<fn(tcb: &TcpCb, val: &[u8]) -> Result<(), KernelError>>,
    /// Get handler. None for write-only options (TCP_REPAIR_OPTIONS).
    get_fn: Option<fn(tcb: &TcpCb, buf: &mut [u8]) -> Result<usize, KernelError>>,
    /// Minimum capability required for set (None = unprivileged).
    set_cap: Option<Capability>,
}

/// Compile-time option table. Indexed by option value (0..=37).
/// Sparse: unused slots contain None. Lookup is O(1).
static TCP_SOCKOPT_TABLE: [Option<TcpSockoptHandler>; 38] = [/* ... */];

16.1.2 SO-Level Socket Options (SOL_SOCKET = 1)

Socket-level options are processed by the generic socket layer (Section 16.3) before reaching TCP. The options below are the subset most relevant to TCP socket behavior. Values match Linux's include/uapi/asm-generic/socket.h.

Option Value Type Description
SO_REUSEADDR 2 i32 (bool) Allow bind to an address in TIME_WAIT state. Does not allow two active listeners on the same port (use SO_REUSEPORT for that).
SO_KEEPALIVE 9 i32 (bool) Enable TCP keepalive probes. Probe timing controlled by TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT.
SO_SNDBUF 7 i32 Send buffer size in bytes. Kernel doubles the value internally (matching Linux behavior) to account for bookkeeping overhead. Range: [2048, sysctl_wmem_max].
SO_RCVBUF 8 i32 Receive buffer size in bytes. Kernel doubles the value. Range: [256, sysctl_rmem_max].
SO_LINGER 13 Linger Control behavior on close(). If enabled with timeout > 0: close() blocks until pending data is sent or timeout expires. If timeout = 0: close() sends RST (abort).
SO_REUSEPORT 15 i32 (bool) Multiple sockets bind to the same (IP, port) tuple. See SO_REUSEPORT section below.
SO_SNDTIMEO_OLD 21 Timeval (old) Send timeout (old ABI, long-based). Blocking sendmsg() returns EAGAIN after this duration.
SO_RCVTIMEO_OLD 20 Timeval (old) Receive timeout (old ABI, long-based). Blocking recvmsg() returns EAGAIN after this duration.
SO_RCVTIMEO_NEW 66 __kernel_sock_timeval Receive timeout (Y2038-safe, i64-based). Preferred on 64-bit and 32-bit.
SO_SNDTIMEO_NEW 67 __kernel_sock_timeval Send timeout (Y2038-safe, i64-based). Preferred on 64-bit and 32-bit.
SO_PRIORITY 12 u32 Socket priority for QoS classification. Values > 6 require CAP_NET_ADMIN (Section 16.3).
SO_BINDTODEVICE 25 [u8; IFNAMSIZ] Bind socket to a specific network interface. No capability required (Linux 5.7+, commit c427bfec18f2).
SO_MARK 36 u32 Packet mark for policy routing and netfilter matching (Section 16.21). Requires CAP_NET_ADMIN.
SO_BUSY_POLL 46 u32 Busy-poll time in microseconds. Socket-level override of net.core.busy_poll sysctl (Section 16.14).
SO_INCOMING_CPU 49 i32 Read-only. CPU that last received data for this socket (for affinity hints).
SO_ATTACH_BPF 50 i32 Attach a BPF program (by fd) to this socket for packet filtering (Section 16.18).
SO_ZEROCOPY 60 i32 (bool) Enable zero-copy sendmsg() via page pinning. Beneficial above ~10 KB per send. See Section 16.8.
SO_TXTIME 61 SockTxtime Transmit time scheduling. Allows applications to specify per-packet TX timestamps for precise pacing (Section 16.21).
SO_PREFER_BUSY_POLL 69 i32 (bool) Prefer busy-polling over interrupt-driven receive. Used with SO_BUSY_POLL for latency-sensitive workloads.
/// SO_LINGER parameter. Layout matches Linux `struct linger`.
#[repr(C)]
pub struct Linger {
    /// 0 = linger disabled (default), nonzero = linger enabled.
    pub l_onoff: i32,
    /// Linger timeout in seconds. Only meaningful when l_onoff != 0.
    /// 0 = send RST on close (hard abort).
    pub l_linger: i32,
}
// UAPI ABI: l_onoff(4) + l_linger(4) = 8 bytes.
const_assert!(core::mem::size_of::<Linger>() == 8);

/// SO_TXTIME parameter. Layout matches Linux `struct sock_txtime`.
#[repr(C)]
pub struct SockTxtime {
    /// Clock ID for timestamps (CLOCK_TAI for time-sensitive networking).
    pub clockid: i32,
    /// Flags: `SOF_TXTIME_DEADLINE_MODE` (1) = enable deadline mode,
    /// `SOF_TXTIME_REPORT_ERRORS` (2) = report missed-deadline errors via `MSG_ERRQUEUE`.
    pub flags: u32,
}
// UAPI ABI: clockid(4) + flags(4) = 8 bytes.
const_assert!(core::mem::size_of::<SockTxtime>() == 8);

16.1.3 TcpInfo Structure

TCP_INFO returns a comprehensive snapshot of TCP connection state. This structure is read by monitoring tools (ss -i, Prometheus node_exporter), load balancers (Envoy, HAProxy), and profiling frameworks (BPF TCP tracing). Binary layout must match Linux's struct tcp_info from include/uapi/linux/tcp.h exactly.

/// TCP connection info — returned by getsockopt(TCP_INFO).
/// Binary layout matches Linux's struct tcp_info for ABI compatibility.
/// Applications and tools (ss, envoy, prometheus) depend on field offsets.
///
/// Fields are populated from the TcpCb at getsockopt() time under the
/// per-socket spinlock to ensure a consistent snapshot.
// kernel-internal, not KABI
#[repr(C)]
pub struct TcpInfo {
    /// TCP state (matches TcpState enum: ESTABLISHED=1, SYN_SENT=2, etc.).
    pub tcpi_state: u8,
    /// Congestion avoidance state (0=Open, 1=Disorder, 2=CWR, 3=Recovery, 4=Loss).
    pub tcpi_ca_state: u8,
    /// Current retransmit count for the oldest unacked segment.
    pub tcpi_retransmits: u8,
    /// Keepalive/zero-window probes sent.
    pub tcpi_probes: u8,
    /// Exponential backoff counter (doubled on each RTO).
    pub tcpi_backoff: u8,
    /// Negotiated TCP options bitmask.
    /// Bit 0: TCPI_OPT_TIMESTAMPS, Bit 1: TCPI_OPT_SACK,
    /// Bit 2: TCPI_OPT_WSCALE, Bit 3: TCPI_OPT_ECN,
    /// Bit 4: TCPI_OPT_ECN_SEEN, Bit 5: TCPI_OPT_SYN_DATA (TFO).
    pub tcpi_options: u8,
    /// Packed window scale factors: bits [0:3] = snd_wscale, bits [4:7] = rcv_wscale.
    /// Rust has no C bitfields; use pack/unpack helpers:
    /// ```rust
    /// fn pack_wscale(snd: u8, rcv: u8) -> u8 { (snd & 0xF) | ((rcv & 0xF) << 4) }
    /// fn unpack_wscale(v: u8) -> (u8, u8) { (v & 0xF, (v >> 4) & 0xF) }
    /// ```
    pub tcpi_snd_wscale_rcv_wscale: u8,
    /// Packed delivery rate + TFO client failure (matches Linux bitfield):
    ///   bit 0:    delivery_rate_app_limited (1 = app-limited delivery rate)
    ///   bits 1-2: fastopen_client_fail (0 = TFO_STATUS_UNSPEC,
    ///             1 = TFO_COOKIE_UNAVAILABLE, 2 = TFO_DATA_NOT_ACKED,
    ///             3 = TFO_SYN_RETRANSMITTED). Read by `ss -i` for TFO diagnostics.
    ///   bits 3-7: reserved (zero)
    /// Pack/unpack helpers:
    /// ```rust
    /// fn pack_delivery_tfo(app_limited: bool, tfo_fail: u8) -> u8 {
    ///     (app_limited as u8) | ((tfo_fail & 0x3) << 1)
    /// }
    /// fn unpack_delivery_tfo(v: u8) -> (bool, u8) {
    ///     ((v & 1) != 0, (v >> 1) & 0x3)
    /// }
    /// ```
    pub tcpi_delivery_rate_app_limited: u8,

    /// Retransmission timeout in microseconds.
    pub tcpi_rto: u32,
    /// Predicted ACK timeout in microseconds (delayed ACK timer).
    pub tcpi_ato: u32,
    /// Send MSS (effective, after negotiation and PMTUD).
    pub tcpi_snd_mss: u32,
    /// Receive MSS (advertised by peer).
    pub tcpi_rcv_mss: u32,

    /// Segments sent but not yet acknowledged.
    pub tcpi_unacked: u32,
    /// SACK'd segments above snd_una.
    pub tcpi_sacked: u32,
    /// Segments considered lost by the loss detection algorithm.
    pub tcpi_lost: u32,
    /// Currently retransmitted (in-flight retransmit) segments.
    pub tcpi_retrans: u32,
    /// Deprecated (was FACK-specific). Always 0 in UmkaOS. Preserved for layout compat.
    pub tcpi_fackets: u32,

    // --- Timestamps (milliseconds since an unspecified epoch) ---
    /// Time since last data segment was sent.
    pub tcpi_last_data_sent: u32,
    /// Time since last ACK was sent. Always 0 in Linux; preserved for compat.
    pub tcpi_last_ack_sent: u32,
    /// Time since last data segment was received.
    pub tcpi_last_data_recv: u32,
    /// Time since last ACK was received.
    pub tcpi_last_ack_recv: u32,

    // --- Path metrics ---
    /// Path MTU (from PMTUD or route entry).
    pub tcpi_pmtu: u32,
    /// Receiver-side slow-start threshold (controls auto-tuning upper bound).
    pub tcpi_rcv_ssthresh: u32,
    /// Smoothed RTT in microseconds (SRTT from Jacobson/Karels).
    pub tcpi_rtt: u32,
    /// RTT variance in microseconds.
    pub tcpi_rttvar: u32,
    /// Sender slow-start threshold (bytes).
    pub tcpi_snd_ssthresh: u32,
    /// Sender congestion window (segments). Note: u32 for Linux ABI compat;
    /// internal TcpCb.cwnd is u64 for high-BDP paths — clamped to u32::MAX here.
    pub tcpi_snd_cwnd: u32,
    /// Advertised MSS (before clamping by peer/PMTUD).
    pub tcpi_advmss: u32,
    /// Reordering metric (max observed reorder distance in segments).
    pub tcpi_reordering: u32,

    /// Receiver RTT estimate in microseconds (used for auto-tuning rcvbuf).
    pub tcpi_rcv_rtt: u32,
    /// Receiver buffer auto-tuning target (bytes).
    pub tcpi_rcv_space: u32,

    /// Total retransmitted segments over connection lifetime.
    pub tcpi_total_retrans: u32,

    // --- 64-bit counters (appended in Linux 4.6+) ---
    /// Current pacing rate in bytes/sec (from congestion control or SO_MAX_PACING_RATE).
    pub tcpi_pacing_rate: u64,
    /// Maximum pacing rate (SO_MAX_PACING_RATE or u64::MAX if unlimited).
    pub tcpi_max_pacing_rate: u64,
    /// Total bytes acknowledged by peer (cumulative ACK, not SACK).
    pub tcpi_bytes_acked: u64,
    /// Total bytes received (delivered to socket buffer).
    pub tcpi_bytes_received: u64,
    /// Total segments sent (including retransmits).
    pub tcpi_segs_out: u32,
    /// Total segments received.
    pub tcpi_segs_in: u32,

    /// Bytes in the send buffer not yet sent (queued but not transmitted).
    pub tcpi_notsent_bytes: u32,
    /// Minimum RTT observed over the connection lifetime (microseconds).
    pub tcpi_min_rtt: u32,
    /// Data segments received (excluding pure ACKs).
    pub tcpi_data_segs_in: u32,
    /// Data segments sent (excluding pure ACKs and retransmits).
    pub tcpi_data_segs_out: u32,

    /// Delivery rate estimate in bytes/sec (from the rate sampling engine).
    pub tcpi_delivery_rate: u64,

    /// Time the connection was send-buffer limited (microseconds).
    pub tcpi_busy_time: u64,
    /// Time the connection was receiver-window limited (microseconds).
    pub tcpi_rwnd_limited: u64,
    /// Time the connection was send-buffer limited (microseconds).
    pub tcpi_sndbuf_limited: u64,

    /// Total delivered packets (tracked for delivery rate estimation).
    pub tcpi_delivered: u32,
    /// Delivered packets marked with ECN CE (congestion experienced).
    pub tcpi_delivered_ce: u32,
    /// Total bytes sent (including retransmits, headers excluded).
    pub tcpi_bytes_sent: u64,
    /// Total bytes retransmitted.
    pub tcpi_bytes_retrans: u64,
    /// DSACK duplicates seen (spurious retransmit counter).
    pub tcpi_dsack_dups: u32,
    /// Reordering events observed.
    pub tcpi_reord_seen: u32,

    /// Out-of-order packets received.
    pub tcpi_rcv_ooopack: u32,
    /// Current send window (bytes, from last window update).
    pub tcpi_snd_wnd: u32,
    /// Current receive window (bytes, advertised to peer).
    pub tcpi_rcv_wnd: u32,
    /// Rehash count (number of times the connection was rehashed due to SYN retransmit).
    pub tcpi_rehash: u32,

    // --- Fields added in Linux 6.1+ (versioned field group) ---
    // getsockopt(TCP_INFO) uses copy_to_user(min(optlen, sizeof(struct tcp_info))),
    // so older tools requesting fewer bytes get a truncated (but correct) prefix.
    // Fields below are ordered exactly as Linux mainline tcp.h.

    /// Total RTO events over connection lifetime (Linux 6.1+).
    pub tcpi_total_rto: u16,
    /// Total RTO recoveries (successful, not spurious) over connection lifetime (Linux 6.1+).
    pub tcpi_total_rto_recoveries: u16,
    /// Total time spent in RTO (microseconds, Linux 6.1+).
    pub tcpi_total_rto_time: u32,

    // --- AccECN fields (Linux 6.12+, RFC 9332) ---
    // Populated when AccECN support is implemented (Phase 3+).
    // Until then, getsockopt returns zeroes for these fields.
    // Layout matches torvalds/linux master include/uapi/linux/tcp.h.

    /// Total received CE-marked packets (AccECN receiver counter).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_received_ce: u32,              // 4 bytes, offset 248
    /// Delivered bytes from ECT(1)-marked segments (AccECN sender).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_delivered_e1_bytes: u32,       // 4 bytes, offset 252
    /// Delivered bytes from ECT(0)-marked segments (AccECN sender).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_delivered_e0_bytes: u32,       // 4 bytes, offset 256
    /// Delivered bytes from CE-marked segments (AccECN sender).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_delivered_ce_bytes: u32,       // 4 bytes, offset 260
    /// Received bytes from ECT(1)-marked segments (AccECN receiver).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_received_e1_bytes: u32,        // 4 bytes, offset 264
    /// Received bytes from ECT(0)-marked segments (AccECN receiver).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_received_e0_bytes: u32,        // 4 bytes, offset 268
    /// Received bytes from CE-marked segments (AccECN receiver).
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_received_ce_bytes: u32,        // 4 bytes, offset 272
    /// Packed bitfield (matches Linux u32 bitfield layout):
    ///   bits [0:1]:  tcpi_ecn_mode (0=off, 1=classic ECN, 2=AccECN)
    ///   bits [2:3]:  tcpi_accecn_opt_seen (0=none, 1=SYN, 2=established)
    ///   bits [4:7]:  tcpi_accecn_fail_mode (AccECN negotiation failure mode)
    ///   bits [8:31]: tcpi_options2 (reserved, zero)
    /// Reserved (zero) until AccECN implementation.
    pub tcpi_ecn_accecn_options2: u32,      // 4 bytes, offset 276
}

// ABI size assertion: TcpInfo is userspace-visible via getsockopt(TCP_INFO).
// ss -i, Envoy, Prometheus node_exporter all depend on exact layout.
// Size breakdown:
//   8 × u8 = 8 bytes               (offset 0..8)
//   24 × u32 = 96 bytes            (offset 8..104: rto through total_retrans)
//   4 × u64 = 32 bytes             (offset 104..136: pacing_rate through bytes_received)
//   6 × u32 = 24 bytes             (offset 136..160: segs_out through data_segs_out)
//   1 × u64 = 8 bytes              (offset 160..168: delivery_rate)
//   3 × u64 = 24 bytes             (offset 168..192: busy_time through sndbuf_limited)
//   2 × u32 = 8 bytes              (offset 192..200: delivered, delivered_ce)
//   2 × u64 = 16 bytes             (offset 200..216: bytes_sent, bytes_retrans)
//   2 × u32 = 8 bytes              (offset 216..224: dsack_dups, reord_seen)
//   4 × u32 = 16 bytes             (offset 224..240: rcv_ooopack through rehash)
//   2 × u16 + 1 × u32 = 8 bytes   (offset 240..248: total_rto fields)
//   8 × u32 = 32 bytes             (offset 248..280: AccECN fields, Linux 6.12+)
// Total: 248 bytes (through tcpi_total_rto_time) + 32 bytes (AccECN) = 280 bytes.
// Linux mainline (torvalds/linux master) defines the reference layout. AccECN
// fields are populated when AccECN (RFC 9332) is implemented (Phase 3+).
// getsockopt forward-compat: copy_to_user(min(optlen, sizeof)) handles older tools.
const_assert!(core::mem::size_of::<TcpInfo>() == 280);

Population path: getsockopt(SOL_TCP, TCP_INFO) acquires the per-socket spinlock (TcpCb.lock), snapshots all fields from TcpCb into a stack-allocated TcpInfo, releases the lock, and copies the struct to the user buffer. The lock hold time is bounded (pure reads, no I/O) — approximately 100-200 cycles for the field copy.

16.1.4 TCP Fast Open (TFO)

TCP Fast Open (RFC 7413) eliminates one RTT from the connection establishment latency by allowing data in the SYN packet. This is critical for short-lived HTTP/1.1 connections and DNS-over-TCP where the handshake dominates total latency.

16.1.4.1 Server Side

A listening socket enables TFO via setsockopt(SOL_TCP, TCP_FASTOPEN, &qlen, 4) where qlen is the maximum number of pending TFO connections (separate from the regular SYN backlog). The server generates cookies to validate returning clients:

/// TFO cookie: 8-byte authenticator binding client IP to server key.
///
/// Generated as: SipHash-2-4(client_ip_bytes, tfo_key) truncated to 8 bytes.
/// SipHash is chosen for its resistance to hash-flooding attacks while being
/// fast enough for per-SYN computation (~15 cycles on modern hardware).
///
/// The key is stored per-listener (TCP_FASTOPEN_KEY sockopt) or per-namespace
/// (net.ipv4.tcp_fastopen_key sysctl). Cluster deployments set the same key
/// across all load-balanced servers so TFO cookies are portable.
pub struct TfoCookie {
    pub val: [u8; 8],
}

/// Per-listener TFO state.
pub struct TfoListenerState {
    /// TFO cookie key. Set by TCP_FASTOPEN_KEY or auto-generated at listen() time.
    /// 128-bit key for SipHash-2-4.
    pub key: [u8; 16],
    /// Maximum pending TFO connections (from TCP_FASTOPEN setsockopt value).
    pub max_qlen: u32,
    /// Current number of pending TFO connections.
    pub cur_qlen: AtomicU32,
}

TFO server handshake:

  1. Client sends SYN with TFO cookie option (from a previous connection) + payload data.
  2. Server validates the cookie: SipHash(client_ip, key)[..8] == cookie.val.
  3. If valid and cur_qlen < max_qlen: accept the SYN+data, deliver data to the application immediately (before the 3-way handshake completes), send SYN+ACK.
  4. If invalid or queue full: fall back to standard 3-way handshake (SYN+data is ignored; only the SYN is processed). This is transparent to the client.
  5. On first connection (no cookie): server includes a new cookie in the SYN+ACK options. Client caches the cookie for subsequent connections.

16.1.4.2 Client Side

setsockopt(SOL_TCP, TCP_FASTOPEN_CONNECT, &1, 4) enables TFO for the next connect(). When set, sendmsg() after connect() (or sendto() on an unconnected socket) sends data in the SYN:

  1. If a cached cookie exists for the destination: send SYN + cookie + data.
  2. If no cached cookie: send SYN with an empty TFO option (requesting a cookie). The connection proceeds normally; data is sent after the handshake completes. The received cookie is cached for future connections.

The TFO cookie cache is per-network-namespace, keyed by destination IP address. Cache entries expire after 1 hour (matching Linux's TCP_FASTOPEN_COOKIE_EXPIRY).

16.1.4.3 Cookieless TFO

TCP_FASTOPEN_NO_COOKIE (per-socket) or net.ipv4.tcp_fastopen bit 10 (per-namespace) enables TFO without cookie verification. This is appropriate for trusted networks (datacenter east-west traffic) where the SYN+data amplification risk is acceptable. Without cookies, any client can send data in the SYN without prior authentication.

16.1.4.4 TFO Sysctl

net.ipv4.tcp_fastopen is a bitmask controlling TFO behavior per-namespace:

Bit Value Meaning
0 0x1 Enable TFO client (send data in SYN when cookie available)
1 0x2 Enable TFO server (accept data in SYN with valid cookie)
2 0x4 Client: send data in SYN even without a cached cookie (implies bit 0)
10 0x400 Server: accept data in SYN without cookie validation

Default: 0x1 (client-only). Kubernetes and Docker typically set 0x3 (client + server).

16.1.4.5 Security Considerations

  • SYN+data amplification: Without cookies, an attacker can forge SYN+data to amplify traffic to a victim. Cookies bind the TFO authorization to the client's IP address. Spoofed-IP SYNs with invalid cookies fall back to the normal handshake (no amplification).
  • Maximum outstanding TFO connections: The per-listener max_qlen bounds resource consumption. When exceeded, new TFO SYNs are handled as regular SYNs.
  • Key rotation: Servers should rotate TCP_FASTOPEN_KEY periodically. After rotation, cookies generated with the old key are rejected; clients fall back to a regular handshake and receive a new cookie. No connection failures occur — only one RTT of extra latency.

16.1.5 TCP_REPAIR (CRIU Checkpoint/Restore)

TCP_REPAIR enables transparent checkpoint and restore of established TCP connections, primarily for CRIU (Checkpoint/Restore In Userspace) live migration of containers. When a connection is checkpointed on one host and restored on another, the peer (remote server or client) is unaware that the connection was migrated.

16.1.5.1 Repair Mode Protocol

1. Enter repair mode:
   setsockopt(fd, SOL_TCP, TCP_REPAIR, &1, sizeof(int))
   — Requires CAP_NET_ADMIN.
   — Socket must be in ESTABLISHED, CLOSE_WAIT, or SYN_SENT state.
   — Suppresses all TCP protocol processing (no ACKs, no retransmits,
     no keepalives) while in repair mode.

2. Read connection state:
   getsockopt(TCP_INFO)        → connection metrics, state
   getsockopt(TCP_QUEUE_SEQ)   → sequence numbers (after selecting queue)
   getsockopt(TCP_TIMESTAMP)   → current TCP timestamp value
   read(fd, buf, len)          → drain receive queue data

3. Checkpoint send queue:
   setsockopt(TCP_REPAIR_QUEUE, &TCP_SEND_QUEUE)
   read(fd, buf, len)          → extract pending send data
   getsockopt(TCP_QUEUE_SEQ)   → read snd_una

4. Checkpoint receive queue:
   setsockopt(TCP_REPAIR_QUEUE, &TCP_RECV_QUEUE)
   read(fd, buf, len)          → extract received-but-unread data
   getsockopt(TCP_QUEUE_SEQ)   → read rcv_nxt

5. Close the socket (on the source host).

--- On the destination host ---

6. Create a new socket, enter repair mode:
   socket(AF_INET, SOCK_STREAM, 0)
   setsockopt(TCP_REPAIR, &1)

7. Restore addressing:
   bind(source_addr)
   connect(dest_addr)          → in repair mode, does NOT send SYN

8. Restore sequence numbers:
   setsockopt(TCP_REPAIR_QUEUE, &TCP_SEND_QUEUE)
   setsockopt(TCP_QUEUE_SEQ, &snd_una)
   write(fd, send_data, len)   → inject into send queue (no TX)

   setsockopt(TCP_REPAIR_QUEUE, &TCP_RECV_QUEUE)
   setsockopt(TCP_QUEUE_SEQ, &rcv_nxt)
   write(fd, recv_data, len)   → inject into receive queue

9. Restore TCP options:
   setsockopt(TCP_REPAIR_OPTIONS, &opts)
   — opts is an array of TcpRepairOpt structs:
     MSS, window scale (snd + rcv), SACK permitted, timestamps

10. Restore window parameters:
    setsockopt(TCP_REPAIR_WINDOW, &window_params)

11. Exit repair mode:
    setsockopt(TCP_REPAIR, &0)
    — TCP protocol processing resumes.
    — ACKs are sent, retransmit timer is armed.
    — The peer sees no interruption.

16.1.5.2 Repair Mode Types

/// TCP_REPAIR_OPTIONS entry. Each struct sets one TCP option in repair mode.
/// Layout matches Linux `struct tcp_repair_opt`.
#[repr(C)]
pub struct TcpRepairOpt {
    /// Option code: TCPOPT_MSS (2), TCPOPT_WINDOW (3), TCPOPT_SACK_PERM (4),
    /// TCPOPT_TIMESTAMP (8).
    pub opt_code: u32,
    /// Option value. Interpretation depends on opt_code:
    ///   MSS: effective MSS value (u32).
    ///   WINDOW: packed snd_wscale | (rcv_wscale << 16).
    ///   SACK_PERM: 0 = disabled, 1 = enabled.
    ///   TIMESTAMP: not used (timestamp restored via TCP_TIMESTAMP sockopt).
    pub opt_val: u32,
}
// UAPI ABI: opt_code(4) + opt_val(4) = 8 bytes.
const_assert!(core::mem::size_of::<TcpRepairOpt>() == 8);

/// TCP_REPAIR_WINDOW parameters. Layout matches Linux `struct tcp_repair_window`.
#[repr(C)]
pub struct TcpRepairWindow {
    /// Sequence number of last window update (snd_wl1).
    pub snd_wl1: u32,
    /// Current send window size.
    pub snd_wnd: u32,
    /// Maximum window size ever seen from peer.
    pub max_window: u32,
    /// Current receive window size.
    pub rcv_wnd: u32,
    /// Receive window update sequence (rcv_wup).
    pub rcv_wup: u32,
}
// UAPI ABI: 5 × u32(4) = 20 bytes.
const_assert!(core::mem::size_of::<TcpRepairWindow>() == 20);

/// Repair queue selection constants.
pub const TCP_NO_QUEUE: i32 = 0;
pub const TCP_RECV_QUEUE: i32 = 1;
pub const TCP_SEND_QUEUE: i32 = 2;

16.1.5.3 Security

  • CAP_NET_ADMIN is required to enter repair mode. This is checked via sock_ns_capable(sock, CAP_NET_ADMIN) (Section 16.3), scoped to the socket's network namespace. A container with CAP_NET_ADMIN can repair its own connections but not the host's.
  • Repair mode is only valid for connections in ESTABLISHED, CLOSE_WAIT, or SYN_SENT state. Attempting to enter repair mode in other states returns EPERM.
  • While in repair mode, all TCP processing is suppressed. The connection may time out on the peer side if repair mode is held too long. CRIU typically completes the checkpoint in under 100ms.

16.1.6 SO_REUSEPORT

SO_REUSEPORT allows multiple sockets (typically in different threads or processes) to bind to the same (IP, port) tuple. The kernel distributes incoming connections and datagrams across all sockets in the reuseport group, eliminating the accept() thundering herd problem and enabling efficient multi-worker server architectures.

16.1.6.1 Distribution Mechanism

Incoming connections (TCP SYNs) and datagrams (UDP packets) are distributed across the reuseport group using a consistent hash of the 4-tuple:

socket_index = jhash(src_ip, src_port, dst_ip, dst_port, group_random) % group_size

group_random is a random u32 generated when the reuseport group is created (on the first bind() with SO_REUSEPORT set). It prevents an external attacker from predicting which socket receives a given connection.

Consistent hashing property: When a socket is added to or removed from the group, only 1/N of existing flows are remapped (where N is the new group size). This minimizes connection disruption during rolling restarts.

16.1.6.2 BPF Override

Applications may attach a BPF program to override the kernel's default distribution:

// Classic BPF (cBPF):
setsockopt(fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_CBPF, &prog, sizeof(prog));

// Extended BPF (eBPF):
setsockopt(fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, &bpf_fd, sizeof(bpf_fd));

The BPF program receives the incoming packet's headers and returns a socket index (0-based into the reuseport group). This enables application-specific routing: consistent hashing by HTTP/2 stream ID, sticky sessions by client IP, or weighted distribution across workers with different capacities.

See Section 16.18 for the BPF program verification and attachment lifecycle.

16.1.6.3 Reuseport Group

/// A reuseport group: all sockets bound to the same (protocol, local_addr, local_port)
/// within the same network namespace.
///
/// The group is created when the first socket with SO_REUSEPORT calls bind().
/// Subsequent sockets joining the group must have the same effective UID as the
/// creator (security: prevents port hijacking by other users).
pub struct ReuseportGroup {
    /// Group key: (protocol, local address, local port, network namespace pointer).
    pub key: ReuseportKey,
    /// Sockets in the group. Order determines index for BPF selection.
    /// ArrayVec bounded by MAX_REUSEPORT_GROUP_SIZE (256).
    ///
    /// **Concurrency model**: Membership changes (socket join/leave) are rare
    /// (process start/stop). The SYN receive path reads the socket array at
    /// high frequency (up to 1M SYN/sec for reuseport nginx). To avoid
    /// SpinLock contention on the hot read path:
    /// - **Reads**: use `sockets_rcu` snapshot under `rcu_read_lock()` (zero
    ///   lock acquisition, zero contention).
    /// - **Writes**: acquire `sockets_write` SpinLock, modify the ArrayVec,
    ///   publish a new snapshot to `sockets_rcu` via `RcuCell::update()`.
    ///
    /// **Vtable liveness**: `Weak<dyn SocketOps>` references are Tier 1
    /// internal ([Section 16.4](#socket-operation-dispatch)). Live evolution of the
    /// umka-net module replaces all `SocketOps` vtables atomically; the
    /// general evolution mechanism ([Section 13.18](13-device-classes.md#live-kernel-evolution)) ensures
    /// old vtable code remains valid until all references are dropped.
    sockets_write: SpinLock<ArrayVec<Weak<dyn SocketOps>, 256>>,
    /// RCU-protected snapshot of the socket array for lock-free reads on
    /// the SYN receive hot path. Updated (clone-and-swap) under
    /// `sockets_write` lock. Readers call `sockets_rcu.read()` under
    /// `rcu_read_lock()`.
    pub sockets_rcu: RcuCell<ArrayVec<Weak<dyn SocketOps>, 256>>,
    /// Random value for hash-based distribution (generated at group creation).
    pub hash_random: u32,
    /// UID of the socket that created the group (for ownership check on join).
    pub owner_uid: u32,
    /// Optional attached eBPF program for custom distribution.
    pub bpf_prog: Option<Arc<BpfProg>>,
}

/// Maximum sockets in a single reuseport group.
/// 256 is sufficient for any practical multi-worker configuration (e.g., nginx
/// with one worker per CPU core; 256 cores is the upper bound for Phase 2).
pub const MAX_REUSEPORT_GROUP_SIZE: usize = 256;

16.1.6.4 Requirements

  • Every socket in the group must set SO_REUSEPORT before calling bind(). Setting it after bind() has no effect.
  • All sockets must have the same effective UID as the group creator. This prevents unprivileged port hijacking (a non-root process cannot join a root-created reuseport group). This matches Linux's behavior since kernel 4.6.
  • The group is keyed by (protocol, local_addr, local_port, net_namespace). Different network namespaces maintain independent reuseport groups (Section 17.1).
  • When a socket is closed, it is removed from the group. When the last socket leaves, the group is deallocated.

16.1.7 TCP-MD5 Signature (RFC 2385)

TCP-MD5 provides per-segment authentication for BGP sessions. Each outgoing segment includes an MD5 digest computed over the TCP pseudo-header, segment data, and a shared secret key. UmkaOS supports TCP-MD5 for Linux ABI compatibility, though TCP-AO (RFC 5925) is the recommended successor.

/// TCP-MD5 signature key configuration.
/// Layout matches Linux `struct tcp_md5sig` for setsockopt compatibility.
#[repr(C)]
pub struct TcpMd5Sig {
    /// Peer address (IPv4 or IPv6) to associate the key with.
    pub tcpm_addr: SockAddrStorage,
    /// Reserved (must be zero).
    pub tcpm_flags: u8,
    /// Prefix length for address matching (TCP_MD5SIG_EXT only; 0 = exact match).
    pub tcpm_prefixlen: u8,
    /// Key length in bytes (max 80).
    pub tcpm_keylen: u16,
    /// Interface index for VRF binding (TCP_MD5SIG_EXT only; 0 = any interface).
    pub tcpm_ifindex: i32,
    /// Shared secret key (up to 80 bytes, matching Linux's TCP_MD5SIG_MAXKEYLEN).
    pub tcpm_key: [u8; 80],
}
// UAPI ABI: SockAddrStorage(128)+flags(1)+prefixlen(1)+keylen(2)+ifindex(4)+key(80) = 216 bytes.
const_assert!(core::mem::size_of::<TcpMd5Sig>() == 216);

Per-connection MD5 keys are stored in an ArrayVec<TcpMd5Key, 8> on the TcpCb (maximum 8 peer-specific keys per socket, matching typical BGP deployments). The MD5 digest is computed inline on the TX path and verified on the RX path — segments failing verification are silently dropped (RFC 2385 Section 3).

16.1.8 TCP_ULP (Upper Layer Protocols)

TCP_ULP is the extension point for attaching protocol layers above TCP but below the application. Once attached, the ULP intercepts sendmsg() / recvmsg() and performs transparent processing (encryption, compression, etc.).

// Attach kTLS:
setsockopt(fd, SOL_TCP, TCP_ULP, "tls", 3);

Phase 2 scope: Only kTLS is implemented (Section 16.15). The ULP registration mechanism is designed to accommodate future ULPs (e.g., SMC-R for RDMA-accelerated sockets) but no others are specified for Phase 2.

ULP attachment is one-way: once a ULP is attached, it cannot be detached or replaced. The socket must be in ESTABLISHED state (post-handshake) at attachment time.

16.1.9 TCP Zero-Copy

TCP zero-copy reduces CPU usage for bulk data transfers by avoiding memory copies between userspace and kernel buffers.

16.1.9.1 Zero-Copy Send (SO_ZEROCOPY)

When SO_ZEROCOPY is enabled and sendmsg() includes the MSG_ZEROCOPY flag:

  1. The kernel pins the userspace pages containing the send data (instead of copying).
  2. Pages are DMA'd directly from userspace memory to the NIC.
  3. After transmission, a completion notification is posted to MSG_ERRQUEUE. The notification carries SO_EE_ORIGIN_ZEROCOPY with a notification ID range (ee_info = start ID, ee_data = end ID inclusive) that the application uses to identify which send buffers are now safe to reuse. Each MSG_ZEROCOPY sendmsg() is assigned a monotonically increasing per-socket notification ID (starting at 0). Multiple completions may be batched into a single notification (start ID < end ID). See Section 16.8 for the full sock_extended_err layout.
    recvmsg(fd, &msg, MSG_ERRQUEUE);
    // msg.msg_control → sock_extended_err with ee_info/ee_data = notification ID range
    
  4. After draining the notification, the application may reuse/free the buffer.

Trade-off: Page pinning + notification overhead is ~2 microseconds. Below approximately 10 KB per send, the overhead exceeds the copy cost. The kernel does NOT silently fall back — applications should check the buffer size before using MSG_ZEROCOPY.

16.1.9.2 Zero-Copy Receive (TCP_ZEROCOPY_RECEIVE)

For high-throughput bulk receives, the kernel maps incoming packet pages directly into userspace via mmap(). See Section 16.8 for the full specification including completion notifications and page reclaim policy.

/// TCP_ZEROCOPY_RECEIVE parameter struct.
/// Layout matches Linux `struct tcp_zerocopy_receive`.
#[repr(C)]
pub struct TcpZeroCopyReceive {
    /// Virtual address of the mmap'd receive region.
    pub address: u64,
    /// Length of the receive region (must be page-aligned).
    pub length: u32,
    /// Output: sequence number of the first byte mapped.
    pub recv_skip_hint: u32,
    /// Output: number of bytes actually mapped.
    pub inq: u32,
    /// Error code (0 on success, EAGAIN if no data, EINVAL on bad params).
    pub err: i32,
    /// Copy-hint: if nonzero, kernel recommends copy for this range (data too small).
    pub copybuf_address: u64,
    /// Length of the copy-hint buffer.
    pub copybuf_len: u32,
    /// Flags (reserved, must be 0).
    pub flags: u32,
}
// UAPI ABI: address(8)+length(4)+recv_skip_hint(4)+inq(4)+err(4)+copybuf_address(8)+copybuf_len(4)+flags(4) = 40 bytes.
const_assert!(core::mem::size_of::<TcpZeroCopyReceive>() == 40);

When the SYN backlog (tcp_max_syn_backlog) for a listening socket is exhausted, UmkaOS activates SYN cookies to accept new connections without allocating TcpRequest state. This defends against SYN flood attacks while maintaining service availability.

Encoding: The SYN-ACK ISN (initial sequence number) encodes connection parameters into a 32-bit cookie:

/// SYN cookie format (32-bit ISN sent in SYN-ACK):
///
/// Bits [31:8]  = truncated HMAC-SHA256(server_secret, client_ip, client_port,
///                server_ip, server_port, timestamp_top5) — 24-bit authenticator.
/// Bits [7:3]   = timestamp counter (top 5 bits of seconds/64, wraps every ~34 min).
/// Bits [2:0]   = MSS index (0-7, encoding the negotiated MSS tier).
///
/// The server_secret is a 256-bit key generated from the kernel CSPRNG at boot
/// and rotated every 60 seconds (two keys active during rotation — current and
/// previous — to handle in-flight SYN-ACKs). Rotation uses AtomicU64 index swap,
/// no lock.
pub struct SynCookie;

/// SYN cookie key state. Two 256-bit keys allow seamless rotation: the current
/// key generates new cookies, the previous key validates cookies generated
/// before the last rotation. This ensures in-flight SYN-ACKs (which may take
/// up to 2 × RTO ≈ 6s to be acknowledged) are never invalidated by rotation.
///
/// Allocated once per network namespace and stored in the namespace's TCP
/// configuration (`TcpNsConfig`). Accessed under `rcu_read_lock()` on the
/// validation path (hot) and under a Mutex on the rotation path (cold, every
/// 60 seconds).
pub struct SynCookieSecret {
    /// Current HMAC key (256 bits). All new SYN cookies are generated with
    /// this key. Populated from the kernel CSPRNG at boot and on each rotation.
    pub key: [u8; 32],
    /// Previous HMAC key (256 bits). Retained for one rotation cycle to validate
    /// cookies generated before the last key swap. After two rotations, a cookie
    /// signed with this key will fail timestamp validation anyway (>2 minutes old).
    pub prev_key: [u8; 32],
    /// Monotonic timestamp (jiffies) of the most recent key rotation. Used to
    /// determine when the next rotation is due (every 60 seconds). The rotation
    /// timer runs as a deferred work item in the per-namespace TCP timer context.
    pub rotation_time: AtomicU64,
}

/// MSS encoding table. The 3-bit index maps to common MSS values.
/// On ACK receipt, the server reconstructs the negotiated MSS from the index.
///
/// **Linux baseline** (verified against `torvalds/linux` master
/// `net/ipv4/syncookies.c` `msstab[]`): Linux uses exactly 4 entries
/// `{536, 1300, 1440, 1460}` with a 2-bit index (COOKIEBITS = 24, MSS
/// index mixed into the low 2 bits of the 24-bit hash).
///
/// **UmkaOS extension**: UmkaOS expands to 8 entries with a 3-bit index
/// to cover jumbo, loopback, and TSO MSS values. This is ABI-compatible
/// because SYN cookies are kernel-internal state — the MSS value is
/// reconstructed server-side and never exposed in the wire protocol.
/// Each side (client, server) uses its own encoding independently.
///
/// **Bit layout difference**: Linux uses upper 8 bits = count/timestamp,
/// lower 24 bits = hash with MSS index in low 2 bits. UmkaOS uses
/// bits [31:8] = HMAC, bits [7:3] = timestamp (5 bits), bits [2:0] = MSS
/// index (3 bits). The UmkaOS layout is self-consistent and provides
/// 8 MSS entries at the cost of 5-bit timestamp granularity (32 steps
/// vs Linux's 8-bit / 256 steps — still sufficient for SYN cookie
/// expiry detection within the 64-second validity window).
pub const SYN_COOKIE_MSS_TABLE: [u16; 8] = [
    536,   // 0: minimum (RFC 879)
    1300,  // 1: PPPoE typical
    1440,  // 2: PPPoE over Ethernet
    1460,  // 3: Ethernet (default)
    4312,  // 4: jumbo-lite
    8960,  // 5: jumbo (9K frame - 40B headers)
    16384, // 6: loopback
    32768, // 7: TSO segment
];

Activation:

Condition Action
syn_backlog < tcp_max_syn_backlog Normal SYN processing: allocate TcpRequest, send SYN-ACK
syn_backlog >= tcp_max_syn_backlog SYN cookie mode: compute cookie ISN, send SYN-ACK, do NOT allocate TcpRequest
sysctl net.ipv4.tcp_syncookies = 0 SYN cookies disabled: drop SYN when backlog full
sysctl net.ipv4.tcp_syncookies = 1 (default) Activate SYN cookies only when backlog full
sysctl net.ipv4.tcp_syncookies = 2 Always use SYN cookies (for testing; not recommended for production)

ACK validation (on receiving the 3-way handshake ACK):

validate_syn_cookie(ack_seq, client_addr, server_addr):
  1. cookie = ack_seq - 1  (ISN was ack_seq - 1)
  2. Extract timestamp_index = (cookie >> 3) & 0x1F
  3. Verify timestamp is within ±2 of current (rejects cookies older than ~2 min)
  4. Recompute HMAC with current and previous server_secret
  5. Compare upper 24 bits — if neither key matches, drop (forged cookie)
  6. Extract mss_index = cookie & 0x7 → reconstruct MSS from table
  7. Allocate full connection state (TcpCb), initialize with reconstructed MSS
  8. Connection proceeds normally

Limitations: SYN cookies cannot encode TCP options beyond MSS. When SYN cookies are active: window scaling is disabled (the scale factor from the SYN is lost), SACK is disabled, and ECN is disabled. These limitations are acceptable during an active SYN flood — they degrade performance but maintain availability.

Timestamp-based option recovery (Linux 3.x+): If the client included TCP timestamps in the SYN (TSopt), the server encodes window scale (4 bits) and SACK-permitted (1 bit) into the low bits of the SYN-ACK timestamp value. On ACK receipt, these options are recovered from the echoed timestamp. This partially mitigates the option loss.

16.1.11 Cross-References

  • Section 16.10 — congestion control algorithms, CongestionOps trait, per-socket selection lifecycle, sysctl entries
  • Section 16.3SockCommon struct, SocketOps trait, setsockopt dispatch, namespace-scoped capability checks
  • Section 16.15 — kTLS upper layer protocol, cipher suites, NIC offload
  • Section 16.2 — TCP packet processing path (RX/TX), KABI ring protocol for Tier 1 data transfer
  • Section 16.8TcpCb struct, TCP state machine, timer specifications, zero-copy receive
  • Section 16.11 — MPTCP subflow architecture, scheduler trait
  • Section 16.14 — busy-poll integration (SO_BUSY_POLL, SO_PREFER_BUSY_POLL)
  • Section 16.18SO_ATTACH_BPF, SO_ATTACH_REUSEPORT_EBPF, BPF program verification
  • Section 17.1 — per-namespace TCP sysctls, reuseport group isolation, capability scoping
  • Section 16.21SO_PRIORITY, SO_TXTIME, QoS classification
  • Section 16.5NetBuf used for zero-copy page management

16.2 Network Stack Architecture

umka-net is the Tier 1 network stack. It runs in its own isolation domain, isolated from umka-core. The kernel never executes protocol processing directly — all network I/O crosses the domain boundary via ring buffers (~23 cycles per crossing, Section 11.2).

kernel_services binding: umka-net obtains its KernelServicesVTable handle during the Hello protocol at module load time (Section 12.8). The handle is stored as a module-level static:

/// Handle to Tier 0 kernel services, obtained during umka-net's Hello
/// protocol. Used for cross-domain callbacks: `wake_socket()`,
/// `timer_arm()`, `alloc_page()`, etc. Resolved at bind time to either
/// a direct vtable pointer (if umka-net runs as Tier 0 during boot) or
/// a ring handle (normal Tier 1 operation).
///
/// Initialized once at module load; never changes during umka-net's lifetime.
/// A domain crash + restart re-runs the Hello protocol and re-binds.
static KERNEL_SERVICES: OnceCell<KabiHandle> = OnceCell::new();

All references to kernel_services in the TCP/UDP/socket code resolve to KERNEL_SERVICES.get(). This is a single atomic load (OnceCell fast path) per call site — negligible overhead.

The stack is layered, with each layer communicating through well-defined internal interfaces:

Application (userspace)
    |  syscall (socket, bind, listen, accept, read, write, sendmsg, recvmsg)
    v
umka-core: socket dispatch (translates fd ops to umka-net ring commands)
    |  domain ring buffer (~23 cycles)
    v
umka-net (Tier 1):
    Socket layer (protocol-agnostic)
    |
    Transport layer (TCP, UDP, SCTP, MPTCP)
    |
    Network layer (IPv4, IPv6, routing, netfilter)
    |
    Link layer (ARP/NDP, bridge, VLAN)
    |  domain ring buffer (~23 cycles)
    v
NIC driver (Tier 1): device-specific TX/RX

The four domain switches (two domain entries — NIC driver and umka-net — each requiring an enter and exit switch) add ~92 cycles total to the packet delivery path from NIC hardware to the socket receive buffer (4 × ~23 cycles; see Section 16.12 for detailed analysis). The recvmsg() syscall adds 2 additional switches, totaling 6 for the complete NIC-to-userspace path. For comparison, a single sendmsg() syscall in Linux costs ~700-1800 cycles in syscall transition overhead on modern hardware with Spectre/Meltdown mitigations enabled (pre-mitigation: ~200-400 cycles) (the SYSCALL/SYSRET ring crossing, not a full Linux process context switch which costs 5,000-20,000 cycles due to TLB flushes, cache pollution, and scheduler overhead). The domain boundary is cheaper than even a bare syscall transition.

Linux comparison: Linux's network stack is monolithic — TCP, IP, Netfilter, and the socket layer all execute in the same address space with no isolation. A buffer overflow in a Netfilter module can corrupt TCP connection state. In UmkaOS, umka-net's isolation means a bug in the VXLAN tunnel parser cannot corrupt the TCP congestion window of an unrelated connection because Rust's type system and memory safety enforce separation within the umka-net module. Hardware domain isolation enforces the boundary between umka-net and the NIC driver (and between umka-net and umka-core); within umka-net, Rust's ownership model provides the intra-module isolation.

16.2.1 RX Packet Delivery Path (L2 → L3 → L4)

The following sequence describes the complete receive path from NIC hardware through protocol dispatch and routing to socket delivery. Each step identifies the responsible subsystem and the exact trigger point for routing lookup.

1. NIC hardware → DMA completion → NapiContext.poll()
       ([Section 16.14](#napi-new-api-for-packet-polling): NAPI poll drains RX ring descriptors)

2. poll() calls napi_receive_buf(napi, buf) for each packet
       (driver builds NetBuf from RX descriptor, accumulates into rx_batch)

3. napi_complete_done() → napi_deliver_batch(napi)
       **DOMAIN SWITCH: Tier 0 → Tier 1 (umka-net)**
       One domain switch for the entire batch (up to 64 packets per poll cycle).
       NAPI (Tier 0) passes the batch of NetBufHandles to umka-net via shared
       memory. The domain switch cost (~23 cycles, [Section 16.12](#domain-switch-overhead-analysis))
       is amortized across the entire batch.

4. umka-net batch receive → NetRxContext::receive_batch()
       GRO coalescing runs here: flow-matched packets are merged into
       coalesced super-packets. GRO is protocol-aware (TCP/UDP header parsing)
       and belongs in the protocol stack (Tier 1), not in NAPI (Tier 0).
       GRO state (hash tables, flow tracking, flush timeout) lives in
       NetRxContext — see §NetRxContext below.

5. netif_receive_buf() (inside umka-net, Tier 1):
   (UmkaOS naming: `netif_receive_buf()`, not Linux's `netif_receive_skb()`.
   UmkaOS uses `NetBuf`, not `sk_buff`, throughout the stack. The function
   is semantically equivalent but operates on `&mut NetBuf`.)
   a. VLAN processing (Section 16.22): strip 802.1Q tag if present,
      populate NetBuf.vlan_tci, dispatch to VlanDev if registered
   b. Parse EtherType from L2 header
   c. Dispatch to protocol handler by EtherType:
      - 0x0800 (IPv4) → ip_rcv()
      - 0x86DD (IPv6) → ipv6_rcv()
      - 0x0806 (ARP)  → arp_rcv() (Section 16.5, Neighbor Subsystem)

6. ip_rcv() / ipv6_rcv():
   a. Validate IP header (checksum, TTL/hop-limit, version, total length)
   b. **NF_INET_PRE_ROUTING hook** ([Section 16.18](#packet-filtering-bpf-based)):
      Invoke netfilter/BPF hooks before routing. Conntrack NEW/ESTABLISHED
      state lookup occurs here (connection tracking — ConntrackTuple,
      ConntrackEntry, hash table design — is fully specified in
      [Section 16.18](#packet-filtering-bpf-based) §Conntrack Subsystem).
      BPF_PROG_TYPE_CGROUP_SKB programs attached
      to the ingress cgroup are also invoked at this point.
      If any hook returns NF_DROP, drop packet (count in per-interface stats).
   c. Set NetBuf.l3_offset, NetBuf.protocol, NetBuf.addr_family
   d. **Routing lookup** (TRIGGER POINT):
      RouteTable::lookup(dst, src, mark, ifindex, protocol, sport, dport, uid, flow_hash)
      → RouteLookupResult
      (Section 16.5: FIB lookup using the receiving interface's network namespace,
       NetDevice.net_ns.routes; evaluated under rcu_read_lock())
   e. Cache result: netbuf.route_ext = Some(slab_alloc(route))
   f. Route decision based on RouteLookupResult.route_type:
      - RTN_LOCAL     → set NetBufFlags::LOCAL_IN, deliver to socket layer (step 7)
      - RTN_UNICAST   → set NetBufFlags::FORWARDED, decrement TTL, TX on egress interface
      - RTN_BROADCAST → deliver to raw sockets + local delivery
      - RTN_BLACKHOLE → drop silently
      - RTN_UNREACHABLE → drop + send ICMP Destination Unreachable
      - RTN_PROHIBIT  → drop + send ICMP Administratively Prohibited

6g. **NF_INET_LOCAL_IN hook** (for RTN_LOCAL only):
      Invoke netfilter LOCAL_IN hooks after routing confirmed local delivery.
      Conntrack ESTABLISHED confirmation occurs here
      (see [Section 16.18](#packet-filtering-bpf-based) for the full conntrack specification).

6h. **NF_INET_FORWARD hook** (for RTN_UNICAST only):
      Invoke netfilter FORWARD hooks for routed/forwarded packets.

7. L4 dispatch (for RTN_LOCAL):
   - protocol 6   (TCP)    → tcp_v4_rcv() / tcp_v6_rcv()
   - protocol 17  (UDP)    → udp_rcv()
   - protocol 1   (ICMP)   → icmp_rcv()
   - protocol 58  (ICMPv6) → icmpv6_rcv()
   - protocol 132 (SCTP)   → sctp_rcv()

Routing lookup is performed once per packet in ip_rcv() / ipv6_rcv() (step 6d). The result is cached via NetBuf.route_ext (Section 16.5) for use by subsequent processing stages (conntrack (Section 16.18), netfilter, socket delivery). The FIB lookup uses the packet's destination address and the receiving interface's network namespace (NetDevice.net_ns, Section 16.13; Section 17.1 for namespace routing table ownership).

Network namespace cleanup: when the last task exits a network namespace, cleanup runs in order: (1) close all sockets, (2) remove all network devices (triggers RTM_DELLINK), (3) flush routing tables, (4) flush conntrack entries, (5) release the PortAllocator (Section 17.1). Cleanup is deferred via RCU to allow in-flight packets referencing the namespace to drain before resources are freed.

The cached RouteLookupResult (Section 16.6) includes the resolved next-hop, output interface, effective MTU, and route type. The cache is valid for the lifetime of the NetBuf — routing table changes (RCU-swapped) do not invalidate in-flight packets because the old table remains accessible until the RCU grace period completes, and no NetBuf metadata struct outlives a single NAPI poll cycle (Section 16.14). (The packet data may outlive NAPI poll when copied or refcounted into socket receive buffers; it is the NetBuf descriptor itself — headers, offsets, routing cache — that is recycled within the poll cycle.)

16.2.2 NetRxContext: GRO State

NetRxContext is the per-NAPI-instance receive context inside umka-net (Tier 1). It owns all GRO (Generic Receive Offload) state and the batch processing logic. One NetRxContext exists per registered NapiContext — they are paired 1:1, but live in different isolation domains (NapiContext in Tier 0, NetRxContext in Tier 1).

/// Per-NAPI receive context inside umka-net (Tier 1). Owns GRO state
/// and batch receive processing. Paired 1:1 with a NapiContext in Tier 0.
///
/// Created when a NAPI instance registers with umka-net (at `napi_register()`
/// time). Destroyed when the NAPI instance is unregistered.
pub struct NetRxContext {
    /// NAPI instance ID (matches NapiContext.napi_id in Tier 0).
    /// Used to correlate batch deliveries with the correct GRO state.
    pub napi_id: u32,

    /// The NetDevice this context is associated with (for stats, namespace).
    /// This is the Tier 1 representation of `NapiContext.dev` (Tier 0):
    /// cross-domain safety requires integer handles, not Arc<NetDevice> pointers.
    /// Tier 1 umka-net uses the u32 interface index for route lookups and stats;
    /// Tier 0 NAPI holds the full `Arc<NetDevice>` for hardware operations.
    pub dev_index: u32,

    /// GRO (Generic Receive Offload) hash table. Packets are matched by
    /// flow key (src/dst IP, src/dst port, protocol) and coalesced into
    /// super-packets when headers match and payload is contiguous (TCP
    /// sequence numbers, UDP-GRO segment boundaries).
    ///
    /// 8 buckets matching Linux GRO_HASH_BUCKETS. Flow key hashed via
    /// jhash2 (same as Linux) to select the bucket.
    pub gro_hash: [GroList; GRO_HASH_BUCKETS],

    /// GRO bitmask: tracks which hash buckets have pending packets.
    /// Bit N is set if `gro_hash[N]` has at least one pending flow.
    /// Used by `gro_flush_all()` to skip empty buckets.
    pub gro_bitmask: u64,

    /// Number of packets currently held in GRO hash buckets (across
    /// all buckets). Used for accounting and backpressure.
    pub gro_count: u32,

    /// GRO flush timeout (nanoseconds). After this timeout, pending
    /// GRO packets are flushed even if the hash bucket is not full.
    /// Default: 0 (flush at end of batch only). Configurable via sysfs
    /// `/sys/class/net/<dev>/gro_flush_timeout`.
    pub gro_flush_timeout: u64,
}

/// GRO hash buckets (8 buckets, matching Linux GRO_HASH_BUCKETS).
pub const GRO_HASH_BUCKETS: usize = 8;

/// Per-bucket GRO flow list. Each bucket holds a linked list of
/// in-progress GRO flows. When a new packet arrives, the bucket is
/// searched for a matching flow. If found, the packet is coalesced;
/// otherwise a new flow entry is started (or the oldest flow is flushed
/// to make room, bounded by MAX_GRO_HELD = 8 per bucket).
///
/// **Bucket overflow policy**: If a bucket reaches `MAX_GRO_HELD` (8)
/// entries and a new non-matching flow arrives, the oldest incomplete
/// flow is force-flushed (delivered as-is to the IP layer). This bounds
/// memory usage per NAPI poll cycle. The flushed packet loses potential
/// coalescing but is not dropped.
pub struct GroList {
    /// Head of the linked list of in-progress GRO flows in this bucket.
    /// Each `NetBuf` in the chain has `gro_next: Option<NonNull<NetBuf>>`
    /// linking to the next flow entry. `None` if the bucket is empty.
    ///
    /// **Intrusive linked list exception** (see [Section 3.13](03-concurrency.md#collection-usage-policy)):
    /// GRO flow chaining via `NetBuf.gro_next` is an intrusive linked list,
    /// which normally violates UmkaOS policy. Justified because:
    /// 1. **Transient lifetime**: chain exists only within one NAPI poll
    ///    cycle (~10-100 us); flushed at `napi_complete_done()`.
    /// 2. **Bounded size**: max `MAX_GRO_HELD` (8) entries per bucket,
    ///    64 total across 8 buckets.
    /// 3. **Hot-path allocation constraint**: GRO runs in NAPI softirq
    ///    context at 100 Mpps; heap allocation is forbidden.
    /// Precedent: neighbor hash-bucket chains use the same pattern.
    pub head: Option<NonNull<NetBuf>>,
    /// Number of active flows in this bucket (0..=MAX_GRO_HELD).
    pub count: u32,
}

/// Maximum GRO flows held per hash bucket before force-flushing.
pub const MAX_GRO_HELD: u32 = 8;

/// GRO coalescing result for a single packet.
pub enum GroResult {
    /// Packet was merged into an existing flow. Caller does nothing.
    Merged,
    /// New flow started in the GRO hash. Packet held until flush.
    Held,
    /// Packet cannot be coalesced. Deliver to protocol stack immediately.
    Normal(NetBuf),
}

impl NetRxContext {
    /// Attempt to coalesce a received NetBuf with an existing GRO flow.
    ///
    /// # Algorithm
    /// 1. Parse the packet's L3/L4 headers to extract the flow key:
    ///    `(src_ip, dst_ip, src_port, dst_port, protocol)`.
    /// 2. If the packet is not GRO-eligible (non-TCP/non-UDP, or fragmented
    ///    IPv4 with MF=1, or no flow key parseable): return `GroResult::Normal`.
    /// 3. Hash the flow key via `jhash2()` to select a bucket index (0..7).
    /// 4. Search the bucket's flow list for a matching entry:
    ///    a. **Match found** (same 5-tuple AND contiguous — see GRO-TCP
    ///       coalescing invariant below):
    ///       - Append this packet's payload to the flow's coalesced NetBuf.
    ///       - Update `flow.expected_next_seq += payload_len` (TCP).
    ///       - Check combined size: if `flow.total_len >= GRO_MAX_SIZE` (65536),
    ///         flush the flow immediately via `gro_flush_flow()`.
    ///       - Return `GroResult::Merged`.
    ///    b. **No match** and bucket count < `MAX_GRO_HELD`:
    ///       - Start a new flow entry: set `expected_next_seq = seq + payload_len`.
    ///       - Insert at bucket head.
    ///       - Set `gro_bitmask |= (1 << bucket_index)`.
    ///       - Return `GroResult::Held`.
    ///    c. **No match** and bucket count == `MAX_GRO_HELD`:
    ///       - Force-flush the oldest flow in the bucket (tail of the list).
    ///       - Deliver the flushed flow to `netif_receive_buf()`.
    ///       - Insert the new packet as a new flow entry.
    ///       - Return `GroResult::Held`.
    ///
    /// # Performance
    /// Bucket search is O(MAX_GRO_HELD) = O(8) — bounded constant.
    /// Hash computation is ~15-20 cycles (jhash2). Total GRO per-packet
    /// overhead: ~30-50 cycles on cache-warm path.
    fn gro_receive(&mut self, buf: NetBuf) -> GroResult {
        // Parse flow key from L3/L4 headers.
        let flow_key = match parse_gro_flow_key(&buf) {
            Some(k) => k,
            None => return GroResult::Normal(buf),
        };

        let bucket_idx = (jhash2_flow(&flow_key) as usize) % GRO_HASH_BUCKETS;
        let bucket = &mut self.gro_hash[bucket_idx];

        // Search for matching flow.
        if let Some(flow) = bucket.find_matching(&flow_key, &buf) {
            flow.coalesce(buf);
            if flow.total_len >= GRO_MAX_SIZE {
                let flushed = bucket.remove_flow(flow);
                self.gro_count -= 1;
                netif_receive_buf(flushed);
            }
            return GroResult::Merged;
        }

        // No match — start new flow or flush oldest.
        if bucket.count >= MAX_GRO_HELD {
            let oldest = bucket.pop_oldest();
            self.gro_count -= 1;
            netif_receive_buf(oldest);
        }

        bucket.insert_head(buf, &flow_key);
        self.gro_bitmask |= 1 << bucket_idx;
        self.gro_count += 1;
        GroResult::Held
    }
}

/// GRO maximum coalesced packet size (bytes). Matching Linux `gro_max_size`.
pub const GRO_MAX_SIZE: u32 = 65536;

Batch receive protocol: When napi_deliver_batch() performs the Tier 0 → Tier 1 domain switch, the batch of NetBufHandle tokens (accumulated in Tier 0's NapiContext.rx_batch: ArrayVec<NetBufHandle, 64>) is passed via the KABI shared argument buffer, NOT via the socket KABI command/completion ring. The T1CommandEntry.arg_offset and T1CommandEntry.arg_len fields in the KABI ring entry point to the contiguous NetBufHandle array in shared memory. This is a direct batch transfer via shared memory accessible to both Tier 0 and Tier 1 — the 64 × 16 = 1024-byte handle array does not need to fit in a single 64-byte T1CommandEntry ring slot.

The handles are 16-byte tokens (pool-id + slot-index + generation + IOVA) that reference DMA-mapped packet data in the shared DMA pool (PKEY 14 / PKEY_SHARED on x86-64). Tier 1 code can read the packet data pages at the IOVA addresses carried in the handles without any additional mapping step — the DMA pool is pre-mapped into both Tier 0 and Tier 1 address spaces at boot time.

Domain switch count clarification: The four domain switches (two domain entries — NIC driver and umka-net — each requiring an enter and exit switch) add ~92 cycles total to the packet delivery path (NIC hardware to the socket receive buffer). The recvmsg() syscall adds 2 additional switches (Tier 0 → Tier 1 for RecvMsg command, Tier 1 → Tier 0 for response), totaling 6 for the complete NIC-to-userspace path.

umka-net's entry point is NetRxContext::receive_batch():

NetRxContext::receive_batch(handles: &[NetBufHandle]):
    for handle in handles:
        // Reconstruct NetBuf from handle. This is a pool-based lookup:
        //   1. Look up NetBufPool by handle.pool_id in the shared pool registry
        //      (pre-mapped into Tier 1 at domain init time).
        //   2. Validate handle.generation against the pool slot's current generation
        //      (stale handle = crash recovery in progress; drop silently).
        //   3. Compute NetBuf pointer via pool.claim(handle) — returns &mut NetBuf
        //      for the slab slot. Data pages are at handle.iova, already accessible
        //      via PKEY_SHARED.
        //   4. Initialize warm/cold fields: route_ext = None, gro_next = None,
        //      next = None, frag_ext = None.
        // On stale handle (generation mismatch): skip this handle, continue.
        // See [Section 16.5](#netbuf-packet-buffer--deserialization) for full specification.
        buf = match NetBufPool::claim(handle) {
            Some(b) => b,
            None => continue, // stale handle after driver crash/reload
        }

        // Attempt GRO coalescing.
        result = self.gro_receive(buf)
        match result:
            GroResult::Merged:
                // Packet absorbed into an existing GRO flow. No further
                // action until the flow is flushed.
                continue
            GroResult::Held:
                // New GRO flow started. Packet held in gro_hash.
                self.gro_count += 1
                continue
            GroResult::Normal(buf):
                // Packet cannot be coalesced (non-TCP/UDP, or flow mismatch
                // with full bucket). Deliver immediately.
                netif_receive_buf(buf)

    // After processing the entire batch, flush all GRO flows.
    gro_flush_all()
        → for each non-empty bucket (via gro_bitmask):
            flush held packets as coalesced super-packets
            → netif_receive_buf(coalesced_buf) for each flushed flow

GRO-TCP contiguous coalescing invariant: Two TCP segments are coalesceable iff all of the following hold: 1. Same 5-tuple (src_ip, dst_ip, src_port, dst_port, protocol). 2. seg.seq == flow.expected_next_seq (contiguous in sequence space). 3. Same TCP flags (no FIN, RST, SYN, URG, or ECE/CWR changes mid-flow). 4. Same IP TOS/DSCP and TTL. 5. No TCP options that differ between segments (timestamp values may differ but the option set must be identical; segments with differing SACK blocks are not coalesced). 6. Combined payload size does not exceed GRO_MAX_SIZE (65536 bytes, matching Linux gro_max_size default).

ifindex inheritance: All segments in a GRO flow originate from the same NAPI instance (and thus the same NIC). The coalesced super-packet inherits ifindex from the first segment in the flow. GRO does not explicitly check ifindex (it is implicit from the per-NAPI-per-NIC structure), but the invariant holds: downstream routing lookup (RouteTable::lookup(dst, src, mark, ifindex, ...)) requires a valid ifindex, which the coalesced packet carries.

RSS hash non-reuse: The NIC RSS hash (Toeplitz algorithm) and TCP connection hash (Jenkins, matching Linux inet_ehashfn) use different algorithms. The RSS flow_hash carried in NetBuf cannot be substituted for the TCP hash table lookup — a separate Jenkins hash is computed per packet during tcp_established_hash lookup in tcp_rcv_established().

If any condition fails, the existing flow is flushed as a coalesced super-packet and the new segment starts a fresh flow. This invariant ensures that the coalesced NetBuf presented to the TCP stack is indistinguishable from a single large segment — tcp_rcv_established() processes it as one contiguous receive.

GRO and route cache: GRO runs before the routing lookup (step 6d) — route_ext is None on all segments at GRO time. The coalesced super-packet has route_ext = None; the routing lookup in ip_rcv() (step 6d) populates it after GRO completes. This is correct: GRO only coalesces packets from the same flow (same src/dst/protocol), and the routing lookup is deferred to a single call on the coalesced super-packet rather than per-segment.

NetRxContext lookup (Tier 0 → Tier 1 dispatch):

When napi_deliver_batch() performs the domain switch from Tier 0 to Tier 1, umka-net must locate the correct NetRxContext for the originating NAPI instance. The lookup mechanism is an XArray keyed by napi_id: u32:

/// Global NetRxContext registry inside umka-net (Tier 1).
/// Keyed by napi_id (u32), matching NapiContext.napi_id in Tier 0.
/// XArray provides O(1) lookup on the hot RX path with RCU-safe reads.
static NET_RX_CONTEXTS: XArray<NetRxContext> = XArray::new();
  • Registration: When a NIC driver calls napi_register() (Tier 0), umka-core forwards the registration to umka-net via a kabi_call! (warm-path KABI ring command, distinct from the hot-path batch delivery which uses the KABI shared argument buffer). umka-net creates a new NetRxContext and inserts it into NET_RX_CONTEXTS at key napi_id. This is a warm path (device probe time only).
  • Lookup (hot path): napi_deliver_batch() passes napi_id as part of the KABI batch delivery message. On the Tier 1 side, umka-net performs NET_RX_CONTEXTS.load(napi_id) — an O(1) XArray lookup — to obtain the NetRxContext. The lookup is lock-free (RCU read-side) and costs ~5-10 cycles.
  • Unregistration: When a NIC driver calls napi_unregister(), umka-net removes the NetRxContext from the XArray and drops it after an RCU grace period (in-flight batches referencing the old context complete before the memory is freed).
  • Missing napi_id: If NET_RX_CONTEXTS.load(napi_id) returns None (driver unregistered between batch delivery and Tier 1 dispatch — possible during hot-unplug), the entire batch is silently dropped and a WARN_ONCE diagnostic is logged. This is safe because the NIC device is being torn down.

Why GRO state lives in umka-net, not in NapiContext: - GRO is protocol-aware: it parses TCP sequence numbers, UDP segment boundaries, and IP headers to determine coalesceability. Protocol parsing logic belongs in the network stack (Tier 1), not in Tier 0. - NapiContext is Tier 0 Evolvable and runs in the NIC driver's scheduling context. Placing protocol-specific state in Tier 0 would leak umka-net internals across the isolation boundary. - The batch delivery model (one domain switch per poll cycle, not per packet) means there is no performance penalty for separating GRO from NAPI.

16.2.3 TCP Receive Path (L4 → Socket Buffer → Userspace)

The RX delivery path above (steps 1-7) delivers TCP segments to tcp_v4_rcv() / tcp_v6_rcv(). The following sequence describes the complete path from TCP segment receipt through socket buffer delivery to the recvmsg() syscall return.

 8. tcp_v4_rcv(packet):
    a. Extract 4-tuple: (src_ip, src_port, dst_ip, dst_port) from IP + TCP headers.
    b. Lookup TcpCb in the established connection hash table:
       tcb = netns.tcp_ehash.lookup(src_ip, src_port, dst_ip, dst_port)
       — `NetNamespace.tcp_ehash: RcuHashMap<FourTuple, Arc<TcpCb>>`
         Per-namespace established connection hash table. RCU-protected
         for lock-free lookup on the per-packet path. Keyed by 4-tuple
         with Jenkins hash (same as Linux `inet_ehashfn`). Per-namespace
         (not global) to maintain network namespace isolation — Linux uses
         `net->ipv4.tcp_ehash` for the same reason.
       — If no match in established table:
         Check SYN_RECV / TIME_WAIT tables.
         If still no match: send RST (RFC 793 §3.4), drop packet.
    c. Acquire per-socket spinlock (TcpCb.lock) for state mutation.

 9. tcp_rcv_established(tcb, packet):
    (Fast path for ESTABLISHED state — most common case)
    a. Validate sequence number:
       — Segment must overlap with the receive window [rcv_nxt, rcv_nxt + rcv_wnd).
       — Out-of-window segments: send ACK (to resynchronize), drop segment.
    b. Process ACK field:
       — If ACK advances snd_una: update snd_una, free acknowledged segments
         from retransmit queue, cancel/restart retransmit timer,
         call cong_ops.cong_avoid() or cong_ops.cong_control().
       — Window update: if wnd > snd_wnd, update snd_wnd (enables sender
         to transmit more).
       — Duplicate ACK handling: fast retransmit / fast recovery
         (see [Section 16.8](#tcp-control-block--tcp-state-machine)).
    c. Process TCP timestamps (RFC 7323): update RTT estimate via
       Jacobson/Karels algorithm, PAWS check.
    d. Deliver payload to socket receive buffer:
       — If segment.seq == rcv_nxt (in-order delivery, common case):
           Append payload to recv_queue.
           Advance rcv_nxt by payload length.
           Deliver any previously queued out-of-order segments that are
           now contiguous (walk reorder_head linked list).
       — If segment.seq > rcv_nxt (out-of-order):
           Insert into reorder queue (sorted by sequence number).
           Generate SACK block for the received range.
           Send duplicate ACK immediately (triggers fast retransmit at sender).
    e. Advertise receive window (RFC 7323 §2.2):
       rcv_wnd = (common.rcvbuf as u32).saturating_sub(recv_queue.bytes)
       The full-precision `rcv_wnd` is stored in `TcpMutableState.rcv_wnd: u32`
       (bytes, up to SO_RCVBUF which is capped at `net.core.rmem_max`).
       The TCP header's 16-bit Window field carries `rcv_wnd >> rcv_wscale`
       (right-shifted by the negotiated window scale factor, 0-14).
       The peer reconstructs the full window as `hdr.window << snd_wscale`.
       At `rcv_wscale = 14`, the maximum advertisable window is 1 GiB.
       Included in the next outgoing ACK.
    f. Schedule delayed ACK (if not piggybacking on outgoing data):
       — Arm delack_timer (40ms or RTT/2, whichever is smaller).
       — If 2 full-size segments received since last ACK: send ACK immediately
         (RFC 1122 §4.2.3.2 "ACK every other segment").

10. sk_data_ready(tcb):
    a. Check if any task is blocked waiting on this socket's wait queue
       (i.e., a thread called recvmsg() and blocked because the receive
       buffer was empty).
    b. If a waiter exists: wake it (mark task RUNNABLE, enqueue on scheduler).
    c. If the socket is registered with epoll/poll: set EPOLLIN/POLLIN
       readiness bit on the epoll interest list entry, wake the epoll
       waiter if edge-triggered or level-triggered threshold met.

Cross-tier sk_data_ready() notification: umka-net (Tier 1) calls sk_data_ready() via the KABI KernelServicesVTable wake_socket entry. This dispatches to umka-core (Tier 0), which marks the blocked task RUNNABLE and enqueues it on the scheduler. The KABI dispatch cost is amortised by batching socket wakeups at the end of each NAPI poll cycle: napi_complete_done() flushes all accumulated wake_socket requests in a single KABI completion ring post, reducing the number of Tier 1 → Tier 0 domain crossings from one-per-socket to one-per-poll-cycle.

ReadinessRing vs WakeupAccumulator posting timing: Both notification mechanisms are batched at napi_complete_done() time: - SocketWakeEvent::DataReady events (for threads blocked in recv/send) are posted to the WakeupAccumulator during packet processing. - SocketReadinessEvent { sock_handle, EPOLLIN } events (for epoll waiters) are also posted to the WakeupAccumulator, NOT directly to the ReadinessRing. The WakeupAccumulator is flushed at napi_complete_done time, which writes both wakeup and readiness events in a single batch. - The doorbell coalescer ensures that multiple sockets becoming ready in the same poll cycle produce a single doorbell notification. - Data IS committed to recv_queue (step 8) before sk_data_ready() fires (step 10), so there is no race where an early doorbell arrives before data is available.

sk_data_ready() failure path: If the cross-domain wake_socket KABI call fails (ring full, EAGAIN), the packet remains in the socket receive queue. The socket is marked POLLIN — the next epoll_wait() or recv() will process it. No packet loss occurs; delivery is deferred until the ring has capacity.

Cross-tier sk_write_space_ready() notification (TX wakeup): When ACK processing inside umka-net frees send buffer space (snd_una advances and sk_wmem_queued drops below sndbuf), umka-net signals Tier 0 to wake any task blocked on sendmsg(). The callback is:

sk_write_space_ready(sock_handle: u64)

Dispatched via the same KernelServicesVTable.wake_socket KABI entry used by sk_data_ready(), with a WakeType::WritableSpace discriminator in the SocketReadinessEvent. Called from tcp_rcv_established() after ACK processing.

Batched in the per-poll-cycle WakeupAccumulator (same as RX wakeups): napi_complete_done() flushes all accumulated write-space wakeups alongside data-ready wakeups in a single KABI completion ring post. Tier 0 checks the wake type and wakes either the read_wait or write_wait queue accordingly.

WakeupAccumulator: An ArrayVec<SocketWakeEvent, 64> in NetRxContext (per-NAPI-instance). sk_data_ready() and sk_write_space_ready() each append a SocketWakeEvent { sock_handle, wake_type } entry. napi_complete_done() flushes the accumulator via a single batched KABI call, reducing domain crossings from one-per-socket to one-per-poll-cycle.

/// Accumulated socket wakeup events batched per NAPI poll cycle.
struct SocketWakeEvent {
    /// Socket handle identifying the target socket in Tier 0.
    pub sock_handle: u64,
    /// Discriminator: RX data ready vs TX buffer space available.
    pub wake_type: WakeType,
}

#[repr(u8)]
enum WakeType {
    /// Data arrived — wake recv() waiters and set EPOLLIN.
    DataReady = 0,
    /// Send buffer space freed — wake send() waiters and set EPOLLOUT.
    WritableSpace = 1,
}

recv_queue bounds: The receive queue is bounded by SO_RCVBUF (per-socket, default net.core.rmem_default = 212992 bytes, max net.core.rmem_max = 212992, overridable via setsockopt(SO_RCVBUF)). When the queue is full (total queued bytes >= rcvbuf), the TCP stack advertises a zero receive window (rcv_wnd = 0), causing the sender to enter persist mode (zero-window probing). No incoming data is dropped at the TCP level — the sender stops transmitting until the window opens.

16.2.3.1 Socket Operation Dispatch

The complete socket operation dispatch protocol -- ring wire format, per-CPU ring extension, epoll cross-domain integration, zero-copy paths, crash recovery, ML hooks, and performance analysis -- is specified in Section 16.4.

That section defines SocketRingCmd (128-byte fixed-size request entries with SocketOpcode discriminant), SocketRingResp (64-byte response entries), SocketRingSet (per-namespace N-ring topology following the VFS per-CPU ring pattern), and ReadinessRing (lightweight epoll notification channel).

Key design properties: - Batch amortization: io_uring coalesces N SQEs into one ring transaction. At N=16, per-op domain switch cost = 80/16 = 5 cycles (vs Linux's ~15-20 cycles of indirect call overhead). UmkaOS wins at batch_size >= 3 (x86) or

= 11 (AArch64). - sendmmsg: N messages in one ring entry. QUIC-typical N=44 costs one domain switch for all 44 datagrams. - epoll EPOLLET: readiness ring delivers edge events with zero re-polling domain crossings. - Zero-copy: MSG_ZEROCOPY, sendfile, and splice carry page references through the ring (128-byte command only, no payload copy). - Tier 1 non-blocking invariant preserved: umka-net never blocks; WOULD_BLOCK responses return immediately to Tier 0.

Tier 0 posts a SocketRingCmd to umka-net's command ring (per-CPU ring selected via select_socket_ring()). umka-net processes the command and posts a SocketRingResp to the response ring. For blocking operations that return SOCK_RESP_WOULD_BLOCK, Tier 0 blocks the calling task on the socket's wait queue; when umka-net signals data readiness (sk_data_ready() via the readiness ring), Tier 0 wakes the task and re-issues the command.

recvmsg() path (userspace reads from socket):

11. recvmsg(fd, &msg, flags):
    a. Syscall dispatch (umka-sysapi, Tier 0):
       — Look up fd → socket reference.
       — Acquire per-socket read lock (RwLock shared, see socket concurrency model).
    b. Dispatch to SocketOps::recvmsg() via KABI ring:
       — Tier 0 posts RecvRequest { fd, buf_handle, max_len, flags } to
         umka-net's KABI command ring.
       — Domain switch: Tier 0 → Tier 1 (umka-net).
    c. umka-net (Tier 1) processes the request (NON-BLOCKING):
       — Dequeue data from TcpCb.recv_queue.
       — Copy payload into the KABI shared buffer (a pre-allocated shared-memory
         region accessible to both Tier 0 and Tier 1 via the shared PKEY).
       — If queue was empty:
           Post RecvResponse { bytes_read: 0, status: SOCK_RESP_WOULD_BLOCK }
           to KABI completion ring.
           Domain switch: Tier 1 → Tier 0. KABI ring slot is released.
           (Tier 1 code NEVER blocks — see "Tier 1 Non-Blocking Invariant" below.)
       — If data was available:
           Update rcv_wnd: opening the window may resume a sender that was
           in zero-window persist mode.
           Post RecvResponse { bytes_read, src_addr, flags } to KABI
           completion ring.
           Domain switch: Tier 1 → Tier 0.
    d. Tier 0 reads RecvResponse from KABI completion ring:
       — If SOCK_RESP_WOULD_BLOCK and flags does not include MSG_DONTWAIT:
           Tier 0 blocks the calling task on the socket's wait queue.
           Woken by sk_data_ready() when new data arrives (step 10).
           On wake: go to step (b) — re-enter Tier 1 to dequeue data.
       — If SOCK_RESP_WOULD_BLOCK and flags includes MSG_DONTWAIT:
           Return EAGAIN to the syscall caller.
       — Otherwise (data received):
           copy_to_user(): copy data from KABI shared buffer to the
           userspace buffer (msg.msg_iov).
           Return bytes_read to the syscall caller.

This path is identical in structure to the VFS read() path for Tier 1 filesystems: the Tier 1 component dequeues data into a KABI shared buffer, and Tier 0 performs the final copy_to_user() because only Tier 0 (PKEY 0) has write access to userspace memory. See Section 11.8 for the general KABI ring protocol.

16.2.3.2 Tier 1 Non-Blocking Invariant

Critical design rule: Tier 1 code (umka-net) never blocks on a wait queue. All blocking/sleeping for socket operations is performed by Tier 0 (umka-core). This invariant exists for two reasons:

  1. KABI ring slot lifetime: A blocking recvmsg() that sleeps inside Tier 1 holds a KABI ring slot for an unbounded duration (potentially hours for a blocking TCP receive with no incoming data). The KABI command ring has a fixed number of slots (typically 256). If enough threads block inside Tier 1, the ring fills up and all network operations stall — including unrelated sockets and even control-plane operations like bind() and listen().

  2. Tier 1 fault isolation: If umka-net crashes or hangs while tasks are sleeping inside it, those tasks cannot be woken or cancelled by Tier 0. The Tier 1 recovery path (reload umka-net, ~50-150 ms) would need to reconstruct the wait queues, which is fragile and error-prone. With sleeping in Tier 0, a Tier 1 crash simply wakes all waiters with an error (the wait queue and task state are fully owned by Tier 0).

The protocol: Tier 1 socket operations are always non-blocking. When Tier 1 has no data to return, it posts a SOCK_RESP_WOULD_BLOCK response and releases the ring slot. Tier 0 then decides whether to sleep (blocking syscall) or return EAGAIN (non-blocking syscall). On wake (data arrived via sk_data_ready()), Tier 0 re-enters Tier 1 to complete the receive. This adds one extra domain crossing for the blocking-with-empty-queue case, but that is the slow path — the fast path (data already queued) completes in a single Tier 0 → Tier 1 → Tier 0 round trip.

Applies to all Tier 1 socket operations: recvmsg(), accept(), connect() (waiting for SYN+ACK). In each case, Tier 1 returns immediately with a status indicating whether the operation completed or would block, and Tier 0 handles the sleep/wake lifecycle.

16.2.4 Tier 1 recvmsg() Cross-Domain Data Path

umka-net (Tier 1, isolated via MPK/POE with its own PKEY) cannot directly call copy_to_user() because userspace page tables are mapped with PKEY 0 (Tier 0 kernel-only). The data path for recvmsg() — and symmetrically for sendmsg(), VFS read()/write(), and all other Tier 1 → userspace transfers — follows a three-stage pipeline:

Stage 1: umka-net (Tier 1) dequeues from internal buffer (NON-BLOCKING)
  ┌──────────────────────────────────────────────────────────────┐
  │ Check recv_queue:                                      │
  │   If data available:                                         │
  │     recv_queue → memcpy into KABI shared buffer        │
  │     (KABI shared buffer: PKEY_SHARED, readable by Tier 0+1)  │
  │     Post to KABI completion ring:                             │
  │       { fd, buf_offset, len, msg_flags, src_addr, addr_len } │
  │   If queue empty:                                            │
  │     Post to KABI completion ring:                             │
  │       { fd, bytes_read: 0, status: SOCK_RESP_WOULD_BLOCK }         │
  │   Tier 1 ALWAYS returns immediately — never sleeps.          │
  └──────────────────────────────────────────────────────────────┘
                    domain switch
Stage 2: umka-core (Tier 0) reads completion ring
  ┌──────────────────────────────────────────────────────────────┐
  │ Read RecvResponse from KABI completion ring                  │
  │ If SOCK_RESP_WOULD_BLOCK:                                         │
  │   If MSG_DONTWAIT: return EAGAIN to userspace                │
  │   Else: sleep on socket wait queue (Tier 0 manages sleep)   │
  │     → woken by sk_data_ready() → re-enter Tier 1 (Stage 1)  │
  │ Else (data received):                                        │
  │   Validate: buf_offset + len within shared buffer bounds     │
  │   copy_to_user(user_iov, &shared_buf[buf_offset..][..len])   │
  │   (PKEY 0 has write access to userspace mappings)            │
  └──────────────────────────────────────────────────────────────┘
                    syscall return
Stage 3: userspace receives data in msg.msg_iov
  ┌──────────────────────────────────────────────────────────────┐
  │ recvmsg() returns bytes_read                                 │
  │ Data is in userspace buffer — application processes it       │
  └──────────────────────────────────────────────────────────────┘

KABI shared buffer management: The shared buffer is a pre-allocated region (default 2 MB per umka-net instance, configurable) mapped with the shared PKEY that both Tier 0 and Tier 1 can read. Tier 1 writes data into the buffer; Tier 0 reads it and copies to userspace. The buffer is managed as a ring of fixed-size slots (4 KB each, matching page size) with producer-consumer indices. Slot allocation is lock-free (AtomicU64 producer index, CAS-based).

Shared buffer full (backpressure): If all shared buffer slots are in use when Tier 1 attempts to dequeue from recv_queue, Tier 1 posts a RECV_SHARED_BUF_FULL response to the KABI completion ring instead of copying data. Tier 0 handles this by: 1. Completing any pending copy_to_user() for previously dequeued slots (freeing them). 2. Re-entering Tier 1 with a retry. This is bounded: at most 2 MB / 4 KB = 512 slots are in flight, so the retry succeeds after Tier 0 completes its outstanding copies. 3. If the retry also fails (programming error or extreme contention), Tier 0 returns EAGAIN to userspace for non-blocking sockets or re-sleeps for blocking sockets. The data remains in recv_queue and is not lost — TCP flow control (zero window) prevents the sender from overrunning the receiver regardless of shared buffer pressure.

sendmsg() (reverse path): The reverse direction uses the same KABI shared buffer mechanism: 1. Tier 0 receives sendmsg() syscall, copy_from_user() into KABI shared buffer. 2. Tier 0 posts SendRequest { buf_offset, len, dst_addr, flags } to KABI command ring. 3. umka-net (Tier 1) reads from KABI shared buffer, enqueues into TCP send buffer, initiates transmission.

Why Tier 1 cannot copy_to_user() directly: The hardware memory domain isolation (MPK on x86, POE on ARM) restricts which PKEYs can access userspace pages. Userspace pages are tagged with PKEY 0 (the default key). Only Tier 0 code running with PKEY 0 active can write to these pages. Tier 1 code runs with its own PKEY (e.g., PKEY 3 for umka-net) and would fault if it attempted to write to a PKEY 0 page. This is a deliberate security boundary: even if umka-net is compromised, it cannot write arbitrary data to userspace memory — all data must flow through the validated KABI ring, where Tier 0 performs bounds checking before copy_to_user().

16.2.5 IP Layer Implementation (IPv4 / IPv6)

The network layer handles packet validation, reassembly, routing, fragmentation, ICMP, and multicast group membership. Steps 6a–6h in the RX delivery path above invoke these functions. This section specifies the implementation-level data structures and algorithms.

16.2.5.1 IP Reassembly

IP datagrams may arrive fragmented. The reassembly engine collects fragments and reconstructs the original datagram before delivering to the transport layer.

/// Per-fragment-chain reassembly state. One entry per (src_ip, dst_ip, protocol, id)
/// tuple. Stored in `IpReassemblyTable`.
pub struct IpReassemblyQueue {
    /// Fragment chain: sorted by fragment offset. Each entry is a `NetBuf`
    /// containing one fragment. Gaps between fragments indicate missing data.
    pub fragments: BoundedVec<NetBuf, 64>,
    /// Total payload bytes received so far (sum of fragment payloads).
    pub bytes_received: u32,
    /// Total expected payload bytes (`None` until the last fragment is received,
    /// identified by MF=0 in the IP header).
    pub total_len: Option<u32>,
    /// Reassembly timeout timer. If all fragments do not arrive within
    /// `net.ipv4.ipfrag_time` (default 30s for IPv4) or
    /// `net.ipv6.ip6frag_time` (default 60s for IPv6), the chain is discarded
    /// and an ICMP Time Exceeded (Fragment Reassembly) is sent to the source.
    pub timer: HrTimerHandle,
    /// Timestamp of first fragment arrival (for timeout calculation).
    pub first_seen_ns: u64,
}

/// Global reassembly hash table. One per network namespace.
///
/// Keyed by (src_ip, dst_ip, protocol, fragment_id) — a 4-tuple hashed via
/// SipHash for DoS resistance. XArray with the hash as integer key.
/// Bounded by `net.ipv4.ipfrag_max_dist` (default 64 concurrent chains)
/// to limit memory consumption under fragment flood attacks.
pub struct IpReassemblyTable {
    pub chains: XArray<IpReassemblyQueue>,
    /// Current memory consumption (bytes). When this exceeds
    /// `net.ipv4.ipfrag_high_thresh` (default 4 MB), the oldest chains
    /// are evicted (LRU) until memory drops below `ipfrag_low_thresh`.
    pub mem_used: AtomicU64,
}

16.2.5.2 ip_rcv() / ip6_rcv() Processing

ip_rcv(netbuf: NetBuf):
    1. Validate IP header:
       - Version == 4, IHL >= 5, total_length <= netbuf.len.
       - Header checksum (IPv4 only; IPv6 has no header checksum).
       - TTL > 0 (drop and send ICMP Time Exceeded if TTL == 0).
    2. NF_INET_PRE_ROUTING hook ([Section 16.18](#packet-filtering-bpf-based)):
       - Invoke conntrack, BPF, and nftables PRE_ROUTING chains.
       - If NF_DROP: drop packet, return.
    3. Check for fragments (MF flag set or fragment_offset > 0):
       - If fragmented: insert into IpReassemblyTable.
         If chain is now complete (all fragments received, no gaps):
           reassemble into a single NetBuf, continue.
         Else: return (wait for remaining fragments).
    4. Route lookup:
       route = RouteTable::lookup(dst, src, mark, ifindex, ...)
       netbuf.route_ext = Some(slab_alloc(route))
    5. Dispatch by route type:
       - RTN_LOCAL: NF_INET_LOCAL_IN hook, then L4 dispatch (tcp_rcv/udp_rcv/icmp_rcv).
       - RTN_UNICAST: NF_INET_FORWARD hook, decrement TTL, ip_output() on egress.
       - RTN_BROADCAST: clone to raw sockets + local delivery.
       - RTN_BLACKHOLE: silent drop.
       - RTN_UNREACHABLE: drop + icmp_send(DEST_UNREACH, HOST_UNREACH).
       - RTN_PROHIBIT: drop + icmp_send(DEST_UNREACH, ADMIN_PROHIBIT).

ip6_rcv() follows the same structure with IPv6-specific differences: no header checksum, hop-by-hop extension header processing before routing, and ICMPv6 for error reporting.

TCP/IP RX error handling for unspecified paths:

  • Checksum validation failure in ip_rcv() (step 1, header checksum for IPv4): The packet is dropped silently. NetDevStats.rx_crc_errors is incremented. No notification is sent to the sender -- this is standard IP behavior (RFC 1122): corrupted packets are discarded at the receiver without feedback. The sender detects loss via higher-layer timeouts (TCP retransmit, application-level retry for UDP).

  • recv_queue allocation failure (socket's sk_rcvbuf limit reached): The incoming packet is dropped. NetDevStats.rx_dropped is incremented. For TCP: the sender detects the loss via retransmission timeout or SACK-based loss detection and retransmits the segment. For UDP: the packet is lost permanently (expected for UDP -- RFC 768 provides no reliability guarantee). The socket's sk_drops counter is also incremented, visible via getsockopt(SO_RXQ_OVFL) or /proc/net/udp.

  • KABI shared buffer exhaustion (DomainRingBuffer full): When the KABI ring between umka-net (Tier 1) and umka-core (Tier 0) is full, the producer (NIC driver or umka-net) receives EAGAIN from ring_push(). The packet is dropped at the ring boundary. The rx_ring_full_drops counter is incremented (per-ring, visible via FMA telemetry). NAPI poll returns with budget remaining, triggering backpressure: napi_complete_done() re-enables interrupts, and the poll loop slows down until the consumer drains the ring. This backpressure mechanism prevents unbounded memory growth under sustained overload.

16.2.5.3 ip_queue_xmit() — TCP-specific IP output entry point

/// TCP-specific IP output path. Stamps the per-connection cached route onto
/// the NetBuf and delegates to `ip_output()`. This avoids a full FIB trie
/// walk on every TCP segment — the route was cached at `connect()`/`accept()`
/// time in `SockCommon.cached_route`.
///
/// Called from `tcp_write_xmit()` for every TCP segment (including
/// retransmissions). The equivalent of Linux's `ip_queue_xmit()` in
/// `net/ipv4/ip_output.c`.
///
/// # Steps
///
/// 1. **Route validation**: `sk_dst_check(&tcb.common)` — validates the cached
///    route is still current (generation counter comparison). If stale, performs
///    a fresh FIB lookup and updates the cache.
/// 2. **Stamp route**: `netbuf.route_ext = Some(route.clone())`.
/// 3. **Build transport header offset**: set `netbuf.transport_header` offset
///    for the TCP header (IP header will be prepended by `ip_output()`).
/// 4. **Delegate**: `ip_output(netbuf)`.
///
/// Performance: Steps 1-3 are O(1) on the fast path (cache hit). The FIB
/// lookup (O(W) where W=32 for IPv4) only runs on cache miss (route table
/// change, PMTU update, interface state change — all cold events).
/// Validate the cached route for a connected socket. If the cached route
/// is still current (generation counter matches the global route generation),
/// returns it directly (O(1), ~5 cycles). If stale, performs a fresh FIB
/// lookup, updates `SockCommon.cached_route`, and returns the new route.
///
/// # Arguments
/// - `common`: The socket's `SockCommon` (contains `cached_route`).
///
/// # Returns
/// `Ok(RouteLookupResult)` — the validated (possibly refreshed) route.
/// `Err(IoError::EHOSTUNREACH)` — FIB lookup failed (no route to destination).
///
/// # Route generation counter
/// The global `ROUTE_GENERATION: AtomicU64` is incremented on every routing
/// table modification (netlink `RTM_NEWROUTE`/`RTM_DELROUTE`, interface
/// state change, PMTU update). Each `RouteLookupResult` carries the
/// generation at which it was computed. `sk_dst_check()` compares
/// `cached.generation` against `ROUTE_GENERATION.load(Acquire)`:
/// - Match → route is current, return cached copy.
/// - Mismatch → route may be stale, perform fresh FIB lookup.
///
/// # Lock requirement
/// Caller must hold the socket's protocol lock (`TcpCb.lock` for TCP,
/// `UdpCb.lock` for UDP) to protect `cached_route` from concurrent
/// modification by a route-change notification.
///
/// Linux equivalent: `sk_dst_check()` in `net/core/dst.c`.
fn sk_dst_check(common: &SockCommon) -> Result<RouteLookupResult, IoError> {
    if let Some(ref cached) = common.cached_route {
        let current_gen = ROUTE_GENERATION.load(Acquire);
        if cached.generation == current_gen {
            return Ok(cached.clone());
        }
    }
    // Cache miss or stale — perform fresh FIB lookup.
    let route = RouteTable::lookup(
        common.dst_addr, common.src_addr,
        common.mark, common.bound_dev_if,
    )?;
    // Update the socket's cached route (under protocol lock).
    // This is a warm-path write (~20 cycles for the clone + store).
    common.cached_route = Some(route.clone());
    Ok(route)
}

fn ip_queue_xmit(tcb: &TcpCb, mut netbuf: NetBuf) -> Result<(), IoError> {
    let route = sk_dst_check(&tcb.common)?;
    // Allocate a RouteLookupResult from the route_cache_slab and store
    // a NonNull pointer in netbuf.route_ext. The slab allocation is
    // warm-path (~20 cycles). The route is freed when the NetBuf is
    // returned to the pool (see netbuf_free() lifecycle note in
    // [Section 16.5](#netbuf-packet-buffer)).
    let route_ptr = route_cache_slab.alloc(route);
    netbuf.route_ext = Some(NonNull::new(route_ptr).unwrap());
    ip_output(netbuf)
}

16.2.5.4 ip_output() / ip6_output()

/// IP output path. Stamps IP headers, runs netfilter hooks, fragments
/// if needed, resolves neighbor, and hands off to `dev_queue_xmit()`.
///
/// # Arguments
/// - `netbuf`: The packet to transmit. Must have `route_ext` set (either
///   from `ip_queue_xmit()` cached route or from a fresh FIB lookup).
///   Transport header must already be built (TCP/UDP/ICMP).
///
/// # Returns
/// `Ok(())` if the packet was successfully enqueued for transmission.
/// `Err(IoError::EMSGSIZE)` if the packet exceeds PMTU and DF is set.
/// `Err(IoError::EHOSTUNREACH)` if neighbor resolution fails permanently.
///
/// # Equivalent
/// Linux: `ip_output()` in `net/ipv4/ip_output.c` +
///        `ip6_output()` in `net/ipv6/ip6_output.c`.
fn ip_output(netbuf: NetBuf) -> Result<(), IoError> {
ip_output(netbuf):
    1. Resolve route (if not already cached in netbuf.route_ext):
       route = RouteTable::lookup(dst, src, mark, ...)
    2. NF_INET_LOCAL_OUT hook (for locally-generated packets).
    3. Set IP header fields: src_addr, dst_addr, TTL (from route or
       setsockopt IP_TTL), TOS (from socket or route), identification.
    4. Compute header checksum (IPv4 only).
    5. NF_INET_POST_ROUTING hook.
    6. Check MTU: if packet_len > route.mtu:
       a. If DF bit set (IPv4) or always (IPv6):
          - If locally generated: return EMSGSIZE to caller.
          - If forwarded: send ICMP Fragmentation Needed (IPv4) /
            ICMPv6 Packet Too Big, drop.
       b. Else (IPv4 with DF=0): call ip_fragment():
          Split payload into fragments of route.mtu - ip_hdr_len.
          Each fragment gets its own IP header with MF flag (except last).
          The original `netbuf` is consumed; ip_fragment() produces a
          linked list of fragment NetBufs. Continue to step 7 for each
          fragment independently:
          ```
          let fragments = ip_fragment(netbuf, route.mtu);
          for frag in fragments {
              // Each fragment goes through neighbor resolution and
              // dev_queue_xmit independently.
              neighbor_resolve_and_xmit(dev, frag, &route)?;
          }
          return Ok(());
          ```
    7. Neighbor resolution: ARP (IPv4) or NDP (IPv6) lookup for next-hop
       MAC address. If not cached: queue packet, send ARP/NDP solicitation.
       On neighbor miss (no ARP/NDP entry), the outgoing NetBuf is queued on
       `neigh.arp_queue` (bounded, default 3 entries). The ARP/NDP solicitation
       is sent; when the reply arrives, `neigh_resolve_output()` drains the
       queue: for each queued packet, it fills in the L2 header from the
       newly-resolved neighbor entry and calls `dev_queue_xmit()`. The drain
       runs in the softirq context of the ARP/NDP reply processing.
       Queue overflow drops the oldest entry.
    8. dev_queue_xmit(dev, netbuf):
       See [Section 16.21](#traffic-control-and-queue-disciplines--dev_queue_xmit---transmit-entry-point)
       for the full function. Summary: select TX queue, convert
       `NetBuf` → `NetBufHandle` via `pool.handle_for(buf)` (consuming),
       enqueue the handle into the qdisc, then `qdisc_run()` (validate_xmit
       + dispatch_xmit). The `NetBufHandle::Drop` impl ensures the slab
       slot is returned on all error/completion paths.

16.2.5.5 ICMP Handling

/// ICMP dispatch table. Maps ICMP type to handler function.
/// Used by `icmp_rcv()` for incoming ICMP messages.
pub struct IcmpHandler {
    /// ICMP Echo Request (type 8): generate Echo Reply (type 0).
    /// Rate-limited by `net.ipv4.icmp_ratelimit` (default 1000 ms).
    pub echo_request: fn(NetBuf),
    /// ICMP Destination Unreachable (type 3): propagate error to transport
    /// layer via `tcp_v4_err()` / `udp_err()` using the embedded original
    /// packet header to identify the affected socket.
    pub dest_unreachable: fn(NetBuf),
    /// ICMP Time Exceeded (type 11): same propagation as Dest Unreachable.
    pub time_exceeded: fn(NetBuf),
    /// ICMP Redirect (type 5): update route cache if source is the
    /// current gateway for the destination. Ignored if
    /// `net.ipv4.conf.*.accept_redirects = 0`.
    pub redirect: fn(NetBuf),
    /// ICMP Parameter Problem (type 12): propagate to transport layer.
    pub param_problem: fn(NetBuf),
}

ICMPv6 uses a parallel Icmpv6Handler with additional entries for Neighbor Solicitation/Advertisement (NDP), Router Solicitation/Advertisement, and MLD.

16.2.5.6 Path MTU Discovery

/// Cached PMTU entry for a destination. Stored in the per-namespace
/// PMTU cache (XArray keyed by destination IP address hash).
pub struct PathMtuEntry {
    /// Destination address.
    pub dst: IpAddr,
    /// Current path MTU estimate (bytes). Initialized to the outgoing
    /// interface MTU. Reduced when an ICMP Fragmentation Needed / Packet
    /// Too Big message is received. Increased by periodic probing.
    pub pmtu: AtomicU32,
    /// Timestamp of last PMTU reduction (nanoseconds since boot).
    /// PMTU is periodically increased (probed) after
    /// `net.ipv4.route.mtu_expires` (default 600s) to detect path changes.
    pub last_reduced_ns: AtomicU64,
    /// Lock timer: after a PMTU reduction, further reductions for the same
    /// destination are ignored for `net.ipv4.route.min_pmtu_interval`
    /// (default 10s) to prevent oscillation from reordered ICMP messages.
    pub lock_until_ns: u64,
}

/// Per-namespace PMTU cache.
pub struct PmtuCache {
    /// XArray keyed by hash of destination IP. O(1) lookup on TX path.
    pub entries: XArray<PathMtuEntry>,
    /// Periodic GC timer: evicts entries older than `mtu_expires` that
    /// have not been accessed since the last probe.
    pub gc_timer: HrTimerHandle,
}

PMTU processing on ICMP receipt: 1. Receive ICMP Fragmentation Needed (type 3, code 4) or ICMPv6 Packet Too Big (type 2). 2. Extract the next-hop MTU from the ICMP message. Validate: must be >= 68 (IPv4) or

= 1280 (IPv6). 3. Look up PmtuCache for the destination. If entry exists and now < lock_until_ns: ignore (anti-oscillation). Otherwise update pmtu and set last_reduced_ns. 4. Propagate to the transport layer: call tcp_v4_mtu_reduced() / udp_err() so active connections can adjust their MSS / segment size.

16.2.5.7 Multicast Group Membership

/// Per-interface multicast group membership state (IGMP for IPv4, MLD for IPv6).
pub struct MulticastState {
    /// Set of joined multicast groups on this interface.
    /// XArray keyed by multicast group address hash.
    pub groups: XArray<MulticastGroup>,
    /// IGMP/MLD protocol version (IGMPv2/v3 for IPv4, MLDv1/v2 for IPv6).
    pub version: u8,
    /// General query timer: responds to IGMP/MLD general queries from the
    /// querier (router). Random delay per RFC 3376 §7.2 to avoid report storms.
    pub query_timer: HrTimerHandle,
}

pub struct MulticastGroup {
    /// Multicast group address.
    pub addr: IpAddr,
    /// Source filter mode (IGMPv3/MLDv2): Include or Exclude.
    /// Include: accept only from listed sources. Exclude: accept all except listed.
    pub filter_mode: SourceFilterMode,
    /// Source list for IGMPv3/MLDv2 source-specific filtering.
    pub sources: ArrayVec<IpAddr, 64>,
    /// Number of sockets that have joined this group on this interface.
    /// When it drops to zero, send a Leave/Done message and remove the entry.
    pub ref_count: u32,
}

ip_mc_join_group() / ip_mc_leave_group() are called from setsockopt(IP_ADD_MEMBERSHIP). They update MulticastState, send IGMP/MLD membership reports, and configure the NIC's multicast filter (via NetDevice::set_rx_mode()).

16.2.6 UDP Subsystem

UDP (User Datagram Protocol) provides connectionless, unreliable datagram delivery. It is the second most-used transport protocol after TCP and the encapsulation substrate for VXLAN, WireGuard, QUIC, and DNS.

16.2.6.1 UDP Control Block

/// Per-socket UDP state. Embeds `SockCommon` for namespace, credentials,
/// and socket-level options. One `UdpCb` per `socket(AF_INET/AF_INET6, SOCK_DGRAM, 0)`.
pub struct UdpCb {
    /// Common socket state (namespace, type, family, cred, cgroup).
    pub common: SockCommon,
    /// Local port (host byte order). Assigned by `bind()` or auto-allocated
    /// from the ephemeral range on first `sendmsg()`.
    pub local_port: u16,
    /// Local address. `INADDR_ANY` / `in6addr_any` until explicitly bound.
    pub local_addr: IpAddr,
    /// Connected remote address (set by `connect(2)`, optional for UDP).
    /// When set, `sendmsg()` without a destination uses this address, and
    /// `recvmsg()` filters to only accept datagrams from this peer.
    pub remote_addr: Option<SockAddr>,
    /// Receive queue: datagrams waiting for `recvmsg()`. Each entry is a
    /// `NetBuf` containing one complete datagram (or a GRO-coalesced batch).
    /// Bounded by `common.rcvbuf` (default `net.core.rmem_default`).
    /// When full, incoming datagrams are silently dropped (UDP has no
    /// flow control — this is protocol-correct behavior per RFC 768).
    pub recv_queue: BoundedQueue<NetBuf>,
    /// Cork mode: when enabled via `setsockopt(UDP_CORK)`, `sendmsg()` calls
    /// accumulate payload into `cork_buf` instead of sending immediately.
    /// The corked payload is flushed as a single datagram when cork is
    /// disabled or when `cork_buf` reaches the PMTU.
    pub cork_active: bool,
    /// Cork accumulation buffer. Allocated lazily on first corked `sendmsg()`.
    pub cork_buf: Option<NetBuf>,
    /// Generic Segmentation Offload segment size (bytes). Set via
    /// `setsockopt(UDP_SEGMENT, segment_size)`. When non-zero, `sendmsg()`
    /// accepts payloads larger than the MTU and the stack (or NIC via
    /// NETIF_F_GSO_UDP_L4) segments them into chunks of `gso_size` bytes.
    /// 0 = GSO disabled (default).
    pub gso_size: u16,
    /// Skip UDP checksum on IPv6 TX. Set via `setsockopt(UDP_NO_CHECK6_TX, 1)`.
    /// RFC 6935/6936 permits zero-checksum UDP over IPv6 for specific tunnel
    /// protocols (VXLAN, GENEVE). Default false (checksum computed per RFC 8200).
    pub no_check6_tx: bool,
    /// Accept UDP datagrams with zero checksum on IPv6 RX.
    /// Set via `setsockopt(UDP_NO_CHECK6_RX, 1)`. Default false.
    pub no_check6_rx: bool,
    /// Per-socket multicast group membership list. Tracks groups joined via
    /// `setsockopt(IP_ADD_MEMBERSHIP)` / `setsockopt(IPV6_JOIN_GROUP)`.
    /// When the socket is closed, all memberships are automatically dropped
    /// (decrementing the per-interface `MulticastGroup.ref_count`).
    /// Bounded: max 20 groups per socket (matching Linux `IP_MAX_MEMBERSHIPS`).
    pub mc_list: ArrayVec<MulticastMembership, 20>,
    /// UDP encapsulation callback. Registered by tunnel protocols (VXLAN,
    /// WireGuard, GENEVE) via `udp_encap_rcv`. When set, `udp_rcv()` calls
    /// this callback instead of delivering to `recv_queue`, allowing
    /// the tunnel to strip the outer UDP header and re-inject the inner
    /// packet into the network stack.
    pub encap_rcv: Option<fn(&UdpCb, NetBuf) -> Result<(), NetBuf>>,
    /// GRO (Generic Receive Offload) state. Tracks in-progress GRO
    /// aggregation for this socket's flow. Enabled by default for
    /// UDP sockets; can be disabled via `setsockopt(UDP_GRO, 0)`.
    pub gro_enabled: bool,
}

/// Per-socket multicast group membership entry.
pub struct MulticastMembership {
    /// Multicast group address.
    pub group: IpAddr,
    /// Interface index on which this group was joined.
    pub ifindex: u32,
}

16.2.6.2 UDP Socket Table

/// Entry type for `UdpTable.by_port` slots.
///
/// Optimises for the common case of one socket per port (no heap allocation)
/// while supporting `SO_REUSEPORT` groups (slab-allocated overflow via
/// `SlabVec`). Using an explicit enum avoids introducing `SmallVec` (not a
/// defined UmkaOS type) and makes the single-socket fast path zero-allocation.
pub enum PortEntry {
    /// One socket bound to this port (no SO_REUSEPORT). Zero heap allocation.
    Single(Arc<UdpCb>),
    /// Multiple sockets sharing this port via SO_REUSEPORT.
    /// `SlabVec<Arc<UdpCb>, 1>`: 1 element inline, slab-allocated overflow
    /// for larger reuseport groups (typically 1–16 sockets).
    Multi(SlabVec<Arc<UdpCb>, 1>),
}

/// Global UDP socket lookup table. One per network namespace.
///
/// Keyed by local port (u16 → integer key → XArray per collection policy).
/// Each slot holds a `PortEntry` for sockets bound to that port.
///
/// Lookup on the receive path (`udp_rcv`) is O(1) by port, then O(n) in the
/// number of `SO_REUSEPORT` sockets on that port (typically 1–16). For
/// `SO_REUSEPORT` groups, selection uses a BPF program if attached, otherwise
/// consistent hashing on the 4-tuple.
pub struct UdpTable {
    /// Port-indexed socket array. XArray<u16, PortEntry>.
    /// `PortEntry::Single` for the common one-socket-per-port case.
    /// `PortEntry::Multi` for SO_REUSEPORT groups.
    pub by_port: XArray<PortEntry>,
    /// Ephemeral port allocator. Tracks the next candidate port for
    /// auto-assignment. Range: `net.ipv4.ip_local_port_range` (default
    /// 32768–60999). Uses a per-CPU rotating counter to reduce contention
    /// on high-connection-rate servers.
    /// PerCpu access is under PreemptGuard (single-writer per CPU); Cell<u16>
    /// avoids unnecessary atomic overhead. UDP port allocation is syscall
    /// context only (bind/sendto), never IRQ context.
    pub ephemeral_next: PerCpu<Cell<u16>>,
}

16.2.6.3 UDP Receive Path

udp_rcv(netbuf: NetBuf):
    1. Extract (dst_port, src_port) from UDP header.
    2. Validate UDP length field: udp_len >= 8 and udp_len <= netbuf.len - ip_hdr_len.
       Drop if invalid (increment UdpMib::InErrors).
    3. Validate checksum:
       - IPv4: if checksum field is 0, skip (UDP checksum optional over IPv4).
       - IPv6: checksum is mandatory (RFC 8200). Drop if zero or invalid.
       - Hardware checksum offload: if NIC reports CHECKSUM_COMPLETE, verify
         the pseudo-header adjusted csum. Otherwise compute in software.
    4. Lookup socket: udp_table.by_port.get(dst_port).
       - If SO_REUSEPORT group: select socket via BPF or 4-tuple hash.
       - If connected socket: filter by remote_addr match.
       - If no match: send ICMP Port Unreachable (rate-limited), drop.
    5. Check encap_rcv: if socket has encapsulation callback, call it.
       The callback returns Ok(()) if it consumed the packet, or Err(netbuf)
       if the packet should be delivered normally.
    6. Enqueue to recv_queue. If queue is full (>= rcvbuf):
       drop datagram, increment UdpMib::RcvbufErrors.
    7. Wake any blocked recvmsg() waiter (sk_data_ready).

16.2.6.4 UDP recvmsg Semantics

udp_recvmsg(ucb: &UdpCb, msg: &mut MsgHdr, flags: u32) -> Result<usize>:
    1. If recv_queue is empty:
       - If flags & MSG_DONTWAIT: return EAGAIN.
       - Else: block on ucb.common.read_wait until data arrives or timeout.
    2. Peek the head datagram from recv_queue.
    3. Copy payload to msg.msg_iov (up to total iov length).
       - If datagram is larger than the user buffer:
         Truncate to buffer size. If flags & MSG_TRUNC: return the
         *original* datagram length (not the truncated length), so the
         caller can detect truncation. Otherwise return truncated length.
    4. Copy source address to msg.msg_name (if non-null).
    5. If flags & MSG_PEEK: do NOT dequeue — leave the datagram in
       recv_queue for a subsequent recvmsg() call.
       Otherwise: dequeue the datagram, release the NetBuf back to the pool.
    6. Deliver ancillary data (cmsg) if requested:
       - IP_PKTINFO: destination address and interface index.
       - IP_TOS / IPV6_TCLASS: DSCP/ECN byte.
       - SO_TIMESTAMP / SO_TIMESTAMPNS: kernel RX timestamp.
       - UDP_GRO: original datagram boundaries for GRO-coalesced buffers.
    7. Return bytes copied (or original datagram length if MSG_TRUNC).

Flag summary (matches Linux recvmsg semantics):

Flag Value Effect on UDP recvmsg
MSG_PEEK 0x02 Return data without consuming it from the queue
MSG_TRUNC 0x20 Return the real datagram length even if it exceeded the buffer
MSG_DONTWAIT 0x40 Non-blocking: return EAGAIN instead of sleeping on empty queue
MSG_ERRQUEUE 0x2000 Read from the error queue (ICMP errors, zerocopy notifications)
MSG_CMSG_CLOEXEC 0x40000000 Set close-on-exec on received fds (AF_UNIX; no-op for UDP)

16.2.6.5 UDP Send Path

udp_sendmsg(ucb: &UdpCb, msg: &MsgHdr, flags: u32) -> Result<usize>:
    1. Determine destination:
       - If msg.msg_name is set: use it (connected or unconnected send).
       - Else if ucb.remote_addr is set: use connected address.
       - Else: return EDESTADDRREQ.
    2. If cork_active: append payload to cork_buf, return payload length.
       Flush cork_buf when cork is released or buf reaches PMTU.
    3. Route lookup: routes.lookup(dst_addr, src_addr, ...) → RouteLookupResult.
       Cache in netbuf.route_ext.
    4. Compute effective MTU from route (PMTU discovery).
    5. If payload + UDP header (8) + IP header > PMTU:
       - If DF bit is set (default for IPv6, optional for IPv4): return EMSGSIZE.
       - Else: fragment at IP layer (ip_fragment()).
    6. Build UDP header: src_port, dst_port, length, checksum.
       - Checksum: compute over pseudo-header + UDP header + payload.
       - Hardware TX checksum offload: if NIC supports NETIF_F_HW_CSUM,
         set netbuf.csum_offset and let the NIC compute it.
    7. Call ip_output(netbuf) for transmission.

16.2.6.6 UDP-Lite (RFC 3828)

UDP-Lite extends UDP with partial checksum coverage: the checksum covers only the first N bytes of the payload (configurable via setsockopt(UDPLITE_SEND_CSCOV)). This allows applications that tolerate partial corruption (e.g., media codecs) to accept datagrams where only the header and critical payload prefix are verified.

UDP-Lite uses IP protocol 136 (distinct from UDP's protocol 17) and shares the same UdpCb struct with an additional field:

/// UDP-Lite partial checksum coverage (bytes). 0 = full coverage (standard UDP).
/// When non-zero, only the first `cscov` bytes of the datagram are checksummed.
pub cscov_send: u16,
pub cscov_recv: u16,  // Minimum coverage required on receive.

16.2.6.7 UDP GRO/GSO

GRO (receive): The umka-net GRO layer (in NetRxContext) can coalesce multiple UDP datagrams destined for the same socket into a single large NetBuf, reducing per-packet processing overhead. GRO coalesces datagrams with identical (src_ip, src_port, dst_ip, dst_port) and contiguous payload. The socket reads the coalesced buffer via recvmsg() with GRO_SIZE auxiliary data indicating the original datagram boundaries.

GSO (transmit): sendmsg() can accept payloads larger than the MTU when setsockopt(UDP_SEGMENT, segment_size) is set. The stack builds a single large NetBuf and defers segmentation to the NIC (if it supports NETIF_F_GSO_UDP_L4) or to software GSO at the device queue. This reduces per-segment overhead for high-throughput UDP applications (QUIC, media streaming).

16.2.7 Neighbour Subsystem (ARP / NDP)

The neighbour subsystem resolves L3 addresses (IPv4, IPv6) to L2 addresses (Ethernet MAC). It implements ARP (RFC 826) for IPv4 and NDP (RFC 4861) for IPv6 behind a unified NeighborTable abstraction.

For the neighbor table entry definition (NeighborEntry), NUD states, and state machine, see Section 16.7. That section is the canonical definition.

/// Per-protocol neighbour table. One instance for ARP (IPv4), one for NDP (IPv6).
/// Canonical name: `NeighborTable` (matching [Section 16.7](#neighbor-subsystem)).
pub struct NeighborTable {
    /// Hash table: keyed by (L3 address, dev_index). RCU-protected reads.
    /// Resized when load factor exceeds 0.75 (RCU-safe resize: new table
    /// is populated, then swapped atomically).
    pub hash: RcuHashMap<(IpAddr, u32), NeighborEntry>,
    /// GC timer: runs every `gc_interval` (default 30s). Evicts FAILED entries
    /// and entries in STALE state older than `gc_stale_time` (default 60s).
    pub gc_interval_ms: u32,
    /// Base reachable time (default 30s for ARP, 30s for NDP per RFC 4861).
    /// Actual timeout is randomized: uniform [0.5 * base, 1.5 * base].
    pub base_reachable_time_ms: u32,
    /// Maximum number of unicast probes before declaring FAILED (default 3).
    pub max_unicast_probes: u32,
    /// Maximum number of multicast probes (ARP broadcast / NDP multicast) for
    /// initial resolution (default 3).
    pub max_multicast_probes: u32,
}

ARP packet processing (called from arp_rcv() in the L3 dispatch path, step 5c above):

  1. Validate ARP header: hrd=1 (Ethernet), pro=0x0800 (IPv4), hln=6, pln=4.
  2. ARP Request (op=1): If target_ip matches a local interface address, update or create a neighbour entry for (sender_ip, sender_mac) → REACHABLE, then send an ARP Reply with our MAC.
  3. ARP Reply (op=2): Update the existing neighbour entry for sender_ip to REACHABLE with the new MAC. Flush pending_queue (NeighborEntry.pending_queue: SpinLock<ArrayVec<NetBufHandle, 3>>, Section 16.7) — transmit all queued packets via dev_queue_xmit().
  4. Gratuitous ARP (sender_ip == target_ip): Update existing entry if present (for MAC migration detection); do not create a new entry (prevents cache poisoning).

Sysctl tunables (per-namespace, per-protocol):

Sysctl Default Description
net.ipv4.neigh.default.gc_stale_time 60s STALE entries older than this are GC'd
net.ipv4.neigh.default.base_reachable_time_ms 30000 Base reachability timeout
net.ipv4.neigh.default.retrans_time_ms 1000 Retransmit interval for probes
net.ipv4.neigh.default.gc_thresh1 128 GC starts when table exceeds this
net.ipv4.neigh.default.gc_thresh3 1024 Hard limit; new entries rejected above this

Cross-references: - NAPI poll lifecycle and budget accounting: Section 16.14 - NetBuf struct and route_ext field: Section 16.5 - RouteTable::lookup() and RouteLookupResult: Section 16.6 - VLAN receive path (runs before L3 dispatch): Section 16.27 - BPF packet filtering and connection tracking (ConntrackTuple, ConntrackEntry, hash table design, NAT): Section 16.18 - Domain crossing protocol (NIC driver → umka-net): Section 16.5 - XDP pre-stack processing: Section 19.2 - TCP control block (TcpCb): Section 16.8 - KABI ring protocol: Section 11.8 - Isolation domain architecture: Section 11.2

16.3 Socket Abstraction

Every socket carries common state independent of its transport protocol.

Namespace resolution path: When a socket is created via the socket() syscall, the kernel resolves the calling task's network namespace through the task's NamespaceSet:

/// Resolve the network namespace for the current task.
///
/// The `NamespaceSet` (held in `Task.nsproxy`) contains a capability reference
/// to the task's network stack. `cap_resolve` validates the capability and
/// returns the `NetNamespace` that governs all of this task's network operations
/// (routing, interface enumeration, port binding, firewall rules).
///
/// This function is called once per `socket()` syscall; the result is stored
/// in `SockCommon.net_ns` and remains immutable for the socket's lifetime.
fn resolve_net_ns(ns_set: &NamespaceSet) -> Arc<NetNamespace> {
    ns_set.net_stack.cap_resolve()
}

The resolved NetNamespace is stored in SockCommon.net_ns (below) and governs all subsequent operations on that socket — routing lookups use net_ns.routes, interface enumeration uses net_ns.interfaces, and port binding checks net_ns.port_allocator. A socket cannot change its namespace after creation.

/// Per-socket common state, embedded in every protocol-specific socket struct
/// (TcpCb, UdpCb, SctpAssociation, etc.).
///
/// Captured at socket creation time (`socket()` syscall) from the calling task's
/// context. Immutable after creation — a socket cannot change its namespace.
pub struct SockCommon {
    /// Network namespace this socket belongs to. Captured from
    /// `resolve_net_ns(&current_task().nsproxy)` at `socket()` syscall time.
    /// All routing lookups, interface enumeration, and port binding for
    /// this socket use `net_ns.routes`, `net_ns.interfaces`, etc.
    /// ([Section 17.1](17-containers.md#namespace-architecture--capability-domain-mapping)).
    pub net_ns: Arc<NetNamespace>,
    /// Socket type (SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, SOCK_SEQPACKET).
    pub sock_type: SockType,
    /// Protocol family (AF_INET, AF_INET6, AF_UNIX, AF_VSOCK, AF_PACKET).
    pub family: AddressFamily,
    /// Socket-level flags (SO_REUSEADDR, SO_KEEPALIVE, etc.).
    pub flags: AtomicU32,
    /// Receive buffer size limit (bytes). Set by SO_RCVBUF; default from
    /// `net.core.rmem_default` sysctl (per-namespace, default 212992).
    /// Maximum: `2 * net.core.rmem_max` (default rmem_max = 212992, so
    /// max = 425984). With `CAP_NET_ADMIN`, SO_RCVBUFFORCE bypasses the
    /// limit. Linux compatibility: identical sysctl names and semantics.
    // Stored as u32 (always non-negative after validation). setsockopt(SO_RCVBUF)
    // rejects negative values at the syscall boundary before this field is updated.
    pub rcvbuf: u32,
    /// Send buffer size limit (bytes). Set by SO_SNDBUF; default from
    /// `net.core.wmem_default` sysctl (per-namespace, default 212992).
    /// Maximum: `2 * net.core.wmem_max` (default wmem_max = 212992, so
    /// max = 425984). With `CAP_NET_ADMIN`, SO_SNDBUFFORCE bypasses the
    /// limit.
    pub sndbuf: u32,
    /// Total bytes currently in the send queue (across all queued segments).
    /// Used for send buffer backpressure: when `sk_wmem_queued >= sndbuf`,
    /// further writes block until ACKs free space. u64 for high-BDP paths
    /// (>4.3 GB at 400 Gbps / 100ms RTT). UDP also uses this for send accounting.
    pub sk_wmem_queued: u64,
    /// Wait queue for receive-side blocking. Woken by `sk_data_ready()` when
    /// new data arrives on the socket (TCP segment queued, UDP datagram
    /// enqueued, etc.). Tasks blocked in `recvmsg()` or `poll(EPOLLIN)`
    /// sleep on this queue.
    ///
    /// Linux equivalent: the read half of `sk->sk_wq`. UmkaOS splits
    /// read and write wait queues for clarity and to avoid false wakeups.
    pub read_wait: WaitQueue,
    /// Wait queue woken when ACK frees send buffer space (sk_wmem_queued drops
    /// below sndbuf threshold).
    pub write_wait: WaitQueue,
    /// Flag indicating that at least one task is blocked on `write_wait`.
    /// Checked by the TCP ACK processing path (Tier 1) to avoid unnecessary
    /// cross-domain wakeup signals when no task is waiting for write space.
    /// Set to `true` by `sendmsg()` before sleeping on `write_wait`; set to
    /// `false` by the wakeup path after waking the waiter.
    /// `Relaxed` ordering suffices: the exact value is not ordering-dependent
    /// relative to other fields (a stale `true` causes one extra wakeup signal,
    /// which is harmless).
    pub write_wait_pending: AtomicBool,
    /// Owning user credentials at socket creation time (for LSM checks).
    pub cred: Arc<Credentials>,
    /// Bound cgroup for eBPF classification ([Section 16.21](#traffic-control-and-queue-disciplines--integration-with-cgroups-network-bandwidth-enforcement)).
    /// Captured from `current_task().cgroup` at socket creation. Used by
    /// cgroup-attached eBPF programs for packet classification — NOT via
    /// a kernel `net_cls_classid` field (cgroup v2 replaces net_cls with eBPF).
    pub cgroup: Arc<CgroupCss>,

    /// Cached route for connected sockets. Set at `connect()`/`accept()` time
    /// from the FIB lookup result. TCP's `ip_queue_xmit()` stamps each segment's
    /// `route_ext` from this cache, avoiding per-packet FIB trie walks.
    ///
    /// **Design rationale (inline vs pointer)**: `RouteLookupResult` (~88 bytes)
    /// is embedded inline because sockets are long-lived objects (~minutes to
    /// hours) where the one-time cost of the larger struct is amortized over
    /// millions of packets. In contrast, `NetBuf.route_ext` uses
    /// `Option<NonNull<RouteLookupResult>>` (8 bytes, pointer to slab) because
    /// per-packet NetBufs are allocated/freed at line rate and every byte counts.
    ///
    /// **Invalidation mechanism**: Each `RouteTable` maintains a monotonically
    /// increasing generation counter (`route_gen: AtomicU64`). The cached route
    /// stores the generation at the time of the FIB lookup. On the TX hot path,
    /// `sk_dst_check()` compares the cached generation against the current
    /// `net_ns.routes.route_gen` (single atomic load, ~1 cycle). If they differ:
    /// 1. The cached route is invalidated (set to `None`).
    /// 2. A fresh FIB lookup is performed.
    /// 3. The new result (with current generation) replaces the cache.
    ///
    /// Invalidation triggers (each bumps `route_gen`):
    /// - Routing table changes (netlink `RTM_NEWROUTE`/`RTM_DELROUTE`)
    /// - PMTU changes (ICMP feedback updates the route's `mtu` field)
    /// - Interface state changes (link down → `rt_cache_flush()`)
    ///
    /// Protected by `TcpCb.lock` (for TCP) or the socket's protocol lock.
    /// Read on the TX hot path under that lock.
    ///
    /// Linux equivalent: `struct sock.sk_dst_cache` (a `dst_entry*` validated
    /// per-packet via `sk_dst_check()` using `dst.obsolete` + `dst_ops->check`).
    pub cached_route: Option<RouteLookupResult>,

    /// Opaque socket handle for the Tier 0/Tier 1 wakeup protocol.
    /// Allocated by the socket ring dispatcher at `socket()` time from a
    /// monotonic `AtomicU64` counter. Used in `SocketRingCmd.sock_handle`,
    /// `SocketWakeEvent.sock_handle`, and the `sk_data_ready()` /
    /// `sk_write_space_ready()` cross-domain wakeup signals.
    ///
    /// The handle is the fundamental socket identifier in the Tier 0/Tier 1
    /// protocol: Tier 1 network stack sends wakeup events to Tier 0 keyed
    /// by this handle, and Tier 0 dispatches the wake to the correct
    /// `read_wait`/`write_wait` queue. Without this field, the socket
    /// wakeup path cannot function.
    ///
    /// u64: at 10^9 sockets/sec, wraps in 584 years.
    pub sock_handle: u64,
}

The socket layer is protocol-agnostic. Transport protocols register implementations of a common trait:

/// Protocol-agnostic socket operations.
/// Each transport protocol (TCP, UDP, SCTP, MPTCP) implements this trait.
/// Every implementor embeds a `SockCommon` struct and exposes it via `sock_common()`.
///
/// **Scope: Tier 1 internal (umka-net only).** This trait is used within the
/// umka-net domain for direct dispatch between protocol implementations. It is
/// NOT part of the KABI boundary. Cross-domain socket operations use the
/// `SocketRingCmd`/`SocketRingResp` ring protocol ([Section 16.4](#socket-operation-dispatch)),
/// which represents sockets as integer `sock_handle: u64` values — never as
/// trait object pointers. The `SlabRef<dyn SocketOps>` returned by `accept()`
/// is valid only within the umka-net Tier 1 domain; the corresponding
/// `sock_handle` is registered in the socket dispatch table for cross-domain
/// access.
pub trait SocketOps: Send + Sync {
    /// Bind the socket to a local address.
    fn bind(&self, addr: &SockAddr) -> Result<(), KernelError>;

    /// Mark the socket as a passive listener.
    fn listen(&self, backlog: u32) -> Result<(), KernelError>;

    /// Accept an incoming connection (blocking or non-blocking).
    fn accept(&self) -> Result<(SlabRef<dyn SocketOps>, SockAddr), KernelError>;

    /// Initiate an outgoing connection.
    fn connect(&self, addr: &SockAddr) -> Result<(), KernelError>;

    /// Send a message (scatter-gather, ancillary data, destination address).
    fn sendmsg(&self, msg: &MsgHdr, flags: u32) -> Result<usize, KernelError>;

    /// Receive a message (scatter-gather, ancillary data, source address).
    fn recvmsg(&self, msg: &mut MsgHdr, flags: u32) -> Result<usize, KernelError>;

    /// Set a socket option (protocol-specific behavior).
    fn setsockopt(&self, level: i32, name: i32, val: &[u8]) -> Result<(), KernelError>;

    /// Get a socket option value.
    fn getsockopt(&self, level: i32, name: i32, buf: &mut [u8]) -> Result<usize, KernelError>;

    /// Retrieves the local address of a bound socket.
    /// Returns the address in `addr` and its length.
    fn getsockname(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;

    /// Retrieves the remote address of a connected socket.
    /// Returns the address in `addr` and its length.
    fn getpeername(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;

    /// Poll for readiness events (POLLIN, POLLOUT, POLLERR, POLLHUP).
    fn poll(&self, events: PollEvents) -> PollEvents;

    /// Shut down part of a full-duplex connection.
    fn shutdown(&self, how: ShutdownHow) -> Result<(), KernelError>;

    /// Close the socket and release all resources.
    /// For TCP: initiates FIN handshake (or RST if SO_LINGER with timeout 0).
    /// For UDP: releases port binding and queued buffers.
    /// Called when the last file descriptor reference is dropped (VFS layer
    /// guarantees exactly one call per socket lifetime). Error recovery paths
    /// do NOT call close() directly — they mark the socket as errored and let
    /// the VFS drop path handle cleanup.
    ///
    /// Close is **best-effort**: if close() returns Err (e.g., TCP FIN
    /// handshake timeout), the error is logged and the socket is released
    /// regardless. The VFS layer always frees the socket resources after
    /// this call, matching Linux semantics where close(2) errors on
    /// sockets are not retryable (POSIX: "If close() is interrupted by a
    /// signal [...] the state of fildes is unspecified"; Linux: close always
    /// releases the fd regardless of error). Applications requiring durable
    /// delivery must use `shutdown(SHUT_WR)` + `read()` for EOF confirmation
    /// before calling close(), same as on Linux.
    fn close(&self) -> Result<(), KernelError>;
}

16.3.1 Namespace-Scoped Network Privilege Checks

All network operations that require elevated privileges use ns_capable() scoped to the socket's network namespace's owning user namespace — never the global capable() function. This ensures that a process with CAP_NET_ADMIN inside a container can administer its own network namespace without gaining access to the host network namespace.

Rule: For any network privilege check, the capability is validated against sock.net_ns.user_ns (the user namespace that owns the socket's network namespace):

/// Check a network capability against the socket's owning namespace.
///
/// This is the ONLY correct way to check network capabilities.
/// Using `capable(cap)` instead of `ns_capable(...)` is a privilege
/// escalation bug — a container root could affect the host network.
fn sock_ns_capable(sock: &SockCommon, cap: Capability) -> bool {
    ns_capable(&sock.net_ns.user_ns, cap)
}

/// Check a network capability against the current task's network namespace.
/// Used when no socket is available (e.g., netlink operations, interface
/// configuration, route manipulation).
fn net_ns_capable(cap: Capability) -> bool {
    let net_ns = &current_task().nsproxy.net_ns;
    ns_capable(&net_ns.user_ns, cap)
}

Network operations and their capability checks:

Operation Capability Scope
socket(AF_PACKET, SOCK_RAW, ...) CAP_NET_RAW ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_RAW)
socket(AF_INET, SOCK_RAW, ...) CAP_NET_RAW ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_RAW)
bind() to privileged port (<1024) CAP_NET_BIND_SERVICE ns_capable(sock.net_ns.user_ns, CAP_NET_BIND_SERVICE)
setsockopt(IP_TRANSPARENT) CAP_NET_ADMIN ns_capable(sock.net_ns.user_ns, CAP_NET_ADMIN)
Netlink RTM_NEWROUTE / RTM_DELROUTE CAP_NET_ADMIN ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_ADMIN)
Netlink RTM_NEWLINK (create interface) CAP_NET_ADMIN ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_ADMIN)
ioctl(SIOCSIFFLAGS) (set interface up/down) CAP_NET_ADMIN ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_ADMIN)
nftables/iptables rule manipulation CAP_NET_ADMIN ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_ADMIN)
setsockopt(SO_MARK) CAP_NET_ADMIN ns_capable(sock.net_ns.user_ns, CAP_NET_ADMIN)
setsockopt(SO_PRIORITY) to value > 6 CAP_NET_ADMIN ns_capable(sock.net_ns.user_ns, CAP_NET_ADMIN)

Key invariant: A process with CAP_NET_ADMIN in user namespace U can only administer network namespaces whose owning user namespace is U or a descendant of U. It cannot affect network namespaces owned by ancestor or sibling user namespaces. This is enforced by the ns_capable() function, which walks the user namespace hierarchy (Section 17.1).

16.3.2 io_uring Socket Operations

All socket operations are accessible through the io_uring async submission interface (Section 19.3). The following opcodes dispatch to the same SocketOps methods as their syscall counterparts, but avoid the syscall entry/exit overhead and support batched submission:

io_uring opcode Equivalent syscall Notes
IORING_OP_SEND send() / sendto() Supports MSG_ZEROCOPY via registered buffers
IORING_OP_RECV recv() / recvfrom() Supports multishot (re-arms automatically)
IORING_OP_SENDMSG sendmsg() Scatter-gather via msghdr
IORING_OP_RECVMSG recvmsg() Supports multishot + provided buffer groups
IORING_OP_ACCEPT accept4() Supports multishot (accepts multiple connections)
IORING_OP_CONNECT connect() Async non-blocking connect
IORING_OP_SEND_ZC send() + MSG_ZEROCOPY Zero-copy send with completion notification
IORING_OP_SOCKET socket() Direct socket creation into the io_uring fd table
IORING_OP_SHUTDOWN shutdown() Async socket shutdown

The io_uring submission path resolves the socket from the SQE's fd field (or from a registered fixed-file slot), acquires the same locks as the syscall path, and invokes SocketOps methods directly. Capability and LSM checks are identical to the syscall path — io_uring does not bypass security policy.

Credential freshness: io_uring captures the submitter's credentials at io_uring_setup() time. At SQE execution time, socket operations re-validate against the captured credentials (not the task's current credentials), matching Linux's IORING_SETUP_CUR_CID behavior. If the task's credentials have changed since submission (e.g., via setuid()), the operation uses the credentials that were active when the io_uring was created. This prevents privilege escalation via deferred execution after credential change. If IORING_SETUP_CUR_CID is set, the current task credentials are used instead (re-checked at execution time).

16.4 Socket Operation Dispatch

The socket operation dispatch subsystem bridges umka-core (Tier 0) and umka-net (Tier 1) for all socket-related syscalls. It is the networking analog of the VFS ring buffer protocol (Section 14.2), applying the same per-CPU ring, doorbell coalescing, and zero-copy principles to socket I/O.

Performance imperative: Linux's sys_sendmsg() to tcp_sendmsg() path costs ~50-100ns total (in-kernel function calls, no isolation boundary). UmkaOS adds a domain switch (~23-80 cycles depending on architecture). The design compensates via batch submission, per-CPU rings, zero-copy page sharing, and completion coalescing to achieve negative overhead on batched workloads. Per-architecture analysis: - x86-64 (MPK 23cy): At N=16, per-op cost = 23/16 = 1.4cy. Linux indirect call chain with retpoline: ~15-20cy. Negative overhead from N>=2. - AArch64 (POE 40-80cy): At N=16, per-op cost = 80/16 = 5cy. Linux indirect call on AArch64 without retpoline: ~5-8cy (ARM uses different Spectre mitigations — no retpoline penalty). Break-even at N=16; negative overhead requires N>=20 or additional savings from ring prefetch/cache effects. - ARMv7 (DACR 30-40cy): At N=16, per-op cost = 40/16 = 2.5cy. Linux indirect call: ~5-8cy. Negative overhead from N>=8.

The negative-overhead claim is strongest on x86-64 (where retpolines add significant indirect call cost) and weakest on AArch64 (where indirect branches are cheaper).

Tier assignment: Tier 0 (Nucleus). The ring wire format, message opcodes, and dispatch logic are non-replaceable -- they define the ABI between umka-core and umka-net. Socket dispatch routing policy (which ring handles which socket) is Evolvable. TCP congestion control is Evolvable (Section 16.10).

Phase: Phase 1 (core ring protocol, single-ring-per-namespace). Phase 2 (per-CPU rings, epoll integration). Phase 3 (zero-copy MSG_ZEROCOPY, sendfile, splice, ML hooks).


16.4.1 Ring Protocol

16.4.1.1 SocketRingCmd: Request Message

Every socket syscall is serialized into a SocketRingCmd and posted to umka-net's command ring. The design follows the VFS ring protocol pattern (Section 14.2): fixed-size header + tagged-union payload.

/// Socket operation request posted to umka-net's command ring by Tier 0.
///
/// **Layout**: 128-byte fixed-size entries. All variants are padded to
/// 128 bytes for ring slot alignment (one L1 cache line pair). The 128-byte
/// size is chosen because the largest variant (`SendMsg`) carries inline
/// destination address + control message metadata; 64 bytes would truncate
/// common sendmsg payloads requiring a second ring entry or DMA indirection.
///
/// **ABI stability**: `#[repr(C)]` struct with explicit `opcode: SocketOpcode`
/// field (u32 discriminant). The opcode is in the fixed header, not embedded
/// in a tagged union. New opcodes are appended (never reordered).
///
/// **Placement**: Tier 0 (Nucleus). Non-replaceable wire format.
#[repr(C)]
pub struct SocketRingCmd {
    /// Unique request ID for response matching. Drawn from the per-ring-set
    /// global AtomicU64 counter (same pattern as VFS
    /// [Section 14.3](14-vfs.md#vfs-per-cpu-ring-extension--request-id-generation)).
    ///
    /// **Longevity**: At 100M socket ops/sec (10x beyond any realistic
    /// network stack throughput), u64 wraps after ~5,800 years. Safe.
    pub request_id: u64,

    /// Operation code identifying the socket syscall.
    pub opcode: SocketOpcode,

    /// Explicit padding after u32 opcode to align subsequent u64 fields.
    /// Must be zero. Prevents information disclosure from implicit padding.
    pub _pad_opcode: u32,

    /// Socket handle — the kernel-internal socket identifier (NOT the
    /// userspace fd). Tier 0 resolves fd -> socket handle before posting.
    /// u64: indexed in umka-net's socket XArray. Using fd directly would
    /// require umka-net to maintain an fd-to-socket mapping, duplicating
    /// Tier 0's FdTable.
    pub sock_handle: u64,

    /// Flags from the syscall (MSG_DONTWAIT, MSG_ZEROCOPY, etc.).
    /// Interpretation is opcode-dependent.
    pub flags: u32,

    /// Reserved for future use. Must be zero.
    pub _reserved: u32,

    /// Operation-specific arguments. The variant must match `opcode`.
    pub args: SocketRingArgs,
}
const_assert!(core::mem::size_of::<SocketRingCmd>() == 128);
// Verify header: request_id(8) + opcode(4) + pad(4) + sock_handle(8) +
//   flags(4) + reserved(4) = 32 bytes header, 96 bytes for args.
const_assert!(core::mem::offset_of!(SocketRingCmd, args) == 32);

/// Socket operation opcodes. Stable u32 discriminants.
/// New opcodes are appended; existing values never change.
///
/// **Group boundaries and reserved ranges**:
///
/// | Group              | Range  | Used  | Reserved |
/// |--------------------|--------|-------|----------|
/// | Connection lifecycle | 1-7  | 7     | —        |
/// | Paired sockets     | 8-8    | 1     | —        |
/// | Socket ioctls      | 9-9    | 1     | —        |
/// | Data transfer      | 10-19  | 4     | 14-19    |
/// | Socket options     | 20-29  | 4     | 24-29    |
/// | Readiness          | 30-39  | 4     | 34-39    |
/// | Zero-copy          | 40-49  | 3     | 43-49    |
/// | io_uring batched   | 50-59  | 1     | 51-59    |
///
/// Unknown opcodes return `status = -ENOSYS` in the completion ring.
#[repr(u32)]
pub enum SocketOpcode {
    // --- Connection lifecycle (1-7) ---
    /// Create a new socket. sock_handle is unused (set to 0).
    /// Response returns the new sock_handle.
    Socket          = 1,
    /// Bind socket to a local address.
    Bind            = 2,
    /// Mark socket as passive listener.
    Listen          = 3,
    /// Accept incoming connection.
    Accept          = 4,
    /// Initiate outgoing connection.
    Connect         = 5,
    /// Shut down part of full-duplex connection.
    Shutdown        = 6,
    /// Close socket and release all resources.
    Close           = 7,

    // --- Paired sockets ---
    /// Create a pair of connected sockets (socketpair syscall).
    /// sock_handle is unused (set to 0). Response returns two new
    /// sock_handles in aux[0..8] and aux[8..16].
    /// Required for AF_UNIX (used by pipe(), systemd, D-Bus, shell
    /// pipelines, and every program that calls socketpair(2)).
    SocketPair      = 8,

    // --- Socket ioctls ---
    /// Socket ioctl dispatch. Handles SIOCGIFADDR, SIOCGIFFLAGS,
    /// SIOCGIFCONF, SIOCADDRT, SIOCDELRT, SIOCSIFADDR, and all other
    /// socket-specific ioctls (SIOC* family). In Linux these are routed
    /// through the VFS ioctl path, but UmkaOS dispatches socket ioctls
    /// directly to umka-net to avoid an unnecessary VFS domain crossing.
    /// Tier 0 detects socket ioctls by checking the fd type (socket vs
    /// file) before dispatching: if fd refers to a socket, the ioctl
    /// bypasses VFS and goes through the socket ring with this opcode.
    /// Non-socket ioctls (terminal, block device, etc.) continue through
    /// the VFS ioctl path.
    Ioctl           = 9,

    // --- Data transfer (10-19, 6 reserved) ---
    /// Send data (sendmsg/sendto/send).
    SendMsg         = 10,
    /// Receive data (recvmsg/recvfrom/recv).
    RecvMsg         = 11,
    /// Batched send (sendmmsg). N messages in one ring entry.
    SendMmsg        = 12,
    /// Batched receive (recvmmsg). N messages in one ring entry.
    RecvMmsg        = 13,
    // Opcodes 14-19 reserved for future expansion. Unknown opcodes return
    // status = -ENOSYS in the completion ring.

    // --- Socket options (20-29, 6 reserved) ---
    /// Set socket option (setsockopt).
    SetSockOpt      = 20,
    /// Get socket option (getsockopt).
    GetSockOpt      = 21,
    /// Get local socket address (getsockname).
    GetSockName     = 22,
    /// Get peer socket address (getpeername).
    GetPeerName     = 23,

    // --- Readiness (30-39, 6 reserved) ---
    /// Poll socket for readiness events (used by epoll registration).
    Poll            = 30,
    /// Register socket with epoll interest list (Tier 1 side).
    EpollCtlAdd     = 31,
    /// Modify epoll interest list entry.
    EpollCtlMod     = 32,
    /// Remove socket from epoll interest list.
    EpollCtlDel     = 33,

    // --- Zero-copy (40-49, 7 reserved) ---
    /// sendfile() — transfer pages from file page cache to socket.
    Sendfile        = 40,
    /// splice() — pipe-to-socket zero-copy transfer.
    SpliceToSocket  = 41,
    /// splice() — socket-to-pipe zero-copy transfer.
    SpliceFromSocket = 42,

    // --- io_uring batched (50-59, 9 reserved) ---
    /// Batch of N socket operations coalesced by io_uring dispatch.
    /// args.batch contains the count and DMA buffer of sub-operations.
    IoUringBatch    = 50,
}

/// Per-opcode argument payload for a `SocketRingCmd`.
///
/// `#[repr(C)]` union of fixed-size structs, selected by `opcode`.
/// The union is exactly 96 bytes (128 - 32 byte header), enforced by
/// the `_size_pad` variant. Individual argument structs may be smaller
/// than 96 bytes; unused trailing bytes are zero-initialized by the
/// producer before posting to the ring.
///
/// Each variant is `#[repr(C)]` for cross-domain ABI stability.
/// All `SocketSlotHandle` fields are 8-byte aligned (they are `u64`
/// newtypes). Producers must insert explicit 4-byte padding after any
/// `u32` field that precedes a `SocketSlotHandle` to prevent implicit
/// compiler padding from creating information-disclosure channels.
/// (This is handled by the field ordering below — each struct places
/// `SocketSlotHandle` fields at naturally-aligned offsets.)
#[repr(C)]
pub union SocketRingArgs {
    pub socket:         SocketCreateArgs,
    pub bind:           BindArgs,
    pub listen:         ListenArgs,
    pub accept:         AcceptArgs,
    pub connect:        ConnectArgs,
    pub shutdown:       ShutdownArgs,
    pub close:          CloseArgs,
    pub send_msg:       SendMsgArgs,
    pub recv_msg:       RecvMsgArgs,
    pub send_mmsg:      SendMmsgArgs,
    pub recv_mmsg:      RecvMmsgArgs,
    pub setsockopt:     SetSockOptArgs,
    pub getsockopt:     GetSockOptArgs,
    pub getsockname:    GetSockNameArgs,
    pub getpeername:    GetPeerNameArgs,
    pub poll:           PollArgs,
    pub epoll_ctl:      EpollCtlArgs,
    pub sendfile:       SendfileArgs,
    pub splice:         SpliceArgs,
    pub socket_pair:    SocketPairArgs,
    pub sock_ioctl:     SockIoctlArgs,
    pub io_uring_batch: IoUringBatchArgs,
    /// Size guarantee: ensures the union is exactly 96 bytes regardless
    /// of which variant is active.
    _size_pad: [u8; 96],
}
const_assert!(core::mem::size_of::<SocketRingArgs>() == 96);

16.4.1.2 Argument Structs

/// socket() arguments. Size: 12 bytes (3 x u32).
#[repr(C)]
pub struct SocketCreateArgs {
    /// Address family (AF_INET=2, AF_INET6=10, AF_UNIX=1, AF_PACKET=17, etc.).
    pub domain: u32,
    /// Socket type (SOCK_STREAM=1, SOCK_DGRAM=2, SOCK_RAW=3, etc.).
    /// Upper bits may contain SOCK_NONBLOCK (0x800) and SOCK_CLOEXEC (0x80000).
    pub sock_type: u32,
    /// Protocol number (IPPROTO_TCP=6, IPPROTO_UDP=17, 0=default for type).
    pub protocol: u32,
}
const_assert!(core::mem::size_of::<SocketCreateArgs>() == 12);

/// bind() arguments. Size: 16 bytes.
/// Field ordering: u64 first, then u32, for natural alignment.
#[repr(C)]
pub struct BindArgs {
    /// Offset into the KABI shared buffer where the sockaddr is stored.
    pub addr_buf: SocketSlotHandle,
    /// Length of the sockaddr (sizeof(struct sockaddr_in) = 16,
    /// sizeof(struct sockaddr_in6) = 28, sizeof(struct sockaddr_un) = 110).
    pub addr_len: u32,
    /// Explicit padding to 16-byte alignment (prevent info disclosure).
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<BindArgs>() == 16);

/// listen() arguments. Size: 4 bytes.
#[repr(C)]
pub struct ListenArgs {
    /// Maximum pending connection queue length. Clamped to
    /// `net.core.somaxconn` (default 4096).
    pub backlog: u32,
}
const_assert!(core::mem::size_of::<ListenArgs>() == 4);

/// accept() arguments. Size: 8 bytes.
/// Flags are in the header `flags` field (SOCK_NONBLOCK, SOCK_CLOEXEC
/// from accept4).
#[repr(C)]
pub struct AcceptArgs {
    /// DMA buffer handle for writing the peer address on completion.
    pub addr_buf: SocketSlotHandle,
}
const_assert!(core::mem::size_of::<AcceptArgs>() == 8);

/// connect() arguments. Size: 16 bytes.
#[repr(C)]
pub struct ConnectArgs {
    /// Offset into KABI shared buffer for the destination sockaddr.
    pub addr_buf: SocketSlotHandle,
    /// Length of the sockaddr.
    pub addr_len: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<ConnectArgs>() == 16);

/// shutdown() arguments. Size: 4 bytes.
#[repr(C)]
pub struct ShutdownArgs {
    /// SHUT_RD=0, SHUT_WR=1, SHUT_RDWR=2.
    pub how: u32,
}
const_assert!(core::mem::size_of::<ShutdownArgs>() == 4);

/// close() arguments. No extra fields needed (sock_handle is in header).
#[repr(C)]
pub struct CloseArgs {
    /// Unused. Close operates solely on the header's sock_handle.
    pub _unused: u32,
}
const_assert!(core::mem::size_of::<CloseArgs>() == 4);

/// socketpair() arguments. Size: 12 bytes.
///
/// Creates two connected sockets. Both sock_handles are returned in the
/// response: aux[0..8] = sock_handle_0, aux[8..16] = sock_handle_1.
/// The header's sock_handle field is unused (set to 0) because no socket
/// exists prior to this call.
#[repr(C)]
pub struct SocketPairArgs {
    /// Address family (AF_UNIX=1 is the primary use case; AF_INET and
    /// AF_INET6 are also valid per Linux socketpair(2)).
    pub domain: u32,
    /// Socket type (SOCK_STREAM=1, SOCK_DGRAM=2).
    /// Upper bits may contain SOCK_NONBLOCK and SOCK_CLOEXEC.
    pub sock_type: u32,
    /// Protocol (0 = default for the given domain + type).
    pub protocol: u32,
}
const_assert!(core::mem::size_of::<SocketPairArgs>() == 12);

/// Socket ioctl arguments. Size: 24 bytes.
///
/// Handles SIOC* ioctls that Linux routes through the VFS ioctl path
/// for socket fds. UmkaOS dispatches these directly to umka-net.
#[repr(C)]
pub struct SockIoctlArgs {
    /// ioctl request number (SIOCGIFADDR=0x8915, SIOCGIFFLAGS=0x8913,
    /// SIOCGIFCONF=0x8912, SIOCADDRT=0x890B, SIOCDELRT=0x890C, etc.).
    /// Linux ABI values — must match exactly.
    pub request: u32,
    /// Explicit padding after u32 for SocketSlotHandle alignment.
    pub _pad: u32,
    /// DMA buffer containing the ioctl argument (struct ifreq, struct
    /// rtentry, etc.). umka-net reads and/or writes this buffer depending
    /// on the ioctl direction (read/write/readwrite). The buffer must be
    /// large enough for the specific ioctl's argument structure.
    pub arg_buf: SocketSlotHandle,
    /// Length of the argument buffer in bytes.
    pub arg_len: u32,
    /// Explicit padding.
    pub _pad2: u32,
}
const_assert!(core::mem::size_of::<SockIoctlArgs>() == 24);

/// sendmsg() / sendto() / send() arguments. Size: 40 bytes.
///
/// Data payload is in the KABI shared buffer at `data_buf`. Tier 0
/// has already performed copy_from_user() into this buffer (or pinned
/// user pages for MSG_ZEROCOPY).
///
/// **Layout** (all u64 fields at 8-byte aligned offsets):
/// offset 0:  data_buf (8)
/// offset 8:  dst_addr_buf (8)
/// offset 16: cmsg_buf (8)
/// offset 24: data_len (4) + dst_addr_len (4)
/// offset 32: cmsg_len (4) + _pad (4)
/// Total: 40 bytes.
#[repr(C)]
pub struct SendMsgArgs {
    /// DMA buffer handle containing the payload to send.
    /// For MSG_ZEROCOPY: this is a scatter-gather descriptor referencing
    /// pinned user pages (see [Section 16.5](#netbuf-packet-buffer--zero-copy-domain-crossing-msgzerocopy)).
    pub data_buf: SocketSlotHandle,
    /// Destination address buffer (for sendto/sendmsg with msg_name).
    /// SOCKET_SLOT_HANDLE_NONE if connected socket (no destination needed).
    pub dst_addr_buf: SocketSlotHandle,
    /// Ancillary data (cmsg) buffer. SOCKET_SLOT_HANDLE_NONE if no cmsg.
    pub cmsg_buf: SocketSlotHandle,
    /// Total payload length in bytes.
    pub data_len: u32,
    /// Destination address length. 0 for connected sockets.
    pub dst_addr_len: u32,
    /// Ancillary data length.
    pub cmsg_len: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<SendMsgArgs>() == 40);

/// recvmsg() / recvfrom() / recv() arguments. Size: 32 bytes.
///
/// **Layout**:
/// offset 0:  data_buf (8)
/// offset 8:  src_addr_buf (8)
/// offset 16: cmsg_buf (8)
/// offset 24: data_max_len (4) + cmsg_max_len (4)
/// Total: 32 bytes.
#[repr(C)]
pub struct RecvMsgArgs {
    /// DMA buffer handle where umka-net writes received data.
    pub data_buf: SocketSlotHandle,
    /// DMA buffer for writing source address (recvfrom/recvmsg msg_name).
    /// SOCKET_SLOT_HANDLE_NONE if caller does not want source address.
    pub src_addr_buf: SocketSlotHandle,
    /// DMA buffer for ancillary data (cmsg) output.
    pub cmsg_buf: SocketSlotHandle,
    /// Maximum bytes to receive. Clamped by the Tier 0 dispatcher to
    /// `min(data_max_len, KABI_SHARED_SLOT_SIZE)` where `KABI_SHARED_SLOT_SIZE`
    /// is 4096 bytes (one page). This bounds the lock hold time in umka-net's
    /// `tcp_recvmsg()` to ~1 μs per RecvMsg command. For `recv()` calls
    /// requesting more than 4 KB, Tier 0 issues **multiple** RecvMsg KABI
    /// requests in a loop, each copying up to 4 KB. The per-iteration domain
    /// crossing overhead is ~23 cycles × 2 = ~46 cycles (~15 ns at 3 GHz).
    /// For a 1 MB recv: 256 iterations × (~1 μs lock + ~15 ns crossing) ≈ 260 μs.
    pub data_max_len: u32,
    /// Maximum cmsg buffer length.
    pub cmsg_max_len: u32,
}
const_assert!(core::mem::size_of::<RecvMsgArgs>() == 32);

/// Maximum data transfer per single RecvMsg/SendMsg KABI command.
/// One page (4096 bytes). Bounds the per-socket SpinLock hold time.
/// The Tier 0 socket dispatch layer clamps `data_max_len` and `data_len`
/// to this value and issues multiple commands for larger transfers.
pub const KABI_SHARED_SLOT_SIZE: u32 = 4096;

/// sendmmsg() arguments. Size: 16 bytes.
#[repr(C)]
pub struct SendMmsgArgs {
    /// DMA buffer containing an array of `SendMmsgEntry` descriptors.
    /// Each entry describes one message (data_buf, len, dst_addr, etc.).
    pub entries_buf: SocketSlotHandle,
    /// Number of messages (1..=1024, matching Linux UIO_MAXIOV).
    pub vlen: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<SendMmsgArgs>() == 16);

/// recvmmsg() arguments. Size: 24 bytes.
#[repr(C)]
pub struct RecvMmsgArgs {
    /// DMA buffer for array of `RecvMmsgEntry` result descriptors.
    pub entries_buf: SocketSlotHandle,
    /// Timeout in nanoseconds (0 = no timeout). The timeout applies to
    /// the entire batch, not individual messages (matching Linux semantics).
    pub timeout_ns: u64,
    /// Maximum number of messages to receive.
    pub vlen: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<RecvMmsgArgs>() == 24);

/// setsockopt() arguments. Size: 24 bytes.
#[repr(C)]
pub struct SetSockOptArgs {
    /// DMA buffer containing the option value.
    pub optval_buf: SocketSlotHandle,
    /// Option level (SOL_SOCKET=1, SOL_TCP=6, SOL_UDP=17, etc.).
    pub level: u32,
    /// Option name (SO_REUSEADDR=2, TCP_NODELAY=1, etc.).
    pub optname: u32,
    /// Option value length.
    pub optlen: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<SetSockOptArgs>() == 24);

/// getsockopt() arguments. Size: 24 bytes.
#[repr(C)]
pub struct GetSockOptArgs {
    /// DMA buffer for writing the option value.
    pub optval_buf: SocketSlotHandle,
    /// Option level.
    pub level: u32,
    /// Option name.
    pub optname: u32,
    /// Maximum output length.
    pub optlen: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<GetSockOptArgs>() == 24);

/// getsockname() arguments. Size: 8 bytes.
#[repr(C)]
pub struct GetSockNameArgs {
    /// DMA buffer for writing the local address.
    pub addr_buf: SocketSlotHandle,
}
const_assert!(core::mem::size_of::<GetSockNameArgs>() == 8);

/// getpeername() arguments. Size: 8 bytes.
#[repr(C)]
pub struct GetPeerNameArgs {
    /// DMA buffer for writing the peer address.
    pub addr_buf: SocketSlotHandle,
}
const_assert!(core::mem::size_of::<GetPeerNameArgs>() == 8);

/// poll() arguments. Size: 4 bytes.
#[repr(C)]
pub struct PollArgs {
    /// Requested events bitmask (POLLIN=0x0001, POLLOUT=0x0004, etc.).
    pub events: u32,
}
const_assert!(core::mem::size_of::<PollArgs>() == 4);

/// epoll_ctl arguments (ADD/MOD/DEL). Size: 24 bytes.
///
/// **Layout**: u64 fields first for natural alignment.
/// offset 0:  epoll_entry_id (8)
/// offset 8:  user_data (8)
/// offset 16: events (4) + _pad (4)
/// Total: 24 bytes.
#[repr(C)]
pub struct EpollCtlArgs {
    /// Epoll interest list entry ID (Tier 0 assigned, unique per epoll fd).
    pub epoll_entry_id: u64,
    /// Opaque user data returned with epoll_wait events.
    pub user_data: u64,
    /// Events to monitor (EPOLLIN, EPOLLOUT, EPOLLET, EPOLLONESHOT, etc.).
    pub events: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<EpollCtlArgs>() == 24);

/// sendfile() arguments. Size: 24 bytes.
#[repr(C)]
pub struct SendfileArgs {
    /// Source file page cache token. Tier 0 resolves the input fd to a
    /// page cache reference and passes it as an opaque handle. umka-net
    /// reads directly from the page cache pages (shared read-only via
    /// PKEY_SHARED) without copying.
    pub page_cache_token: u64,
    /// Offset in source file (bytes).
    pub offset: u64,
    /// Number of bytes to transfer.
    pub count: u64,
}
const_assert!(core::mem::size_of::<SendfileArgs>() == 24);

/// splice() arguments (both directions). Size: 32 bytes.
#[repr(C)]
pub struct SpliceArgs {
    /// Pipe buffer handle (kernel-internal pipe identifier).
    pub pipe_handle: u64,
    /// Offset in the socket's stream (for lseek-capable sockets; 0 for TCP).
    pub offset: u64,
    /// Maximum bytes to transfer.
    pub len: u64,
    /// SPLICE_F_* flags (SPLICE_F_MOVE=1, SPLICE_F_NONBLOCK=2, etc.).
    pub splice_flags: u32,
    /// Explicit padding.
    pub _pad: u32,
}
// SpliceArgs: pipe_handle(8) + offset(8) + len(8) + splice_flags(4) + _pad(4) = 32 bytes.
const_assert!(core::mem::size_of::<SpliceArgs>() == 32);

/// io_uring batched operations. Size: 16 bytes.
#[repr(C)]
pub struct IoUringBatchArgs {
    /// DMA buffer containing an array of `SocketRingCmd` sub-entries.
    /// Each sub-entry is a complete 128-byte SocketRingCmd.
    pub batch_buf: SocketSlotHandle,
    /// Number of operations in the batch (1..=256).
    pub count: u32,
    /// Explicit padding.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<IoUringBatchArgs>() == 16);

/// Sentinel value: no slot handle (used for optional buffer fields).
pub const SOCKET_SLOT_HANDLE_NONE: SocketSlotHandle = SocketSlotHandle(u64::MAX);

16.4.1.3 sendmmsg / recvmmsg Batch Entry Structs

/// Per-message descriptor for sendmmsg(). Stored in the DMA buffer
/// referenced by `SendMmsgArgs.entries_buf`.
#[repr(C)]
pub struct SendMmsgEntry {
    /// DMA buffer containing this message's payload.
    pub data_buf: SocketSlotHandle,      // 8 bytes, offset 0
    /// Payload length.
    pub data_len: u32,                   // 4 bytes, offset 8
    /// Explicit alignment padding for dst_addr_buf (u64).
    pub _pad0: u32,                      // 4 bytes, offset 12
    /// Destination address buffer (per-message, for unconnected sockets).
    pub dst_addr_buf: SocketSlotHandle,  // 8 bytes, offset 16
    /// Destination address length.
    pub dst_addr_len: u32,               // 4 bytes, offset 24
    /// Per-message flags (e.g., MSG_DONTWAIT).
    pub flags: u32,                      // 4 bytes, offset 28
    /// Trailing padding to 8-byte struct alignment boundary.
    pub _pad1: [u8; 8],                  // 8 bytes, offset 32
    // Total: 8+4+4+8+4+4+8 = 40 bytes. No implicit padding.
}
const_assert!(core::mem::size_of::<SendMmsgEntry>() == 40);

/// Per-message result descriptor for recvmmsg(). Written by umka-net
/// into the DMA buffer referenced by `RecvMmsgArgs.entries_buf`.
#[repr(C)]
pub struct RecvMmsgEntry {
    /// DMA buffer where this message's payload was written.
    pub data_buf: SocketSlotHandle,      // 8 bytes, offset 0
    /// Bytes actually received. 0 if no data for this slot.
    pub data_len: u32,                   // 4 bytes, offset 8
    /// Explicit alignment padding for src_addr_buf (u64).
    pub _pad0: u32,                      // 4 bytes, offset 12
    /// Source address buffer (written by umka-net).
    pub src_addr_buf: SocketSlotHandle,  // 8 bytes, offset 16
    /// Source address length.
    pub src_addr_len: u32,               // 4 bytes, offset 24
    /// Result flags (MSG_TRUNC, MSG_CTRUNC, etc.).
    pub msg_flags: u32,                  // 4 bytes, offset 28
    /// Trailing padding to 8-byte struct alignment boundary.
    pub _pad1: [u8; 8],                  // 8 bytes, offset 32
    // Total: 8+4+4+8+4+4+8 = 40 bytes. No implicit padding.
}
const_assert!(core::mem::size_of::<RecvMmsgEntry>() == 40);

16.4.1.4 SocketRingResp: Response Message

/// Response from umka-net to Tier 0 for a socket operation.
///
/// 64-byte fixed-size entries (one cache line). Aligned to cache line
/// on the response ring for false-sharing avoidance.
///
/// **Status encoding**: Same convention as VFS
/// ([Section 14.2](14-vfs.md#vfs-ring-buffer-protocol--vfs-response-message)):
/// - `status >= 0`: success (bytes transferred, new sock_handle, etc.)
/// - `status == -4095..-1`: negated Linux errno
/// - `status == SOCK_RESP_WOULD_BLOCK`: operation would block
/// - `status == SOCK_RESP_SHARED_BUF_FULL`: KABI shared buffer exhausted
#[repr(C, align(64))]
pub struct SocketRingResp {
    /// Request ID this response completes.
    pub request_id: u64,

    /// Status code.
    pub status: i64,

    /// umka-net generation at response time. Tier 0 discards responses
    /// from pre-crash generations (see crash recovery below).
    pub net_generation: u64,

    /// Operation-specific auxiliary data.
    /// - Socket: new sock_handle in aux[0..8]
    /// - Accept: new sock_handle in aux[0..8]
    /// - RecvMsg: msg_flags in aux[0..4], src_addr_len in aux[4..8]
    /// - GetSockOpt: actual optlen in aux[0..4]
    /// - SendMmsg: number of messages actually sent in aux[0..4]
    /// - RecvMmsg: number of messages actually received in aux[0..4]
    /// - Poll: ready events bitmask in aux[0..4]
    /// - All others: zero
    pub aux: [u8; 32],

    /// Padding to 64-byte cache line.
    pub _pad: [u8; 8],
}
const_assert!(core::mem::size_of::<SocketRingResp>() == 64);

/// Status sentinel: operation would block. Tier 0 decides whether to
/// sleep the calling task or return EAGAIN based on socket flags.
pub const SOCK_RESP_WOULD_BLOCK: i64 = i64::MIN;

/// Status sentinel: KABI shared buffer full. Tier 0 must drain pending
/// copies and retry.
pub const SOCK_RESP_SHARED_BUF_FULL: i64 = i64::MIN + 1;

16.4.2 Per-CPU Socket Ring Extension

Following the VFS per-CPU ring pattern (Section 14.3), socket dispatch uses a SocketRingSet with N SPSC rings per network namespace. Under high concurrency (many CPUs doing sendmsg/recvmsg on different sockets), a single ring becomes a serialization bottleneck identical to the PostgreSQL checkpoint scenario described in Section 14.3.

16.4.2.1 SocketRingSet: Per-Namespace Ring Collection

/// Per-network-namespace collection of socket dispatch ring pairs.
///
/// Each ring pair is a full SPSC channel (request + response + doorbell +
/// completion wait queue). Rings are indexed by CPU group, identical to
/// the VFS pattern.
///
/// **Placement**: Tier 0 (Core, Nucleus). Owned by the NetNamespace.
/// Ring data regions are in shared memory (PKEY_SHARED, readable by
/// both Tier 0 and Tier 1).
///
/// **One SocketRingSet per network namespace**: sockets in different
/// namespaces use different ring sets. This prevents cross-namespace
/// information leakage via ring buffer side channels and allows
/// per-namespace ring count tuning.
/// kernel-internal, not KABI — contains raw pointers, never crosses domain boundary.
#[repr(C)]
pub struct SocketRingSet {
    /// Array of ring pairs. Length is `ring_count`.
    /// Allocated from kernel slab at namespace creation time.
    /// Maximum N = MAX_SOCKET_RINGS (256).
    ///
    /// SAFETY: Slab-allocated at namespace creation. Valid for the
    /// lifetime of the NetNamespace. Freed during namespace teardown
    /// after all rings are drained. Raw pointer for `#[repr(C)]` KABI
    /// transport. `ring_count` is the element count.
    pub rings: *const SocketRingPair,

    /// Number of active ring pairs. Range: 1..=MAX_SOCKET_RINGS.
    /// Set at namespace creation time.
    pub ring_count: u16,

    /// Explicit alignment padding for cpu_to_ring (pointer, requires
    /// 8-byte alignment). ring_count(u16) at offset 8 leaves offset 10;
    /// this 6-byte pad brings us to offset 16.
    pub _pad0: [u8; 6],

    /// CPU-to-ring mapping table. Index: CPU ID (0..nr_cpu_ids).
    /// Value: ring index (0..ring_count).
    /// Same hot-path lookup pattern as VFS: one Relaxed atomic load.
    ///
    /// SAFETY: Same lifetime as `rings`. Raw pointer for `#[repr(C)]`.
    pub cpu_to_ring: *const AtomicU16,

    /// Number of entries in cpu_to_ring (== nr_cpu_ids).
    pub cpu_to_ring_len: u32,

    /// Ring allocation granularity.
    pub granularity: RingGranularity,

    /// Coalesced doorbell for cross-ring notification batching.
    pub coalesced_doorbell: CoalescedDoorbell,

    /// Global request_id generator. Cache-line padded to avoid
    /// false sharing with other fields.
    pub next_request_id: CacheLinePadded<AtomicU64>,

    /// Ring set lifecycle state.
    ///   0 = Active
    ///   1 = Recovering (umka-net crash recovery in progress)
    ///   2 = Draining (namespace teardown)
    pub state: AtomicU8,

    /// Trailing padding after `state` (AtomicU8). The implementing agent
    /// must verify this padding with a `const_assert!` after all field
    /// types are resolved.
    pub _pad1: [u8; 7],

    /// Shared buffer pool for socket data transfer. Both Tier 0 and
    /// Tier 1 (umka-net) can access this region (PKEY_SHARED).
    /// Used by `SocketSlotHandle::to_ptr()` and `from_ptr()` for
    /// pointer ↔ slot-index conversion on every sendmsg/recvmsg.
    /// Allocated at namespace creation alongside the ring pairs.
    pub shared_buf: SocketSharedBufPool,
}

/// Maximum socket rings per namespace. Matches VFS MAX_VFS_RINGS.
pub const MAX_SOCKET_RINGS: usize = 256;

/// Per-ring pair for socket dispatch.
pub struct SocketRingPair {
    /// Request ring: Tier 0 (umka-core) -> Tier 1 (umka-net).
    /// SPSC: one CPU is the producer (via cpu_to_ring mapping),
    /// umka-net consumer thread is the consumer.
    /// Ring depth: 512 entries (configurable via sysctl
    /// `net.core.socket_ring_depth`, default 512).
    pub request_ring: DomainRingBuffer,

    /// Response ring: Tier 1 (umka-net) -> Tier 0 (umka-core).
    /// SPSC: umka-net produces, Tier 0 consumes.
    pub response_ring: DomainRingBuffer,

    /// Doorbell register for this ring.
    pub doorbell: DoorbellRegister,

    /// Completion wait queue. Tier 0 tasks waiting for synchronous
    /// socket operations block here. Woken by response ring consumer.
    /// See [Section 3.6](03-concurrency.md#lock-free-data-structures--completion-one-shot-or-multi-shot-signaling-primitive)
    /// for the formal `Completion` primitive definition. Socket rings use
    /// `WaitQueue` directly (rather than `Completion`) because socket
    /// operations are multi-shot (multiple responses per ring lifecycle),
    /// not one-shot.
    pub completion: WaitQueue,

    /// Per-ring response consumer statistics.
    pub stats: SocketRingStats,
}

/// Per-ring statistics (cold path reads via /proc/net/socket_rings).
pub struct SocketRingStats {
    /// Total operations submitted on this ring.
    pub ops_submitted: AtomicU64,
    /// Total operations completed.
    pub ops_completed: AtomicU64,
    /// Operations that returned WOULD_BLOCK.
    pub ops_would_block: AtomicU64,
    /// Ring full drops (request_ring was full when producer tried to enqueue).
    pub ring_full_drops: AtomicU64,
}

16.4.2.2 Ring Selection Hot Path

/// Select the socket ring for the current CPU.
///
/// Called on every socket syscall that crosses the domain boundary.
/// Cost: 1 atomic load (Relaxed) + bounds check. ~1-3 cycles.
/// Identical pattern to VFS select_ring().
///
/// **Preemption**: Not disabled. If the task migrates between CPU ID
/// read and ring enqueue, the operation targets a non-local ring.
/// Performance-only issue (one cache line bounce), not correctness.
/// The per-ring SPSC invariant is maintained because each ring has
/// a producer lock (spinlock for the rare multi-producer case when
/// CPU migration occurs).
#[inline(always)]
fn select_socket_ring(ring_set: &SocketRingSet) -> &SocketRingPair {
    let cpu = arch::current::cpu::smp_processor_id();
    let ring_idx = if cpu < ring_set.cpu_to_ring_len as usize {
        // SAFETY: cpu_to_ring is valid for cpu_to_ring_len elements.
        unsafe { &*ring_set.cpu_to_ring.add(cpu) }
            .load(Ordering::Relaxed) as usize
    } else {
        0
    };
    debug_assert!(ring_idx < ring_set.ring_count as usize);
    let idx = if ring_idx < ring_set.ring_count as usize {
        ring_idx
    } else {
        0
    };
    // SAFETY: rings is valid for ring_count elements.
    unsafe { &*ring_set.rings.add(idx) }
}

/// Generate a namespace-globally unique request ID.
/// Same pattern as VFS alloc_request_id.
#[inline]
fn alloc_socket_request_id(ring_set: &SocketRingSet) -> u64 {
    ring_set.next_request_id.fetch_add(1, Ordering::Relaxed)
}

16.4.2.3 Ring Negotiation at Namespace Creation

/// Negotiate socket ring count at network namespace creation time.
///
/// The ring count is determined by:
/// 1. The system's CPU topology (nr_cpu_ids, NUMA nodes, LLC groups).
/// 2. The `net.core.socket_ring_granularity` sysctl (PerCpu, PerNuma,
///    PerLlc, Fixed, Single — same enum as VFS RingGranularity).
/// 3. The `net.core.socket_ring_count` sysctl (for Fixed granularity).
///
/// Default: PerLlc granularity. This balances parallelism and memory.
/// On a 64-core AMD EPYC with 8 CCXs: 8 rings. On a 4-core desktop: 1 ring.
///
/// Memory: each SocketRingPair is ~128 KB (512 entries x 128 bytes request
/// + 512 entries x 64 bytes response + metadata). 8 rings = ~1 MB per
/// namespace. Acceptable for servers; Single mode for embedded.
fn negotiate_socket_rings(
    ns: &NetNamespace,
) -> SocketRingSet {
    let granularity = ns.sysctl_ring_granularity();
    let ring_count = compute_ring_count(granularity);
    let cpu_map = build_cpu_to_ring_map(ring_count, granularity);
    // ... allocate ring pairs, doorbells, etc.
}

16.4.3 Dispatch Flow

16.4.3.1 sendmsg() Path (Tier 0 -> Tier 1 -> Tier 0)

1. Userspace calls sendmsg(fd, &msghdr, flags).

2. Syscall dispatch (umka-sysapi, Tier 0):
   a. Resolve fd -> SocketRef via FdTable.
   b. Validate capability ([Section 9.1](09-security.md#capability-based-foundation)).
   c. copy_from_user(msghdr.msg_iov) into KABI shared buffer.
      - If MSG_ZEROCOPY and SO_ZEROCOPY enabled and len >= 4KB:
        Pin user pages instead (get_user_pages_fast).
        Build scatter-gather SocketSlotHandle referencing pinned pages.
      - Otherwise: memcpy into shared buffer slot.

3. Select ring: select_socket_ring(ns.ring_set).
   Allocate request_id: alloc_socket_request_id(ns.ring_set).

4. Construct SocketRingCmd {
       request_id,
       opcode: SocketOpcode::SendMsg,
       sock_handle: socket.handle,
       flags: msg_flags,
       args: SendMsgArgs {
           data_buf: shared_buf_handle,
           data_len: total_iov_len,
           dst_addr_buf: addr_handle_or_NONE,
           dst_addr_len,
           cmsg_buf: cmsg_handle_or_NONE,
           cmsg_len,
       },
   }.

5. Enqueue on ring.request_ring. Ring doorbell
   (via coalesced_doorbell — may be deferred if io_uring batch
   has more SQEs pending).

6. DOMAIN SWITCH: Tier 0 -> Tier 1 (umka-net).
   umka-net consumer thread dequeues SocketRingCmd.

7. umka-net (Tier 1) processes SendMsg (NON-BLOCKING):
   a. Lookup socket: sock_table.get(sock_handle) -> &TcpCb or &UdpCb.
   b. Read payload from KABI shared buffer (or pinned pages for zerocopy).
   c. For TCP: tcp_sendmsg() — copy into send buffer, segment, initiate TX.
      For UDP: udp_sendmsg() — build datagram, route lookup, transmit.
   d. Post SocketRingResp {
          request_id,
          status: bytes_sent (>= 0) or -errno,
          net_generation,
      } to ring.response_ring.
   e. If send buffer full (TCP backpressure):
      Post SocketRingResp { status: SOCK_RESP_WOULD_BLOCK }.
      Tier 1 NEVER blocks (see Tier 1 Non-Blocking Invariant,
      [Section 16.2](#network-stack-architecture--tier-1-non-blocking-invariant)).

8. DOMAIN SWITCH: Tier 1 -> Tier 0.
   Tier 0 reads SocketRingResp from response ring.
   a. If success: return bytes_sent to userspace syscall.
   b. If WOULD_BLOCK and !(flags & MSG_DONTWAIT):
      Block calling task on socket write_wait queue.
      Woken by sk_write_space() when ACK frees buffer space.
      On wake: re-enter Tier 1 with retry (go to step 4).
   c. If WOULD_BLOCK and MSG_DONTWAIT: return EAGAIN.
   d. If error: return -errno.

16.4.3.2 io_uring Batch Path

1. io_uring submission thread detects N consecutive IORING_OP_SEND /
   IORING_OP_SENDMSG SQEs targeting sockets in the same namespace.

2. Tier 0 coalesces into a single SocketRingCmd with
   opcode = IoUringBatch:
   a. Allocates a DMA buffer large enough for N x 128-byte sub-entries.
   b. Copies each SQE's SocketRingCmd into the batch buffer.
   c. Posts ONE SocketRingCmd { opcode: IoUringBatch, count: N }.
   d. Rings doorbell ONCE.

3. ONE domain switch for the entire batch.

4. umka-net unpacks the batch:
   for i in 0..count:
       sub_cmd = batch_buf[i]
       result[i] = process_socket_cmd(sub_cmd)

5. Posts N SocketRingResp entries (one per sub-operation)
   to the response ring. Each has the sub-operation's request_id.

6. ONE domain switch back to Tier 0.

7. Tier 0 matches each response to the originating SQE and posts CQEs.

Performance (x86-64): N=16, domain switch cost = 23 / 16 = 1.4 cycles per op.
Linux sock_sendmsg indirect call chain (with retpoline): ~15-20 cycles per op.
UmkaOS saves ~14-19 cycles per op = NEGATIVE overhead.

Performance (AArch64): N=16, domain switch cost = 80 / 16 = 5 cycles per op.
Linux sock_sendmsg indirect call (no retpoline): ~5-8 cycles per op.
UmkaOS is break-even to slightly faster (0 to -3 cycles per op).
Negative overhead on AArch64 requires N>=20 or cache/prefetch gains.

16.4.3.3 sendmmsg() Path

1. Userspace calls sendmmsg(fd, &mmsghdr[], vlen, flags).

2. Tier 0:
   a. Resolve fd -> SocketRef.
   b. Allocate DMA buffer for vlen SendMmsgEntry descriptors.
   c. For each message i in 0..vlen:
      copy_from_user(mmsghdr[i].msg_iov) into shared buffer slot.
      Fill entries_buf[i] = SendMmsgEntry { data_buf, data_len, ... }.
   d. Post single SocketRingCmd { opcode: SendMmsg, vlen }.
   e. Ring doorbell ONCE.

3. ONE domain switch. umka-net processes all vlen messages.
   Returns count of successfully sent messages in SocketRingResp.aux.

4. Tier 0 fills mmsghdr[i].msg_len for each sent message.
   Returns total messages sent to userspace.

Performance (x86-64): sendmmsg(vlen=44) (typical QUIC batch) costs ONE
domain switch. Per-message overhead: 23/44 = ~0.5 cycles. Linux sendmmsg
does 44 separate sock_sendmsg() calls = ~44 x 15-20 = 660-880 cycles of
indirect call overhead (retpoline). UmkaOS: ~23 + 44 x 5 (batch
processing) = ~243 cycles. ~3x faster.

Performance (AArch64): Same batch. Domain switch: 80/44 = ~1.8 cycles.
Linux: 44 x 5-8 = 220-352 cycles (no retpoline). UmkaOS:
~80 + 44 x 5 = ~300 cycles. ~Break-even to 15% slower — the sendmmsg
batch advantage is smaller on AArch64 because Linux's indirect call
cost is lower. Net result depends on cache/prefetch gains from the
ring protocol (estimated ~2-3 cycles/msg savings from sequential ring
layout vs pointer-chasing), bringing AArch64 to approximate parity.

16.4.4 epoll Cross-Domain Integration

16.4.4.1 Problem

epoll_wait() runs in Tier 0 (umka-core). Socket readiness state (POLLIN when data arrives, POLLOUT when send buffer drains) is known only by umka-net (Tier 1). The challenge: how does Tier 0 learn about readiness changes in Tier 1 without polling?

16.4.4.2 Design: Readiness Doorbell

umka-net signals readiness changes to Tier 0 via a dedicated readiness notification ring (separate from the socket command/response rings). This ring carries lightweight readiness events that Tier 0's epoll implementation consumes.

/// Readiness event from umka-net (Tier 1) -> Tier 0 (epoll subsystem).
///
/// Posted by umka-net whenever a socket's readiness state changes
/// (data arrives, send buffer drains, connection established, error).
///
/// 16 bytes per event. The readiness ring holds 4096 entries = 64 KB.
/// This is small enough to fit in L2 cache on the CPU that runs
/// epoll_wait().
#[repr(C)]
pub struct SocketReadinessEvent {
    /// Socket handle (matches SocketRingCmd.sock_handle).
    /// Tier 0 uses this to look up the epoll interest list entry.
    pub sock_handle: u64,
    /// Ready events bitmask (EPOLLIN=0x001, EPOLLOUT=0x004,
    /// EPOLLERR=0x008, EPOLLHUP=0x010, EPOLLRDHUP=0x2000).
    pub events: u32,
    /// Padding to 16-byte alignment.
    pub _pad: u32,
}
const_assert!(core::mem::size_of::<SocketReadinessEvent>() == 16);

/// Per-namespace readiness notification ring.
///
/// SPSC: umka-net is the sole producer; Tier 0 epoll subsystem is
/// the sole consumer. Separate from the socket command/response rings
/// to avoid head-of-line blocking (readiness events are tiny and
/// latency-critical; socket command processing may be slow).
///
/// When the ring is full, umka-net sets a per-socket "readiness pending"
/// flag (AtomicBool in the socket metadata, Tier 1 side). On the next
/// drain cycle, Tier 0 scans flagged sockets and sends Poll opcodes to
/// re-check readiness. To prevent an O(N) thundering herd when many
/// sockets have pending flags simultaneously:
///   1. Tier 0 scans at most MAX_PENDING_POLL_BATCH (256) flagged sockets
///      per drain cycle. Remaining flagged sockets are deferred to the
///      next cycle. This bounds the worst-case per-cycle Poll opcode
///      storm to 256 domain crossings.
///   2. Poll opcodes for flagged sockets are batched into a single
///      IoUringBatch-style compound ring entry (up to 64 per batch entry),
///      reducing the 256 worst-case polls to ~4 domain crossings.
///   3. The readiness ring depth (4096 entries) is sized to absorb typical
///      burst patterns without overflow. Overflow requires >4096 sockets
///      to change readiness state between two consecutive drain cycles —
///      a scenario that implies the epoll_wait consumer is severely
///      backlogged, and bounded fallback is the correct response.
pub struct ReadinessRing {
    pub ring: DomainRingBuffer,  // entry_size = 16, depth = 4096
    pub doorbell: DoorbellRegister,
}

16.4.4.3 epoll_wait() Flow

1. Task calls epoll_wait(epfd, events, maxevents, timeout).

2. Tier 0 (epoll subsystem):
   a. Check the readiness ring for pending SocketReadinessEvents.
   b. For each event, look up the socket in the epoll interest list
      (XArray keyed by sock_handle).
   c. If the interest entry matches (events & interest.events != 0):
      - Level-triggered (default): add to ready list.
      - Edge-triggered (EPOLLET): add to ready list ONLY if this is
        a new edge (events changed since last notification).
        The edge detection flag is a per-interest-entry AtomicU32
        that records the last reported events. A new notification
        matches an edge iff (new_events & ~last_events) != 0.
      - EPOLLONESHOT: add to ready list and disable the interest entry
        locally in Tier 0 by setting a per-entry `armed: AtomicBool` flag
        to `false`. While `armed == false`, Tier 0 suppresses readiness
        events for this entry (readiness ring events from umka-net for this
        sock_handle are checked against `armed` before adding to the ready
        list — if `!armed`, the event is silently discarded). umka-net
        continues to post readiness events normally (it has no knowledge of
        EPOLLONESHOT state — the suppression is entirely Tier 0 local). This
        avoids a domain crossing for the disable operation. Re-arming
        requires `EPOLL_CTL_MOD` from userspace, which sets `armed = true`
        and sends an `EpollCtlMod` opcode to umka-net to refresh the
        interest mask. The `armed` flag is stored in the Tier 0 epoll
        interest list entry (not in umka-net), so it survives umka-net crash.
   d. If ready list is non-empty: fill userspace events[], return.
   e. If ready list is empty and timeout > 0:
      Block on the readiness ring doorbell wait queue.
      Woken when umka-net rings the readiness doorbell.
      On wake: go to step (a).
   f. If timeout == 0: return 0 (no events).

3. Readiness ring producer (umka-net side):
   Whenever a socket's readiness changes (inside Tier 1), umka-net:
   a. Constructs SocketReadinessEvent { sock_handle, events }.
   b. Enqueues on the namespace's ReadinessRing.
   c. Rings the readiness doorbell.
   The doorbell coalescer batches notifications: if multiple sockets
   become ready in the same NAPI poll cycle, a single doorbell covers
   all of them.

16.4.4.4 Level-Triggered vs Edge-Triggered Semantics

Level-triggered (default): Tier 0 re-checks the socket's readiness on every epoll_wait() call. If the socket still has data (POLLIN) after the application received some bytes, the socket remains on the ready list. Implementation: Tier 0 sends a Poll opcode to umka-net if the interest entry is level-triggered and the application has not consumed all pending events since the last epoll_wait() return. This costs one extra domain crossing per epoll_wait cycle for sockets with persistent readiness.

Edge-triggered (EPOLLET): Tier 0 reports the socket ready exactly once per edge transition. The readiness ring event IS the edge. No re-polling needed. This is the high-performance path: EPOLLET sockets generate exactly one readiness event per state change, with no additional domain crossings.

Performance comparison:

Mode Domain crossings per epoll_wait cycle Typical use
EPOLLET 0 (readiness ring drain only) nginx, HAProxy, high-perf servers
Level-triggered 1 per still-ready socket (re-poll) Legacy applications

16.4.4.5 sk_data_ready() and sk_write_space() Cross-Domain Path

When umka-net (Tier 1) detects a readiness change, it posts to the readiness ring via the existing KABI KernelServicesVTable.wake_socket mechanism (Section 16.2). The mechanism is unified:

umka-net detects readiness change (data arrived, buffer drained, etc.)
    |
    v
post SocketReadinessEvent to ReadinessRing
    |
    v
ring readiness doorbell
    |
    v
Tier 0 epoll consumer drains readiness ring
    |
    v
wake tasks blocked on epoll_wait()

The readiness ring is separate from the per-socket wait queue wake (which is used for blocking recv/send). A single socket event may trigger both: - ReadinessRing event (for epoll waiters) - WaitQueue wake (for threads blocked in recv/send on that socket)

Both are batched at NAPI poll completion time: napi_complete_done() flushes accumulated readiness events and wake requests in a single KABI completion ring post.


16.4.5 Zero-Copy Paths

16.4.5.1 MSG_ZEROCOPY sendmsg()

Integrates with the existing zero-copy domain crossing protocol (Section 16.5). The socket dispatch layer's role is to carry the scatter-gather descriptor through the ring:

1. Tier 0 receives sendmsg(fd, &msg, MSG_ZEROCOPY):
   a. Verify SO_ZEROCOPY enabled on socket.
   b. Pin user pages: get_user_pages_fast(msg_iov).
   c. Build SocketSlotHandle as scatter-gather list of pinned pages.
      (NOT a copy — the handle references physical page frames.)
   d. Post SocketRingCmd { opcode: SendMsg, data_buf: sg_handle,
      flags: MSG_ZEROCOPY }.

2. umka-net reads from pinned pages directly (zero copy across
   the ring — only the 128-byte command was written to the ring,
   not the payload).

3. umka-net queues data for transmission. NIC DMA reads directly
   from the pinned user pages.

4. On TX completion:
   a. umka-net posts SocketRingResp with status = bytes_sent.
   b. Tier 0 posts SO_EE_ORIGIN_ZEROCOPY notification to the
      socket error queue.
   c. Application drains MSG_ERRQUEUE to acknowledge buffer reuse.
   d. Pages unpinned, DMA unmapped.

16.4.5.2 sendfile() (File -> Socket)

1. Tier 0 receives sendfile(out_fd, in_fd, offset, count):
   a. Resolve in_fd -> file, out_fd -> socket.
   b. Look up pages in the page cache for the input file.
      On cache hit: pages are already in physical memory.
      On cache miss: trigger readahead, wait for pages.
   c. Build page_cache_token: opaque handle referencing the
      cached pages. Pages are read-only shared via PKEY_SHARED
      (same PKEY domain as the KABI shared buffer).
   d. Post SocketRingCmd { opcode: Sendfile, page_cache_token,
      offset, count }.

2. umka-net reads directly from page cache pages (no copy from
   file to socket buffer — the TCP send path references the
   page cache pages as scatter-gather fragments in the NetBuf).

3. TCP transmits from page cache pages. NIC DMA reads directly
   from page cache.

4. On completion: page refcounts decremented. If the file is
   modified after sendfile(), CoW semantics apply.

16.4.5.3 splice() (Pipe <-> Socket)

1. Tier 0 receives splice(in_fd=pipe, out_fd=socket, ...):
   a. Resolve pipe_fd -> PipeBuffer.
   b. The PipeBuffer contains page references (from a previous
      write/vmsplice to the pipe).
   c. Post SocketRingCmd { opcode: SpliceToSocket, pipe_handle,
      offset: 0, len, splice_flags }.

2. umka-net dequeues pipe pages and uses them directly as
   TCP send buffer pages (zero copy — page ownership transfers
   from pipe to socket send buffer).

3. Reverse direction (socket -> pipe): umka-net fills pipe
   pages from the socket receive buffer and posts completion.

16.4.6 Crash Recovery

When umka-net crashes and is reloaded (~50-150ms, Section 11.9):

1. Tier 0 detects umka-net domain crash (generation mismatch or
   domain fault signal).

2. Set ring_set.state = Recovering (atomic store, Release).

3. All pending SocketRingCmds are failed with EIO:
   a. Walk all ring pairs in the SocketRingSet.
   b. For each pending request_id: wake the blocked Tier 0 task
      with SocketRingResp { status: -EIO }.
   c. Active sockets are marked with SO_ERROR = ECONNRESET (TCP)
      or EIO (UDP).

4. umka-net reloads. New instance increments its generation counter.

5. Reconnection:
   a. TCP sockets in ESTABLISHED state: Tier 0 re-sends the
      TcpShadowState to the new umka-net instance via a special
      RESTORE_TCP_STATE opcode. The new umka-net reconstructs
      TcpCb from the shadow state (sequence numbers, RTT estimates,
      TCP options, congestion parameters). Send buffer data is NOT
      available (lost with the crashed Tier 1 instance). Recovery
      of in-flight data relies on the peer's retransmission: the
      peer will retransmit-timeout for unACKed segments, and the
      new umka-net accepts and ACKs them normally. cwnd is clamped
      to min(shadow_cwnd, IW10) to avoid post-recovery burst.
   b. UDP sockets: stateless. Re-registration is immediate.
   c. Listening sockets: the listen backlog is re-created.

6. Set ring_set.state = Active.

7. Pending syscalls that received EIO are retried by userspace
   (standard Unix error handling — applications retry EIO on
   transient failures). Applications using TCP see ECONNRESET
   and must reconnect (same behavior as a NIC reset in Linux).

TCP state preservation detail: TCP connections in ESTABLISHED state are preserved with best-effort connection continuity after umka-net crash. Tier 0 retains a shadow copy of critical TcpCb fields (sequence numbers, window sizes, RTT estimates, TCP options, connection state) in the socket metadata stored in umka-core. This shadow copy is updated on every SocketRingResp that carries updated TCP state (piggyback on ACK processing responses). The shadow is sufficient to reconstruct the connection's control state after umka-net reload, but send buffer data and receive buffer data are lost (they reside in Tier 1 memory that is destroyed on crash). Recovery relies on the peer's TCP retransmission mechanism to resend unacknowledged data. Applications may observe a stall of one peer-RTO (typically 200ms-2s) during recovery. Connections with no in-flight data recover transparently; connections with significant in-flight data recover after the peer retransmits. In all cases, RST is avoided — the connection continues without requiring application reconnect.

/// Shadow TCP state maintained by Tier 0 for crash recovery.
/// Updated piggyback on SocketRingResp for SendMsg/RecvMsg completions.
/// Stored in the Tier 0 socket metadata (not in the ring).
///
/// **Recovery guarantees and limitations**:
/// - Connections in ESTABLISHED state are recovered with **best-effort
///   connection preservation**, not seamless recovery. The peer may
///   observe a brief stall (one RTO) while umka-net reloads and the
///   shadow state is restored.
/// - **SRTT/RTTVAR**: Restored from shadow. Without these, the
///   retransmission timer would use RFC 6298 defaults (1s initial RTO),
///   causing either spurious retransmits or excessive latency. The shadow
///   values are stale by at most one RTT sample (last ACK before crash).
/// - **TCP options** (MSS, window scale, timestamps, SACK): Restored
///   from shadow. These are negotiated at SYN/SYN-ACK time and do not
///   change for the connection's lifetime. Without them, the reconstructed
///   connection would use wrong segment sizes or fail timestamp validation.
/// - **Send buffer data**: NOT recoverable. The TCP send buffer and
///   retransmission queue live in umka-net's Tier 1 memory, which is
///   destroyed on crash. Data between `snd_una` and `snd_nxt` (in-flight,
///   unacknowledged) is lost. The peer will retransmit-timeout and
///   retransmit, but the reconstructed umka-net has no copy to retransmit
///   from. The connection will recover via the peer's retransmission: the
///   peer retransmits unACKed data, the new umka-net ACKs it, and the
///   connection resumes. Applications sending large bursts may see a
///   brief stall equal to the peer's RTO (typically 200ms-2s).
/// - **Receive buffer data**: NOT recoverable. Data received by umka-net
///   but not yet delivered to the application is lost. The application
///   will see a gap and must handle it (TCP's byte-stream guarantee is
///   violated for the in-flight receive window at crash time). In practice,
///   applications using TCP already handle short reads and retries.
/// - **Congestion state**: cwnd and ssthresh are restored but may be
///   stale. The reconstructed connection enters conservative mode (cwnd
///   clamped to min(shadow_cwnd, IW10)) to avoid burst losses.
#[repr(C)]
pub struct TcpShadowState {
    /// Last known send unacknowledged sequence number.
    pub snd_una: u32,
    /// Last known next send sequence number.
    pub snd_nxt: u32,
    /// Last known receive next expected sequence number.
    pub rcv_nxt: u32,
    /// Last known receive window.
    pub rcv_wnd: u32,
    /// Last known congestion window (bytes).
    pub cwnd: u64,
    /// Last known slow-start threshold.
    pub ssthresh: u64,
    /// Smoothed RTT (microseconds). Required for retransmission timer
    /// computation (RFC 6298). Without this, the reconstructed connection
    /// would use the default 1-second initial RTO, causing either
    /// unnecessary retransmit timeouts (if real RTT << 1s) or delayed
    /// loss detection (if real RTT >> default).
    pub srtt_us: u32,
    /// RTT variance (microseconds). Used with srtt_us to compute RTO
    /// per RFC 6298: RTO = srtt + max(G, 4 * rttvar).
    pub rttvar_us: u32,
    /// Negotiated Maximum Segment Size (bytes). Set during SYN/SYN-ACK
    /// handshake, immutable for the connection lifetime.
    pub mss: u16,
    /// Window scale factor (0-14). Negotiated at SYN time, immutable.
    /// Applied to rcv_wnd: effective_wnd = rcv_wnd << wscale.
    pub wscale: u8,
    /// TCP options flags (bitfield):
    ///   bit 0: timestamps enabled (TCP_TIMESTAMP)
    ///   bit 1: SACK permitted (TCP_SACK_PERM)
    ///   bit 2: ECN negotiated (TCP_ECN)
    ///   bits 3-7: reserved (zero)
    pub tcp_options: u8,
    /// Last known TSval sent to peer (for timestamp echo validation).
    pub ts_recent: u32,
    /// Connection state (TcpState discriminant as u8).
    pub state: u8,
    /// Congestion algorithm identifier (index into registered algorithms).
    pub cong_alg_id: u8,
    /// Explicit padding to struct alignment boundary (8 bytes, from u64
    /// fields cwnd/ssthresh). Without this, repr(C) would add 6 bytes of
    /// implicit trailing padding. Making it explicit prevents information
    /// disclosure and documents the layout.
    pub _pad: [u8; 6],
}
const_assert!(core::mem::size_of::<TcpShadowState>() == 56);

16.4.7 Evolvable Classification

Component Classification Rationale
SocketRingCmd / SocketRingResp wire format Nucleus (non-replaceable) Defines the ABI between Tier 0 and Tier 1. Changing it requires synchronized replacement of both sides.
SocketOpcode enum Nucleus Opcodes are the wire protocol vocabulary. Append-only evolution via new discriminants.
SocketRingSet / ring topology Nucleus (data structure) Ring metadata is the transport mechanism.
Ring selection policy (which granularity, how many rings) Evolvable The RingGranularity decision and ring count negotiation are tuneable policies. ML can optimize ring count based on workload.
Socket dispatch routing (which operations use which ring) Evolvable Currently CPU-based (cpu_to_ring). Could be evolved to socket-affinity-based routing (hot sockets get dedicated rings).
TCP congestion control Evolvable Per Section 16.10: &'static dyn CongestionOps with CongPriv inline state. Live-swappable.
epoll readiness delivery policy Evolvable The coalescing heuristic (how aggressively to batch readiness events) is tuneable.
sendfile/splice zero-copy path Nucleus (data path) Page sharing across domain boundaries is a correctness-critical invariant.

16.4.8 ML Policy Integration

16.4.8.1 Observation Points

The socket dispatch subsystem emits observations via observe_kernel! at key decision points. All observations use SubsystemId::TcpStack (or SubsystemId::NetworkDriver for NIC-level events).

/// Socket dispatch observation types.
/// Used as `obs_type` in KernelObservation for SubsystemId::TcpStack.
#[repr(u16)]
pub enum SocketObsType {
    /// Per-socket throughput sample (bytes/sec, sampled every 100ms).
    /// features[0] = throughput_kbps (scaled: bytes_sent / 100ms / 1024)
    /// features[1] = sock_handle (lower 32 bits)
    /// features[2] = cgroup_id (lower 32 bits)
    SocketThroughput    = 20,

    /// Per-socket RTT sample (from TCP ACK processing).
    /// features[0] = srtt_us (smoothed RTT, microseconds)
    /// features[1] = rttvar_us (RTT variance)
    /// features[2] = sock_handle (lower 32 bits)
    SocketRtt           = 21,

    /// Per-socket retransmit event.
    /// features[0] = retransmit_count (cumulative)
    /// features[1] = reason (0=timeout, 1=fast_retransmit, 2=TLP)
    /// features[2] = sock_handle (lower 32 bits)
    SocketRetransmit    = 22,

    /// Ring utilization sample (per-ring, every 1s).
    /// features[0] = ring_index
    /// features[1] = request_ring_fill_pct (0-100)
    /// features[2] = response_ring_fill_pct (0-100)
    /// features[3] = ops_per_sec (last 1s window)
    /// features[4] = avg_latency_us (ring post to response, last 1s)
    RingUtilization     = 23,

    /// Batch efficiency (per io_uring batch completion).
    /// features[0] = batch_size (number of ops in this batch)
    /// features[1] = total_bytes (sum of bytes across all ops)
    /// features[2] = domain_switch_cycles (measured via rdtsc/CNTVCT)
    BatchEfficiency     = 24,
}

16.4.8.2 ML-Tunable Parameters

Registered with the ML policy framework (Section 23.1) via ParamId in the TcpStack subsystem group (0x02xx).

ParamId Name Default Min Max Scope Effect
0x0200 tcp_wmem_default 16384 4096 16777216 per-cgroup Default TCP send buffer size (bytes)
0x0201 tcp_rmem_default 87380 4096 16777216 per-cgroup Default TCP receive buffer size (bytes)
0x0202 tcp_congestion 0 (CUBIC) 0 15 per-cgroup Congestion algorithm index for new connections
0x0203 socket_ring_count auto 1 256 per-namespace Number of socket rings (0=auto)
0x0204 epoll_coalesce_us 0 0 1000 global Readiness doorbell coalescing delay (microseconds)
0x0205 zerocopy_threshold 4096 512 1048576 per-cgroup Minimum payload size for MSG_ZEROCOPY (bytes)
0x0206 batch_coalesce_max 16 1 256 per-namespace Maximum io_uring SQEs to coalesce per batch

ML feedback loop: The Tier 2 ML service reads SocketObsType observations from the per-CPU observation rings, trains/infers optimal parameter values, and writes them back via the ParamUpdate interface. Parameters are clamped to [min, max] at the kernel enforcement point. If the ML service crashes or becomes unreachable, parameters decay to their defaults after decay_period_ms (default 30000ms).

Interaction with EEVDF scheduler: Cgroups with cpu.guarantee (Section 7.1) and high network throughput may receive ML-recommended TCP buffer size increases to match their CPU allocation. The ML service performs this cross-subsystem correlation using: - Observations: SocketObsType::SocketThroughput (obs_type 0x14, SubsystemId::TcpStack) provides per-socket throughput sampled every 100ms with cgroup_id in features[2]. The scheduler emits CbsObsType::CbsReplenish (obs_type 0x11, SubsystemId::Scheduler) with cgroup_id in features[0] and consumed_us in features[5]. The ML service joins these by cgroup_id to determine whether a cgroup's network throughput is constrained by its CPU allocation. - Parameters: The ML service writes ParamId::0x0200 (tcp_wmem_default, SubsystemId::TcpStack, per-cgroup scope) to increase TCP send buffer size for cgroups where throughput is CPU-bound. It reads ParamId::0x0010 (cbs_steal_fraction_pct, SubsystemId::Scheduler) to observe the current steal heuristic and avoid recommending large buffers to cgroups that are already steal-constrained. - Correlation rule: if throughput_kbps > threshold AND consumed_us / quota_us > 0.8: increase tcp_wmem_default. The threshold and consumed/quota ratio are themselves ML-tunable (via the policy model's internal state, not kernel parameters).


16.4.9 Shared Buffer Management

16.4.9.1 KABI Shared Buffer Pool

Socket dispatch reuses the same shared buffer infrastructure as VFS (Section 16.2). Each network namespace has a dedicated shared buffer pool:

/// Per-namespace shared buffer pool for socket data transfer.
///
/// Allocated at namespace creation. Sized based on expected concurrent
/// socket operations: default 4 MB (1024 x 4 KB slots).
/// Configurable via `net.core.socket_shared_buf_size` sysctl.
///
/// Both Tier 0 and Tier 1 can read the buffer (PKEY_SHARED / domain 2).
/// Tier 0 writes (copy_from_user for sendmsg, copy_to_user for recvmsg).
/// Tier 1 writes (dequeue from recv_queue for recvmsg, read from
/// page cache for sendfile).
pub struct SocketSharedBufPool {
    /// Base address of the shared region. Page-aligned.
    pub base: *mut u8,
    /// Total size in bytes. Must be a multiple of SLOT_SIZE.
    pub size: usize,
    /// Per-slot size (4 KB = PAGE_SIZE, matching page cache granularity).
    pub slot_size: u32,
    /// Number of slots (size / slot_size).
    pub slot_count: u32,
    /// Lock-free slot allocator. Producer (Tier 0 or Tier 1) claims
    /// a slot via atomic CAS on the bitmap, uses it, then releases.
    /// Bitmap: 1 bit per slot. At 1024 slots = 128 bytes = 2 cache lines.
    pub alloc_bitmap: AtomicBitmap,
    /// Free slot count (for backpressure detection).
    pub free_slots: AtomicU32,
}

/// Slot size for the socket shared buffer pool.
pub const SOCKET_SHARED_BUF_SLOT_SIZE: u32 = 4096;

16.4.9.2 Large Payload Handling

For sendmsg() / recvmsg() with payloads larger than one shared buffer slot (4 KB), the dispatcher allocates multiple contiguous slots. The SocketSlotHandle encodes (base_slot_index, slot_count):

/// Handle referencing a contiguous region in the shared buffer pool.
///
/// Encoding: bits [63:32] = base slot index, bits [31:0] = byte length.
/// Maximum byte length per handle: 4 GB (u32::MAX). This is a deliberate
/// design constraint: no single sendmsg() call may exceed 4 GB payload.
/// The Linux sendmsg() interface uses `size_t` for iov_len but practical
/// payloads are bounded by socket send buffer sizes (typically 64 KB -
/// 16 MB). A single MSG_ZEROCOPY scatter-gather total exceeding 4 GB
/// is not a supported use case. The dispatcher validates
/// `total_iov_len <= u32::MAX` before building the SocketSlotHandle and
/// returns `-EMSGSIZE` if exceeded.
/// Maximum slot index: 2^32 slots x 4 KB = 16 TB. Absurdly large;
/// practically bounded by pool size (4 MB default).
///
/// For MSG_ZEROCOPY: SocketSlotHandle instead references a scatter-gather
/// table in the shared region. The first 16 bytes of the referenced
/// region are an SgTableHeader { entry_count: u32, _pad: u32,
/// total_len: u64 }, followed by entry_count SgEntry { phys_addr: u64,
/// len: u32, _pad: u32 } records.
#[repr(transparent)]
pub struct SocketSlotHandle(pub u64);

impl SocketSlotHandle {
    pub fn slot_index(&self) -> u32 { (self.0 >> 32) as u32 }
    pub fn byte_len(&self) -> u32 { self.0 as u32 }
    pub fn is_none(&self) -> bool { self.0 == u64::MAX }

    /// Construct a `SocketSlotHandle` from a slot index and byte length.
    ///
    /// `slot_index` is the zero-based index into the shared buffer pool.
    /// `byte_len` is the total payload length in bytes (up to `u32::MAX`).
    pub fn new(slot_index: u32, byte_len: u32) -> Self {
        Self(((slot_index as u64) << 32) | byte_len as u64)
    }

    /// Convert a slot handle to a raw pointer into the shared buffer pool.
    /// The shared buffer pool is a contiguous memory region mapped into both
    /// Tier 0 and Tier 1 (umka-net) address spaces at domain init time.
    /// The base address is stored in `SocketRingSet.shared_buf.base` (a
    /// per-ring-set field set during socket ring creation).
    ///
    /// # Safety
    /// The caller must verify `slot_index < ring_set.shared_buf.slot_count`
    /// and `byte_len <= SOCKET_SHARED_BUF_SLOT_SIZE`. Invalid handles from
    /// crashed or malicious userspace are bounds-checked by the Tier 0
    /// dispatcher before passing to umka-net.
    pub fn to_ptr(&self, ring_set: &SocketRingSet) -> *const u8 {
        // Each slot is SOCKET_SHARED_BUF_SLOT_SIZE (4096) bytes.
        let offset = self.slot_index() as usize * SOCKET_SHARED_BUF_SLOT_SIZE as usize;
        // SAFETY: shared_buf.base is valid for the lifetime of the
        // SocketRingSet (allocated at namespace creation, freed at teardown).
        // The caller has validated slot_index < shared_buf.slot_count.
        unsafe { ring_set.shared_buf.base.add(offset) }
    }

    /// Reconstruct a `SocketSlotHandle` from a raw pointer and byte length.
    ///
    /// This is the reverse of `to_ptr()`. Given a pointer within the shared
    /// buffer region and the known data length, computes the slot index via
    /// pointer arithmetic:
    /// `slot_index = (ptr - shared_buf.base) / SOCKET_SHARED_BUF_SLOT_SIZE`.
    ///
    /// Returns `None` if the pointer is not aligned to a slot boundary or is
    /// outside the shared buffer region.
    ///
    /// **Use case**: When umka-net (Tier 1) needs to return a buffer reference
    /// to Tier 0 (e.g., in a `SocketRingResp` completion), it encodes the slot
    /// index back into a `SocketSlotHandle`. The caller knows the data length
    /// from the protocol operation that produced the buffer.
    pub fn from_ptr(ptr: *const u8, len: u32, ring_set: &SocketRingSet) -> Option<Self> {
        let base = ring_set.shared_buf.base as usize;
        let addr = ptr as usize;
        if addr < base {
            return None;
        }
        let offset = addr - base;
        if offset % SOCKET_SHARED_BUF_SLOT_SIZE as usize != 0 {
            return None;
        }
        let slot_index = (offset / SOCKET_SHARED_BUF_SLOT_SIZE as usize) as u32;
        if slot_index >= ring_set.shared_buf.slot_count {
            return None;
        }
        Some(Self::new(slot_index, len))
    }
}

16.4.10 Performance Analysis

16.4.10.1 Cycle Budget (x86-64, production Linux baseline)

Operation Linux x86-64 (cycles) UmkaOS x86-64 (cycles) Delta Notes
sendmsg (single) ~200-400 ~225-425 +25 MPK domain switch (23cy)
sendmsg via io_uring (N=16) ~200-400 per op ~155-330 per op -45 to -70 Batch amortization wins
sendmmsg (N=44, QUIC) ~15-20 per msg (indirect calls w/ retpoline) ~1-6 per msg (batch) -14 to -14 Massive batch win
epoll_wait (EPOLLET, 10 ready) ~50-100 per event ~30-60 per event -20 to -40 No poll re-check needed
epoll_wait (level-triggered, 10 ready) ~50-100 per event ~80-150 per event +30-50 Re-poll cost
recvmsg (data ready) ~200-400 ~225-425 +25 MPK domain switch
recvmsg via io_uring (N=8) ~200-400 per op ~180-350 per op -20 to -50 Batch amortization
sendfile (4 KB) ~150-300 ~175-325 +25 Domain switch
sendfile (1 MB, batched) ~15-30 per 4KB page ~5-10 per 4KB page -10 to -20 Page cache zero-copy

AArch64 adjustments: Replace MPK 23cy with POE 40-80cy. Replace Linux retpoline overhead (~15-20cy per indirect call) with ~5-8cy (AArch64 uses branch target identification, not retpolines). Consequence: single-operation overhead is higher (+40-80cy vs +23cy), and batched breakeven requires higher N (see formula below). AArch64 sendmmsg at N=44 is approximately break-even with Linux rather than 3x faster.

16.4.10.2 Amortization Formula

For any batched socket operation path:

per_op_overhead = (domain_switch_cycles * 2) / batch_size

Where domain_switch_cycles = 23 (x86 MPK) to 80 (AArch64 POE). linux_indirect_call_overhead varies by architecture: - x86-64 with retpoline: ~15-20 cycles - AArch64 without retpoline: ~5-8 cycles - ARMv7: ~5-8 cycles - RISC-V/s390x/LoongArch64: domain switch elided (Tier 0 fallback)

Breakeven (UmkaOS per-op cost == Linux per-op cost):

batch_size_breakeven = (2 * domain_switch_cycles) / linux_indirect_call_overhead

x86-64:  (2 * 23) / 15  = ~3 operations  (best case)
ARMv7:   (2 * 40) / 5   = ~16 operations
AArch64: (2 * 80) / 5   = ~32 operations (worst case)

At batch_size >= breakeven, UmkaOS is faster than Linux. io_uring default queue depth is 32-128; typical batching achieves N=8-32. UmkaOS wins on all batched workloads for x86-64 and ARMv7. AArch64 achieves negative overhead at io_uring depths >= 32, and is within ~5% at typical depths (N=16). Architectures without fast isolation (RISC-V, s390x, LoongArch64) elide the domain switch entirely and have near-zero overhead from the ring protocol (~5-10 cycles for ring enqueue/dequeue).

16.4.10.3 Memory Overhead

Resource Per-namespace cost Notes
SocketRingSet metadata 256 bytes Fixed struct
Ring pairs (8 rings, PerLlc) ~1 MB 512 entries x 128B cmd + 512 x 64B resp per ring
cpu_to_ring table ~1 KB (512 CPUs) One AtomicU16 per CPU
Readiness ring 64 KB 4096 entries x 16 bytes
Shared buffer pool 4 MB (default) 1024 x 4 KB slots
Total ~5.3 MB Per network namespace

For the root namespace on a 64-core server, this is negligible (0.003% of 128 GB). Per-container namespaces use fewer rings (Single or PerNuma granularity) and smaller shared buffers.


16.4.11 Architecture-Specific Notes

Architecture Domain switch Ring selection cost Notes
x86-64 MPK WRPKRU ~23cy 1cy (Relaxed load, TSO) Best case. LOCK XADD for request_id ~15cy.
AArch64 POE POR_EL0 ~40-80cy 1-3cy (LDAR) LSE LDADD for request_id ~10-30cy uncontended.
ARMv7 DACR+ISB ~30-40cy 3-5cy (LDREX) 64-bit atomics via LDREXD/STREXD ~10-20cy.
RISC-V 64 PT-only ~200-500cy (Tier 0 fallback) 1-3cy (LR/SC) No fast isolation; rings still used for IPC structure. Domain switch elided.
PPC32 Segment+isync ~20-40cy 3-5cy (lwarx/stwcx) 64-bit atomics via paired lwarx/stwcx.
PPC64LE Radix PID ~30-60cy 1-3cy (ldarx) Strong ordering reduces fence costs.
s390x Storage Keys (Tier 0 fallback) 1cy (native 64-bit atomics) Domain switch elided; ring protocol unchanged.
LoongArch64 None (Tier 0 fallback) 1-3cy (LL/SC) Domain switch elided; ring protocol unchanged.

On architectures without fast isolation (RISC-V, s390x, LoongArch64), the ring buffer protocol is retained for code structure uniformity, but domain register switches are elided. The ring becomes a simple shared-memory queue with no isolation overhead. Per-operation cost on these architectures approaches Linux baseline (function call) plus ring protocol overhead (~5-10 cycles for ring enqueue/dequeue) minus retpoline savings (~5-15 cycles on architectures with indirect branch prediction issues).


16.4.12 Cross-References

16.5 NetBuf: Packet Buffer

The NetBuf is UmkaOS's native packet data structure — the equivalent of Linux's sk_buff. It carries packet data and metadata through the entire network stack: from NIC driver RX through protocol processing, firewall evaluation, socket delivery, and back out through TX. Unlike Linux's sk_buff (~240 bytes, accumulated over 30 years of organic growth), NetBuf is designed from scratch for zero-copy domain crossings, scatter-gather I/O, and reference-counted sharing.

Zero-copy scope: "Zero-copy" refers to the KABI ring path between Tier 0 and umka-net (or NIC drivers): NetBuf data pages are passed by DMA handle, not copied. Standard sendmsg() still copies from userspace to kernel socket buffer (Linux-compatible behavior). True userspace→wire zero-copy is available via MSG_ZEROCOPY (sendmsg flag), which pins user pages as scatter-gather fragments.

KABI ring terminology: "KABI ring" in the networking context refers to the DomainRingBuffer shared between umka-net and the NIC driver (Section 11.7), distinct from NIC hardware TX/RX descriptor rings managed by the driver internally.

Design principles: 1. Separation of metadata and data: The NetBuf struct is a metadata header (~256 bytes, fits in 4 cache lines). Packet data lives in separately allocated DMA-eligible pages (via DmaBufferHandle, Section 12.3). When a NetBuf crosses an isolation domain boundary (umka-net to NIC driver or vice versa), only the metadata header is copied (~256 bytes); data pages are shared via the DMA buffer pool (shared isolation domain: PKEY 14 on x86-64 / domain 2 on AArch64; see Section 11.2).

Size note: The 256-byte figure is the full NetBuf struct size as allocated from the slab cache (4 cache lines). The routing result (RouteLookupResult, ~80 bytes) is stored in a separate slab-allocated extension pointed to by route_ext (8 bytes in the NetBuf), not inline. The "256 handles per page" figure (4096 / 16 = 256) found in NetBufPool documentation refers to the compact NetBufHandle token (16 bytes: pool-id + slot-index + generation), not the full NetBuf. Handles are stored in ring buffers and transmission queues; full NetBuf objects are stored separately in the slab pool.

  1. Per-CPU allocation, no global lock: NetBufs are allocated from per-CPU NetBufPool slabs (Section 4.1). The fast path (alloc/free) never touches a global lock or cross-CPU data structure.
  2. Reference-counted for zero-copy: Multiple NetBufs can reference the same underlying data pages (e.g., XDP_REDIRECT to multiple interfaces, TCP zero-copy receive delivering the same page fragment to multiple sockets). Cloning a NetBuf increments the data page refcount without copying data.
  3. Scatter-gather native: Large packets (GSO/GRO aggregates, jumbo frames) use a fragment list rather than requiring contiguous allocation. The fragment list is inline for small fragment counts (up to 6) and spills to a heap-allocated extension for larger counts.
  4. RDMA-eligible: Data pages allocated from the RDMA pool (Section 5.4) can be used directly for RDMA operations without re-registration. The flags field tracks RDMA eligibility.
/// A variable-length array backed by the kernel slab allocator.
/// Semantics similar to `SmallVec<[T; N]>`: up to N elements stored
/// inline without allocation; overflow spills to a slab-allocated
/// heap block.
///
/// Used for small, bounded collections on hot paths (e.g., routing next-hop
/// arrays, scatter-gather lists) where heap allocation must be avoided.
pub struct SlabVec<T, const N: usize> {
    /// Inline storage for the common case (N elements).
    inline: [MaybeUninit<T>; N],
    /// Pointer to slab-allocated overflow storage. NULL if len <= N.
    overflow: *mut T,
    /// Current element count.
    len: usize,
    /// Capacity of the overflow allocation (in elements). 0 if inline.
    overflow_cap: usize,
}
// umka-net/src/netbuf.rs

/// Opaque handle referencing a DMA-accessible buffer in the shared
/// NIC ↔ kernel buffer pool. The full handle (16 bytes) is used in
/// `NetBufRingEntry`; the compact reference (4 bytes) `DmaFragRef`
/// is used in continuation fragment entries.
#[repr(C)]
pub struct DmaBufferHandle {
    /// Pool identifier (selects which DMA buffer pool this buffer belongs to).
    pub pool_id: u16,
    /// Generation counter for use-after-free detection (defense-in-depth).
    ///
    /// **Longevity analysis**: u16 is safe because the data page refcount
    /// is the primary liveness mechanism — a buffer in the TCP retransmit
    /// queue or in cross-domain transit has `refcount >= 1`, preventing the
    /// pool from recycling that slot. The generation is checked only during
    /// deserialization from the NIC-to-kernel ring after a driver crash-
    /// restart, where stale handles from the crashed driver may reference
    /// recycled slots. During crash recovery (~50-150ms), the pool drains
    /// and re-allocates slots sequentially. At 100 Mpps with 1024 pool
    /// slots, generation increments ~100K/sec per slot; u16 wraps after
    /// ~0.65s per slot — well beyond crash recovery duration. Under normal
    /// operation, refcount prevents slot recycling before handle validation.
    pub generation: u16,
    /// Byte offset within the pool's DMA region.
    pub offset: u32,
    /// IOVA base address of the pool region (device-visible address).
    pub iova_base: u64,
}
// size: 16 bytes (2 + 2 + 4 + 8)
const_assert!(size_of::<DmaBufferHandle>() == 16);

impl DmaBufferHandle {
    /// Access the shared atomic refcount for the data pages.
    ///
    /// Each DMA buffer pool maintains a per-slot refcount array alongside
    /// the data region. The refcount is indexed by `self.offset / slot_size`,
    /// where `slot_size` is the pool's per-buffer allocation granularity.
    ///
    /// Starts at 1 on allocation. Incremented by `NetBuf::clone_shared()`.
    /// When `fetch_sub(1, Release)` returns 1 (was the last reference), the
    /// caller returns the buffer to the pool.
    ///
    /// **Relationship with generation**: Refcounting is the primary liveness
    /// mechanism that prevents use-after-free. The `generation` field is a
    /// secondary stale-detection aid for cross-domain handles: if a Tier 1
    /// driver holds a handle across a crash-restart cycle, the generation
    /// mismatch prevents it from accessing a reused slot. Refcounting alone
    /// is sufficient for single-domain safety; generation adds defense-in-depth
    /// for cross-domain handle passing.
    pub fn refcount(&self) -> &AtomicU32 {
        let pool = dma_buffer_pool(self.pool_id);
        let slot_idx = self.offset as usize / pool.slot_size;
        &pool.refcounts[slot_idx]
    }
}

/// Compact fragment reference for continuation entries (4 bytes).
/// Used in `NetBufRingFrag` where space is constrained.
/// The pool_id and iova_base are inherited from the parent `NetBufRingEntry`.
#[repr(C)]
pub struct DmaFragRef {
    /// Byte offset within the parent entry's pool region.
    /// 16-bit offset supports up to 64 KB per fragment, which is sufficient
    /// for all standard MTU sizes (Ethernet 1500, jumbo 9000) and typical
    /// DMA pool region sizes. **Constraint**: DMA pool regions used with
    /// DmaFragRef MUST be <= 64 KB. Larger regions (e.g., for TSO segments
    /// >64 KB) must use the full `NetBufFrag` type with u32 offset instead
    /// of the compact `DmaFragRef` form.
    pub offset: u16,
    /// Generation counter. Incremented each time the fragment slot is
    /// recycled in the DMA pool. Used to detect stale `DmaFragRef` handles
    /// that point to reused pool entries — a stale handle with a mismatched
    /// generation is rejected before accessing the fragment data.
    ///
    /// **Sizing rationale**: u8 wraps at 256; at 10 Mpps with per-packet
    /// fragment recycling, wrap occurs in ~25 µs — too short to guarantee
    /// that all stale references have been retired. u16 wraps at 65536,
    /// giving ~6.5 ms at 10 Mpps — sufficient for RCU grace periods and
    /// NAPI poll cycles to complete, ensuring no live stale handle survives
    /// a full generation wrap.
    pub generation: u16,
}
// kernel-internal, not KABI: offset(2) + generation(2) = 4 bytes.
const_assert!(core::mem::size_of::<DmaFragRef>() == 4);

/// Packet buffer — carries packet data and metadata through the network stack.
///
/// `NetBuf` is the UmkaOS equivalent of Linux's `sk_buff`. It is a metadata header
/// (~256 bytes) that references separately allocated data pages. The struct itself
/// is allocated from a per-CPU `NetBufPool` (slab-backed, Section 4.1).
///
/// **Lifetime**: Allocated via `NetBufPool::alloc()`, freed via `NetBufPool::free()`
/// or when the last reference is dropped. Reference counting applies to the data
/// pages (`data_handle`), not the `NetBuf` struct itself — the struct is owned by
/// exactly one consumer at a time. Zero-copy sharing is achieved by cloning: the
/// clone gets a new `NetBuf` struct (from the local CPU's pool) pointing to the
/// same data pages with an incremented refcount.
///
/// **Domain crossing protocol** (see Section 16.9):
/// When a NetBuf crosses the umka-net / NIC driver isolation domain boundary:
/// 1. The sending domain allocates a new `NetBuf` struct in the receiving domain's
///    per-CPU pool (via a cross-domain slab allocation helper).
/// 2. The metadata fields are copied to the new struct (~256 bytes memcpy).
/// 3. The `data_handle` (DMA buffer reference) is shared — both domains can access
///    data pages through the shared DMA buffer pool (PKEY 14 / domain 2).
/// 4. The original `NetBuf` struct is freed in the sending domain's pool.
/// This ensures each domain operates on its own metadata (preventing TOCTOU attacks
/// on header offsets) while sharing the bulk data zero-copy.
///
/// **Cross-reference**: `DmaBufferHandle` (Section 12.1.5), `NetBufPool` (below),
/// `NetDeviceVTable` TX/RX paths (Section 16.10), XDP bounce buffer ([Section 19.2](19-sysapi.md#ebpf-subsystem)),
/// NAPI batching (Section 16.9), TCP zero-copy receive (Section 16.5.9).
/// **Cache-line layout** (`#[repr(C)]`, 64-byte cache lines on all supported architectures):
///
/// ```text
/// ┌─────────────────────────────────────────────────────────────────────────┐
/// │ CACHE LINE 0 (bytes 0-63) — HOT: RX/TX fast path, touched per-packet  │
/// │                                                                       │
/// │  [0..16)   data_handle: DmaBufferHandle      16 B  data page ref      │
/// │  [16..20)  head_offset: u32                   4 B  buffer start       │
/// │  [20..24)  data_offset: u32                   4 B  payload start      │
/// │  [24..28)  tail_offset: u32                   4 B  payload end        │
/// │  [28..32)  end_offset: u32                    4 B  buffer end         │
/// │  [32..34)  l2_offset: i16                     2 B  L2 header offset   │
/// │  [34..36)  l3_offset: u16                     2 B  L3 header offset   │
/// │  [36..38)  l4_offset: u16                     2 B  L4 header offset   │
/// │  [38..40)  inner_l3_offset: u16               2 B  tunnel inner L3    │
/// │  [40..42)  inner_l4_offset: u16               2 B  tunnel inner L4    │
/// │  [42..43)  checksum_status: ChecksumStatus    1 B  csum offload state │
/// │  [43..44)  (padding)                          1 B                     │
/// │  [44..46)  csum_start: u16                    2 B  csum range start   │
/// │  [46..48)  csum_offset: u16                   2 B  csum field offset  │
/// │  [48..52)  csum_value: u32                    4 B  raw HW checksum    │
/// │  [52..54)  vlan_tci: u16                      2 B  802.1Q TCI         │
/// │  [54..55)  vlan_present: u8                   1 B  VLAN tag valid     │
/// │  [55..56)  (padding)                          1 B                     │
/// │  [56..58)  protocol: u16                      2 B  IP protocol number │
/// │  [58..60)  addr_family: AddressFamily          2 B  AF_INET/AF_INET6  │
/// │  [60..64)  flags: NetBufFlags                 4 B  direction/state    │
/// │  Total: 64 bytes (exactly one cache line)                             │
/// ├─────────────────────────────────────────────────────────────────────────┤
/// │ CACHE LINE 1 (bytes 64-71) — WARM: routing, touched on               │
/// │   forwarding/output path but not on every RX demux                    │
/// │                                                                       │
/// │  [64..72)   route_ext: Option<NonNull<RouteLookupResult>>  8 B        │
/// │             Pointer to slab-allocated RouteLookupResult               │
/// │             (NextHop + mtu + prefsrc + route_type + table_id;         │
/// │              slab-backed for TCP retransmit safety — the slab         │
/// │              object survives RCU route table swaps).                   │
/// │             A `const_assert!(size_of::<NetBuf>() <= 256)` in the      │
/// │             implementation verifies the total size fits within 4       │
/// │             cache lines.                                              │
/// ├─────────────────────────────────────────────────────────────────────────┤
/// │ CACHE LINE 1 cont. (72-127) — WARM: GSO/GRO, touched on offload      │
/// │   paths (TSO, GRO coalescing) but not on simple forwarded packets     │
/// │                                                                       │
/// │  [~72)      gso_type: GsoType                 1 B  segmentation type  │
/// │             (padding)                         1 B                     │
/// │             gso_size: u16                     2 B  MSS for GSO        │
/// │             gso_segs: u16                     2 B  segment count      │
/// │             (padding to 8-byte align)         2 B                     │
/// │             gro_next: Option<NonNull<NetBuf>> 8 B  GRO chain link     │
/// ├─────────────────────────────────────────────────────────────────────────┤
/// │ CACHE LINES 2-3 (bytes 128-~255) — COLD: scatter-gather, touched      │
/// │   only for multi-fragment packets (GSO, sendfile, splice)             │
/// │                                                                       │
/// │             nr_frags: u8                      1 B  fragment count     │
/// │             (padding to 8-byte align)         7 B                     │
/// │             frags: [NetBufFrag; 6]          144 B  inline SG list     │
/// │             frag_ext: Option<SlabVec<..>>   ~16 B  overflow SG        │
/// ├─────────────────────────────────────────────────────────────────────────┤
/// │ CACHE LINE 3 cont. / LINE 4 — WARM/COLD: flow hash, timestamps,      │
/// │   classification metadata                                             │
/// │                                                                       │
/// │             flow_hash: u32                    4 B  RSS/flow steering   │
/// │             hash_type: RssHashType            1 B  hash algorithm     │
/// │             (padding)                                                 │
/// │             timestamp_ns: u64                 8 B  packet timestamp   │
/// │             ifindex: u32                      4 B  interface index    │
/// │             alloc_numa_node: u16              2 B  NUMA origin        │
/// │             (padding)                                                 │
/// │             mark: u32                         4 B  netfilter mark     │
/// │             conntrack_idx: u32                4 B  conntrack ref      │
/// │             priority: u32                     4 B  QoS/tc class       │
/// └─────────────────────────────────────────────────────────────────────────┘
/// ```
///
/// **Design rationale**: Cache line 0 contains every field touched on the RX
/// fast path (data pointer arithmetic, L2/L3/L4 offset parsing, checksum
/// validation, protocol dispatch). A packet that is received, demultiplexed to
/// a socket, and delivered to userspace touches only cache line 0 of the NetBuf
/// metadata — all other cache lines remain cold. The routing cache (lines 1-2)
/// is pulled in only when the packet enters the forwarding or output path.
/// Scatter-gather fragments (lines 3-4) are accessed only for multi-page
/// packets. This layout ensures the common single-segment RX-to-socket path
/// incurs exactly one cache line fill for NetBuf metadata access.
///
/// **Note**: Offsets beyond cache line 0 are approximate because
/// The routing result is stored as `Option<NonNull<RouteLookupResult>>` (8 bytes:
/// a nullable pointer, with `None` represented as null via Rust's niche
/// optimization). Total struct size is ~256 bytes (4 cache lines).
// kernel-internal, not KABI
#[repr(C)]
pub struct NetBuf {
    // ---- Cache line 0: HOT — RX/TX fast path (data pointers + protocol offsets) ----

    /// DMA buffer handle for the underlying data pages.
    ///
    /// Points to a DMA-mapped memory region allocated via `KernelServicesVTable::
    /// alloc_dma_buffer()` (Section 12.1.5). The handle is valid for the lifetime of
    /// this NetBuf (or until the data pages are explicitly released). The same
    /// handle may be shared across multiple cloned NetBufs (refcounted in the
    /// DMA buffer pool).
    ///
    /// For scatter-gather packets, this handle refers to the linear (header) portion
    /// only. Fragment data is in separate DMA handles within `frags`.
    pub data_handle: DmaBufferHandle,

    /// Offset from `data_handle` base to the start of the allocated buffer region.
    ///
    /// The region `[head_offset .. end_offset)` is the total allocated linear buffer.
    /// `head_offset` is typically 0 but may be non-zero if the buffer was carved from
    /// a larger DMA allocation (e.g., page fragment sub-allocation for small packets).
    pub head_offset: u32,

    /// Offset from `data_handle` base to the start of packet data.
    ///
    /// The region `[head_offset .. data_offset)` is headroom — available for
    /// prepending headers (e.g., tunnel encapsulation adds an outer IP/UDP header).
    /// `push()` decrements `data_offset` to claim headroom; if insufficient headroom
    /// remains, the caller must reallocate (or use `NetBuf::prepend_realloc()`).
    ///
    /// **Invariant**: `head_offset <= data_offset <= tail_offset <= end_offset`.
    pub data_offset: u32,

    /// Offset from `data_handle` base to the end of packet data.
    ///
    /// `tail_offset - data_offset` is the linear data length. `put()` increments
    /// `tail_offset` to append data; `pull()` increments `data_offset` to consume
    /// a header (advancing past it after parsing).
    pub tail_offset: u32,

    /// Offset from `data_handle` base to the end of the allocated buffer region.
    ///
    /// `end_offset - tail_offset` is tailroom — available for appending data
    /// (e.g., padding, FCS). The total linear buffer size is `end_offset - head_offset`.
    pub end_offset: u32,

    // ---- Protocol metadata (parsed by the stack) — still cache line 0 ----

    /// Byte offset from `data_offset` to the start of the L2 (link-layer) header (signed).
    ///
    /// For Ethernet frames, this is 0 (L2 header is at the start of data). For
    /// packets received after L2 processing (e.g., after bridge forwarding or XDP),
    /// this may be negative (L2 header was in the headroom and has been consumed).
    /// Set by the NIC driver or the L2 processing layer.
    /// `i16::MIN` (-32768) = sentinel meaning "L2 layer not present or not parsed".
    /// Valid range: -32767 to 32767. Typical values: 0 (L2 starts at data_offset).
    pub l2_offset: i16,

    /// Byte offset from `data_offset` to the start of the L3 (network-layer) header.
    ///
    /// For IPv4/IPv6. Set during L3 header parsing. Used by checksum offload,
    /// GSO segmentation, and BPF helpers (`bpf_skb_load_bytes()`). Value 0xFFFF
    /// means "not set" (packet has not been parsed to L3 yet).
    pub l3_offset: u16,

    /// Byte offset from `data_offset` to the start of the L4 (transport-layer) header.
    ///
    /// For TCP/UDP/SCTP. Set during L4 header parsing. Used by checksum offload
    /// (provides the checksum start offset to the NIC) and GRO coalescing.
    /// Value 0xFFFF means "not set".
    pub l4_offset: u16,

    /// Byte offset from `data_offset` to the start of the inner L3 header.
    ///
    /// Non-zero only for encapsulated packets (VXLAN, Geneve, GRE, IPIP).
    /// Used by GSO for tunnel segmentation offload and by XDP decap helpers.
    /// Value 0xFFFF means "not encapsulated".
    pub inner_l3_offset: u16,

    /// Byte offset from `data_offset` to the start of the inner L4 header.
    ///
    /// Non-zero only for encapsulated packets. Value 0xFFFF means "not encapsulated".
    pub inner_l4_offset: u16,

    // ---- Checksum state — still cache line 0 ----

    /// Checksum offload status. Determines whether software checksum verification
    /// or computation is needed.
    ///
    /// **RX path** (NIC to stack):
    /// - `None`: NIC did not verify checksum; software must verify.
    /// - `Unnecessary`: NIC verified the full L4 checksum; software can skip.
    /// - `Complete`: NIC computed a raw checksum over `[csum_start .. end]` and
    ///   stored it in `csum_value`. Software must fold and verify.
    ///
    /// **TX path** (stack to NIC):
    /// - `None`: Software computed the full checksum; NIC should not touch it.
    /// - `Partial`: Software filled the pseudo-header checksum; NIC must compute
    ///   the L4 checksum from `csum_start` for `csum_offset` bytes and write the
    ///   result at `csum_start + csum_offset`. This matches Linux's
    ///   `CHECKSUM_PARTIAL` semantics.
    pub checksum_status: ChecksumStatus,

    /// Byte offset from `data_offset` where checksum computation starts.
    ///
    /// Used with `ChecksumStatus::Partial` (TX) and `ChecksumStatus::Complete` (RX).
    /// For TX partial offload, this is the start of the L4 header.
    pub csum_start: u16,

    /// Byte offset from `csum_start` to the checksum field within the L4 header.
    ///
    /// Used with `ChecksumStatus::Partial` (TX). For TCP, this is 16 (offset of
    /// the checksum field in the TCP header). For UDP, this is 6.
    pub csum_offset: u16,

    /// Raw checksum value from hardware (RX `Complete` mode) or computed by
    /// software. Interpretation depends on `checksum_status`.
    pub csum_value: u32,

    // ---- VLAN — still cache line 0 ----

    /// 802.1Q VLAN tag. `vlan_present` indicates whether this field is valid.
    ///
    /// Format: bits [15:13] = PCP (priority), bit [12] = DEI, bits [11:0] = VID.
    /// This matches the on-wire 802.1Q TCI format.
    pub vlan_tci: u16,

    /// Whether `vlan_tci` contains a valid VLAN tag.
    ///
    /// Non-zero if: (a) the NIC extracted the VLAN tag via hardware offload (the tag
    /// was stripped from the frame and placed here), or (b) software VLAN processing
    /// parsed and extracted the tag. Zero for untagged frames.
    ///
    /// `u8` instead of `bool` because this field is memcpy'd across the Tier 1
    /// isolation boundary; a non-0/1 byte in a Rust `bool` is UB.
    pub vlan_present: u8, // 0 = absent, 1 = present

    // ---- Packet classification — end of cache line 0 ----

    /// IP protocol number from the (outer) L3 header. Set during L3 parsing.
    /// Values: 6 (TCP), 17 (UDP), 1 (ICMP), 58 (ICMPv6), 132 (SCTP), etc.
    /// 0 means "not yet parsed". u16 (not u8) to match `NetBufRingEntry`
    /// wire format — avoids narrowing conversion on ring serialization.
    /// IP protocol values fit in 0-255; the high byte is always zero.
    pub protocol: u16,

    /// Address family of the (outer) L3 header.
    ///
    /// `AddressFamily::Inet` for IPv4, `AddressFamily::Inet6` for IPv6.
    /// Set during L3 parsing. Used for routing table selection and BPF
    /// program dispatch.
    pub addr_family: AddressFamily,

    /// Packet direction and processing state flags.
    pub flags: NetBufFlags,

    // ---- Cache line 1: WARM — routing (forwarding/output path) ----

    /// Pointer to a slab-allocated routing lookup result. Populated by the first
    /// routing table lookup for this packet (L3 input or output path). Subsequent
    /// consumers (e.g., conntrack, firewall, forwarding) reuse the cached result
    /// without repeating the FIB lookup. `None` if routing has not been performed yet.
    ///
    /// **Cross-reference**: `RouteLookupResult`
    /// ([Section 16.6](#routing-table-fib-forwarding-information-base--route-lookup-algorithm)).
    /// The cached result includes the resolved next-hop, output interface, and MTU.
    ///
    /// **Extension pattern**: The result is stored as a pointer to a slab-allocated
    /// `RouteLookupResult` (8 bytes in the NetBuf) rather than by value (~88 bytes).
    /// This reduces NetBuf size from ~296 to ~256 bytes (4 cache lines instead of 5),
    /// improving cache utilization on the RX-to-socket fast path. The slab object is
    /// allocated from a dedicated `RouteLookupResult` slab cache on the first route
    /// lookup and freed when the NetBuf is freed.
    ///
    /// **TCP retransmit safety**: Cloned NetBufs in the retransmit queue can outlive
    /// RCU grace periods (retransmit timeout is seconds to minutes). The slab-allocated
    /// result is owned by the NetBuf (not reference-counted from the route table) — it
    /// is immune to route table RCU swaps. Clone cost is one slab alloc + ~80 bytes
    /// memcpy per TCP retransmit clone — acceptable because TCP clones are warm-path.
    ///
    /// **Staleness detection**: When a route changes (e.g., next-hop failover),
    /// packets in the retransmit queue still use the old route. This is correct
    /// because: (1) the old route was valid when the packet was first sent, (2)
    /// TCP retransmit will eventually fail if the route is truly dead (timeout),
    /// and (3) the socket's route is refreshed on the next `sendmsg()` path via
    /// `ip_route_output_flow()`. This matches Linux behavior.
    pub route_ext: Option<NonNull<RouteLookupResult>>,

    // ---- Cache line 2 cont.: WARM — GSO (offload paths) ----

    // **Lifecycle note (route_ext deallocation)**: When a NetBuf is freed via
    // `NetBufPool::return_slot()` (called by `NetBufHandle::Drop`) or
    // `NetBuf::free()` (direct free without handle), if `route_ext.is_some()`,
    // the `RouteLookupResult` is freed to the dedicated `route_cache_slab`
    // cache. `NetBuf` itself has no Rust `Drop` impl (slab-managed with
    // explicit lifecycle); the `NetBufHandle` wrapper provides automatic
    // cleanup via its `Drop` impl:
    //
    // ```
    // fn netbuf_free(buf: &mut NetBuf) {
    //     if let Some(route) = buf.route_ext.take() {
    //         // SAFETY: route was allocated from route_cache_slab.
    //         unsafe { slab_free(route_cache_slab, route.as_ptr()); }
    //     }
    //     // ... free data pages, return to per-CPU pool ...
    // }
    // ```

    /// GSO type. Non-zero if this NetBuf represents an aggregated super-packet
    /// that must be segmented before transmission (if the NIC does not support
    /// hardware TSO/USO) or was coalesced by GRO on the receive path.
    pub gso_type: GsoType,

    /// MSS (Maximum Segment Size) for GSO segmentation.
    ///
    /// When `gso_type != GsoType::None`, the packet must be split into segments
    /// of at most `gso_size` bytes of L4 payload each. The NIC (via TSO) or
    /// software GSO performs the segmentation. Value 0 when `gso_type == None`.
    pub gso_size: u16,

    /// Number of segments in this GSO packet.
    ///
    /// For GRO-coalesced packets, this is the count of original packets merged
    /// into this aggregate. Used for byte/packet accounting and for calculating
    /// the number of ACKs to expect. Value 0 when `gso_type == None`.
    pub gso_segs: u16,

    // ---- Segment queue link — still cache line 2 (warm) ----

    /// Intrusive linked-list pointer for TCP segment queues (`TcpSegQueue`).
    /// Used by the TCP out-of-order queue and retransmission queue to chain
    /// NetBufs in sequence-number order without heap allocation. `None` when
    /// the NetBuf is not in any segment queue.
    ///
    /// This field is separate from `gro_next` because GRO linking and TCP
    /// segment queuing operate in different lifecycle phases: GRO linking
    /// happens during NIC poll (before socket dispatch), while segment
    /// queuing happens after socket demux (inside the TCP state machine).
    /// A NetBuf is never in both a GRO chain and a segment queue simultaneously.
    pub next: Option<NonNull<NetBuf>>,

    // ---- GRO chain — still cache line 2 (warm) ----

    /// Pointer to the next NetBuf in a GRO (Generic Receive Offload) chain.
    ///
    /// During GRO coalescing in `NetRxContext`, packets belonging to the same
    /// flow are linked into a singly-linked list via this field. The GRO engine
    /// merges payload from chain members into the head NetBuf's data pages
    /// (or scatter-gather fragments) and updates `gso_segs` to reflect the
    /// total number of coalesced segments. After coalescing, `gro_next` is
    /// `None` on the delivered NetBuf (the chain is consumed). This field is
    /// only valid during GRO processing inside umka-net; it is always `None`
    /// for NetBufs outside the GRO path.
    pub gro_next: Option<NonNull<NetBuf>>,

    // ---- Cache lines 3-4: COLD — scatter-gather (multi-fragment packets only) ----

    /// Number of valid entries in `frags`. Range: 0 (linear-only packet) to
    /// `MAX_INLINE_FRAGS` for inline storage, or up to `frag_ext.len()` if
    /// the extension list is allocated.
    pub nr_frags: u8,

    /// Inline fragment storage for common cases (up to 6 fragments).
    ///
    /// Most packets have 0-3 fragments (linear header + 1-3 page fragments for
    /// payload). The inline array avoids a heap allocation for the common case.
    /// Fragments beyond `MAX_INLINE_FRAGS` (6) spill to `frag_ext`.
    pub frags: [NetBufFrag; MAX_INLINE_FRAGS],

    /// Extension fragment list for packets with more than `MAX_INLINE_FRAGS`
    /// fragments (e.g., large GSO aggregates with many page fragments).
    ///
    /// Heap-allocated via the slab allocator (Section 4.1) on demand. `None` for
    /// packets with 6 or fewer fragments. When present, `frags[0..MAX_INLINE_FRAGS]`
    /// holds the first 6 fragments and `frag_ext` holds the remainder.
    pub frag_ext: Option<SlabVec<NetBufFrag, MAX_SKB_FRAGS>>,

    // ---- Cache line 4 cont. / line 5: WARM — flow hash, timestamps, classification ----

    /// Atomic reference count for the data pages.
    ///
    /// Starts at 1 on allocation. Incremented by `NetBuf::clone_shared()` (zero-copy
    /// clone). When it reaches 0, the data pages are returned to the DMA buffer pool.
    /// The `NetBuf` struct itself is always singly-owned and freed to its CPU's pool
    /// independently of the data refcount.
    ///
    /// **Implementation**: This is a pointer to a shared atomic counter that lives in
    /// the `DmaBufferHandle`'s metadata region (not in the NetBuf struct). Multiple
    /// cloned NetBufs point to the same counter. Shown here for documentation; the
    /// actual refcount is accessed via `data_handle.refcount()`.
    // (refcount is part of DmaBufferHandle, not stored inline)

    /// Hash value computed over the packet's flow key (src/dst IP, src/dst port,
    /// protocol). Used for:
    /// - Receive flow steering (RFS): selecting the CPU queue
    /// - Conntrack bucket selection
    /// - ECMP next-hop selection (consistent hashing)
    /// - Socket demultiplexing
    ///
    /// Computed once (by the NIC via RSS hardware hash, or by software during L3/L4
    /// parsing) and reused by all consumers. Value 0 means "not computed".
    pub flow_hash: u32,

    /// RSS hash type indicating which packet fields were hashed to produce
    /// `flow_hash`. Matches `NetBufRingEntry.hash_type` on the KABI wire format.
    pub hash_type: RssHashType,

    /// Timestamp of packet arrival (RX) or queuing (TX), in nanoseconds since boot
    /// (CLOCK_MONOTONIC_RAW). Set by the NIC driver from hardware timestamping if
    /// available, otherwise set by umka-net from the kernel clock at first touch.
    /// Used for RTT estimation, packet scheduling (pacing), and SO_TIMESTAMPNS.
    pub timestamp_ns: u64,

    /// Network interface index on which this packet was received (RX) or will be
    /// transmitted (TX). Indexes into the per-namespace interface table
    /// (`NetNamespace::interfaces`, Section 17.1.1). Set by the NIC driver on RX;
    /// set by routing on TX.
    pub ifindex: u32,

    /// NUMA node of the CPU that allocated this NetBuf. Used for NUMA-aware
    /// freeing: when a NetBuf is freed on a different NUMA node than where it was
    /// allocated, it is returned to a cross-node return magazine (Section 4.1) rather
    /// than the local CPU's pool, to avoid remote memory access on the next alloc.
    pub alloc_numa_node: u16,

    // NOTE: mark, conntrack_idx, priority are COLD classification metadata —
    // touched only by netfilter/tc paths, not on the common RX-to-socket fast path.
    // They are metadata tags with protocol-defined semantics (netfilter mark,
    // conntrack index, IP TOS/DSCP). They survive Evolvable evolution because their
    // values are set by protocol standards, not by replaceable policy algorithms.

    /// Mark value (equivalent to Linux `skb->mark`). Set by iptables/nftables MARK
    /// target (translated to BPF), policy routing rules, or `SO_MARK` socket option.
    /// Used for routing table selection (policy routing, Section 16.5) and traffic
    /// classification (tc, QoS).
    pub mark: u32,

    /// Connection tracking reference. Index into the conntrack hash table
    /// ([Section 16.18](#packet-filtering-bpf-based)). `CONNTRACK_UNTRACKED` (u32::MAX) means
    /// this packet is not tracked.
    /// Populated by the prerouting conntrack BPF hook. Used by NAT and stateful
    /// firewall rules.
    pub conntrack_idx: u32,

    /// Priority / traffic class. Used by the QoS/tc layer for queue selection.
    /// Initialized from the IP TOS/DSCP field or from `SO_PRIORITY`.
    pub priority: u32,
}

// NetBuf must fit in 4 cache lines (256 bytes) for cache-efficient packet processing.
// If the struct exceeds this limit, reconsider field layout or move cold fields to
// a slab-allocated extension.
const_assert!(core::mem::size_of::<NetBuf>() <= 256);

impl NetBuf {
    /// Returns true if a valid VLAN tag is present.
    pub fn has_vlan(&self) -> bool { self.vlan_present != 0 }
}

/// Owning handle to a `NetBuf` slab slot. The handle represents exclusive
/// ownership of the underlying `NetBuf` metadata and its DMA data pages.
/// When the handle is dropped, the slab slot is returned to the pool and
/// the data page refcount is decremented (freeing the DMA buffer if this
/// was the last reference).
///
/// **Move-only**: `NetBufHandle` has no `Copy` or `Clone` impl. It follows
/// the same ownership model as `Box<T>`: exactly one owner at a time, drop
/// frees. This prevents double-free of slab slots. For refcounted sharing,
/// use `NetBuf::clone_shared()` to allocate a new slab slot with shared
/// data pages, then `NetBufPool::handle_for()` on the clone.
///
/// Used to pass buffer ownership across ring buffer boundaries without
/// copying the full ~256-byte `NetBuf` struct. Encodes the DMA pool index
/// and slot offset for O(1) pointer reconstruction without storing a raw
/// pointer (avoids KASLR leaks in ring buffers).
///
/// **Borrowing the underlying NetBuf**: Use `handle.peek()` for `&NetBuf`
/// or `handle.peek_mut()` for `&mut NetBuf` without consuming the handle.
/// These delegate to `NetBufPool::claim()`/`claim_mut()` with generation
/// validation.
///
/// Explicit 16-byte layout (with `#[repr(C)]`):
///   bytes 0-1:  pool_id (u16)
///   bytes 2-3:  _pad0 (explicit, aligns slot_idx to 4 bytes)
///   bytes 4-7:  slot_idx (u32)
///   bytes 8-11: generation (u32)
///   bytes 12-15: _pad1 (explicit, pads to 16 bytes for ring buffer alignment)
/// 16 bytes total → 256 handles per 4KB page.
#[derive(Debug)]
#[repr(C)]
pub struct NetBufHandle {
    /// DMA pool index (selects which pool this handle refers to).
    pub pool_id: u16,
    /// Explicit padding to align `slot_idx` to a 4-byte boundary.
    pub _pad0: [u8; 2],
    /// Slot index within the pool's backing slab.
    pub slot_idx: u32,
    /// Generation counter matching the pool slot's generation (prevents
    /// stale handle use after buffer recycle).
    ///
    /// **Longevity**: At 100 Gbps with 1500-byte packets (~8.3M packets/sec),
    /// u32 wraps in ~515 seconds. Combined with slot_idx uniqueness within a
    /// pool, the probability of a (slot_idx, generation) collision is zero
    /// within any single NAPI poll cycle (~100 μs). A NetBuf handle that
    /// outlives 515 seconds of continuous line-rate traffic on the same pool
    /// slot is a bug — NetBufs must not outlive an RCU grace period (~10 ms).
    /// u32 is therefore safe for 50+ year continuous operation.
    ///
    /// **Wrap-time clarification**: The 515s figure is the pool-wide
    /// generation epoch (all ~8.3M allocs/sec share one counter). Per-slot,
    /// a specific `(slot_idx, generation)` pair wraps in ~37 hours (pool has
    /// ~65K slots at typical sizing). Safety relies on the invariant that
    /// NetBuf handles MUST NOT outlive an RCU grace period (~10ms), which
    /// is 5+ orders of magnitude below either wrap time.
    pub generation: u32,
    /// Explicit padding to bring the struct to exactly 16 bytes for
    /// ring buffer alignment (256 handles per 4KB page).
    pub _pad1: [u8; 4],
}
// kernel-internal, not KABI: pool_id(2)+_pad0(2)+slot_idx(4)+generation(4)+_pad1(4) = 16 bytes.
const_assert!(core::mem::size_of::<NetBufHandle>() == 16);

/// `NetBufHandle` is move-only (no `Copy`, no `Clone`). It represents exclusive
/// ownership of a `NetBuf` slab slot. When the handle is dropped, the underlying
/// slab slot (and its DMA data pages, if refcount reaches zero) is returned to
/// the per-CPU `NetBufPool`.
///
/// **Why not `Copy`**: A `Copy` handle would allow double-free — two handles to
/// the same slab slot, each returning it to the pool on drop. This is the same
/// semantic as `Box<T>`: move-only, drop frees.
///
/// **Why not `Clone`**: Duplicating a handle requires allocating a new `NetBuf`
/// slab slot and incrementing the data page refcount. Use `NetBuf::clone_shared()`
/// followed by `NetBufPool::handle_for()` for explicit refcounted duplication.
impl Drop for NetBufHandle {
    fn drop(&mut self) {
        // Return the underlying NetBuf metadata slot to its pool.
        // The pool is identified by self.pool_id; the slot by self.slot_idx.
        // The generation counter in the slot header is incremented on return,
        // invalidating any stale copies (defense-in-depth for ring entries
        // that were memcpy'd before this handle was freed).
        //
        // If the data page refcount (`DmaBufferHandle::refcount()`) reaches
        // zero after decrement, the DMA buffer is also freed. If the data
        // is shared (`SHARED_DATA` flag on the NetBuf), only the refcount
        // is decremented; the last holder frees the DMA pages.
        //
        // SAFETY: pool_id and slot_idx were set during handle_for() and are
        // valid for the lifetime of the handle. The generation counter is
        // checked inside pool_return() as defense-in-depth.
        //
        // Safe to call from any CPU — cross-CPU returns go through the
        // pool's cross-node return list (same as NetBuf::free()).
        if let Some(pool) = SHARED_NETBUF_POOLS.load(self.pool_id as u64) {
            pool.return_slot(self.slot_idx, self.generation);
        }
        // If the pool is not found (driver was unloaded), the slot is leaked.
        // This is a crash-recovery edge case: the pool is destroyed only after
        // all outstanding handles have been drained (NIC shutdown quiesces TX).
    }
}

impl NetBufHandle {
    /// Borrow the underlying `NetBuf` metadata without consuming the handle.
    ///
    /// Returns a shared reference to the `NetBuf` stored in the pool's slab
    /// slot. The handle retains ownership; the reference borrows the handle's
    /// lifetime. Used by `validate_xmit()` to read GSO fields before deciding
    /// whether to pass the handle through or segment.
    ///
    /// Returns `None` if the handle is stale (generation mismatch) or the pool
    /// is not found (driver unloaded during crash recovery).
    ///
    /// **Hot-path**: O(1) — XArray load (pool registry) + pointer arithmetic
    /// (slot index) + generation comparison. No locks, no allocation.
    pub fn peek(&self) -> Option<&NetBuf> {
        let pool = SHARED_NETBUF_POOLS.load(self.pool_id as u64)?;
        pool.claim(self)
    }

    /// Borrow the underlying `NetBuf` metadata mutably without consuming the
    /// handle.
    ///
    /// Returns a mutable reference to the `NetBuf` stored in the pool's slab
    /// slot. The caller must ensure exclusive access (no concurrent readers).
    /// Used by `gso_segment()` to read packet data and headers during
    /// software segmentation.
    ///
    /// Returns `None` if the handle is stale or the pool is not found.
    ///
    /// **Safety**: The caller must ensure no other handle to the same slot
    /// is concurrently accessing the NetBuf. In the TX path, each NetBuf is
    /// owned by exactly one handle at a time (move-only semantics), so this
    /// is guaranteed.
    pub fn peek_mut(&self) -> Option<&mut NetBuf> {
        let pool = SHARED_NETBUF_POOLS.load(self.pool_id as u64)?;
        pool.claim_mut(self)
    }
}

/// Maximum number of scatter-gather fragments stored inline in `NetBuf::frags`.
///
/// 6 fragments covers the common case: 1 linear header region + up to 5 page
/// fragments for a 64KB GSO packet with 4KB pages (ceil(64KB / 4KB) - 1 = 15,
/// but most real packets have fewer fragments because MTU-sized packets use
/// 1-2 pages). 6 fragments covers >99% of packets while keeping `NetBuf` at
/// ~256 bytes (fits in 4 cache lines; competitive with Linux's `sk_buff` at ~240 bytes,
/// with the route cache moved to a slab-allocated extension via `route_ext`).
pub const MAX_INLINE_FRAGS: usize = 6;

/// Maximum total scatter-gather fragments per packet (inline + overflow).
/// Matches Linux's `MAX_SKB_FRAGS` (PAGE_SIZE / 256 = 17 on 4KB pages).
/// Used as the const generic capacity for `SlabVec` overflow storage.
pub const MAX_SKB_FRAGS: usize = 17;

/// Scatter-gather fragment: a reference to a contiguous region within a DMA buffer.
///
/// Each fragment represents a page (or page range) of packet data that is not
/// contiguous with the linear buffer. Fragments are used for:
/// - TCP zero-copy receive: userspace pages are directly referenced as fragments
/// - GRO coalescing: appended packet payloads become fragments
/// - sendfile()/splice(): file pages are attached as fragments without copying
#[repr(C)]
pub struct NetBufFrag {
    /// DMA buffer handle for this fragment's data pages.
    ///
    /// May be the same as `NetBuf::data_handle` (different region of the same
    /// DMA allocation) or a completely separate DMA buffer. The handle's refcount
    /// is incremented when the fragment is attached and decremented when removed.
    pub handle: DmaBufferHandle,

    /// Byte offset within the DMA buffer where this fragment's data begins.
    pub offset: u32,

    /// Length of this fragment's data in bytes.
    pub length: u32,
}
// kernel-internal, not KABI: handle(16) + offset(4) + length(4) = 24 bytes.
const_assert!(core::mem::size_of::<NetBufFrag>() == 24);

/// RSS hash type indicating which packet fields the NIC hashed.
/// Matches the `hash_type` field in `NetBufRingEntry` wire format.
#[repr(u8)]
pub enum RssHashType {
    /// No RSS hash computed.
    None    = 0,
    /// Hashed over L3 addresses only (src/dst IP).
    L3      = 1,
    /// Hashed over L3 addresses + L4 ports (src/dst IP + src/dst port).
    L4      = 2,
}

/// Checksum offload status (matches Linux CHECKSUM_* semantics).
#[repr(u8)]
pub enum ChecksumStatus {
    /// No checksum information. Software must compute/verify.
    None = 0,
    /// Hardware verified the checksum is correct (RX). Software may skip verification.
    /// Equivalent to Linux `CHECKSUM_UNNECESSARY`.
    Unnecessary = 1,
    /// Hardware computed a partial checksum over the packet data (RX). The raw value
    /// is in `NetBuf::csum_value`. Software must fold and verify against the
    /// pseudo-header. Equivalent to Linux `CHECKSUM_COMPLETE`.
    Complete = 2,
    /// Software filled the pseudo-header checksum; hardware must complete the L4
    /// checksum computation (TX). `csum_start` and `csum_offset` specify the
    /// computation range. Equivalent to Linux `CHECKSUM_PARTIAL`.
    Partial = 3,
}

/// GSO (Generic Segmentation Offload) type.
///
/// Identifies the segmentation algorithm needed when software GSO must split
/// a super-packet into MTU-sized frames. Hardware TSO/USO supersedes software
/// GSO when the NIC reports the corresponding offload capability.
#[repr(u8)]
pub enum GsoType {
    /// Not a GSO packet. No segmentation needed.
    None = 0,
    /// TCP segmentation (TSO). Split at MSS boundaries, rewrite TCP sequence
    /// numbers and checksums per segment.
    TcpV4 = 1,
    /// TCP segmentation for IPv6.
    TcpV6 = 2,
    /// UDP fragmentation offload (UFO). Split at MSS boundaries, generate
    /// IP fragments (IPv4) or fragment extension headers (IPv6).
    Udp = 3,
    /// TCP segmentation for tunnel-encapsulated packets. Outer and inner
    /// headers are rewritten per segment.
    TcpTunnel = 4,
    /// UDP segmentation for tunnel-encapsulated packets.
    UdpTunnel = 5,
    /// GRO partial: a GRO-coalesced packet that was only partially merged
    /// (different IP IDs or non-contiguous sequence numbers). Must be
    /// re-segmented before delivery if the receiver cannot handle it.
    GroPartial = 6,
    /// GRE tunnel segmentation. Outer GRE header is replicated per segment;
    /// inner payload is split at MSS boundaries. Used by GRE tunnel offload
    /// when the NIC supports TSO over GRE or software GSO handles it.
    Gre = 7,
}

bitflags! {
    /// Packet processing flags.
    pub struct NetBufFlags: u32 {
        /// Packet is locally generated (TX), not forwarded.
        const LOCAL_OUT       = 1 << 0;
        /// Packet is destined for local delivery (RX), not forwarded.
        const LOCAL_IN        = 1 << 1;
        /// Packet is being forwarded (neither locally generated nor locally destined).
        const FORWARDED       = 1 << 2;
        /// Packet data pages are in the RDMA-eligible pool (Section 5.4.3).
        /// Can be used directly for RDMA operations without re-registration.
        const RDMA_ELIGIBLE   = 1 << 3;
        /// Packet was decapsulated (tunnel outer headers stripped).
        const DECAPPED        = 1 << 4;
        /// Packet requires encryption before transmission (IPsec or WireGuard).
        const NEEDS_ENCRYPT   = 1 << 5;
        /// Packet has been decrypted (IPsec or WireGuard).
        const DECRYPTED       = 1 << 6;
        /// XDP metadata area is valid (contains XDP metadata prepended by the driver).
        const XDP_META_VALID  = 1 << 7;
        /// Data pages are shared (refcount > 1). Write operations must
        /// copy-on-write to avoid corrupting other consumers.
        const SHARED_DATA     = 1 << 8;
        /// Packet is a clone created by `clone_shared()`. The data pages are
        /// shared with the original; metadata is independently owned.
        const CLONED          = 1 << 9;
        /// Software GSO segmentation is needed before TX (NIC lacks TSO support).
        const NEEDS_GSO       = 1 << 10;
    }
}

16.5.1 NetBufPool: Per-CPU Slab Pool

/// Per-CPU pool for `NetBuf` metadata struct allocation.
///
/// Each CPU maintains its own slab of pre-allocated `NetBuf` structs. The fast path
/// (alloc/free) is a single pointer swap with no locks, no atomics, and no cross-CPU
/// traffic — the pool pointer lives in the `CpuLocalBlock::slab_magazines` array
/// (Section 3.1.2).
///
/// **Capacity**: Each CPU's magazine holds a configurable number of NetBufs (default:
/// 256). When a magazine is exhausted, the CPU requests a new slab page from the
/// global slab allocator (Section 4.1, one atomic increment). The slab page holds
/// `PAGE_SIZE / size_of::<NetBufHandle>()` handles (e.g., 4096 / 16 = 256 handles per
/// 4KB page; `NetBufHandle` is 16 bytes: pool-id + slot-index + generation).
///
/// **NAPI batch integration**: During NAPI poll (Section 16.9), the driver allocates
/// a batch of up to 64 NetBufs at once via `NetBufPool::alloc_batch()`. This amortizes
/// any fallback to the global allocator across the entire batch. The batch is processed
/// within a single NAPI poll cycle — no NetBuf from the batch outlives the poll call.
///
/// **NUMA awareness**: NetBufs freed on a different NUMA node than their allocation
/// node are placed on a cross-node return list (per-CPU, per-source-node). The return
/// list is drained back to the origin node's pool in batches of 32 during idle time
/// or when the local pool is full. This prevents NUMA-remote memory from accumulating
/// on a local CPU's free list.
///
/// **Cross-reference**: Section 4.1 (slab allocator), Section 3.1.2 (CpuLocalBlock),
/// NAPI batching (Section 16.9).
pub struct NetBufPool {
    // Implementation is the standard slab magazine pattern from Section 4.1.
    // No additional fields are specified here — NetBufPool is a type alias for
    // `SlabPool<NetBuf>` with the NetBuf-specific size class.
}

16.5.2 NetBufRingEntry: KABI Wire Format

When a NetBuf crosses the isolation domain boundary via the KABI ring buffer (Tier 1 NIC drivers), the full NetBuf struct cannot be passed directly — it contains pointers (SlabVec overflow, Option fields) that are only valid in the sending domain's address space. Instead, a flattened, fixed-size NetBufRingEntry is serialized to the ring.

/// Flattened, fixed-size representation of NetBuf metadata for the KABI ring.
///
/// Contains only the fields needed by the receiving domain. Pointer-based fields
/// (SlabVec overflow, private_data) are excluded — the receiver reconstructs
/// its own `NetBuf` from this entry plus the shared DMA data handle.
///
/// Size: 128 bytes (2 cache lines). Fits exactly in one ring slot.
///
/// Scatter-gather: 2 inline `NetBufRingFrag` entries cover the vast majority of
/// packets including jumbo frames (9KB MTU with 4KB pages = data_handle + 2
/// frags). Each frag carries `(DmaFragRef, offset, length)` — sufficient to
/// reconstruct the full scatter-gather list (pool_id and iova_base are
/// inherited from `data_handle`). Packets with >2 frags (GRO super-packets,
/// TSO segments) use continuation entries: the next ring slot carries
/// `NETBUF_FLAG_FRAG_CONTINUATION` and contains up to 7 additional
/// `NetBufRingFrag` entries. Producer writes all slots before advancing tail;
/// consumer detects continuation via the flag and collects all frags
/// before constructing the NetBuf.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct NetBufRingEntry {
    // ---- Data region ----
    pub data_handle: DmaBufferHandle,    // 16 bytes: pool-id + offset + gen
    pub head_offset: u32,                // 4
    pub data_offset: u32,                // 4
    pub tail_offset: u32,                // 4
    pub end_offset: u32,                 // 4
    // subtotal: 32 bytes

    // ---- Protocol metadata ----
    pub l2_offset: i16,                  // 2
    pub l3_offset: u16,                  // 2
    pub l4_offset: u16,                  // 2
    pub inner_l3_offset: u16,            // 2
    pub inner_l4_offset: u16,            // 2
    pub protocol: u16,                   // 2  IP protocol number (6=TCP, 17=UDP)
    pub addr_family: u16,                // 2  AF_INET=2, AF_INET6=10
    // subtotal: 14 bytes (running: 46)

    // ---- Checksum state ----
    pub csum_status: u8,                 // 1  CsumStatus enum
    pub _pad0: u8,                       // 1  explicit padding (repr(C): u8 before u16)
    pub csum_start: u16,                 // 2
    pub csum_offset: u16,                // 2
    pub csum_value: u32,                 // 4  full hardware checksum (no truncation)
    // subtotal: 10 bytes (running: 56)
    // Note: csum_value is naturally aligned at offset 52 (after 2×u16 = 4 bytes
    // from offset 48); no explicit padding needed between csum_offset and csum_value.

    // ---- Offload metadata ----
    pub hash: u32,                       // 4  RSS hash from NIC
    pub hash_type: u8,                   // 1  RssHashType
    pub vlan_present: u8,                // 1  (moved next to hash_type to avoid padding)
    pub vlan_tci: u16,                   // 2
    // subtotal: 8 bytes (running: 64)

    // ---- Cgroup metadata (stamped by Tier 0 before ring submission) ----
    pub cgroup_classid: u32,             // 4  cgroup v2 eBPF classification tag (set by
                                         //    BPF_PROG_TYPE_CGROUP_SKB programs attached to the
                                         //    task's cgroup; net_cls v1 controller is NOT
                                         //    implemented — see [Section 17.2](17-containers.md#control-groups))
    // subtotal: 4 bytes (running: 68)
    // Note: cgroup_classid is naturally aligned at offset 64 (after vlan_tci u16
    // ends at offset 64); no explicit padding needed.

    // ---- GSO (Generic Segmentation Offload) ----
    pub gso_type: u8,                    // 1  GsoType enum (TSO, UFO, GRE, etc.)
    pub _pad1: u8,                       // 1  explicit padding (repr(C): u8 before u16)
    pub gso_size: u16,                   // 2  MSS for segmentation
    pub gso_segs: u16,                   // 2  number of segments
    // subtotal: 6 bytes (running: 74)

    // ---- Scatter-gather (inline frags) ----
    pub nr_frags: u8,                    // 1  total frag count (including continuations)
    pub _pad2: u8,                       // 1  explicit padding (repr(C): u8 before u32-aligned frags)
    pub frags: [NetBufRingFrag; 2],      // 24 bytes (handle + offset + length per frag)
    // subtotal: 26 bytes (running: 100)

    // ---- Flags ----
    pub flags: u32,                      // 4  NetBufFlags bitmask
    // running: 104

    // ---- Timestamp ----
    pub timestamp_ns: u64,               // 8  hardware or software timestamp (ns)
    // running: 112

    // ---- Routing / classification ----
    /// Ingress/egress interface index. Identifies the NIC this packet was
    /// received from (RX) or will be transmitted on (TX). Required by
    /// umka-net for routing table lookup, netfilter interface matching,
    /// and per-interface statistics.
    pub ifindex: u32,                    // 4
    /// Firewall / QoS mark. Set by BPF programs (bpf_skb_set_mark),
    /// netfilter rules, or traffic classification. Propagated across the
    /// isolation boundary so umka-net can apply mark-based routing
    /// (`ip rule fwmark`) and qdisc classification.
    pub mark: u32,                       // 4
    // running: 120

    // ---- TX queue selection and priority ----
    /// Hardware TX queue index, pre-computed by umka-net's qdisc layer from
    /// the packet's priority→queue mapping before the NetBuf crosses the KABI
    /// ring to the NIC driver. The NIC driver uses this directly for hardware
    /// TX queue selection without needing qdisc configuration knowledge or
    /// re-doing the priority→queue mapping. Set to 0 for RX entries.
    ///
    /// **Design decision**: `queue_index` is the pre-computed result, not the
    /// raw priority. The NIC driver needs the queue number, not the policy
    /// input. However, `priority` is also carried (below) because:
    /// (1) `SO_PRIORITY` reporting via ethtool/netlink needs the original value,
    /// (2) DCB/802.1Qaz ETS requires mapping priority→traffic class on the NIC,
    /// (3) VLAN PCP insertion needs the raw priority, not the queue index.
    pub queue_index: u16,                // 2
    /// Raw packet priority (from `NetBuf.priority`). Carries the SO_PRIORITY /
    /// skb->priority value across the KABI ring. Used by DCB-capable NICs for
    /// 802.1p PCP mapping and by netlink stats for per-priority accounting.
    /// Range: 0-15 (TC_PRIO_MAX). Values >15 clamped to 15.
    pub priority: u16,                   // 2

    // ---- Padding to 128 bytes ----
    pub _reserved: [u8; 4],              // alignment padding to 128 bytes
    // Total: 128 bytes (2 cache lines, exact multiple of align(64))
}

const_assert!(core::mem::size_of::<NetBufRingEntry>() == 128);
const_assert!(core::mem::align_of::<NetBufRingEntry>() == 64);

/// A single scatter-gather fragment descriptor for KABI ring transport.
///
/// Unlike the in-kernel `NetBufFrag` which carries a full `DmaBufferHandle`
/// (16 bytes), the ring-entry variant uses a compact `DmaFragRef` (4 bytes)
/// — the pool_id and iova_base are inherited from the parent
/// `NetBufRingEntry.data_handle`. This is a (ref, offset, length) triple
/// sufficient to reconstruct the scatter-gather list on the receiving side
/// of the isolation boundary.
#[repr(C)]
pub struct NetBufRingFrag {
    pub handle: DmaFragRef,              // 4 bytes: compact DMA fragment reference
    pub offset: u32,                     // 4 bytes: byte offset within the DMA page
    pub length: u32,                     // 4 bytes: byte length of this fragment
}
// KABI ring entry fragment: handle(4) + offset(4) + length(4) = 12 bytes.
const_assert!(core::mem::size_of::<NetBufRingFrag>() == 12);

/// Continuation entry for packets with >2 scatter-gather fragments.
///
/// Carries up to 7 additional `NetBufRingFrag` entries (handle + offset + length
/// per fragment). Identified by `NETBUF_FLAG_FRAG_CONTINUATION` in flags.
/// Multiple continuations can be chained for extreme scatter-gather
/// (e.g., 64KB TSO with 4KB pages).
#[repr(C, align(64))]
pub struct NetBufRingContinuation {
    pub flags: u32,                      // 4  must include NETBUF_FLAG_FRAG_CONTINUATION
    pub nr_frags: u8,                    // 1  frags in THIS entry (1..=7)
    pub _pad: [u8; 11],                  // 11  padding
    pub frags: [NetBufRingFrag; 7],      // 84 bytes (7 × 12)
    // subtotal: 4 + 1 + 11 + 84 = 100 bytes
    pub _pad2: [u8; 28],                 // 28 bytes padding to 128
    // total: 128 bytes (2 cache lines, matches ring slot size)
}
// KABI: 4+1+11+84+28 = 128 bytes. align(64) satisfied (128 = 2 × 64).
const_assert!(core::mem::size_of::<NetBufRingContinuation>() == 128);

/// Flag bits for NetBufRingEntry.flags and NetBufRingContinuation.flags.
pub const NETBUF_FLAG_FRAG_CONTINUATION: u32 = 1 << 0;
pub const NETBUF_FLAG_VLAN_VALID: u32        = 1 << 1;
pub const NETBUF_FLAG_CSUM_VALID: u32        = 1 << 2;
pub const NETBUF_FLAG_GSO: u32               = 1 << 3;
pub const NETBUF_FLAG_GRO_COALESCED: u32     = 1 << 4;

Serialization: NetBuf::to_ring_entry() copies the relevant fields. If the packet has ≤2 scatter-gather fragments, a single 128-byte NetBufRingEntry suffices. If >2 frags, the producer writes the primary entry followed by one or more NetBufRingContinuation entries (each carrying up to 7 additional frag handles), then advances the ring tail atomically past all entries. The nr_frags field in the primary entry holds the total frag count (including those in continuations).

Deserialization: NetBuf::from_ring_entry(entry: &NetBufRingEntry) -> NetBuf allocates a new NetBuf from the local CPU's pool and populates it from the entry. Per-field reconstruction steps: 1. Allocate a NetBuf from the local CPU's NetBufPool (pool.alloc()). 2. Copy hot fields from NetBufRingEntry to NetBuf: data_handle, data_offset, data_len, protocol, vlan_tci, flow_hash, csum_info, nr_frags, ifindex, flags. 3. Copy the first two scatter-gather fragments from entry.frags[0..2] into NetBuf.frags[0..2]. 4. If nr_frags > 2: read subsequent NetBufRingContinuation entries (identified by NETBUF_FLAG_FRAG_CONTINUATION in flags) to collect remaining frag handles into NetBuf.frags[2..nr_frags]. 5. Validate data_handle.generation against the pool slot's current generation. On mismatch (stale handle from crashed driver), return an error (or silently drop the entry). 6. Increment the data page refcount via data_handle.refcount.fetch_add(1, Relaxed). The data_handle is shared — same DMA pages, shared ownership. 7. Initialize warm/cold fields to defaults: route_ext = None, gro_next = None, next = None, frag_ext = None, timestamp = 0. These fields are not in the ring entry (they are set during protocol processing in umka-net, not by the NIC driver).

16.5.3 NetBufQueue and NetBufList

Two intrusive linked-list types used throughout the networking stack:

/// FIFO queue of NetBufs, linked through `NetBuf.next: Option<NonNull<NetBuf>>`.
/// Used for TCP send queues and qdisc internal queues.
///
/// Invariants:
/// - All NetBufs in the queue belong to the same pool (pool_id matches).
/// - `len` is exact (incremented on push, decremented on pop).
/// - The queue does NOT own the NetBufs — they remain in their slab slots.
///   The queue holds `NonNull<NetBuf>` pointers that are valid as long as the
///   NetBuf has not been freed via `NetBuf::free()`.
pub struct NetBufQueue {
    pub head: Option<NonNull<NetBuf>>,
    pub tail: Option<NonNull<NetBuf>>,
    pub len: u32,
    pub byte_count: u64,
}

impl NetBufQueue {
    pub fn push_back(&mut self, buf: NonNull<NetBuf>);
    pub fn pop_front(&mut self) -> Option<NonNull<NetBuf>>;
    pub fn peek_front(&self) -> Option<&NetBuf>;
    pub fn is_empty(&self) -> bool { self.head.is_none() }
}

/// Singly-linked list of NetBufs (no tail pointer). Used for per-CPU deferred
/// packet lists in NOLOCK qdiscs (QdiscDeferState.per_cpu_lists) and for
/// temporary batch accumulation.
///
/// Operations: push_front (O(1)), drain (iterate and consume all, O(N)).
/// No pop_back or random access.
pub struct NetBufList {
    pub head: Option<NonNull<NetBuf>>,
    pub len: u32,
}

impl NetBufList {
    pub fn push_front(&mut self, buf: NonNull<NetBuf>);
    pub fn drain(&mut self) -> NetBufListDrain;
    pub fn is_empty(&self) -> bool { self.head.is_none() }
}

16.5.4 NetBufHandle ↔ NetBuf Conversion

impl NetBufPool {
    /// Claim a `NetBuf` from the pool given its handle.
    ///
    /// Validates the handle (generation counter check) and returns a reference
    /// to the `NetBuf` metadata. The handle must have been obtained from this
    /// pool via `alloc()` and not yet freed.
    ///
    /// Takes `&NetBufHandle` (borrow) because the caller retains ownership of
    /// the handle. The returned `&NetBuf` borrows the pool's slab slot, not
    /// the handle. `NetBufHandle` is move-only (no `Copy`), so taking by value
    /// here would consume the handle and trigger its `Drop` (returning the slot
    /// to the pool) — the opposite of what the caller wants.
    ///
    /// Returns `None` if the handle is stale (generation mismatch) or invalid
    /// (out-of-bounds slot index).
    ///
    /// **Hot-path**: O(1) — direct index into the slab page, one generation
    /// counter comparison. No locks.
    pub fn claim(&self, handle: &NetBufHandle) -> Option<&NetBuf>;

    /// Claim a mutable reference to a `NetBuf` from the pool given its handle.
    ///
    /// Same as `claim()` but returns `&mut NetBuf`. The caller must ensure
    /// exclusive ownership (no concurrent readers). This is guaranteed in the
    /// NAPI poll path because each NetBuf is owned by exactly one consumer,
    /// and in the TX path because `NetBufHandle` is move-only.
    pub fn claim_mut(&self, handle: &NetBufHandle) -> Option<&mut NetBuf>;

    /// Convert a `NetBuf` into its compact handle, consuming the `NetBuf`.
    ///
    /// The `NetBuf` must have been allocated from this pool. The slab slot
    /// remains allocated — ownership transfers from the `NetBuf` value to the
    /// returned `NetBufHandle`. When the handle is eventually dropped, its
    /// `Drop` impl returns the slot to the pool.
    ///
    /// Takes `buf` by value (consuming) to enforce the ownership transfer:
    /// after this call, the caller cannot access the `NetBuf` directly. All
    /// subsequent access goes through `handle.peek()` or
    /// `pool.claim(&handle)`.
    ///
    /// **Hot-path**: O(1) — computes slot index from pointer arithmetic
    /// within the slab page, reads the generation from the slot header.
    /// No allocation, no lock.
    pub fn handle_for(&self, buf: NetBuf) -> NetBufHandle;

    /// Return a slab slot to the pool (called by `NetBufHandle::Drop`).
    ///
    /// 1. Validates `generation` against the slot's current generation counter
    ///    (defense-in-depth: rejects stale handles from crashed drivers).
    /// 2. Frees the `RouteLookupResult` extension if `route_ext.is_some()`.
    /// 3. Decrements the DMA data page refcount. If it reaches zero (last
    ///    reference), the DMA buffer is returned to the DMA buffer pool.
    /// 4. Increments the slot's generation counter (invalidates any stale
    ///    copies of the handle's coordinates).
    /// 5. Returns the slot to the per-CPU free list. If the slot was allocated
    ///    on a different NUMA node, it goes to the cross-node return list
    ///    (same as `NetBuf::free()`).
    ///
    /// **Hot-path**: O(1) — no locks on the per-CPU fast path. Cross-node
    /// returns touch the cross-node return list (bounded contention).
    ///
    /// # Panics
    /// Panics in debug builds if `generation` does not match the slot's
    /// current generation counter (indicates double-free or stale handle).
    /// In release builds, the mismatch is logged and the slot is NOT freed
    /// (leak over crash is safer than double-free corruption).
    pub fn return_slot(&self, slot_idx: u32, generation: u32);
}

16.5.4.1 NetBuf::from_handle() — Handle-to-NetBuf Reconstruction (Tier 1)

impl NetBuf {
    /// Reconstruct a `NetBuf` reference from a `NetBufHandle` inside a Tier 1
    /// domain (e.g., umka-net processing a NAPI batch delivery).
    ///
    /// This is a convenience wrapper around `NetBufPool::claim_mut()`:
    /// 1. Look up the `NetBufPool` by `handle.pool_id` in the shared pool
    ///    registry (`SHARED_NETBUF_POOLS: XArray<NetBufPool>`, accessible from
    ///    both Tier 0 and Tier 1 via the shared DMA region).
    /// 2. Call `pool.claim_mut(&handle)` — validates generation counter and
    ///    returns `&mut NetBuf` if valid.
    /// 3. Returns `None` if the pool_id is unknown, the slot index is
    ///    out-of-bounds, or the generation counter has been incremented
    ///    (stale handle from a crashed/reloaded driver).
    ///
    /// Takes `&NetBufHandle` (borrow): the caller retains ownership of the
    /// handle. This is essential because `NetBufHandle` is move-only with a
    /// `Drop` impl — passing by value would free the slot.
    ///
    /// **Hot-path**: O(1) — XArray load (pool registry) + pointer arithmetic
    /// (slot index) + generation comparison. No locks, no allocation.
    ///
    /// **Safety**: The caller must ensure exclusive ownership of the NetBuf
    /// (no concurrent `claim()` or `claim_mut()` for the same handle). In the
    /// NAPI batch delivery path, each handle is owned by exactly one consumer
    /// (umka-net), so this is guaranteed.
    pub fn from_handle(handle: &NetBufHandle) -> Option<&'static mut NetBuf> {
        let pool = SHARED_NETBUF_POOLS.load(handle.pool_id as u64)?;
        pool.claim_mut(handle)
    }
}

/// Global shared NetBufPool registry. Populated at NIC driver probe time
/// (each NIC registers its pool). Accessible from Tier 0 and Tier 1 via
/// the shared DMA region (PKEY_SHARED on x86-64).
static SHARED_NETBUF_POOLS: XArray<NetBufPool> = XArray::new();

16.5.5 NetBuf Operations

impl NetBuf {
    /// Allocate a new NetBuf with a linear buffer of at least `size` bytes.
    ///
    /// The buffer is allocated from the current CPU's `NetBufPool` (metadata) and
    /// `DmaBufferHandle` pool (data). `headroom` bytes are reserved before the data
    /// region for header prepend operations. Total allocation is `headroom + size`
    /// bytes, rounded up to the DMA allocator's alignment (typically cache-line, 64B).
    ///
    /// Returns `Err(KernelError::NoMem)` if the DMA pool is exhausted.
    ///
    /// **Default headroom**: `NET_BUF_DEFAULT_HEADROOM` (128 bytes) — sufficient for
    /// an outer Ethernet (14) + IP (20) + UDP (8) + VXLAN (8) header plus alignment
    /// padding. Callers that know they need less (e.g., loopback) may specify 0.
    ///
    /// # Preconditions
    /// - Must be called with preemption disabled (NAPI context or explicit
    ///   `PreemptGuard`), because the per-CPU pool requires CPU pinning.
    pub fn alloc(size: u32, headroom: u32) -> Result<NetBuf, KernelError>;

    /// Allocate a batch of `count` NetBufs, each with `size` bytes and `headroom`.
    ///
    /// More efficient than `count` individual `alloc()` calls because the slab
    /// magazine is checked once and the DMA pool may satisfy the entire batch from
    /// a single large allocation (if the DMA allocator supports bulk alloc).
    /// Used by NAPI poll to pre-allocate RX buffers for a batch of up to 64 packets.
    ///
    /// Returns the number of successfully allocated NetBufs (may be less than `count`
    /// if memory is low). Partial success is not an error — the caller processes
    /// however many buffers were obtained.
    pub fn alloc_batch(
        out: &mut [MaybeUninit<NetBuf>],
        count: usize,
        size: u32,
        headroom: u32,
    ) -> usize;

    /// Free this NetBuf, returning the metadata struct to the local CPU's pool.
    ///
    /// If the data page refcount reaches 0, the DMA buffer is also freed.
    /// If the data pages are shared (`SHARED_DATA` flag), only the refcount is
    /// decremented. Safe to call from any CPU — if the NetBuf was allocated on a
    /// different NUMA node, it is placed on the cross-node return list.
    pub fn free(self);

    /// Prepend `len` bytes of headroom, advancing `data_offset` backward.
    ///
    /// Used to prepend protocol headers (e.g., IP header before TCP payload,
    /// Ethernet header before IP). The caller writes the header into the newly
    /// exposed region `[new_data_offset .. old_data_offset)`.
    ///
    /// # Panics
    /// Panics if `data_offset - len < head_offset` (insufficient headroom).
    /// Callers must check headroom or use `prepend_realloc()` for untrusted sizes.
    pub fn push(&mut self, len: u32) -> &mut [u8];

    /// Consume `len` bytes from the front of the data region.
    ///
    /// Used after parsing a protocol header: the header is consumed (data_offset
    /// advances past it) so the next layer sees its own header at `data_offset`.
    /// Returns a slice to the consumed header bytes (valid until the NetBuf is freed).
    ///
    /// # Panics
    /// Panics if `data_offset + len > tail_offset` (consuming more than available).
    pub fn pull(&mut self, len: u32) -> &[u8];

    /// Append `len` bytes at the tail of the data region.
    ///
    /// Used to append data to the linear buffer (e.g., padding, trailer).
    /// Returns a mutable slice to the newly appended region.
    ///
    /// # Panics
    /// Panics if `tail_offset + len > end_offset` (insufficient tailroom).
    pub fn put(&mut self, len: u32) -> &mut [u8];

    /// Create a zero-copy clone of this NetBuf.
    ///
    /// Allocates a new `NetBuf` struct from the local CPU's pool. The new struct
    /// gets a copy of all metadata fields. The data pages are shared: the DMA
    /// buffer's refcount is incremented, and both the original and clone set the
    /// `SHARED_DATA` flag. Subsequent writes to either NetBuf's data region trigger
    /// copy-on-write (the writer allocates new data pages and copies before modifying).
    ///
    /// **Use cases**: XDP_REDIRECT to multiple interfaces, TCP retransmission queue
    /// (keeping a reference to sent data for potential retransmit), multicast forwarding.
    pub fn clone_shared(&self) -> Result<NetBuf, KernelError>;

    /// Linearize the packet: copy all scatter-gather fragments into the linear buffer.
    ///
    /// After linearization, the entire packet is in the contiguous region
    /// `[data_offset .. tail_offset)` and `nr_frags == 0`. This is required before
    /// passing the packet to consumers that do not support scatter-gather (e.g., some
    /// BPF helpers, legacy protocol parsers).
    ///
    /// If the linear buffer is too small to hold all fragment data, a new larger
    /// DMA buffer is allocated and all data (linear + fragments) is copied into it.
    ///
    /// Returns `Err(KernelError::NoMem)` if reallocation fails.
    pub fn linearize(&mut self) -> Result<(), KernelError>;

    /// Total packet length (linear data + all fragments).
    ///
    /// This is the logical packet size visible to protocols. For GSO packets,
    /// this is the aggregate size before segmentation.
    pub fn len(&self) -> u32 {
        let linear = self.tail_offset - self.data_offset;
        let mut frag_total: u32 = self.frags[..self.nr_frags as usize]
            .iter()
            .map(|f| f.length)
            .sum();
        // Include overflow fragments stored in frag_ext (fragments beyond
        // MAX_INLINE_FRAGS spill here). Without this, len() underreports
        // the packet size for large scatter-gather lists, causing protocol
        // length checks and GSO segmentation to produce corrupt packets.
        if let Some(ext) = &self.frag_ext {
            frag_total += ext.iter().map(|f| f.length).sum::<u32>();
        }
        linear + frag_total
    }

    /// Return a read-only slice to the linear data region.
    ///
    /// Does NOT include scatter-gather fragments. Use `linearize()` first if you
    /// need the entire packet as a contiguous slice.
    pub fn linear_data(&self) -> &[u8];

    /// Return a mutable slice to the linear data region.
    ///
    /// If the data pages are shared (`SHARED_DATA` flag), this triggers copy-on-write:
    /// new data pages are allocated, the linear data is copied, and the original
    /// pages' refcount is decremented.
    pub fn linear_data_mut(&mut self) -> Result<&mut [u8], KernelError>;

    /// Attach a page fragment to this NetBuf's scatter-gather list.
    ///
    /// The fragment's DMA buffer handle refcount is incremented. If `nr_frags`
    /// exceeds `MAX_INLINE_FRAGS` and `frag_ext` is `None`, a `SlabVec` is
    /// allocated for overflow storage.
    pub fn add_frag(&mut self, frag: NetBufFrag) -> Result<(), KernelError>;

    /// Adjust the data region by `delta` bytes (positive = grow, negative = shrink).
    ///
    /// This is the underlying operation for `bpf_skb_adjust_room()` BPF helper.
    /// Positive delta inserts space at the current `data_offset` (for encapsulation);
    /// negative delta removes space (for decapsulation). May trigger reallocation
    /// if the adjustment exceeds available headroom or tailroom.
    pub fn adjust_room(&mut self, delta: i32) -> Result<(), KernelError>;
}

/// Default headroom reserved in newly allocated NetBufs.
///
/// 128 bytes is sufficient for: Ethernet (14) + 802.1Q (4) + outer IPv6 (40) +
/// UDP (8) + VXLAN (8) + inner Ethernet (14) + alignment padding (40).
/// This covers the common tunnel encapsulation case without reallocation.
pub const NET_BUF_DEFAULT_HEADROOM: u32 = 128;

/// Sentinel value for conntrack index indicating the packet is not tracked.
pub const CONNTRACK_UNTRACKED: u32 = u32::MAX;

16.5.6 Domain Crossing Protocol

When a NetBuf crosses the isolation domain boundary between umka-net and a NIC driver (in either direction), the following protocol is followed. This protocol ensures that each domain operates on metadata it owns (preventing TOCTOU races on header offsets) while sharing bulk data zero-copy:

RX path (NIC driver → umka-net) — for Tier 1 drivers:

  1. NIC hardware completes DMA into a data page from the shared DMA buffer pool.
  2. Driver poll function (runs inside the Tier 1 driver domain, triggered by KABI ring poll request from umka-core NAPI handler): a. Reads RX ring descriptors from NIC hardware. b. Fills a NetBufRingEntry for each packet (128 bytes; continuation entries for >2 frags). Writes up to budget entries to the RX completion ring. c. Writes work_done (number of packets processed) to the poll response slot.
  3. umka-core NAPI handler (softirq context, Tier 0) reads the completion ring: a. For each NetBufRingEntry (plus any NetBufRingContinuation entries): calls NetBuf::from_ring_entry() to allocate a local NetBuf and populate from the entry. Data pages are shared via DmaBufferHandle (refcount incremented, no data copy). b. Accumulates NetBufHandles into NapiContext.rx_batch.
  4. Batch delivery (napi_deliver_batch()): At napi_complete_done() time, NAPI performs one Tier 0 → Tier 1 domain switch and passes the entire rx_batch array to umka-net. This is a direct batch transfer via shared memory, NOT via the KABI command/completion ring used for socket operations.
  5. umka-net (Tier 1) receives the batch via NetRxContext::receive_batch(): GRO coalescing merges packets by flow hash into coalesced super-packets, reducing per-packet processing overhead by 10-50x for bulk traffic. Then L2 dispatch, IP routing, netfilter, L4 delivery.

Key design decisions: - GRO runs inside umka-net (not in NAPI or the NIC driver), because GRO is protocol-aware (TCP coalescing, UDP-GRO) and belongs in the protocol stack. GRO state lives in NetRxContext, not NapiContext (Section 16.2). - The driver only produces raw per-packet NetBufRingEntry records. umka-core's NAPI handler reconstructs NetBufs, collects them into a batch, and delivers the batch to umka-net in one domain switch. - Two separate handoffs: NIC→NAPI (via NapiPollDispatch) and NAPI→umka-net (batch delivery via napi_deliver_batch()). Neither handoff uses the KABI ring that socket operations (recvmsg/sendmsg) use.

TX path (umka-net → NIC driver): The TX path mirrors the RX domain crossing protocol, with ownership flowing in the opposite direction. The complete sequence:

TX Domain Crossing Protocol:

1. Application calls sendmsg() → umka-core (Tier 0)
   - copy_from_user() into KABI shared buffer.
   - Post SendRequest to umka-net's KABI command ring.
   — Domain switch: Tier 0 → Tier 1 (umka-net).

2. umka-net (Tier 1): TCP/UDP processing
   a. Enqueue data into socket send buffer (TcpCb.send_queue).
   b. TCP segmentation: split into MSS-sized segments.
   c. For each segment: allocate NetBuf from per-CPU NetBufPool,
      populate L4/L3/L2 headers, compute checksums (or set
      CHECKSUM_PARTIAL for NIC offload).
   d. Routing lookup: resolve egress interface and next-hop.
   e. NF_INET_LOCAL_OUT + NF_INET_POST_ROUTING netfilter hooks.
   f. Qdisc processing: enqueue NetBuf into the egress interface's
      traffic control queue discipline (HTB, FQ, pfifo_fast, etc.).
      Qdisc runs inside umka-net because it requires socket/flow
      state (cgroup classification, flow hash, priority) that is
      only available in the network stack's isolation domain.
   g. Qdisc dequeue: when the qdisc releases a NetBuf for transmission,
      serialize metadata to NetBufRingEntry (128 bytes) on the NIC
      umka-net's TX output ring. Data pages are referenced via
      DmaBufferHandle (shared DMA pool, PKEY 14 / domain 2).
   — Domain switch: Tier 1 (umka-net) → Tier 0 (umka-core).

3. umka-core (Tier 0): TX ring relay
   - Read NetBufRingEntry from umka-net's TX output ring.
   - Write NetBufRingEntry to the NIC driver's TX command ring.
   - Ring doorbell (shared atomic flag).
   — Domain switch: Tier 0 → Tier 1 (NIC driver).

4. NIC driver (Tier 1): hardware TX
   a. Read NetBufRingEntry from TX command ring.
   b. Program NIC TX descriptor with DmaBufferHandle's physical
      address, length, checksum offload flags, VLAN tag.
   c. Ring NIC TX doorbell (MMIO write to NIC register).
   d. NIC DMA reads packet data from shared DMA pages and transmits.

5. TX completion (asynchronous):
   a. NIC signals TX completion via MSI-X interrupt → NAPI TX poll
      (or combined RX/TX NAPI instance).
   b. NIC driver posts TX completion entry to completion ring:
      NetBufTxCompletion { handle: NetBufHandle, status: TxStatus }.
   — Domain switch: Tier 1 (NIC driver) → Tier 0.
   c. umka-core reads TX completion and drops the `NetBufTxCompletion`.
      The `NetBufHandle::Drop` impl returns the slab slot to the per-CPU
      pool and decrements the DMA data page refcount (freeing the DMA
      buffer if this was the last reference).

Ownership transfer summary:
  sendmsg → [Tier 0: copy_from_user]
           → [Tier 1 umka-net: TCP/routing/qdisc, OWNS NetBuf]
           → [Tier 0: relay, does NOT own NetBuf — passes ring entry]
           → [Tier 1 NIC: programs DMA, borrows data pages]
           → [TX completion: Tier 0 frees NetBuf]

TX completion types:

/// TX completion entry posted by the NIC driver to the completion ring.
/// Consumed by umka-core (Tier 0). When this struct is dropped, the
/// `NetBufHandle`'s `Drop` impl automatically returns the slab slot to the
/// per-CPU pool and frees DMA data pages (if refcount reaches zero).
/// No explicit `NetBuf::free()` call is needed.
/// 32 bytes per completion entry.
/// Layout: handle (NetBufHandle = 16 bytes) + status (TxStatus: u32 tag +
/// u32 payload for Error variant = 8 bytes) + timestamp_ns (u64 = 8 bytes)
/// = 32 bytes total.
#[repr(C)]
pub struct NetBufTxCompletion {
    /// Owning handle for the transmitted NetBuf. Dropping this handle returns
    /// the slab slot and frees DMA pages. The NIC driver transfers ownership
    /// of this handle to umka-core via the completion ring.
    pub handle: NetBufHandle,
    /// Outcome of the transmit operation.
    pub status: TxStatus,
    /// Hardware TX timestamp in nanoseconds (from NIC PTP clock if available,
    /// otherwise 0). Used for SO_TIMESTAMPING support — the timestamp is
    /// propagated to the socket error queue for applications that requested
    /// hardware TX timestamps.
    pub timestamp_ns: u64,
}

/// Outcome of a NIC transmit operation reported in the TX completion ring.
#[repr(C, u32)]
pub enum TxStatus {
    /// Packet transmitted successfully.
    Ok = 0,
    /// Packet dropped by the NIC (queue overflow, link down, etc.).
    Dropped = 1,
    /// Hardware error during transmission. The error code is NIC-specific;
    /// umka-core logs it via tracepoint and increments the interface error counter.
    Error(u32) = 2,
}
// KABI ring completion: handle(16) + status(8, repr(C,u32) tag+payload) + timestamp_ns(8) = 32 bytes.
const_assert!(core::mem::size_of::<NetBufTxCompletion>() == 32);
const_assert!(core::mem::size_of::<TxStatus>() == 8);

Key design decisions: - Qdisc runs in umka-net (Tier 1): Qdisc requires per-flow classification (cgroup, socket priority, flow hash) which is only available inside the network stack. Running qdisc in Tier 0 would require exporting all flow metadata across the isolation boundary — more complex and more expensive than keeping it in umka-net. - TX completion frees in Tier 0: The NetBuf and its DMA data pages are freed by umka-core (Tier 0) on TX completion, not by the NIC driver. This prevents a crashed Tier 1 NIC driver from leaking NetBuf handles — umka-core tracks all outstanding TX handles and reclaims them on driver crash recovery (Section 11.9). - Batching: Multiple TX ring entries are batched before ringing the NIC doorbell (up to 64 entries per doorbell write), amortizing the MMIO write cost. The umka-net → umka-core → NIC driver relay also batches: umka-net writes a burst of entries before signaling umka-core, which forwards the batch in a single domain switch pair.

NAPI batching: Steps 3-5 are batched. The driver writes up to 64 ring entries before signaling umka-net (doorbell). umka-net processes the entire batch in a single NAPI poll iteration, amortizing the 4 domain switches across all 64 packets (Section 16.12).

16.5.6.1 Zero-Copy Domain Crossing (MSG_ZEROCOPY)

When sendmsg() is called with MSG_ZEROCOPY, the kernel pins user pages (get_user_pages_fast()) rather than copying data into a KABI shared buffer. The NetBuf references pinned pages via DMA-mapped scatter-gather entries (NetBufFrag handles), bypassing the copy_from_user() step in the normal TX path (step 1 above).

Tier 1 driver domain crossing: The KABI ring entry carries NetBufRingFrag handles pointing to the pinned physical pages. These pages are mapped into the Tier 1 NIC driver's IOMMU domain (for Tier 2 drivers) or accessible via the shared DMA page key (PKEY 14 / domain 2, for Tier 1 drivers). The driver reads directly from user memory — no intermediate copy into kernel buffers.

Page pin lifecycle:

  1. Pin: sendmsg() calls get_user_pages_fast() to pin the user pages and increments the per-socket zcopy_pinned_pages counter. Each pinned page gets a ZcopyPinRef entry in the socket's pin tracking list.
  2. DMA map: The pinned pages are DMA-mapped as scatter-gather entries. The NetBuf.frags[] array holds NetBufFrag handles referencing these physical pages.
  3. Transmit: The NIC driver reads packet data directly from the pinned user pages via DMA. No copy occurs on this path.
  4. Completion: When the NIC driver reports transmission complete via the TX completion ring entry (NetBufTxCompletion), umka-core (Tier 0) DMA-unmaps the pages and decrements the pin refcount.
  5. Notification: The kernel sends a SO_EE_ORIGIN_ZEROCOPY notification via the socket error queue (Section 16.8). The application drains MSG_ERRQUEUE to acknowledge buffer reuse.
  6. Unpin: After the completion notification is delivered (or on socket close), the pages are unpinned (put_page()) and the zcopy_pinned_pages counter is decremented.

Bounded pinning: The maximum number of pinned pages per socket is bounded by SO_SNDBUF / PAGE_SIZE. When the pinned page count reaches this limit, subsequent MSG_ZEROCOPY sends block (or return EAGAIN for non-blocking sockets) until outstanding completions free pinned pages. This prevents a single socket from pinning an unbounded amount of physical memory.

/// Tracks a pinned user page for MSG_ZEROCOPY send.
///
/// Inserted into the socket's pin list at sendmsg() time. Removed when the
/// NIC driver reports TX completion and the completion notification is
/// delivered to the application via MSG_ERRQUEUE.
pub struct ZcopyPinRef {
    /// Physical page frame pinned from userspace.
    pub page: PageRef,
    /// DMA address mapped for the NIC (unmapped on completion).
    pub dma_addr: DmaAddr,
    /// Notification ID for correlating with MSG_ERRQUEUE delivery.
    pub notify_id: u32,
}

Interaction with copy-based TX path: The copy path (step 1 in the TX protocol above) and the zero-copy path are mutually exclusive per sendmsg() call. The kernel selects the zero-copy path when all of: (a) SO_ZEROCOPY is enabled on the socket, (b) MSG_ZEROCOPY flag is set in the sendmsg() call, and (c) the total send size exceeds the copy threshold (4 KB). Below the threshold, the copy path is used unconditionally because the page-pinning overhead exceeds the copy cost.

XDP interaction: XDP programs run in a dedicated BPF isolation domain, separate from both the NIC driver's Tier 1 domain and umka-core (PKEY 0). Execution is triggered from the driver's RX handler (which runs in the driver's domain during NAPI poll), but the XDP program itself executes in the BPF domain after the driver copies the packet descriptor to a shared bounce buffer (Section 11.3). For zero-copy XDP, the data pages are mapped read-only into the BPF domain (Section 19.2). The XDP program receives an XdpContext pointer and returns an XdpAction value that determines packet fate. XdpAction::Pass delivers the packet to umka-net via the normal RX path above. XdpAction::Drop, XdpAction::Tx, and XdpAction::Redirect are handled entirely within the driver/BPF domain, never crossing to umka-net.

/// Context passed to an XDP BPF program attached to a network interface.
///
/// Read-only view of the packet: the program sees packet byte offsets into the
/// NIC's DMA buffer. The program's return value determines packet fate.
///
/// # Linux Compatibility
/// Layout-compatible with Linux's `struct xdp_md` (the BPF program ABI context).
/// Existing Linux XDP programs compile and run without modification. UmkaOS-specific
/// fields (if any) are appended after the Linux-compatible fields and are optional.
#[repr(C)]
pub struct XdpContext {
    /// Byte offset from DMA buffer start to the first byte of packet data.
    /// Offset 0 in xdp_md.
    pub data:            u32,
    /// Byte offset from DMA buffer start to the byte AFTER the last packet byte.
    /// Offset 4 in xdp_md.
    pub data_end:        u32,
    /// Byte offset to the start of XDP metadata (between `data_meta` and `data`).
    /// Zero if no metadata has been set. Set via `bpf_xdp_adjust_meta()`.
    /// Offset 8 in xdp_md.
    pub data_meta:       u32,
    /// Ingress network interface index (1-based). Zero if not applicable.
    /// Offset 12 in xdp_md.
    pub ingress_ifindex: u32,
    /// Receive queue index on the ingress interface (zero-based).
    /// Offset 16 in xdp_md.
    pub rx_queue_index:  u32,
    /// Egress interface index for `XdpAction::Redirect`. Set by `bpf_redirect()`.
    /// Zero if not redirecting. Offset 20 in xdp_md.
    pub egress_ifindex:  u32,
    /// UmkaOS extension (offset 24, after all Linux-compatible fields).
    /// Byte offset from DMA buffer start to the beginning of the headroom area.
    /// `bpf_xdp_adjust_head()` can expand `data` backwards into the headroom
    /// (between `data_hard_start` and `data`). Bounds check: new `data` must
    /// be >= `data_hard_start`.
    ///
    /// Linux BPF programs are bounds-checked by the verifier against
    /// sizeof(struct xdp_md) = 24 bytes (6 fields × 4 bytes), so this
    /// extension field at offset 24 is inaccessible to unmodified Linux XDP
    /// programs. UmkaOS-native XDP programs compiled with the extended struct
    /// can access it.
    pub data_hard_start: u32,
}
// BPF ABI: 7 × u32(4) = 28 bytes. Linux xdp_md is 24 bytes (6 fields); UmkaOS adds data_hard_start.
const_assert!(core::mem::size_of::<XdpContext>() == 28);

/// XDP program return codes.
///
/// Values MUST match Linux's `enum xdp_action` for BPF program binary portability.
/// Existing Linux XDP programs that return these values work without recompilation.
#[repr(u32)]
pub enum XdpAction {
    /// Unrecoverable error in the XDP program. Drop packet; bump `xdp_aborted` counter.
    Aborted  = 0,
    /// Discard the packet silently. Fastest drop path.
    Drop     = 1,
    /// Pass the packet up to the normal network stack for processing.
    Pass     = 2,
    /// Retransmit the packet out the same NIC queue it arrived on.
    Tx       = 3,
    /// Redirect the packet to another NIC or another CPU queue via `bpf_redirect()`.
    Redirect = 4,
}

NetBuf → XdpContext conversion: When an XDP program is attached to an interface, the NIC driver's RX handler constructs an XdpContext from the received NetBuf before invoking the BPF program:

/// Construct an XdpContext from a NetBuf for XDP program execution.
///
/// The XdpContext fields are byte offsets into the NetBuf's DMA buffer,
/// not pointers. This allows the BPF verifier to bounds-check all packet
/// accesses at load time (offsets are within [data, data_end)).
///
/// # Preconditions
/// - `nb` is a freshly received NetBuf with valid DMA mapping.
/// - The DMA buffer has at least `XDP_HEADROOM` (256) bytes of headroom
///   before `data` for `bpf_xdp_adjust_head()` to grow into.
fn netbuf_to_xdp_context(nb: &NetBuf, ifindex: u32, rxq: u32) -> XdpContext {
    XdpContext {
        data:            nb.data_offset,       // start of packet data in DMA buffer
        data_end:        nb.tail_offset,           // one past last linear byte (fragments accessed via bpf_xdp_load_bytes() in multi-buffer XDP Phase 4)
        data_meta:       nb.data_offset,       // initially == data (no metadata set)
        ingress_ifindex: ifindex,
        rx_queue_index:  rxq,
        egress_ifindex:  0,                    // set by bpf_redirect() if needed
        data_hard_start: nb.data_offset.saturating_sub(XDP_HEADROOM), // UmkaOS extension
    }
}

After the XDP program returns, the driver reads back data and data_end from the XdpContext (the program may have adjusted them via bpf_xdp_adjust_head() or bpf_xdp_adjust_tail()) and updates the NetBuf offsets accordingly before passing the packet to the network stack (on XdpAction::Pass) or processing the redirect.

XDP metadata passing: XDP programs can store metadata in the data_meta area (between data_meta and data). After XdpAction::Pass, the TC layer reads metadata via bpf_skb_pull_data(). Metadata format is program-defined; the kernel does not interpret it.

XDP multi-buffer support (Phase 4): For jumbo frames and GRO super-packets that span multiple pages, XDP programs access data via bpf_xdp_load_bytes() and bpf_xdp_store_bytes() helpers. Single-buffer XDP (Phase 3) requires linearization before XDP processing.

Post-XDP header offset resynchronization: After XDP program execution, if the action is XdpAction::Pass or XdpAction::Tx, the NetBuf metadata fields are re-synchronized from the XdpContext to reflect any header adjustments:

// Re-sync NetBuf from XdpContext after XDP program execution.
// Called by the NIC driver's RX handler when xdp_action is Pass or Tx.
netbuf.data_offset = xdp_ctx.data;  // data is already an offset from DMA buffer start
netbuf.l2_offset = 0;               // L2 header starts at data_offset+0 (relative)
netbuf.l3_offset = 0xFFFF;          // sentinel: "not parsed yet" — unknown after XDP modification
netbuf.l4_offset = 0xFFFF;          // sentinel: "not parsed yet" — unknown after XDP modification
netbuf.tail_offset = netbuf.data_offset + (xdp_ctx.data_end - xdp_ctx.data) as u32;
// len is computed from tail_offset - data_offset, not stored directly.

L3 and L4 offsets are reset to 0xFFFF (the "not parsed yet" sentinel) because the XDP program may have rewritten or shifted the L2 header, invalidating any previously parsed offsets. L2 offset is set to 0 (relative to data_offset) since the L2 header always starts at the beginning of the packet data. tail_offset is updated instead of len because len is a computed field (tail_offset - data_offset). The IP/TCP layer re-parses them during normal stack processing (ip_rcv() sets l3_offset, protocol dispatch sets l4_offset). This ensures that bpf_xdp_adjust_head() and bpf_xdp_adjust_tail() modifications are correctly reflected in the NetBuf before it enters the network stack. For XdpAction::Redirect, the same resync is performed by xdp_do_redirect() before enqueuing the NetBuf onto the target interface's bulk queue.

16.6 Routing Table (FIB — Forwarding Information Base)

The routing table provides longest-prefix-match (LPM) lookup for IPv4 and IPv6 destination addresses, supporting policy routing (multiple tables with rule-based selection), VRF (Virtual Routing and Forwarding, Section 16.16), and ECMP (Equal-Cost Multi-Path) with weighted next-hops.

Design principles: 1. RCU-protected: Route lookup is on the per-packet forwarding path. Readers (packet processing) access the routing table under rcu_read_lock() with zero lock acquisition. Writers (netlink RTM_NEWROUTE/RTM_DELROUTE, Section 16.17) modify individual FIB trie nodes using per-entry RCU publishing under NetNamespace.config_lock — O(log N) per route change, not O(total routes) clone-and-swap (Section 17.1). 2. Per-namespace: Each NetNamespace holds its own RouteTable (Section 17.1). VRFs within a namespace have separate tables, identified by table ID. 3. Unified data structure: IPv4 and IPv6 share the same trie implementation (operating on 128-bit addresses; IPv4 addresses are stored as IPv4-mapped-IPv6). This eliminates code duplication and simplifies policy routing rules that apply to both address families.

16.6.1 Data Structures

// umka-net/src/routing.rs

/// Forwarding Information Base — the routing table for a network namespace.
///
/// Contains one or more numbered routing tables (Linux supports 256 tables by
/// default; UmkaOS supports up to 4096). Table 253 (`RT_TABLE_DEFAULT`) and table
/// 254 (`RT_TABLE_MAIN`) are always present. Table 255 (`RT_TABLE_LOCAL`) holds
/// routes for local addresses (auto-populated when addresses are assigned).
///
/// **RCU integration**: `RouteTable` is stored directly in `NetNamespace`
/// ([Section 17.1](17-containers.md#namespace-architecture)) — no `RcuCell` wrapper. The FIB trie uses per-entry
/// RCU publishing internally: route adds/deletes modify individual trie nodes and
/// publish them via RCU, giving O(log N) per-route-change cost. The trie nodes
/// themselves are `Arc`-shared, so the per-entry path-copy cost is O(W) allocations
/// where W = key width (at most 128 for IPv6), independent of table size.
///
/// **Cross-reference**: `NetNamespace::routes` ([Section 17.1](17-containers.md#namespace-architecture--capability-domain-mapping)), `FibRule` (below),
/// `NetBuf::route_ext` (Section 16.4), `bpf_fib_lookup()` (Section 16.15),
/// VRF (Section 16.13), netlink RTM_* messages (Section 16.14).
pub struct RouteTable {
    /// Named routing tables, indexed by table ID.
    ///
    /// Standard table IDs (matching Linux `RT_TABLE_*` constants):
    /// - 0: `RT_TABLE_UNSPEC` (used in rules to mean "any table")
    /// - 253: `RT_TABLE_DEFAULT` (default routes)
    /// - 254: `RT_TABLE_MAIN` (main routing table, where `ip route add` goes)
    /// - 255: `RT_TABLE_LOCAL` (local and broadcast addresses, auto-managed)
    /// - 1-252, 256-4095: user-defined tables for policy routing and VRF
    ///
    /// Stored as an `XArray` for O(1) lookup by table ID and ordered iteration
    /// (netlink dump). K is the number of tables (typically 3-10).
    /// The per-table trie provides O(W) prefix lookup where W is the address width.
    pub tables: XArray<FibTrie>,

    /// Policy routing rules, evaluated in priority order.
    ///
    /// Rules select which routing table to consult based on packet attributes
    /// (source address, destination address, mark, incoming interface, IP protocol,
    /// source/destination port, UID). If no rule matches, the default rule chain
    /// applies: local table (255) first, main table (254), then default table (253).
    ///
    /// Sorted by `FibRule::priority` (ascending). Lower numeric priority = higher
    /// precedence (matching Linux semantics where priority 0 is highest).
    /// Bounded: max `MAX_FIB_RULES` (32768). Linux default is 32768.
    /// Modified only via `RTM_NEWRULE`/`RTM_DELRULE` netlink (cold path).
    /// The RTM_NEWRULE handler returns ENOSPC if `rules.len() >= MAX_FIB_RULES`.
    pub rules: Vec<FibRule>,

    /// Default rule chain (compiled from `rules`).
    ///
    /// For the common case where no custom policy rules are configured, this is
    /// `[255, 254, 253]` — check local, main, default, in that order. When custom
    /// rules exist, lookups evaluate `rules` first, falling through to this default
    /// chain only if no rule matches. Caching the default chain avoids allocating
    /// and iterating the rules list for the common no-policy-routing case.
    /// Typically 3 elements (main, local, default). ArrayVec avoids heap
    /// allocation on the common no-policy-routing fast path.
    pub default_chain: ArrayVec<u32, 8>,

    /// Route generation counter. Monotonically increasing, bumped on every
    /// route add/delete/change, PMTU update, or interface state change.
    /// Sockets cache this value in `SockCommon.cached_route` and compare
    /// on the TX path (`sk_dst_check()`) — a single atomic load (~1 cycle)
    /// to detect stale route caches without per-socket invalidation lists.
    ///
    /// u64: at 10^9 route changes/sec, wraps in 584 years.
    pub route_gen: AtomicU64,
}

/// Compressed radix trie for longest-prefix-match IP routing.
///
/// Implements a path-compressed (Patricia) trie over 128-bit keys (IPv4 addresses
/// are stored as IPv4-mapped-IPv6: `::ffff:a.b.c.d`). Path compression collapses
/// single-child internal nodes.
///
/// **Lookup complexity**: O(W) where W = 32 (IPv4) or W = 128 (IPv6). Lookup cost
/// is bounded by key width, independent of table size N — a 4M-entry trie takes the
/// same O(W) steps as a 1K-entry trie. For route lookup at 100 Mpps line rate,
/// W=128 means at most 128 bit comparisons per packet — negligible compared to DRAM
/// latency for the node reads. Path compression reduces actual node visits to 5-20
/// for typical routing tables (worst case bounded by W, not N).
///
/// **Why path-compressed Patricia trie (not LC-trie)**:
/// Linux uses an LC-trie (Level-Compressed trie) for IPv4 FIB, which provides
/// excellent lookup performance for dense, well-distributed prefix tables. However:
/// 1. LC-trie requires periodic rebalancing (level compression ratios change as
///    routes are added/removed), which conflicts with UmkaOS's per-entry RCU model.
///    Rebalancing an LC-trie under RCU requires publishing many nodes atomically,
///    adding complexity and latency spikes on the write path.
/// 2. A path-compressed Patricia trie supports **per-entry RCU publishing**:
///    inserting or deleting a route creates O(W) new nodes on the path from root
///    to the modified leaf (W = key width, bounded by 128 for IPv6), each published
///    individually via `rcu_assign_pointer()`. Old nodes are freed via `rcu_call()`
///    after a grace period. This is NOT clone-and-swap of the root — unchanged
///    subtrees are shared via `Arc` without copying, and there is no recursive
///    `Arc` drop of an old root. The per-entry model ensures O(W) allocations
///    and O(W) RCU callbacks per route change, independent of table size N.
///    For a 1M-prefix BGP table, adding one route creates ~20-25 new nodes
///    (not 1.5M), with proportional cleanup.
/// 3. For typical routing tables (10-1000 entries for host routing, up to ~1M entries
///    for full Internet BGP table), Patricia trie lookup is 5-20 memory accesses,
///    which is comparable to LC-trie (3-10 accesses) and well within the performance
///    budget (route lookup << TCP processing per packet).
///
/// **Algorithm dispatch**: The `FibTrieOps` trait abstracts the trie algorithm,
/// enabling future replacement (e.g., LC-trie for dense tables) without changing
/// the `RouteTable` structure. Phase 2 ships Patricia only; LC-trie is deferred
/// to Phase 4+ pending profiling of production workloads.
///
/// **Full BGP table**: For routers carrying a full Internet routing table (~1M IPv4
/// prefixes, ~200K IPv6 prefixes), the Patricia trie uses approximately 2-4 MB of
/// memory (each node ~40 bytes, ~1.5-2x the number of prefixes due to internal
/// branching nodes). Lookup is ~20-25 memory accesses worst case, ~500-1000 ns on
/// modern CPUs with L2/L3 cache. This is acceptable because full-BGP hosts are
/// routers where routing lookup is a small fraction of per-packet processing.
///
/// **Cache optimization**: Trie nodes are allocated from a dedicated slab pool
/// (not the general-purpose allocator) to improve spatial locality. Nodes along
/// hot paths (default route, /8 aggregates) are likely to remain in L2 cache.
pub struct FibTrie {
    /// Root node of the trie. `None` for an empty table.
    ///
    /// The root is `Arc`-shared: per-entry RCU updates create new nodes along
    /// the modified path and publish them individually via `rcu_assign_pointer()`.
    /// Unchanged subtrees remain shared via `Arc` with zero additional allocation.
    pub root: Option<Arc<FibTrieNode>>,

    /// Number of route entries (prefixes) in this trie.
    /// Used for netlink dump pagination and sysctl reporting.
    pub entry_count: u32,

    /// Table ID (for cross-referencing with `RouteTable::tables`).
    pub table_id: u32,
}

/// A node in the path-compressed Patricia trie.
///
/// Each node represents either:
/// - An **internal branching node**: has children but no route entry. The `prefix`
///   and `prefix_len` fields define the common prefix shared by all descendants.
/// - A **leaf node**: has a route entry (`route`) and possibly children (a prefix
///   that is both a route and a branching point, e.g., 10.0.0.0/8 with more
///   specific routes 10.1.0.0/16, 10.2.0.0/16 as children).
///
/// Path compression: internal nodes with a single child are collapsed. The
/// `prefix_len` may skip multiple bits between parent and child.
pub struct FibTrieNode {
    /// The prefix bits for this node (stored as a 128-bit value).
    ///
    /// Only the first `prefix_len` bits are significant. The remaining bits are zero.
    /// IPv4 routes use IPv4-mapped-IPv6 encoding: `::ffff:a.b.c.d` (prefix_len =
    /// 96 + IPv4 prefix length).
    pub prefix: u128,

    /// Number of significant bits in `prefix`. Range: 0 (default route) to 128.
    ///
    /// For IPv4 routes, this is 96 + the IPv4 prefix length (e.g., /24 becomes 120).
    /// For IPv6 routes, this is the native prefix length (e.g., /64 becomes 64).
    pub prefix_len: u8,

    /// Route entry at this prefix. `Some` if this prefix is a destination in the
    /// routing table. `None` if this is a pure branching node (exists only to
    /// connect more-specific child prefixes).
    pub route: Option<RouteEntry>,

    /// Left child (next bit after `prefix_len` is 0).
    ///
    /// `Arc`-shared: per-entry RCU updates share unchanged subtrees between
    /// old and new versions without copying.
    pub left: Option<Arc<FibTrieNode>>,

    /// Right child (next bit after `prefix_len` is 1).
    pub right: Option<Arc<FibTrieNode>>,
}

/// Algorithm-agnostic FIB trie operations trait.
///
/// Abstracts the trie algorithm to enable future replacement (e.g., LC-trie
/// for dense prefix tables, or hash-based exact-match for host routes).
/// Phase 2 ships only `PatriciaFibTrie`; additional implementations are
/// deferred to Phase 4+ pending production profiling.
///
/// All methods take `&self` (the trie is an Evolvable component). Write
/// operations are called under `NetNamespace.config_lock` (cold path).
/// `lookup()` is called under `rcu_read_lock()` on the hot forwarding path.
pub trait FibTrieOps: Send + Sync {
    /// Longest-prefix-match lookup. Returns the best-matching route entry
    /// and the matched prefix length. Called on the per-packet forwarding
    /// hot path under `rcu_read_lock()`.
    ///
    /// O(W) where W = key width (32 for IPv4, 128 for IPv6).
    fn lookup(&self, trie: &FibTrie, addr: u128) -> Option<(&RouteEntry, u8)>;

    /// Insert a route entry. Creates O(W) new trie nodes along the
    /// insertion path, publishing each via `rcu_assign_pointer()`.
    /// Old replaced nodes are freed via `rcu_call()` after a grace period.
    ///
    /// Called under `NetNamespace.config_lock`. Bumps `RouteTable.route_gen`.
    fn insert(&self, trie: &mut FibTrie, entry: RouteEntry) -> Result<(), KernelError>;

    /// Remove a route entry by prefix. Publishes modified nodes via RCU.
    /// Called under `NetNamespace.config_lock`. Bumps `RouteTable.route_gen`.
    fn remove(&self, trie: &mut FibTrie, prefix: u128, prefix_len: u8) -> Result<RouteEntry, KernelError>;

    /// Iterate all route entries in prefix order (for netlink dump).
    /// Called under `rcu_read_lock()`.
    fn for_each<F: FnMut(&RouteEntry)>(&self, trie: &FibTrie, f: F);
}

/// Algorithm dispatch: selects the FibTrieOps implementation.
/// Currently only Patricia; extensible for future algorithms.
pub enum FibAlgoDispatch {
    /// Path-compressed Patricia trie (Phase 2, default).
    Patricia,
    // Future: LcTrie, HashExact, etc.
}

/// A single route entry in the FIB.
///
/// Corresponds to a row in the output of `ip route show` or an `RTM_NEWROUTE`
/// netlink message.
/// kernel-internal, not KABI — contains NextHopGroup (heap-backed Vec/ArrayVec).
/// Serialized to netlink wire format for userspace.
#[repr(C)]
pub struct RouteEntry {
    /// Destination prefix (redundant with `FibTrieNode::prefix` but stored here
    /// for self-contained netlink serialization and `bpf_fib_lookup()` results
    /// without requiring a trie node reference).
    pub dst_prefix: u128,

    /// Destination prefix length (0-128). Same as `FibTrieNode::prefix_len`.
    pub dst_prefix_len: u8,

    /// Source prefix for source-specific routing (used with `ip route add ... src`).
    ///
    /// When non-zero, this route matches only if the packet's source address also
    /// falls within this prefix. `src_prefix_len == 0` means "any source" (the
    /// common case). Linux supports source-specific routing only for IPv6
    /// (`RT6_F_POLICY`); UmkaOS supports it for both address families.
    pub src_prefix: u128,

    /// Source prefix length (0 = match any source).
    pub src_prefix_len: u8,

    /// Next-hop(s) for this route.
    ///
    /// Single next-hop for simple routes. Multiple next-hops for ECMP (Equal-Cost
    /// Multi-Path). The `NextHopGroup` handles weighted distribution.
    pub next_hops: NextHopGroup,

    /// Route scope — defines the "reach" of this route.
    ///
    /// Values match Linux `RT_SCOPE_*`:
    /// - `Universe` (0): global routes (reachable via gateways)
    /// - `Site` (200): interior routes within a site
    /// - `Link` (253): directly attached (on-link, no gateway needed)
    /// - `Host` (254): local host route (loopback, local address)
    /// - `Nowhere` (255): destination is unreachable
    pub scope: RouteScope,

    /// Route type — determines packet handling action.
    ///
    /// Values match Linux `RTN_*`:
    /// - `Unicast` (1): normal forwarding
    /// - `Local` (2): local delivery (address on this host)
    /// - `Broadcast` (3): broadcast address
    /// - `Unreachable` (7): drop and return ICMP host unreachable
    /// - `Prohibit` (8): drop and return ICMP administratively prohibited
    /// - `Blackhole` (6): silently drop
    /// - `Throw` (9): policy routing: skip this table and try the next rule
    pub route_type: RouteType,

    /// Route protocol — identifies who installed this route.
    ///
    /// Values match Linux `RTPROT_*`:
    /// - `Kernel` (2): installed by the kernel (e.g., directly connected networks)
    /// - `Boot` (3): installed during boot (static routes from config)
    /// - `Static` (4): installed by administrator (`ip route add`)
    /// - `Zebra` (11) / `Bird` (12): installed by routing daemons
    /// - `Dhcp` (16): installed by DHCP client
    ///
    /// Used for route management (e.g., `ip route flush proto dhcp`).
    pub protocol: RouteProtocol,

    /// Route metric (preference). Lower metric = preferred route.
    ///
    /// When multiple routes match the same destination prefix, the route with
    /// the lowest metric is selected. Matches Linux `ip route add ... metric N`.
    /// Default: 0 for kernel routes, 1024 for DHCP, configurable for static routes.
    pub metric: u32,

    /// Preferred source address for packets originating from this host via this route.
    ///
    /// Set via `ip route add ... src <addr>`. When a locally-generated packet uses
    /// this route and has not yet chosen a source address, this address is used.
    /// `[0u8; 16]` means "no preference, use default address selection (RFC 6724)".
    pub prefsrc: [u8; 16],

    /// MTU override for this route.
    ///
    /// If non-zero, packets using this route are limited to this MTU instead of
    /// the output interface's MTU. Used for path MTU discovery caching and for
    /// tunnel routes with reduced MTU. Zero means "use interface MTU".
    pub mtu: u32,

    /// Route flags.
    pub flags: RouteFlags,

    /// Route expiry time (nanoseconds since boot, CLOCK_MONOTONIC_RAW).
    ///
    /// Zero means "no expiry" (permanent route). Non-zero means the route expires
    /// and is garbage-collected after this time. Used for: DHCP routes (expire when
    /// lease expires), redirected routes (expire after `net.ipv4.route.gc_timeout`),
    /// and path MTU entries (expire after `net.ipv4.route.mtu_expires`).
    pub expires_ns: u64,
}

/// Group of next-hops for a route, supporting ECMP.
///
/// Single-next-hop routes are the common case and are stored inline (no heap
/// allocation). Multi-path routes store their next-hops in a slab-allocated
/// vector.
pub enum NextHopGroup {
    /// Single next-hop (the common case for host routing tables).
    Single(NextHop),

    /// Multiple weighted next-hops for Equal-Cost Multi-Path routing.
    ///
    /// Traffic is distributed across next-hops proportionally to their weights.
    /// The selection is deterministic per flow: the `NetBuf::flow_hash`
    /// ([Section 16.5](#netbuf-packet-buffer)) is used to pick a next-hop, ensuring all packets of the same flow
    /// follow the same path (avoiding TCP reordering).
    ///
    /// **Selection algorithm**: `flow_hash % total_weight` determines which
    /// next-hop handles the packet. Each next-hop occupies a range of the weight
    /// space proportional to its weight. For example, with weights [3, 1, 1]
    /// (total 5): hash % 5 in [0,2] -> hop 0, [3,3] -> hop 1, [4,4] -> hop 2.
    ///
    /// **Resilient hashing**: When a next-hop goes down (link failure, neighbor
    /// unreachable), traffic is redistributed only among the remaining next-hops.
    /// Flows that were already using a surviving next-hop are not disrupted.
    /// This matches Linux's `nexthop` group resilient hashing (kernel 5.13+).
    Multipath {
        /// The next-hops and their weights.
        hops: SlabVec<NextHop, NEXTHOP_INLINE_CAP>,
        /// Sum of all weights. Cached to avoid recomputing on every packet.
        total_weight: u32,
    },
}

/// A single next-hop in the routing table.
/// kernel-internal, not KABI — serialized to netlink wire format for userspace.
#[repr(C)]
pub struct NextHop {
    /// Gateway IP address.
    ///
    /// The IP address of the next router to forward the packet to. For directly
    /// connected networks (on-link routes), this is `[0u8; 16]` and the packet is
    /// sent directly to the destination's link-layer address (resolved via ARP/NDP).
    ///
    /// Stored as 128-bit value: IPv4 gateways use IPv4-mapped-IPv6 encoding.
    pub gateway: [u8; 16],

    /// Output interface index. Indexes into `NetNamespace::interfaces` ([Section 17.1](17-containers.md#namespace-architecture--capability-domain-mapping)).
    ///
    /// The physical or virtual interface through which the packet is transmitted
    /// to reach the gateway (or the destination, for on-link routes).
    pub ifindex: u32,

    /// Weight for ECMP distribution. Range: 1-256 (matching Linux `ip route add ...
    /// nexthop ... weight N`). Higher weight = proportionally more traffic.
    /// Default: 1 (equal distribution). Ignored for `NextHopGroup::Single`.
    pub weight: u16,

    /// Next-hop flags.
    pub flags: NextHopFlags,

    /// MPLS label stack (for MPLS forwarding, if configured).
    ///
    /// `label_count == 0` for non-MPLS routes (the common case). When non-zero,
    /// `labels[0..label_count]` are pushed as an MPLS label stack before
    /// forwarding. Used by MPLS-based VPNs and segment routing.
    pub label_count: u8,

    /// MPLS labels to push (up to 4 labels deep, matching Linux `RTA_ENCAP`).
    pub labels: [u32; 4],

    /// Encapsulation type for tunnel routes.
    ///
    /// When `encap_type != EncapType::None`, this next-hop requires tunnel
    /// encapsulation before forwarding. The encapsulation parameters are in
    /// `encap_data`. Used for VXLAN, Geneve, MPLS, and BPF lightweight tunnels.
    pub encap_type: EncapType,
}

/// Route scope — defines how far a destination is reachable.
#[repr(u8)]
pub enum RouteScope {
    /// Global scope: reachable via gateways (Internet routes).
    Universe = 0,
    /// Site-internal scope: reachable within a site but not globally.
    Site = 200,
    /// Link-local scope: directly attached to this link (no gateway).
    Link = 253,
    /// Host scope: this host itself (loopback, local addresses).
    Host = 254,
    /// Nowhere: destination is unreachable.
    Nowhere = 255,
}

/// Route type — determines packet handling action.
#[repr(u8)]
pub enum RouteType {
    /// Unknown / unspecified.
    Unspec = 0,
    /// Unicast route: forward to next-hop.
    Unicast = 1,
    /// Local route: deliver locally (address on this host).
    Local = 2,
    /// Broadcast route: deliver as link-layer broadcast.
    Broadcast = 3,
    /// Anycast route: deliver to any of a set of local addresses.
    Anycast = 4,
    /// Multicast route: deliver via multicast.
    Multicast = 5,
    /// Blackhole: silently drop.
    Blackhole = 6,
    /// Unreachable: drop and send ICMP Destination Unreachable.
    Unreachable = 7,
    /// Prohibit: drop and send ICMP Administratively Prohibited.
    Prohibit = 8,
    /// Throw: skip this table, continue to next policy rule.
    Throw = 9,
}

/// Route protocol identifier — who installed this route.
#[repr(u8)]
pub enum RouteProtocol {
    /// Route origin is unknown.
    Unspec = 0,
    /// Installed by ICMP redirect.
    Redirect = 1,
    /// Installed by the kernel (directly connected networks, local addresses).
    Kernel = 2,
    /// Installed at boot from static configuration.
    Boot = 3,
    /// Installed by administrator (static route via `ip route add`).
    Static = 4,
    /// Installed by the OSPF routing daemon.
    Ospf = 8,
    /// Installed by the RIP routing daemon.
    Rip = 9,
    /// Installed by the BGP routing daemon (Zebra/Quagga/FRR).
    Zebra = 11,
    /// Installed by the BIRD routing daemon.
    Bird = 12,
    /// Installed by a DHCP client.
    Dhcp = 16,
    /// Installed by a BPF program (lightweight tunnel, custom routing).
    Bpf = 200,
}

bitflags! {
    /// Route flags.
    pub struct RouteFlags: u32 {
        /// Route was installed by an ICMP redirect and may expire.
        const REDIRECT    = 1 << 0;
        /// Notify userspace when this route is used (for route monitoring).
        const NOTIFY      = 1 << 1;
        /// Route uses a cached gateway (PMTU entry).
        const CACHE       = 1 << 2;
        /// Route is an on-link route (gateway is on the directly attached network).
        const ONLINK      = 1 << 3;
        /// Route is a link-prefixed route (prefix is assigned to a link, not a host).
        const PREFIX_RT   = 1 << 4;
    }
}

bitflags! {
    /// Next-hop flags.
    pub struct NextHopFlags: u16 {
        /// Next-hop is currently dead (link down or neighbor unreachable).
        /// Excluded from ECMP distribution until restored.
        const DEAD        = 1 << 0;
        /// Next-hop is on-link (no gateway resolution needed).
        const ONLINK      = 1 << 1;
        /// Next-hop is a tunnel (requires encapsulation via `encap_type`).
        const ENCAP       = 1 << 2;
    }
}

/// Encapsulation type for tunnel routes.
#[repr(u8)]
pub enum EncapType {
    /// No encapsulation.
    None = 0,
    /// MPLS encapsulation (push label stack).
    Mpls = 1,
    /// BPF lightweight tunnel (custom encapsulation via BPF program).
    Bpf = 2,
    /// SEG6 (Segment Routing over IPv6).
    Seg6 = 3,
}

/// Maximum number of next-hops per multipath route. Matches Linux's
/// `MULTIPATH_MAX_NEXTHOPS` limit (256). Enforced at the `RTM_NEWNEXTHOP`
/// handler. The `SlabVec` inline capacity is 8 (covers >99% of ECMP
/// routes); routes with >8 next-hops spill to slab-allocated overflow.
pub const MAX_NEXTHOPS: usize = 256;

/// SlabVec inline capacity for next-hop arrays. 8 next-hops inline
/// covers the vast majority of ECMP routes (typically 2-4 paths).
/// At ~44 bytes per NextHop, inline capacity is ~352 bytes per route.
pub const NEXTHOP_INLINE_CAP: usize = 8;

/// Standard routing table IDs (matching Linux RT_TABLE_* constants).
pub const RT_TABLE_UNSPEC: u32 = 0;
pub const RT_TABLE_DEFAULT: u32 = 253;
pub const RT_TABLE_MAIN: u32 = 254;
pub const RT_TABLE_LOCAL: u32 = 255;

16.6.2 Policy Routing Rules

/// Policy routing rule (evaluated for every packet to select the routing table).
///
/// Rules are evaluated in priority order (ascending). The first matching rule
/// determines which routing table to consult. If no rule matches, the default
/// chain (local -> main -> default) is used.
///
/// Corresponds to `ip rule add` commands and `RTM_NEWRULE`/`RTM_DELRULE` netlink
/// messages (Section 16.14).
///
/// **Performance**: For hosts without custom policy rules (the common case), rule
/// evaluation is skipped entirely — `RouteTable::default_chain` is used directly
/// (a static array lookup, ~1 cache miss). For hosts with policy rules, rules are
/// stored in a sorted `Vec` and evaluated linearly. The typical rule count is 3-20
/// (Linux default: 3 rules), so linear scan is optimal (fits in a single cache line
/// for the common case).
///
/// **50+ rule scalability**: For deployments with >50 policy rules (uncommon;
/// typical in VPN concentrators and multi-tenant routers with per-tenant tables),
/// linear scan adds ~50 comparison branches per packet — still O(50) < O(64)
/// cache lines, acceptable overhead. For >64 rules, an interval tree or radix
/// trie on (src_prefix, dst_prefix, tos) tuples is used instead (switched
/// automatically when `rules.len() > POLICY_TRIE_THRESHOLD = 64`). At 64+
/// rules, the trie lookup is O(prefix_bits) = O(128) bit comparisons, faster
/// than O(64) linear scan and fitting in ~3-4 cache lines.
/// 64-rule threshold: deployments with 50-200 rules (common in Kubernetes pod
/// firewall policies) would experience per-packet linear scan under a 256
/// threshold. At 64, most realistic rule sets get trie-based O(W) lookup
/// (W = 128 bits for IPv6 — a fixed constant independent of rule count N).
pub struct FibRule {
    /// Rule priority. Lower = higher precedence.
    ///
    /// Linux default rules: 0 (local table), 32766 (main table), 32767 (default
    /// table). User rules are typically in the range 1-32765.
    pub priority: u32,

    /// Source address prefix to match. `src_len == 0` means "any source".
    pub src: u128,
    /// Source prefix length (0-128). 0 = match all sources.
    pub src_len: u8,

    /// Destination address prefix to match. `dst_len == 0` means "any destination".
    pub dst: u128,
    /// Destination prefix length (0-128). 0 = match all destinations.
    pub dst_len: u8,

    /// Incoming interface name to match. Empty string means "any interface".
    /// Matches Linux `ip rule add iif <name>`.
    pub iif: InterfaceName,

    /// Outgoing interface name to match. Empty string means "any interface".
    /// Matches Linux `ip rule add oif <name>`.
    pub oif: InterfaceName,

    /// Packet mark to match (`NetBuf::mark`). `mark_mask == 0` means "any mark".
    /// Matches Linux `ip rule add fwmark <value>/<mask>`.
    pub mark: u32,
    /// Mask applied to `NetBuf::mark` before comparing with `mark`.
    pub mark_mask: u32,

    /// IP protocol to match (e.g., 6 for TCP, 17 for UDP). 0 = any protocol.
    /// Matches Linux `ip rule add ipproto <proto>`.
    pub ip_proto: u8,

    /// Source port range to match. Both 0 = any port.
    /// Matches Linux `ip rule add sport <start>-<end>`.
    pub sport_start: u16,
    pub sport_end: u16,

    /// Destination port range to match. Both 0 = any port.
    pub dport_start: u16,
    pub dport_end: u16,

    /// UID range to match (originating process UID). Both 0 = any UID.
    /// Matches Linux `ip rule add uidrange <start>-<end>`.
    pub uid_start: u32,
    pub uid_end: u32,

    /// Action to take when this rule matches.
    pub action: FibRuleAction,

    /// Target routing table ID (for `FibRuleAction::Lookup`).
    ///
    /// When `action == Lookup`, the packet is looked up in this table.
    /// Ignored for other actions.
    pub table: u32,

    /// Whether this rule suppresses prefix lengths below a threshold.
    ///
    /// `suppress_prefixlen`: if the lookup in `table` returns a match with a
    /// prefix length <= this value, the match is suppressed and the next rule
    /// is tried. Used to implement "don't use the default route from this table"
    /// (set `suppress_prefixlen = 0` to suppress /0 default routes).
    /// Value 0xFFFF disables suppression.
    pub suppress_prefixlen: u16,
}

/// Policy routing rule action.
#[repr(u8)]
pub enum FibRuleAction {
    /// Look up the packet in the specified routing table (`FibRule::table`).
    Lookup = 1,
    /// Drop the packet (equivalent to blackhole route at rule level).
    Blackhole = 2,
    /// Drop and send ICMP Destination Unreachable.
    Unreachable = 3,
    /// Drop and send ICMP Administratively Prohibited.
    Prohibit = 4,
}

16.6.3 Route Lookup Algorithm

/// Result of a FIB lookup, cached via `NetBuf::route_ext`.
///
/// Contains all information needed for packet forwarding without re-consulting
/// the routing table. The result is valid for the lifetime of the NetBuf (see
/// RCU safety note in `NetBuf::route_ext`).
/// kernel-internal, not KABI — attached to NetBuf, never crosses isolation boundary.
#[repr(C)]
pub struct RouteLookupResult {
    /// The selected next-hop for this packet.
    ///
    /// For ECMP routes, this is the specific next-hop selected by
    /// `flow_hash % total_weight`. For single-hop routes, this is the sole next-hop.
    pub next_hop: NextHop,

    /// Effective MTU for this path.
    ///
    /// `min(route.mtu, output_interface.mtu)`. If `route.mtu == 0`, this is just
    /// the output interface's MTU. Used for IP fragmentation decisions and TCP MSS
    /// clamping.
    pub mtu: u32,

    /// Preferred source address for locally-originated packets.
    ///
    /// Copied from `RouteEntry::prefsrc` if set, otherwise determined by the
    /// source address selection algorithm (RFC 6724 for IPv6, longest-match for IPv4).
    pub prefsrc: [u8; 16],

    /// Route type (Unicast, Local, Broadcast, Blackhole, etc.).
    /// Determines the forwarding action.
    pub route_type: RouteType,

    /// Table ID that provided this result. Used for debugging and netlink reporting.
    pub table_id: u32,
}

impl RouteTable {
    /// Perform a FIB lookup for the given destination address.
    ///
    /// This is the primary packet-path entry point. The algorithm:
    ///
    /// 1. **Rule evaluation**: If policy rules are configured, iterate `rules` in
    ///    priority order. For each matching rule:
    ///    a. If `action == Lookup`: look up the destination in the specified table.
    ///       If a route is found and not suppressed by `suppress_prefixlen`, return it.
    ///    b. If `action == Blackhole/Unreachable/Prohibit`: return immediately with
    ///       the corresponding `RouteType`.
    /// 2. **Default chain**: If no rule matched (or no custom rules exist), look up
    ///    the destination in each table in `default_chain` order (local, main, default).
    ///    Return the first match.
    /// 3. **No route**: If no table contains a matching route, return
    ///    `Err(KernelError::NetUnreachable)`.
    ///
    /// **ECMP selection**: When the matching `RouteEntry` has a `NextHopGroup::Multipath`,
    /// the specific next-hop is selected using `flow_hash % total_weight` (see
    /// `NextHopGroup::Multipath` documentation). Dead next-hops (with `DEAD` flag)
    /// are excluded and their weight is subtracted from `total_weight` for the
    /// selection computation.
    ///
    /// **Performance**: For the common case (no policy rules, single default route),
    /// lookup is: 1 array access (default_chain[0] = table 255) + 1 trie walk
    /// (local table miss, typically 1-2 nodes) + 1 array access (default_chain[1]
    /// = table 254) + 1 trie walk (main table hit). Total: ~4-6 memory accesses,
    /// ~200-400 ns with warm cache.
    ///
    /// # Preconditions
    /// - Caller holds `rcu_read_lock()` (packet processing context).
    /// - `dst` is a 128-bit address (IPv4 uses IPv4-mapped-IPv6 encoding).
    ///
    /// # Cross-reference
    /// - `bpf_fib_lookup()` BPF helper (Section 16.15): wraps this function,
    ///   requires `CAP_NET_ROUTE_READ` capability in the BPF domain.
    /// - `NetBuf::route_ext` (Section 16.4): the result is stored here (via
    ///   slab-allocated extension pointer) to avoid repeated lookups for the same packet.
    ///
    /// **Callsites**: TX path — `ip_route_output()` called from
    /// `ip_queue_xmit()` (TCP) / `ip_push_pending_frames()` (UDP).
    /// RX path — `ip_rcv()` after header validation. Route lookup always
    /// runs in the umka-net Tier 1 domain under `rcu_read_lock()`.
    pub fn lookup(
        &self,
        dst: u128,
        src: u128,
        mark: u32,
        ifindex: u32,
        protocol: u8,
        sport: u16,
        dport: u16,
        uid: u32,
        flow_hash: u32,
    ) -> Result<RouteLookupResult, KernelError>;
}

impl FibTrie {
    /// Longest-prefix-match lookup in this trie.
    ///
    /// Walks the trie from root to leaf, following the path determined by the
    /// destination address bits. At each node, if `node.route` is `Some`, it is
    /// recorded as the current best match. The walk continues into the child
    /// determined by the next bit of `dst` (left for 0, right for 1). When no
    /// more children exist (or the next bit's child is `None`), the most recent
    /// recorded match is returned.
    ///
    /// **Complexity**: O(W) where W is the prefix length of the matching route.
    /// For IPv4, W <= 32; for IPv6, W <= 128. In practice, path compression reduces
    /// the number of actual node visits to 5-20 for typical routing tables.
    ///
    /// Returns `None` if no prefix in this trie matches `dst`.
    pub fn longest_prefix_match(&self, dst: u128) -> Option<&RouteEntry>;

    /// Insert a route into this trie.
    ///
    /// Creates a new trie version by path-copying: only the nodes on the path from
    /// root to the inserted prefix are newly allocated; all other nodes are shared
    /// with the previous version via `Arc`. Returns the new root.
    ///
    /// If a route with the same `(dst_prefix, dst_prefix_len)` already exists, it
    /// is replaced. The old route's memory is freed after the RCU grace period.
    ///
    /// # Usage pattern (RCU update)
    /// ```
    /// let mut new_table = route_table.clone();  // Arc-shared nodes, O(1)
    /// let new_trie = old_trie.insert(entry);    // path-copy, O(W) where W = key width
    /// new_table.tables.insert(table_id, new_trie);
    /// namespace.routes.update(new_table);        // RCU swap
    /// ```
    pub fn insert(&self, entry: RouteEntry) -> FibTrie;

    /// Remove a route from this trie.
    ///
    /// Same path-copying strategy as `insert()`. Returns the new root, or the
    /// same trie unchanged if no matching route was found.
    pub fn remove(&self, dst_prefix: u128, dst_prefix_len: u8) -> FibTrie;
}

16.6.4 FIB Trie Construction: Level-Compressed Trie (LC-Trie) Reference Algorithm

Reference: Nilsson & Karlsson, "IP-Address Lookup Using LC-Tries" (IEEE J-SAC 1999); see also RFC 3765.

UmkaOS uses a path-compressed Patricia trie (described above) rather than an LC-trie, because the LC-trie's level-compression step requires periodic rebalancing incompatible with per-entry RCU publishing (rebalancing would require atomically publishing many nodes). This section specifies the LC-trie algorithm for reference — it is used in the /proc/net/fib_trie compatibility dump (Section 16.5) and serves as the authoritative definition for the level-compression rationale cited in the FibTrie documentation.

LC-trie data structures:

/// A node in the level-compressed FIB trie.
pub enum FibNode {
    /// Internal branching node: branch on `stride` bits starting at `skip` bits from MSB.
    Branch {
        /// Number of bits to skip (path compression: skip nodes with one child).
        skip: u8,
        /// Number of bits in this level's branch key (level compression).
        stride: u8,
        /// 2^stride children. Index by extracting `stride` bits at position `skip`.
        children: Box<[FibNode]>,
    },
    /// Leaf node: matched prefix, points to nexthop table.
    Leaf {
        nexthop: NextHopId,
    },
}

pub struct LcFibTrie {
    pub root: FibNode,
    /// Total number of prefixes stored.
    pub prefix_count: u32,
}

Path compression: if an internal node has exactly one non-empty child, skip it (increment skip counter). This eliminates chains of single-child nodes common in sparse prefix distributions.

Level compression: if all children of an internal node have identical subtrees, they can be represented as a single leaf with a wider stride. This merges multiple single-bit branch levels into one multi-bit lookup.

Construction algorithm: 1. Insert all prefixes as leaf nodes at their natural bit depth (no compression). 2. Bottom-up pass: for each internal node with a single child → apply path compression (absorb into child's skip). 3. Bottom-up pass: for each subtree where all leaves carry the same nexthop → collapse to a single Leaf node. 4. Repeat until no further compression is possible.

Lookup (O(W) steps, W = 32 or 128):

fn lookup(trie: &LcFibTrie, dest: IpAddr) -> NextHopId:
  node = &trie.root
  bit_pos = 0
  loop:
    match node:
      Leaf { nexthop } → return nexthop
      Branch { skip, stride, children }:
        bit_pos += skip as usize
        index = extract_bits(dest, bit_pos, stride as usize)
        node = &children[index]
        bit_pos += stride as usize

Why O(W) not O(log N): the trie walk terminates when a Leaf is reached or all W address bits are consumed. Each loop iteration advances bit_pos by at least skip + stride >= 1 bits (stride >= 1 for any branching node). Therefore the loop executes at most W iterations regardless of prefix count N. A 4M-entry trie takes the same O(W) steps as a 1K-entry trie.

16.6.5 Batch Mutation API

Under BGP churn (route reconvergence), a router may receive hundreds of route updates in a single netlink batch. Without batching, each RTM_NEWROUTE / RTM_DELROUTE triggers a separate per-entry RCU publish on the trie: create O(depth) new nodes on the modified path, publish each via rcu_assign_pointer(), free old nodes via rcu_call(). With W updates, this produces W sets of O(depth) node allocations and O(depth) RCU callbacks.

The batch mutation API amortizes this cost:

/// Accumulates multiple FIB trie mutations into a single RCU update.
/// All mutations share intermediate trie nodes, reducing allocation
/// from O(W * depth) to O(W + depth).
pub struct FibTrieBatchBuilder {
    /// Working copy of the trie root (cloned once at batch start).
    root: TrieNode,
    /// Number of mutations accumulated.
    count: u32,
}

impl FibTrieBatchBuilder {
    /// Begin a batch. Clones the current trie root under `config_lock`.
    pub fn new(fib: &FibTable) -> Self;

    /// Insert or replace a route prefix. Mutates the working copy in-place
    /// (no RCU publish yet). O(depth) node allocation, but nodes are shared
    /// across subsequent mutations in the same batch.
    pub fn insert(&mut self, prefix: IpPrefix, nexthop: NextHop) -> Result<(), FibError>;

    /// Remove a route prefix from the working copy.
    pub fn remove(&mut self, prefix: IpPrefix) -> Result<(), FibError>;

    /// Publish all accumulated mutations as a single atomic RCU update.
    /// Produces exactly ONE new trie version, regardless of mutation count.
    /// Returns the number of mutations applied.
    pub fn commit(self, fib: &FibTable) -> Result<u32, FibError>;
}

Performance: For W route updates in a single BGP batch, the non-batched path performs W RcuCell::update() calls, each triggering an RCU grace period for the old trie version. The batched path performs 1 RcuCell::update() call, with shared intermediate nodes reducing total allocation by approximately (W-1) * depth nodes. Under typical BGP reconvergence (W=50-500, depth=32 for IPv4), this reduces allocation pressure by 90-98%.

16.6.5.1 BGP Full-Table Convergence Performance

Write-side serialization tradeoff: All trie mutations are serialized by NetNamespace::config_lock. This is an explicit design tradeoff: the data plane (packet forwarding at 100Mpps) uses lockless RCU reads with zero contention, at the cost of serialized writes on the control plane (route updates from BGP/OSPF).

For a dedicated BGP router processing a full-table feed from one upstream, this is a single writer with no contention. For a router with multiple BGP sessions updating the same table, write contention occurs on config_lock.

BGP convergence estimate (1M IPv4 routes, 100K route withdrawals):

Metric Value
Typical netlink batch size (FRR/BIRD) 50-500 routes
FIB_BATCH_MAX 4096 mutations
Batches for 100K withdrawals ~25-2000 (depending on daemon batch size)
RCU swaps 1 per batch
Per-batch clone cost O(1) root clone + O(W) path mutations
Estimated convergence time <1 second (25 batches at ~1ms each)

Linux uses in-place trie mutation with finer-grained locking, allowing concurrent modifications to different parts of the trie within the same table. UmkaOS's per-entry RCU publishing serializes writes per-table but provides lockless reads. For BGP routers, the control plane (route updates) is orders of magnitude less frequent than the data plane (packet forwarding), making this the correct tradeoff.

16.6.6 VRF Integration

Each VRF (Virtual Routing and Forwarding) instance is a separate routing table. When a VRF is created (via ip link add vrf0 type vrf table 100), a new entry is added to RouteTable::tables with the specified table ID. A policy rule is also added that directs packets received on VRF-enslaved interfaces to the VRF's table:

ip rule add iif <vrf-interface> table <vrf-table-id>
ip rule add oif <vrf-interface> table <vrf-table-id>

This integrates naturally with the policy routing rule evaluation described above: when a packet arrives on a VRF-enslaved interface, the matching rule directs the lookup to the VRF's private routing table, providing L3 domain isolation.

Cross-reference: VRF (Section 16.16, line 742-744), NetNamespace (Section 17.1), policy routing rules (Section 16.17, RTM_NEWRULE).

Route management is performed via NETLINK_ROUTE (Section 16.17):

  • RTM_NEWROUTE: Insert or replace a route. Translates to FibTrie::insert().
  • RTM_DELROUTE: Remove a route. Translates to FibTrie::remove().
  • RTM_GETROUTE: Perform a FIB lookup and return the result (used by ip route get). Translates to RouteTable::lookup().
  • RTM_NEWRULE / RTM_DELRULE: Add or remove a policy routing rule. Translates to insertion/removal in RouteTable::rules (followed by re-sort).

All write operations are serialized by NetNamespace::config_lock (Section 17.1). After mutation, the entire RouteTable is published via RcuCell::update(). The old RouteTable's FibTrieNodes are freed after the RCU grace period, but most nodes are shared with the new version (via Arc) and are only freed when their reference count reaches zero.

16.6.8 bpf_fib_lookup() Integration

The bpf_fib_lookup() BPF helper (Section 16.18, capability: CAP_NET_ROUTE_READ) wraps RouteTable::lookup() for BPF programs. The helper:

  1. Reads the destination address and other lookup keys from the BPF program's packet context (NetBuf metadata accessible via the BPF domain's read-only mapping, or from function parameters for TC/XDP programs).
  2. Calls RouteTable::lookup() under rcu_read_lock() in the umka-net domain (cross-domain helper invocation, ~23 cycles for domain switch on x86-64).
  3. Writes the result (RouteLookupResult) to BPF-accessible memory.
  4. Returns BPF_FIB_LKUP_RET_SUCCESS (0) on match, or an error code:
  5. BPF_FIB_LKUP_RET_BLACKHOLE (1)
  6. BPF_FIB_LKUP_RET_UNREACHABLE (2)
  7. BPF_FIB_LKUP_RET_PROHIBIT (3)
  8. BPF_FIB_LKUP_RET_NOT_FWDED (4): lookup succeeded but packet is local
  9. BPF_FIB_LKUP_RET_FWD_DISABLED (5): forwarding disabled on interface
  10. BPF_FIB_LKUP_RET_UNSUPP_LWT (6): forwarding requires unsupported LWT encapsulation
  11. BPF_FIB_LKUP_RET_NO_NEIGH (7): next-hop neighbor not resolved
  12. BPF_FIB_LKUP_RET_FRAG_NEEDED (8): fragmentation needed (params->mtu_result contains MTU)

This matches the Linux bpf_fib_lookup() return value semantics, ensuring compatibility with existing XDP programs (e.g., Cilium, Katran) that use this helper for fast-path routing decisions.

16.7 Neighbor Subsystem (ARP/NDP)

Every IP packet ultimately requires L2 (link-layer) address resolution — mapping an IP next-hop address to a hardware (MAC) address. The neighbor subsystem manages this mapping for both IPv4 (ARP — RFC 826) and IPv6 (NDP — RFC 4861).

/// Neighbor cache entry — maps an L3 (IP) address to an L2 (MAC) address.
///
/// Each entry tracks the state of neighbor reachability and the resolved
/// hardware address. Entries transition through a state machine matching
/// RFC 4861 Section 8.3 (NDP) and RFC 826 (ARP).
pub struct NeighborEntry {
    /// L3 (network) address. IPv4 (4 bytes) or IPv6 (16 bytes).
    pub ip_addr: IpAddr,

    /// L2 (hardware) address. Variable length: 6 bytes for Ethernet,
    /// 20 bytes for InfiniBand GID, up to 32 bytes (MAX_ADDR_LEN,
    /// matching Linux). Valid only in REACHABLE, STALE, DELAY, PROBE states.
    pub hw_addr: [u8; 32],
    /// Valid length of hw_addr (6 for Ethernet, 20 for IB, etc.).
    pub hw_addr_len: u8,

    /// Current state in the neighbor reachability state machine.
    pub state: AtomicU8,  // NeighborState discriminant

    /// Output network interface index.
    pub ifindex: u32,

    /// Timestamp of last confirmed reachability (nanoseconds, monotonic).
    /// Used for REACHABLE→STALE timeout (default 30 seconds for IPv6,
    /// configurable via `base_reachable_time` sysctl).
    pub confirmed_ns: AtomicU64,

    /// Number of unanswered solicitations sent in PROBE state.
    pub probes_sent: AtomicU8,

    /// Queue of packets waiting for address resolution (INCOMPLETE state).
    /// Maximum 3 packets queued; excess are dropped (matching Linux behavior).
    pub pending_queue: SpinLock<ArrayVec<NetBufHandle, 3>>,

    /// Hash table linkage. Uses intrusive `HashListNode` for O(1) removal
    /// from the `RcuHashTable`. This is an accepted pattern for RCU-protected
    /// hash tables per [Section 3.13](03-concurrency.md#collection-usage-policy) — the intrusive node
    /// avoids the indirection of a separate hash map entry pointing to the
    /// NeighborEntry. NOT the banned "IntrusiveList as general container"
    /// pattern; this is hash-bucket chaining.
    pub hash_node: HashListNode,

    /// RCU head for deferred freeing. Embedded (not heap-allocated) to avoid
    /// allocation on the removal path. `call_rcu()` enqueues via this field;
    /// after the grace period, the callback frees the entire NeighborEntry.
    pub rcu_head: RcuHead,

    /// Reference count.
    pub refcount: AtomicU32,
}

/// Neighbor reachability states (RFC 4861 Section 8.3.2).
#[repr(u8)]
pub enum NeighborState {
    /// Address resolution in progress. Solicitations are being sent.
    /// Packets to this neighbor are queued (up to 3).
    Incomplete = 0,
    /// L2 address is known and recently confirmed reachable.
    /// Timeout: base_reachable_time (default 30s, randomized +/-50%).
    Reachable = 1,
    /// Reachable timeout expired. L2 address is probably still valid
    /// but has not been confirmed recently.
    Stale = 2,
    /// Traffic was sent to this neighbor from STALE state.
    /// Wait delay_first_probe_time (default 5s) before probing.
    Delay = 3,
    /// Actively probing. Unicast solicitations sent every retrans_timer
    /// (default 1s). Max ucast_solicit (default 3) probes before FAILED.
    Probe = 4,
    /// Address resolution failed. All queued packets dropped with
    /// EHOSTUNREACH. Entry may be garbage-collected.
    Failed = 5,
    /// Permanently configured (static ARP entry / `ip neigh add`).
    /// Never times out or transitions.
    Permanent = 6,
}

/// Per-namespace neighbor table.
pub struct NeighborTable {
    /// Hash table mapping IP addresses to neighbor entries.
    /// RCU-protected for lockless lookup on the packet forwarding path.
    /// Keyed by (ifindex, ip_addr) for interface-scoped lookups.
    pub entries: RcuHashTable<NeighborEntry>,

    /// Garbage collection timer. Runs periodically to remove FAILED and
    /// expired STALE entries.
    pub gc_timer: Timer,

    /// Configuration parameters (sysctl-configurable per-interface).
    pub config: NeighborConfig,
}

pub struct NeighborConfig {
    /// Time after which a REACHABLE entry transitions to STALE.
    /// Default: 30_000_000_000 ns (30 seconds). Randomized +/-50%.
    pub base_reachable_time_ns: u64,
    /// Delay before first probe in DELAY state.
    /// Default: 5_000_000_000 ns (5 seconds).
    pub delay_first_probe_ns: u64,
    /// Interval between unicast probes in PROBE state.
    /// Default: 1_000_000_000 ns (1 second).
    pub retrans_timer_ns: u64,
    /// Maximum unicast probes before transition to FAILED.
    pub ucast_solicit: u8,  // Default: 3
    /// Maximum multicast probes (INCOMPLETE state).
    pub mcast_solicit: u8,  // Default: 3
    /// Below gc_thresh1, no GC runs. Default: 128.
    pub gc_thresh1: u32,
    /// Above gc_thresh2, GC runs at accelerated intervals (5s). Default: 512.
    pub gc_thresh2: u32,
    /// Hard limit: at gc_thresh3, new entries are rejected with EHOSTUNREACH.
    /// Default: 1024. Exposed via sysctl `net.ipv4.neigh.default.gc_thresh{1,2,3}`.
    pub gc_thresh3: u32,
}

State machine transitions:

                   ┌─────────────┐
    ┌──────────────│  INCOMPLETE  │──── resolution timeout ────► FAILED
    │              └──────┬──────┘
    │                     │ reply received
    │                     ▼
    │              ┌─────────────┐
    │              │  REACHABLE  │
    │              └──────┬──────┘
    │                     │ reachable_time expires
    │                     ▼
    │              ┌─────────────┐
    │              │    STALE    │
    │              └──────┬──────┘
    │                     │ traffic sent to neighbor
    │                     ▼
    │              ┌─────────────┐
    │              │    DELAY    │
    │              └──────┬──────┘
    │                     │ delay_first_probe expires
    │                     ▼
    │              ┌─────────────┐
    │              │    PROBE    │──── max probes exceeded ────► FAILED
    │              └──────┬──────┘
    │                     │ reply received
    │                     ▼
    └─────────────── REACHABLE (confirmed)

ARP operation (IPv4): On cache miss, the VFS/routing layer calls neighbor_resolve() which sends an ARP request (broadcast on the local network) and queues the packet. When the ARP reply arrives, the neighbor entry transitions to REACHABLE, the queued packets are transmitted, and subsequent packets use the cached MAC address directly.

16.7.1 IPv6 Neighbor Discovery Protocol (NDP)

NDP (RFC 4861) replaces ARP for IPv6 and additionally handles router discovery, prefix advertisement, address autoconfiguration (SLAAC), duplicate address detection (DAD), and redirect notification. All NDP messages are ICMPv6 packets with hop limit = 255 (used as a security check — packets with hop limit < 255 are silently discarded to prevent off-link spoofing).

16.7.1.1 NDP Message Types

/// ICMPv6 NDP message types (RFC 4861).
pub enum NdpMessageType {
    /// Type 133: Router Solicitation — host asks for router presence.
    RouterSolicitation = 133,
    /// Type 134: Router Advertisement — router announces prefix, MTU, flags.
    RouterAdvertisement = 134,
    /// Type 135: Neighbor Solicitation — resolve IPv6→MAC (like ARP request).
    NeighborSolicitation = 135,
    /// Type 136: Neighbor Advertisement — reply with MAC (like ARP reply).
    NeighborAdvertisement = 136,
    /// Type 137: Redirect — router informs host of a better next-hop.
    Redirect = 137,
}

Neighbor Solicitation (NS) is the IPv6 equivalent of an ARP request. To resolve address T, the host sends an NS to the solicited-node multicast address ff02::1:ffXX:XXXX (where XX:XXXX are the low-order 24 bits of T). The NS carries a Source Link-Layer Address (SLLA) option with the sender's MAC. The target host replies with a Neighbor Advertisement (NA) carrying a Target Link-Layer Address (TLLA) option. This maps directly to the INCOMPLETE → REACHABLE transition in the neighbor cache state machine above.

Router Solicitation (RS) is sent by hosts at interface startup to discover routers without waiting for the next periodic Router Advertisement. Hosts send up to router_solicitations (default: 3) RS messages, spaced router_solicitation_interval (default: 4) seconds apart, after an initial router_solicitation_delay (default: 1) second delay.

Redirect messages are sent by routers to inform a host that a better first-hop exists for a particular destination. The redirect installs a host-specific route in the FIB pointing to the new next-hop. Redirect messages are only accepted from the current first-hop router for the destination.

16.7.1.2 Router Advertisement Structure

/// Router Advertisement message (ICMPv6 type 134).
/// Sent periodically by routers and in response to Router Solicitations.
/// Periodic interval: between MinRtrAdvInterval (default 200s) and
/// MaxRtrAdvInterval (default 600s), randomized within that range.
#[repr(C, packed)]
pub struct NdpRouterAdvertisement {
    /// Current hop limit for hosts to use (0 = unspecified).
    pub cur_hop_limit: u8,
    /// Flags: M (managed address via DHCPv6), O (other config via DHCPv6),
    /// H (home agent), Prf (default router preference: 00=medium, 01=high,
    /// 11=low). Bit layout: |M|O|H|Prf|Prf|0|0|0|.
    pub flags: u8,
    /// Router lifetime in seconds (0 = not a default router). Maximum 9000
    /// seconds (RFC 4861 §6.2.1). Governs how long this router remains in
    /// the default router list. Big-endian on wire (RFC 4861).
    pub router_lifetime: Be16,
    /// Reachable time in milliseconds (0 = unspecified). Copied into
    /// NeighborConfig::base_reachable_time_ns (converted to nanoseconds)
    /// for neighbor unreachability detection on this link. Big-endian on wire.
    pub reachable_time: Be32,
    /// Retransmit timer in milliseconds (0 = unspecified). Used as the
    /// interval between Neighbor Solicitation retransmissions during
    /// address resolution and NUD probing. Big-endian on wire.
    pub retrans_timer: Be32,
    // Followed by NDP options (variable length).
}
// Wire format (RFC 4861): cur_hop_limit(1)+flags(1)+router_lifetime(2)+reachable_time(4)+retrans_timer(4) = 12 bytes.
const_assert!(core::mem::size_of::<NdpRouterAdvertisement>() == 12);

16.7.1.3 NDP Options

NDP options follow a uniform Type-Length-Value encoding where the length field is in units of 8 bytes. All NDP messages (RS, RA, NS, NA, Redirect) can carry options. Unrecognized option types are silently ignored.

Type Name RFC Description
1 Source Link-Layer Address (SLLA) 4861 Sender's L2 address; included in RS, NS, RA
2 Target Link-Layer Address (TLLA) 4861 Target's L2 address; included in NA, Redirect
3 Prefix Information 4861 On-link prefix and SLAAC prefix; included in RA
4 Redirected Header 4861 Original packet header in Redirect messages
5 MTU 4861 Link MTU; included in RA (required on non-Ethernet links)
25 RDNSS (Recursive DNS Server) 8106 DNS server IPv6 addresses with lifetime
31 DNSSL (DNS Search List) 8106 DNS search domain suffixes with lifetime
/// Prefix Information option (NDP option type 3, RFC 4861 §4.6.2).
/// Carries prefix for SLAAC address configuration and on-link determination.
/// Included in Router Advertisements. Multiple Prefix Information options
/// can appear in a single RA (one per advertised prefix).
#[repr(C, packed)]
pub struct NdpPrefixInfo {
    /// Option type (3).
    pub option_type: u8,
    /// Option length in 8-byte units (4 = 32 bytes total).
    pub length: u8,
    /// Prefix length in bits (e.g., 64 for a /64).
    pub prefix_length: u8,
    /// Flags: L (on-link — hosts can reach destinations within this prefix
    /// directly without a router), A (autonomous address configuration —
    /// hosts should use this prefix for SLAAC), R (router address — the
    /// prefix contains the advertising router's complete address).
    pub flags: u8,
    /// Valid lifetime in seconds. Addresses derived from this prefix are
    /// valid for this duration. 0xFFFFFFFF = infinity. When this timer
    /// expires, derived addresses transition to INVALID state and are
    /// removed from the interface.
    pub valid_lifetime: Be32,
    /// Preferred lifetime in seconds. Must be <= valid_lifetime. When this
    /// timer expires, derived addresses transition from PREFERRED to
    /// DEPRECATED. 0xFFFFFFFF = infinity. Big-endian on wire.
    pub preferred_lifetime: Be32,
    /// Reserved (must be zero on transmit, ignored on receipt). Big-endian.
    pub reserved2: Be32,
    /// The prefix (128 bits). Only the first `prefix_length` bits are
    /// significant; remaining bits must be zero.
    pub prefix: Ipv6Addr,
}
// Wire format (RFC 4861): type(1)+length(1)+prefix_length(1)+flags(1)+valid_lt(4)+preferred_lt(4)+reserved2(4)+prefix(16) = 32 bytes.
const_assert!(core::mem::size_of::<NdpPrefixInfo>() == 32);

/// RDNSS option (NDP option type 25, RFC 8106 §5.1).
/// Carries one or more IPv6 addresses of recursive DNS servers.
#[repr(C, packed)]
pub struct NdpRdnss {
    /// Option type (25).
    pub option_type: u8,
    /// Option length in 8-byte units. Minimum 3 (one address).
    /// Formula: 1 + 2 * (number of addresses).
    pub length: u8,
    /// Reserved (must be zero). Big-endian on wire.
    pub reserved: Be16,
    /// Lifetime in seconds. After this time the RDNSS addresses should
    /// no longer be used. 0xFFFFFFFF = infinity. 0 = remove these servers. Big-endian.
    pub lifetime: Be32,
    // Followed by one or more Ipv6Addr (each 16 bytes).
    // Number of addresses = (length - 1) / 2.
}
// Wire format (RFC 8106): type(1)+length(1)+reserved(2)+lifetime(4) = 8 bytes (header only).
const_assert!(core::mem::size_of::<NdpRdnss>() == 8);

/// DNSSL option (NDP option type 31, RFC 8106 §5.2).
/// Carries DNS search domain suffixes.
#[repr(C, packed)]
pub struct NdpDnssl {
    /// Option type (31).
    pub option_type: u8,
    /// Option length in 8-byte units.
    pub length: u8,
    /// Reserved (must be zero). Big-endian on wire.
    pub reserved: Be16,
    /// Lifetime in seconds (same semantics as RDNSS lifetime). Big-endian.
    pub lifetime: Be32,
    // Followed by one or more domain names in DNS wire format
    // (length-prefixed labels, terminated by zero-length label).
    // Padded with zeros to 8-byte boundary.
}
// Wire format (RFC 8106): type(1)+length(1)+reserved(2)+lifetime(4) = 8 bytes (header only).
const_assert!(core::mem::size_of::<NdpDnssl>() == 8);

16.7.1.4 SLAAC (Stateless Address Autoconfiguration)

SLAAC (RFC 4862) allows IPv6 hosts to self-configure globally routable addresses without DHCPv6. The full sequence:

  1. Link-local address generation. On interface up, the host generates a link-local address: fe80::/10 prefix + 54 zero bits + 64-bit interface ID. The interface ID is derived from the MAC address via Modified EUI-64 (inserting 0xFFFE in the middle and flipping the Universal/Local bit), or generated randomly if privacy extensions are enabled (see below).

  2. DAD on link-local. The host performs Duplicate Address Detection (see DAD section below) on the tentative link-local address. If DAD fails, autoconfiguration halts and manual intervention is required (this indicates a duplicate MAC or misconfigured network).

  3. Router Solicitation. After the link-local address is confirmed, the host sends Router Solicitation messages to ff02::2 (all-routers multicast) to solicit Router Advertisements from on-link routers.

  4. Router Advertisement processing. When a Router Advertisement is received, the host processes each included option:

  5. Prefix Information (A flag set): For each prefix with the Autonomous (A) flag, the host generates a global address by combining the advertised prefix with a locally generated interface ID.
  6. Prefix Information (L flag set): The prefix is marked as on-link in the FIB (packets to destinations within this prefix are sent directly without routing through a gateway).
  7. Default router: If router_lifetime > 0, the advertising router is added to the default router list with the specified preference (Prf bits).
  8. MTU option: The link MTU is updated if the advertised value is ≥ 1280.
  9. RDNSS/DNSSL options: DNS configuration is updated and propagated to userspace (via resolvconf mechanism or systemd-resolved notification).

  10. Global address configuration. For each prefix with the A flag: a. Generate interface ID (Modified EUI-64 or random for privacy extensions). b. Combine prefix + interface ID to form a tentative global address. c. Perform DAD on the tentative global address. d. If DAD succeeds, the address enters PREFERRED state. e. Start valid_lifetime and preferred_lifetime timers. f. When preferred_lifetime expires → address transitions to DEPRECATED. g. When valid_lifetime expires → address transitions to INVALID (removed).

  11. Lifetime refresh. Subsequent RAs with the same prefix update the lifetimes. Per RFC 4862 §5.5.3: the valid lifetime is only reset if the new value is > 2 hours, or if the remaining lifetime is ≤ 2 hours. This prevents a rogue RA from prematurely invalidating addresses.

/// SLAAC address state machine (RFC 4862 §5.5).
pub enum Ipv6AddrState {
    /// Address is being probed for uniqueness (DAD in progress).
    /// No packets may be sent from this address. Incoming packets
    /// addressed to this address are accepted (for DAD purposes only).
    Tentative,
    /// Address is valid and preferred for new connections. Source
    /// address selection (RFC 6724) prefers PREFERRED addresses.
    Preferred,
    /// Address is valid but deprecated — existing connections can use it,
    /// new connections should prefer other addresses. This state is
    /// entered when preferred_lifetime expires.
    Deprecated,
    /// Address lifetime expired — removed from interface. No packets
    /// may be sent or received on this address.
    Invalid,
    /// Optimistic DAD (RFC 4429): address usable immediately but with
    /// restrictions (no responding to NS for this address, not used as
    /// source for non-ND packets to off-link destinations). Provides
    /// faster connectivity at slight risk of address collision.
    Optimistic,
}

16.7.1.5 Duplicate Address Detection (DAD)

DAD (RFC 4862 §5.4) ensures address uniqueness on the local link before an address is assigned to an interface. The procedure:

  1. The address enters TENTATIVE state. It is joined to the solicited-node multicast group for the tentative address (ff02::1:ffXX:XXXX where XX:XXXX are the low-order 24 bits of the address).

  2. The host sends dad_transmits (default: 1) Neighbor Solicitation messages with:

  3. Source address: :: (unspecified — the address is not yet confirmed)
  4. Destination: solicited-node multicast address for the tentative address
  5. Target: the tentative address
  6. No SLLA option (source is unspecified)

  7. The host waits retrans_timer (from RA, default: 1 second) between each NS retransmission.

  8. DAD failure conditions:

  9. A Neighbor Advertisement is received with the tentative address as target → another node already owns this address. DAD fails.
  10. A Neighbor Solicitation is received from :: for the same tentative address → simultaneous DAD from another node. Both nodes detect the conflict. Per RFC 4862 §5.4, the address is not configured on either node (both fail). If this occurs on a link-local address, the node should log an error and require manual intervention (likely a MAC address conflict).

  11. If no conflicting response is received after all probes and the final wait period, DAD succeeds and the address transitions to PREFERRED (or OPTIMISTIC → PREFERRED if optimistic DAD was used).

Optimistic DAD (RFC 4429) allows an address to be used immediately upon generation, without waiting for DAD to complete. The address enters OPTIMISTIC state instead of TENTATIVE. Restrictions during OPTIMISTIC: - The address must not be used as a source for packets to off-link destinations (to avoid polluting remote neighbor caches with an address that might be a duplicate). - The node must not respond to Neighbor Solicitations for the OPTIMISTIC address (to avoid overriding the legitimate owner's cache entry). - The node must not send unsolicited Neighbor Advertisements for the OPTIMISTIC address. - Enabled per-interface via the optimistic_dad sysctl.

DAD for anycast addresses: DAD is not performed for anycast addresses (multiple nodes legitimately share the same address).

16.7.1.6 Privacy Extensions (RFC 4941)

Standard SLAAC addresses use a deterministic interface ID derived from the MAC address (Modified EUI-64), making host tracking trivial across networks. Privacy extensions generate random interface IDs to mitigate this:

Configuration:

Sysctl Default Description
use_tempaddr 0 0 = disabled, 1 = generate temporary addresses but prefer public, 2 = prefer temporary addresses for outgoing connections
temp_valid_lft 604800 Valid lifetime of temporary addresses (seconds, default 7 days)
temp_prefered_lft 86400 Preferred lifetime of temporary addresses (seconds, default 1 day)
max_desync_factor 600 Maximum random desynchronization value (seconds) subtracted from preferred lifetime to prevent thundering herd address regeneration

Address lifecycle: 1. When a new public SLAAC address is configured (or a prefix is refreshed via RA), a corresponding temporary address is generated using a random 64-bit interface ID. 2. The temporary address is assigned preferred_lifetime = min(prefix_preferred_lft, temp_prefered_lft) - random(0, max_desync_factor). 3. The temporary address is assigned valid_lifetime = min(prefix_valid_lft, temp_valid_lft). 4. When the current temporary address's preferred lifetime approaches expiration (specifically, when preferred_lifetime - now < max_desync_factor), a new temporary address is generated for the same prefix, ensuring continuity. 5. The old temporary address transitions to DEPRECATED: existing connections continue to use it, but new connections prefer the replacement. It remains valid until valid_lifetime expires. 6. Source address selection (RFC 6724) considers use_tempaddr: - use_tempaddr=1: temporary addresses are available but public addresses are preferred. - use_tempaddr=2: temporary addresses are preferred over public addresses for outgoing connections.

Random interface ID generation: A 64-bit random value is generated using the kernel CSPRNG. The value is checked to ensure it does not conflict with the subnet-router anycast address (all-zeros interface ID), the Modified EUI-64 of the local MAC (to avoid defeating the purpose), or any other address already present on the interface.

16.7.1.7 Per-Interface IPv6 Sysctls

Per-interface NDP and IPv6 configuration is exposed under /proc/sys/net/ipv6/conf/<iface>/, matching the Linux procfs layout for userspace compatibility. Each interface has its own configuration namespace; two pseudo-interfaces provide global controls:

  • conf/all/forwarding: Master forwarding switch. When set to 1, all interfaces act as routers (accept and forward packets, do not send RS, do not process RA for address configuration by default).
  • conf/default/*: Default values applied to newly created interfaces. Changing a default does not retroactively affect existing interfaces.
Sysctl Default Description
accept_ra 1 Accept Router Advertisements. 0 = never, 1 = accept only if forwarding=0 for this interface, 2 = accept always (even when forwarding). When forwarding is enabled, accept_ra=1 effectively disables RA processing.
autoconf 1 Perform SLAAC — generate global addresses from RA Prefix Information options with the A flag set. Requires accept_ra ≥ 1.
dad_transmits 1 Number of DAD Neighbor Solicitation probes to send. 0 = disable DAD entirely (not recommended — risks address conflicts).
forwarding 0 Enable IPv6 packet forwarding on this interface. When enabled, the interface acts as a router: forwards packets, does not send RS, generates RA (if radvd or equivalent is running).
hop_limit 64 Default hop limit for outgoing unicast packets. Overridden by cur_hop_limit from RA if non-zero.
mtu 1280 Interface IPv6 MTU. Minimum 1280 per RFC 2460 §5. Kernel enforces this lower bound. Upper bound = link MTU.
use_tempaddr 0 Privacy extensions (RFC 4941). 0 = disabled, 1 = generate temporary addresses (prefer public), 2 = generate and prefer temporary.
temp_valid_lft 604800 Temporary address valid lifetime in seconds (7 days).
temp_prefered_lft 86400 Temporary address preferred lifetime in seconds (1 day).
accept_ra_defrtr 1 Learn default routes from RA router_lifetime field. When 0, RAs are processed for prefix/MTU/DNS but the router is not added to the default route list.
accept_ra_pinfo 1 Accept Prefix Information options from RA. When 0, prefix options are ignored (no SLAAC, no on-link prefix entries).
router_solicitations 3 Number of Router Solicitation messages to send at interface startup (or after losing all routers). -1 = send RS indefinitely until a router responds.
router_solicitation_interval 4 Seconds between RS retransmissions (RFC 4861 §6.3.7 recommends 4 seconds).
router_solicitation_delay 1 Seconds to wait before sending the first RS after interface up (jitter to avoid RS storms when many hosts boot simultaneously).
optimistic_dad 0 Enable optimistic DAD (RFC 4429). When 1, tentative addresses enter OPTIMISTIC state and are usable immediately with restrictions.
accept_ra_rt_info_max_plen 0 Maximum prefix length accepted from Route Information options (RFC 4191) in RA. 0 = do not accept Route Information options. Set to 128 to accept all specific routes.
disable_ipv6 0 Disable IPv6 on this interface. When set to 1, all IPv6 addresses are removed, no IPv6 packets are processed, and the interface does not participate in NDP.
ndisc_notify 0 Send unsolicited Neighbor Advertisements when the interface address or link-layer address changes. Useful for fast failover in HA configurations (gratuitous NA, analogous to gratuitous ARP).
accept_ra_mtu 1 Accept MTU option from RA. When 0, MTU options are ignored.
max_desync_factor 600 Maximum random desynchronization factor for privacy extension address regeneration (seconds).

16.7.1.8 NDP Integration with Neighbor Cache

NDP message processing maps directly to the neighbor cache state machine defined above:

  • NS → INCOMPLETE→REACHABLE: When a Neighbor Solicitation triggers address resolution, the entry starts in INCOMPLETE state. Upon receiving the corresponding Neighbor Advertisement with the TLLA option, the entry transitions to REACHABLE and the target MAC address is recorded.

  • Unsolicited NA → STALE: An unsolicited Neighbor Advertisement (e.g., from a host announcing a new link-layer address) updates an existing neighbor cache entry and sets it to STALE. The Override (O) flag in the NA determines whether the update replaces an existing entry.

  • Router Advertisement → FIB updates: Each RA triggers:

  • Default route creation/update via Section 16.6 if router_lifetime > 0.
  • Connected route creation for each on-link prefix (L flag set).
  • SLAAC address configuration for each autonomous prefix (A flag set).

  • Redirect → host route: A Redirect message from the current first-hop router creates a host-specific route (/128) in the FIB pointing to the indicated better next-hop. The redirect also updates the neighbor cache entry for the destination (if the TLLA option is included).

  • RDNSS/DNSSL → DNS configuration: RDNSS and DNSSL options are parsed and the resulting DNS server addresses and search domains are stored per-interface. Userspace is notified via a netlink event so that resolvconf or systemd-resolved can update /etc/resolv.conf. Entries are expired based on the lifetime field in the option.

  • Neighbor Unreachability Detection (NUD): The REACHABLE → STALE → DELAY → PROBE → FAILED progression (shown in the state machine above) is driven by NDP Neighbor Solicitation/Advertisement exchanges. The reachable_time and retrans_timer values from the most recent RA on the link are used to parameterize NUD timers.

Cross-references: - Section 16.6 — FIB route updates from RA - Section 16.2 — IPv6 packet receive path and ICMPv6 processing - Section 16.13 — per-interface configuration and link-layer address discovery - Section 17.1 — NDP state is per-network-namespace; each namespace maintains independent neighbor caches, SLAAC state, and per-interface sysctls - Section 16.3 — AF_INET6 socket interface and source address selection (RFC 6724)

Integration with routing (Section 16.6): The routing table lookup returns a next-hop IP address. NetBuf::route_ext points to a slab-allocated RouteLookupResult which includes the next-hop. Before transmission, the output path calls neighbor_lookup(ifindex, next_hop) to resolve the L2 address.

/// Socket address structure. Matches Linux's `struct sockaddr_storage` (128 bytes)
/// to accommodate all address families (AF_INET, AF_INET6, AF_UNIX, etc.).
/// The `family` field discriminates the actual address type.
/// Alignment matches Linux `__kernel_sockaddr_storage`: pointer-aligned
/// (8 on LP64, 4 on ILP32). Per-arch cfg required for ABI correctness.
#[cfg(target_pointer_width = "64")]
#[repr(C, align(8))]
pub struct SockAddr {
    /// Address family (AF_INET, AF_INET6, AF_UNIX, etc.).
    pub family: u16,
    /// Address data. Interpretation depends on family:
    /// - AF_INET: bytes [2..8] = struct sockaddr_in (port, addr, padding)
    /// - AF_INET6: bytes [2..28] = struct sockaddr_in6 (port, flowinfo, addr, scope_id)
    /// - AF_UNIX: bytes [2..108] = struct sockaddr_un (path)
    pub data: [u8; 126],
}

#[cfg(target_pointer_width = "32")]
#[repr(C, align(4))]
pub struct SockAddr {
    pub family: u16,
    pub data: [u8; 126],
}

const_assert!(core::mem::size_of::<SockAddr>() == 128);

/// Socket factory trait. Each protocol registers a factory during initialization.
/// The factory creates socket instances when `socket()` syscall is invoked.
pub trait SocketFactory: Send + Sync {
    /// Create a new socket instance.
    /// `family` is AF_INET or AF_INET6. `sock_type` is SOCK_STREAM, SOCK_DGRAM, etc.
    /// `protocol` is IPPROTO_TCP, IPPROTO_UDP, etc. (or 0 for default).
    fn create_socket(
        &self,
        family: AddressFamily,
        sock_type: SocketType,
        protocol: u16,
    ) -> Result<SlabRef<dyn SocketOps>, KernelError>;
}

/// Address family constants (matches Linux AF_* values).
#[repr(u16)]
pub enum AddressFamily {
    Unspec = 0,   // AF_UNSPEC
    Unix = 1,     // AF_UNIX / AF_LOCAL
    Inet = 2,     // AF_INET (IPv4)
    Inet6 = 10,   // AF_INET6 (IPv6)
    Netlink = 16, // AF_NETLINK
    // ... other families as needed
}

/// Socket type constants (matches Linux SOCK_* values).
#[repr(u32)]
pub enum SocketType {
    Stream = 1,    // SOCK_STREAM (TCP)
    Dgram = 2,     // SOCK_DGRAM (UDP)
    Raw = 3,       // SOCK_RAW
    Seqpacket = 5, // SOCK_SEQPACKET (SCTP)
    // ... other types as needed
}

// SCTP is supported as a transport protocol (SOCK_SEQPACKET, multihoming). The
// `SocketOps` trait and congestion control framework are designed to accommodate
// SCTP's multi-stream and multihoming semantics: `connect()` supports multiple
// addresses (SCTP associations), `send()`/`recv()` carry stream identifiers via
// ancillary data (cmsg), and the congestion controller interface
// ([Section 16.9](#congestion-control-framework)) supports per-path CWND
// (SCTP requires independent congestion state per destination address).
// Full SCTP specification: see [Section 16.23](#sctp-stream-control-transmission-protocol).

16.8 TCP Control Block and State Machine

TcpState enum (matches RFC 793 + TIME_WAIT 2MSL + LISTEN):

pub enum TcpState {
    Closed,
    Listen,
    SynSent,
    SynReceived,
    Established,
    FinWait1,
    FinWait2,
    CloseWait,
    Closing,
    LastAck,
    TimeWait,  // 2 × MSL = 120s (RFC 793); expiry tracked in TcpMutableState.timewait_timer
}

CongPriv — inline congestion control private data (64 bytes):

/// 64 bytes of inline storage for congestion control algorithm state.
/// Algorithms with ≤64 bytes (Reno, CUBIC, Vegas) store directly here.
/// Algorithms needing more (BBR v2) store a heap pointer in the first 8 bytes
/// and free it in `CongestionOps::release()`. The engine zeroes this before init().
#[repr(C, align(8))]
pub struct CongPriv {
    pub data: [u8; 64],
}
const_assert!(core::mem::size_of::<CongPriv>() == 64);

TcpCb — TCP control block (per-socket, ~512-640 bytes including SockCommon, timer handles, congestion private state, and queue headers; actual size depends on SockCommon and TimerHandle definitions — use const_assert!(size_of::<TcpCb>() <= 768)):

pub struct TcpCb {
    /// Common socket state (namespace, credentials, cgroup, buffer limits).
    pub common: SockCommon,
    pub state: TcpState,
    /// Per-socket lock protecting all mutable TcpCb state. The lock wraps
    /// `TcpMutableState` which contains sequence numbers, congestion state,
    /// timers, and queues. Acquired by the protocol stack on RX (NAPI batch
    /// delivery from Tier 0) and by user-space syscalls (process context via
    /// socket ring). In Tier 1 umka-net, both RX and TX paths execute in
    /// umka-net's consumer thread(s) — NOT in softirq context. The lock
    /// serializes concurrent consumer threads (e.g., one processing NAPI RX
    /// batch delivery on CPU 0, another processing SendMsg SocketRingCmd on
    /// CPU 1). IRQ-disable is NOT used in Tier 1 context — Tier 1 code runs
    /// in a hardware isolation domain (MPK/POE) and cannot disable IRQs.
    /// The lock is a plain SpinLock (NOT IRQ-safe in Tier 1 context).
    ///
    /// **Borrow pattern**: `SocketOps::sendmsg(&self)` acquires the lock to
    /// obtain `&mut TcpMutableState` via interior mutability. All TCP data-path
    /// functions (`tcp_sendmsg`, `tcp_rcv_established`) operate on the
    /// `SpinLockGuard<TcpMutableState>`, not on `&mut TcpCb`.
    ///
    /// **Lock ordering**: `TcpCb.lock` (level 40) < `Qdisc.lock` (level 50) <
    /// `BuddyAllocator.lock` (level 60). The TX path holds `TcpCb.lock`
    /// while calling `ip_queue_xmit()` → `ip_output()` → `dev_queue_xmit()`,
    /// which may acquire `Qdisc.lock`. This ordering is the same as Linux's
    /// `sock.sk_lock` < `qdisc_lock`. Never hold `TcpCb.lock` and attempt
    /// to acquire another socket's lock (deadlock with concurrent send+recv
    /// on two connected sockets). See [Section 3.4](03-concurrency.md#cumulative-performance-budget)
    /// for the global lock ordering table.
    ///
    /// **Level spacing note**: The 10-level gap between socket lock (40)
    /// and qdisc lock (50) provides 9 insertion points (41-49) for any
    /// future TX path locks (traffic policing, flow classification, TSO
    /// segmentation). This matches the architecture's 10x spacing policy.
    ///
    /// **Contention profile**: In Tier 1 umka-net, the main contention
    /// scenario is RX batch delivery (NAPI consumer thread) vs TX send
    /// (SocketRingCmd consumer thread). Since both threads run in Tier 1
    /// and process different operations on the same connection, the hold
    /// time is bounded by one packet's processing (~100-500 cycles).
    pub lock: SpinLock<TcpMutableState>,
}

/// Mutable state protected by `TcpCb.lock`. All fields that are modified
/// during packet processing live here, ensuring the `SpinLock` guard grants
/// typed access (not `SpinLock<()>` which cannot provide `&mut` in safe Rust).
pub struct TcpMutableState {
    // === Send-side sequence variables (RFC 793 §3.2) ===
    pub snd_una: u32,       // oldest unacknowledged sequence number
    pub snd_nxt: u32,       // next sequence number to send (network-facing)
    pub write_seq: u32,     // next application write sequence number (app-facing);
                            // data between snd_nxt..write_seq is queued but unsent
    pub snd_wnd: u32,       // current send window (from remote receiver)
    pub snd_up: u32,        // urgent pointer (send side)
    pub snd_wl1: u32,       // sequence number of last window update
    pub snd_wl2: u32,       // ack number of last window update
    pub iss: u32,           // initial send sequence number

    // === Receive-side sequence variables ===
    pub rcv_nxt: u32,       // next expected receive sequence number
    pub rcv_wnd: u32,       // current receive window advertised to peer
    pub rcv_up: u32,        // urgent pointer (receive side)
    pub irs: u32,           // initial receive sequence number

    // === RTT estimation (RFC 6298 Jacobson/Karels) ===
    pub srtt_us: u32,       // smoothed RTT estimate (microseconds × 8)
    pub rttvar_us: u32,     // RTT variance (microseconds × 4)
    pub rto_us: u32,        // current RTO (microseconds), clamped [200ms, 120s]
    pub rtt_seq: u32,       // sequence number being timed

    // === Congestion control (via CongestionOps trait, Section 16.6) ===
    pub cwnd: u64,          // congestion window (bytes); u64 to support high-BDP paths (>4.3 GB at 400 Gbps/100ms RTT)
    pub ssthresh: u64,      // slow-start threshold; u64 matches cwnd width
    /// Stateless algorithm descriptor — a &'static reference to one of the registered
    /// CongestionOps implementations. No per-connection heap allocation; the ops
    /// pointer is 8 bytes. Per-connection state lives in `cong_priv` below.
    ///
    /// **Live evolution**: When a CongestionOps implementation is live-evolved,
    /// existing connections continue using the old implementation until they close.
    /// The old implementation's code is kept in memory (tracked by a global
    /// reference count decremented in `CongestionOps::release()`) until the last
    /// connection using it terminates. New connections established after the
    /// evolution event use the new implementation.
    pub cong_ops: &'static dyn CongestionOps,
    /// 64-byte inline per-connection state for the congestion algorithm.
    /// Algorithms with ≤64 bytes of state (Reno, CUBIC, Vegas) store directly here.
    /// Larger algorithms (BBR v2) store a heap box pointer in the first 8 bytes and
    /// free it in CongestionOps::release(). The engine zeroes cong_priv before init().
    pub cong_priv: CongPriv,

    // === Receive queue (in-order data ready for userspace) ===
    /// Ordered queue of in-sequence data segments ready for delivery to
    /// userspace via `recv()`/`read()`. Segments are appended by
    /// `tcp_data_queue()` after sequence number validation confirms they
    /// extend `rcv_nxt`. The `recv()` syscall copies payload from head,
    /// advancing head and decrementing `bytes`/`count`. When
    /// `recv_queue.bytes` exceeds `common.rcvbuf`, TCP advertises a zero
    /// window to the peer (rcv_wnd = 0), applying backpressure.
    pub recv_queue: TcpRecvQueue,

    // === SACK state (RFC 2018) ===
    pub sack_ok: bool,
    pub sack_scoreboard: SackScoreboard,
    /// Out-of-order receive queue. Sorted by sequence number (ascending).
    /// Segments are inserted when data arrives out of order and removed
    /// when the gap is filled (rcv_nxt advances past them).
    pub reorder_queue: TcpSegQueue,

    // === Send queue (unsent data) ===
    /// Queue of NetBufs awaiting transmission (unsent data from tcp_sendmsg).
    /// Segments move from send_queue to retrans_queue after being transmitted.
    /// This separates unsent data (send_queue) from transmitted-but-unacknowledged
    /// data (retrans_queue), matching the Linux TCP socket buffer split
    /// (sk_write_queue vs retransmit queue).
    pub send_queue: NetBufQueue,

    // === Retransmission queue ===
    /// Retransmission queue. Sorted by sequence number (ascending).
    /// Segments are removed when acknowledged (snd_una advances past them)
    /// or retransmitted when the retransmit timer fires.
    pub retrans_queue: TcpSegQueue,
    pub retrans_stamp: Instant,              // timestamp of last retransmission

    // === Timer handles (Section 7.5.4 timer wheel) ===
    pub retransmit_timer: TimerHandle,
    pub delack_timer: TimerHandle,
    pub keepalive_timer: TimerHandle,
    pub timewait_timer: TimerHandle,
    pub zwp_timer: TimerHandle,             // zero-window probe

    // === Nagle algorithm control ===
    /// Default true; cleared by setsockopt(TCP_NODELAY). When true, small
    /// segments are coalesced until an ACK arrives or the send buffer is full.
    pub nagle_enabled: bool,

    // === Options negotiated at connect time ===
    pub ts_ok: bool,        // TCP timestamps (RFC 7323)
    pub wscale_ok: bool,
    pub rcv_wscale: u8,     // our receive window scale
    pub snd_wscale: u8,     // peer's send window scale
    pub mss_clamp: u16,     // effective MSS (min of ours and peer's)
}

SACK scoreboard (RFC 2018 + RFC 6675):

pub struct SackScoreboard {
    /// Up to 4 SACK blocks per ACK (RFC 2018 §3 limit).
    /// Each block marks received bytes above snd_una.
    pub blocks: ArrayVec<SackBlock, 4>,
    pub pipe: u32,          // RFC 6675 "pipe" variable (bytes in flight estimate)
    pub recovery_point: u32, // sequence number where recovery ends
    pub in_recovery: bool,
}
pub struct SackBlock { pub start: u32, pub end: u32 }

TCP segment queue (used for OOO receive and retransmission queues):

/// Ordered queue of TCP segments. Backed by an intrusive linked list
/// of NetBuf entries, sorted by sequence number (ascending). Each entry
/// is linked via `NetBuf.next: Option<NonNull<NetBuf>>` (intrusive linked
/// list). The queue's head/tail use `NonNull<NetBuf>` (raw pointers)
/// matching the intrusive link type — NOT `NetBufHandle` (16-byte pool
/// tokens that require a pool lookup to dereference). This design enables
/// direct pointer-chasing via `NetBuf.next` without pool-lookup overhead
/// on every queue walk.
///
/// The `NetBufHandle` is stored inside each `NetBuf` struct (the `data_handle`
/// field) for reference counting and pool-return. Queue traversal uses raw
/// pointers; pool-return uses the handle stored inside the NetBuf.
///
/// The queue uses NetBuf's `next` pointer for O(1) insert-at-tail (common
/// case: in-order arrival) and O(n) insert-by-seqno (OOO arrival, typically
/// 1-3 entries to scan). Total segment count is bounded by the receive
/// window / MSS (typically ≤ 100-200 segments for a 256KB window).
pub struct TcpSegQueue {
    /// Head of the sorted linked list (lowest sequence number).
    pub head: Option<NonNull<NetBuf>>,
    /// Tail for O(1) append when data arrives in order.
    pub tail: Option<NonNull<NetBuf>>,
    /// Number of segments currently in the queue.
    pub count: u32,
    /// Total bytes of payload across all queued segments.
    pub bytes: u64,
}

TCP receive queue (in-order data ready for userspace delivery):

/// TCP socket receive queue. Holds in-order data segments ready for delivery
/// to userspace via `recv()`/`read()`. Protected by the socket lock (`TcpCb.lock`).
///
/// **Lifecycle:**
///
/// 1. **Enqueue** (`tcp_data_queue`): When the TCP input path receives a segment
///    whose sequence number matches `rcv_nxt` (in-order arrival), the segment is
///    appended to `tail` and `rcv_nxt` is advanced by the segment's payload length.
///    If the `reorder_queue` contains segments that are now contiguous (gap filled),
///    those segments are dequeued from `reorder_queue` and appended here as well.
///
/// 2. **Dequeue** (`recv()` / `read()` syscall): Uses **peek-before-dequeue**
///    semantics to prevent data loss on `copy_to_user()` failure:
///
///    a. **Peek**: The syscall reads payload bytes from `head` *without advancing
///       the read pointer*. The data is copied into the KABI shared buffer (Tier 1)
///       or directly to the user buffer (Tier 0 `copy_to_user()`). At this point,
///       `head`, `bytes`, and `count` are unchanged — the data is still in the queue.
///
///    b. **Commit**: Only after a successful copy does the read pointer advance.
///       Fully consumed segments are released (returning the `NetBufHandle` to the
///       pool). `head` advances to the next segment; `bytes` and `count` are
///       decremented accordingly.
///
///    c. **Failure recovery**: If `copy_to_user()` fails (returns `-EFAULT` — e.g.,
///       the user buffer is unmapped or read-only), the read pointer has not moved.
///       No data is lost. The syscall returns `-EFAULT` and the application can retry
///       with a valid buffer. The queue remains in a consistent state.
///
///    This matches Linux's `tcp_recvmsg()` approach: `skb_peek()` examines the head
///    of `sk_receive_queue` without unlinking, `skb_copy_datagram_msg()` copies to
///    userspace, and only `sk_eat_skb()` / `tcp_eat_recv_skb()` advances the queue
///    on success. The two-phase peek+commit design ensures that TCP's reliable byte
///    stream guarantee is preserved even when the kernel-to-user copy fails.
///
///    **Tier 1 recvmsg copy-size bound**: When umka-net processes a `RecvMsg` KABI
///    request, it holds `tcb.lock` (SpinLock) during the peek+copy.
///    The copy size is bounded by `min(data_max_len, KABI_SHARED_SLOT_SIZE)` where
///    `KABI_SHARED_SLOT_SIZE` is 4096 bytes (one page). This bounds the lock hold
///    time to ~1 us per recvmsg operation (4 KB memcpy). For `recv()` calls
///    requesting more than 4 KB, Tier 0 issues multiple `RecvMsg` KABI requests,
///    each copying up to 4 KB under the lock. This prevents the latency concern
///    of holding a spinlock during a potentially large (64 KB+) memcpy.
///
///    **Large recv() cost analysis**: For a 1 MB recv():
///    - Iterations: 256 (1 MB / 4 KB).
///    - Lock overhead: 256 × ~1 μs = ~256 μs (memcpy under spinlock).
///    - Domain crossing overhead: 256 × 2 × ~23 cycles = ~11,776 cycles ≈ ~3.9 μs
///      at 3 GHz.
///    - Total: ~260 μs for 1 MB. This is comparable to Linux's monolithic recv()
///      path where `lock_sock()` + memcpy takes ~250 μs for the same data size.
///    - The per-4KB chunking design is correct: it bounds worst-case lock hold time
///      to ~1 μs while keeping total throughput competitive. The 256 domain crossings
///      add only ~4 μs total (<2% overhead). For bulk transfers, TCP window scaling
///      and GRO ensure that most recv() calls receive large windows of data
///      efficiently.
///
/// 3. **Flow control**: After each dequeue, the TCP stack re-calculates the
///    advertised receive window: `rcv_wnd = common.rcvbuf - recv_queue.bytes`.
///    When `recv_queue.bytes >= common.rcvbuf`, the stack advertises a zero window
///    (`rcv_wnd = 0`), applying backpressure to the sender. When userspace drains
///    the queue below the threshold, a window update is sent (either piggybacked
///    on an outgoing ACK or via a standalone window update segment).
///
/// 4. **Connection teardown**: On socket close or RST receipt, all queued segments
///    are drained via `NetBuf::free()` on each dequeued `NonNull<NetBuf>`.
///    `NetBuf::free()` handles both metadata slab return (via pointer arithmetic
///    within the pool's VA range) and data page refcount decrement. No conversion
///    to `NetBufHandle` is needed — the handle is only for cross-domain passing.
///    `bytes` and `count` are zeroed.
///
/// **Invariants:**
/// - `head.is_none() == (count == 0) == (bytes == 0)` — the queue is either
///   entirely empty or has at least one segment with non-zero payload.
/// - `tail.is_some()` whenever `head.is_some()` — tail tracks the last segment
///   for O(1) append.
/// - `bytes` is the exact sum of payload lengths across all segments. It must
///   never drift from the actual total (drift would cause incorrect window
///   advertisement, potentially stalling the connection indefinitely).
pub struct TcpRecvQueue {
    /// Head of the ordered segment chain (lowest sequence number).
    /// `recv()` copies from here and advances head on full consumption.
    /// Uses `NonNull<NetBuf>` (matching the intrusive `NetBuf.next` link type)
    /// for direct pointer-chasing without pool-lookup overhead.
    pub head: Option<NonNull<NetBuf>>,
    /// Tail pointer for O(1) append of newly-sequenced segments.
    /// Updated by `tcp_data_queue()` when in-order data arrives.
    pub tail: Option<NonNull<NetBuf>>,
    /// Total bytes of payload data across all segments in the queue.
    /// Used for receive window calculation: `rcv_wnd = rcvbuf - bytes`.
    pub bytes: u64,
    /// Number of segments in the queue.
    /// // Longevity: u32 — at 1M segments/sec, wraps in ~4295 seconds.
    /// // This is a queue depth counter, not a monotonic ID. The queue is
    /// // regularly drained by recv(), so it never approaches u32::MAX.
    pub count: u32,
}

16.8.1 TCP State Machine

The TCP state machine follows RFC 793 with RFC 1122 corrections and TCP Extensions (RFC 7323). UmkaOS implements all 11 states. The state is stored in TcpCb.state: TcpState.

16.8.1.1 State Transition Table

From State Event Guard Actions To State
CLOSED passive open (listen) allocate TCB, set backlog queue LISTEN
CLOSED active open (connect) send SYN, start connect timer (75s) SYN_SENT
LISTEN recv SYN backlog not full send SYN+ACK, start SYN-ACK timer (1s×3) SYN_RECEIVED
LISTEN recv SYN backlog full drop or send SYN cookie LISTEN
LISTEN send (active data) send SYN, become active SYN_SENT
SYN_SENT recv SYN+ACK ack matches our SYN seq send ACK, cancel connect timer ESTABLISHED
SYN_SENT recv SYN (simultaneous open) send SYN+ACK SYN_RECEIVED
SYN_SENT connect timer expires retries exhausted delete TCB, notify app: ECONNREFUSED CLOSED
SYN_RECEIVED recv ACK ack matches SYN+ACK seq move to accept queue, notify app ESTABLISHED
SYN_RECEIVED SYN-ACK timer expires retries < 3 retransmit SYN+ACK SYN_RECEIVED
SYN_RECEIVED SYN-ACK timer expires retries >= 3 delete TCB CLOSED
SYN_RECEIVED recv RST delete TCB CLOSED
ESTABLISHED app close / shutdown(WR) send FIN, start FIN timer FIN_WAIT_1
ESTABLISHED recv FIN send ACK, notify app: EOF CLOSE_WAIT
ESTABLISHED recv RST notify app: ECONNRESET, delete TCB CLOSED
FIN_WAIT_1 recv ACK (of our FIN) cancel FIN timer FIN_WAIT_2
FIN_WAIT_1 recv FIN send ACK CLOSING
FIN_WAIT_1 recv FIN+ACK ack covers our FIN send ACK, start TIME_WAIT timer (2×MSL) TIME_WAIT
FIN_WAIT_2 recv FIN send ACK, start TIME_WAIT timer (2×MSL) TIME_WAIT
FIN_WAIT_2 FIN_WAIT_2 timer expires (60s idle guard, RFC 1122 §4.2.2.20) delete TCB CLOSED
CLOSE_WAIT app close send FIN, start FIN timer LAST_ACK
CLOSING recv ACK (of our FIN) start TIME_WAIT timer (2×MSL) TIME_WAIT
LAST_ACK recv ACK (of our FIN) delete TCB CLOSED
TIME_WAIT TIME_WAIT timer expires 2×MSL elapsed delete TCB CLOSED
TIME_WAIT recv SYN seq > last seq seen recycle TCB, send SYN+ACK (RFC 1122 §4.2.2.13) SYN_RECEIVED

16.8.1.2 Timer Specifications

Each timer is stored as a TimerHandle in TcpCb (see struct definition above):

Timer Field in TcpCb Duration Action on expiry
retransmit retransmit_timer RTO (1s initial, exponential backoff, max 64s per RFC 6298) Retransmit oldest unacked segment; double RTO
persist zwp_timer RTO-based (5s–60s per RFC 1122 §4.2.2.17) Send zero-window probe segment
keepalive keepalive_timer tcp_keepalive_time sysctl (default 7200s); probe interval tcp_keepalive_intvl (default 75s) Send keepalive probe; after tcp_keepalive_probes (default 9) failures → RST + ECONNRESET
time_wait timewait_timer 2×MSL (MSL default 60s → 2×MSL = 120s; configurable to min 1s via tcp_fin_timeout sysctl) Delete TCB
syn_ack (in SYN queue entry) 1s, max 3 retries (RFC 1122 §4.2.2.13) Retransmit SYN+ACK; on 3rd expiry → discard
fin_wait2 (FIN_WAIT_2 state) tcp_fin_timeout sysctl (default 60s) Force close the connection
connect (SYN_SENT state) 75s total (RFC 1122 §4.2.3.5) Abort with ETIMEDOUT
delack delack_timer min(40ms, RTT/2) (RFC 1122 §4.2.3.2) Send delayed ACK; cancelled when ACK is piggybacked on outgoing data
16.8.1.2.1 Timer Registration and Tier Boundary Crossing

TCP runs in Tier 1 (umka-net domain). The timer wheel and hrtimer subsystem run in Tier 0 (softirq context). Per the Unified Domain Model, timer callbacks cannot directly invoke Tier 1 code. TCP timers use the cross-domain timer registration API (Section 7.8) to bridge this boundary.

Registration: When a TCP connection arms a timer (e.g., retransmit after sending data), it calls timer_register_cross_domain() with:

/// Arm a TCP timer via cross-domain registration.
///
/// `timer_type_tag` encodes the timer kind:
///   0 = retransmit, 1 = delack, 2 = keepalive, 3 = timewait, 4 = zwp
///
/// The `timer_id` is packed as `(sock_handle << 16) | timer_type_tag`
/// so the consumer can dispatch to the correct connection and handler.
fn tcp_arm_timer(
    tcb: &TcpCb,
    timer_type_tag: u16,
    expiry_ns: u64,
) -> Result<CrossDomainTimerHandle, Error> {
    let timer_id = (tcb.common.sock_handle as u64) << 16
                 | timer_type_tag as u64;
    timer_register_cross_domain(
        timer_id,
        tcb.common.domain_id,  // umka-net's DomainId
        expiry_ns,
        TimerType::Wheel,      // TCP timers use coarse-grained wheel
    )
}

Expiry delivery: When the timer fires, the Tier 0 timer wheel calls timer_fire_to_domain() (Section 12.8), which enqueues a TimerExpiry event on umka-net's IRQ ring. The umka-net consumer loop calls DriverIrqHandler::handle_timer_expiry(timer_id, expiry_ns, timestamp). The TCP module unpacks timer_id to recover (sock_handle, timer_type_tag), looks up the TcpCb via sock_handle, acquires the socket lock, and invokes the appropriate handler:

timer_type_tag Handler function Action
0 (retransmit) tcp_retransmit_timer() Retransmit oldest unacked segment; double RTO
1 (delack) tcp_delack_timer() Send delayed ACK
2 (keepalive) tcp_keepalive_timer() Send keepalive probe
3 (timewait) tcp_timewait_timer() Delete TCB
4 (zwp) tcp_zwp_timer() Send zero-window probe

Stale expiry detection: Each timer handle stores the armed expiry_ns. When a timer is rearmed (e.g., retransmit timer reset after receiving an ACK), the old handle is cancelled and a new one created with the new expiry_ns. If a stale event from the old arming arrives (because it was already in the IRQ ring), the handler compares expiry_ns from the event against the timer's current armed value. Mismatches are discarded.

Retransmit timer algorithm (RFC 6298 Jacobson/Karels):

Initial RTO = 1s.
On new RTT sample M:
  if first measurement: SRTT = M; RTTVAR = M/2
  else: RTTVAR = (3/4) × RTTVAR + (1/4) × |SRTT - M|
        SRTT   = (7/8) × SRTT   + (1/8) × M
RTO = SRTT + max(G, 4 × RTTVAR)  where G = clock granularity (1ms)
Clamped to [200ms, 120s].
Exponential backoff on expiry: RTO ← RTO × 2, max retries = 15 (RFC 1122).

16.8.1.3 TIME_WAIT Optimization (TW Recycling)

UmkaOS uses a hash-bucketed TIME_WAIT table (separate from the main TCB hash) to avoid holding a full TcpCb for each TIME_WAIT connection. A TwEntry stores:

pub struct TwEntry {
    pub local:   SocketAddr,  // local IP:port
    pub remote:  SocketAddr,  // remote IP:port
    pub ts_val:  u32,         // last timestamp value seen (for RFC 7323 PAWS)
    pub ts_ecr:  u32,         // last echoed timestamp (for PAWS)
    pub rcv_nxt: u32,         // expected sequence number (to detect stale SYNs)
    pub expiry:  Instant,     // when to delete this entry
}

TIME_WAIT entries are indexed in a global hash table keyed by 4-tuple (TwHashTable), in addition to a per-CPU TwBucket ring. The per-CPU ring provides fast local-CPU reap; the global hash provides cross-CPU lookup when RSS rebalancing or IRQ migration causes a SYN to arrive on a different CPU than the one that created the TIME_WAIT entry. TwHashTable is an RcuHashMap<FourTuple, TwEntry> — readers (SYN processing) use RCU; writers (TIME_WAIT creation/expiry) acquire a per-bucket spinlock. The per-CPU ring references entries in the global hash for O(1) local reap. Expired entries are reaped lazily on the next connection to the same 4-tuple, or by the per-CPU timer interrupt that fires once per hz/8 (125ms) to reclaim slots.

16.8.1.4 Simultaneous Open (RFC 793 §3.4)

Both sides send SYN without receiving one first → both enter SYN_SENT → both receive SYN → both transition to SYN_RECEIVED → both send SYN+ACK → receive the other's SYN+ACK → ESTABLISHED. This is rare (requires NAT traversal or carefully crafted sockets) but must be handled correctly.

16.8.1.5 RST Generation Rules (RFC 793 §3.4, RFC 5961)

  • Segment arrives on LISTEN socket with no SYN: send RST.
  • Segment arrives with ACK for a nonexistent SYN (blind RST injection guard per RFC 5961): validate that RST.seq falls within the receive window before accepting.
  • RST on TIME_WAIT: silently ignore (prevents RST-based TIME_WAIT assassination).

ACK processing (fast retransmit / fast recovery):

On receiving ACK:
  if ACK advances snd_una:
    update snd_una, reset dupack counter
    call congestion.on_ack(bytes_acked)
    update RTT estimate if timestamp or RTT timer matches
    if SACK: update scoreboard, recompute pipe
  else if ACK == snd_una (duplicate ACK):
    dupacks++
    if dupacks == 3: fast retransmit (RFC 5681)
      enter fast recovery:
        ssthresh = max(cwnd/2, 2×mss)
        cwnd = ssthresh + 3×mss
        retransmit snd_una segment immediately
    else if SACK: run RFC 6675 loss recovery (SACK-based retransmit)
  on leaving fast recovery (new ACK): cwnd = ssthresh (deflate)

ACK-to-write-space linkage (send buffer backpressure): When an ACK advances snd_una, the retransmission queue is cleaned:

tcp_clean_retrans_queue(state, ack_seq):
    while let Some(seg) = state.retrans_queue.peek_front():
        if seg.seq + seg.len > ack_seq: break  // not yet ACKed
        let seg = state.retrans_queue.pop_front()
        state.common.sk_wmem_queued -= seg.data_len as u64
        NetBuf::free(seg)  // free metadata + decrement data page refcount
    // Check write-space wakeup threshold:
    // Wake the sender if wmem drops below half the send buffer.
    // This matches Linux's `tcp_check_space()` threshold.
    if state.common.sk_wmem_queued < state.common.sndbuf / 2 {
        if state.common.write_wait_pending.load(Relaxed) {
            // Signal Tier 0 via WakeupAccumulator (batched at napi_complete_done).
            kernel_services.wake_socket(
                state.common.sock_handle,
                SocketWakeEvent::WriteSpaceReady,
            );
        }
    }
The sk_wmem_queued decrement happens inside tcp_clean_retrans_queue(), called from the ACK processing path in tcp_rcv_established(). The wakeup threshold is sndbuf / 2 — matching Linux's sk_stream_write_space() which wakes when sk_stream_wspace(sk) >= sk_stream_min_wspace(sk).

Tail Loss Probe (TLP, RFC 8985 RACK-TLP): - After last-segment-sent, if no ACK within max(2×SRTT, 10ms), send one new or retransmit segment as probe - Allows faster loss recovery without triggering full RTO

16.8.1.6 tcp_sendmsg() — User Data Segmentation and Transmission

/// Copies user data into kernel buffers and submits TCP segments.
/// Called from sys_sendto() / sys_sendmsg() with the socket read lock held
/// (multiple threads may call sendmsg() concurrently; interior mutability via
/// TcpCb.lock serializes access to send-side state).
///
/// # Arguments
/// - `tcb`: The TCP control block for this connection. Must be in ESTABLISHED
///   or CLOSE_WAIT state (data may still be sent after receiving FIN).
/// - `msg`: User message header containing iovec scatter-gather entries and
///   optional ancillary data (e.g., TCP_CORK control messages).
/// - `flags`: Message flags: MSG_DONTWAIT (non-blocking), MSG_MORE (hint:
///   more data follows — defer push), MSG_NOSIGNAL (suppress SIGPIPE),
///   MSG_EOR (record boundary, ignored for TCP).
///
/// # Returns
/// - `Ok(n)`: Total number of bytes copied from user iovecs into the send queue.
///   May be less than the total iovec length if non-blocking and the send buffer
///   is full (`n > 0` partial write, or `Err(EAGAIN)` if zero bytes could be sent).
/// - `Err(EPIPE)`: Connection not in a sendable state (peer sent FIN and we already
///   closed, or RST received). SIGPIPE is delivered unless MSG_NOSIGNAL is set.
/// - `Err(EAGAIN)`: Non-blocking mode and the send buffer is full (no bytes copied).
/// - `Err(ENOMEM)`: NetBuf allocation failed (memory pressure).
/// - `Err(EINTR)`: A signal was received before any data was copied.
///   If some bytes were already copied, returns `Ok(n)` with the partial count.
///
/// # Algorithm
///
/// 1. **Connection state check**: Verify `tcb.state` is `Established` or `CloseWait`.
///    Any other state returns `EPIPE`. If `MSG_NOSIGNAL` is not set, deliver `SIGPIPE`
///    to the calling thread before returning `EPIPE`.
///
/// 2. **Acquire TcpCb lock** (`tcb.lock.lock()`) to serialize against the
///    receive-side NAPI path (ACK processing may update snd_wnd concurrently).
///    In Tier 1 umka-net, the lock serializes the sendmsg consumer thread
///    against the NAPI RX batch delivery thread (which may process ACKs that
///    update `snd_wnd`). No IRQ-disable is needed — Tier 1 code does not run
///    in softirq context; concurrent access comes from other consumer threads.
///
/// 3. **Iovec copy loop**: Iterate over `msg.msg_iov[0..msg.msg_iovlen]`:
///
///    a. **Coalesce check**: If the tail segment of `send_queue` has room (payload
///       length < `tcb.mss_clamp`), copy into the tail's remaining capacity. This
///       avoids a NetBuf allocation for small writes that fit in an existing segment.
///
///    b. **Allocate new NetBuf**: If coalescing is not possible (tail full or queue
///       empty), allocate a NetBuf from the per-CPU NetBuf slab
///       ([Section 16.5](#netbuf-packet-buffer)). If allocation fails: if `total_copied > 0`,
///       break the loop and return partial success; otherwise return `ENOMEM`.
///
///    c. **Copy data into NetBuf**: Copy up to `min(iov_len, MSS - current_segment_len)`
///       bytes into the NetBuf data region. **Tier isolation note**: umka-net runs
///       in Tier 1 and cannot access userspace pages directly. The actual
///       `copy_from_user()` is performed by the Tier 0 syscall dispatch layer,
///       which copies user data into the KABI shared buffer before posting a
///       SendRequest to umka-net ([Section 16.2](#network-stack-architecture--socket-operation-dispatch)).
///       `tcp_sendmsg()` reads from the KABI shared buffer (Tier 0/Tier 1 shared
///       memory), not from userspace. If the buffer read encounters an error
///       (e.g., truncated message): if `total_copied > 0`, break and return
///       partial success; otherwise return `EFAULT`.
///
///    d. **Update write sequence**: `tcb.write_seq += bytes_copied`.
///
///    e. **Send buffer accounting**: `tcb.common.sk_wmem_queued += bytes_copied`.
///       If `sk_wmem_queued >= tcb.common.sndbuf`, the send buffer is full:
///       - If `total_copied == 0`: return `SOCK_RESP_WOULD_BLOCK` immediately.
///         Tier 0 will either return `EAGAIN` (if `MSG_DONTWAIT`) or block the
///         calling task on the socket's `write_wait` queue (Tier 0 manages sleep).
///       - If `total_copied > 0`: break the loop, transmit what we have, and
///         return `Ok(total_copied)`.
///       **Tier 1 non-blocking invariant**: umka-net NEVER sleeps on a wait queue.
///       Blocking is always done by Tier 0 on the caller side. When ACK processing
///       frees send buffer space, umka-net signals `sk_write_space_ready()` via the
///       `KernelServicesVTable` KABI callback to wake the blocked sender in Tier 0.
///       See [Section 16.31](#network-service-provider--tier-1-nonblocking-invariant).
///
/// 4. **Nagle algorithm** (RFC 896 + RFC 1122 §4.2.3.4):
///    If `tcb.nagle_enabled` (default true, disabled by `TCP_NODELAY`) and
///    `unsent_data < tcb.mss_clamp` and `tcb.snd_nxt != tcb.snd_una` (unacked
///    data in flight): defer transmission. The segment will be sent when either:
///    - An ACK arrives (clearing the unacked condition), or
///    - More data accumulates to fill a full MSS segment, or
///    - The delayed-ACK timer expires (200ms worst case).
///
///    **MSG_MORE override**: When `MSG_MORE` is set, always defer push regardless
///    of Nagle state. The application is signaling that more data follows immediately.
///
///    **TCP_NODELAY override**: When `TCP_NODELAY` is set (Nagle disabled),
///    `tcp_write_xmit()` is called unconditionally — all data is pushed immediately.
///
/// 5. **Cork mode** (`TCP_CORK` socket option):
///    When corked, segments accumulate in `send_queue` without being transmitted
///    until one of:
///    - The cork is released (`setsockopt(TCP_CORK, 0)`), or
///    - The accumulated data reaches a full MSS, or
///    - The cork timer expires (200ms, matching Linux's `tcp_cork_timer`).
///    Cork and Nagle are independent: both can be active simultaneously (cork
///    takes precedence — data is held even if Nagle would allow sending).
///
/// 6. **PSH flag**: Set the PSH (push) flag on the last segment emitted by this
///    `sendmsg()` call, unless `MSG_MORE` is set. PSH tells the receiver to deliver
///    data to the application immediately rather than buffering. This matches the
///    Linux behavior where each `sendmsg()` boundary generates a PSH.
///
/// 7. **Transmit**: Call `tcp_write_xmit()` to move segments from `send_queue` to
///    `retrans_queue` and transmit via the IP output path. `tcp_write_xmit()`:
///    - Segments the data into MSS-sized chunks (respecting `tcb.mss_clamp`).
///    - Applies the congestion window: sends at most `min(cwnd, snd_wnd)` bytes
///      beyond `snd_nxt`.
///    - Attaches TCP headers (seq, ack, window, timestamps if `ts_ok`).
///    - Enqueues segments onto `retrans_queue` and starts `retransmit_timer` if
///      not already running.
///    - Passes each segment to `ip_queue_xmit()` for IP encapsulation and routing.
///
/// 8. **Release TcpCb lock** and return `Ok(total_copied)`.
///
/// # Concurrency
///
/// Multiple threads calling `sendmsg()` on the same socket are serialized by
/// `tcb.lock` (BH-safe spinlock). The socket dispatch layer's read lock allows
/// concurrent entry, but the TcpCb lock ensures sequential access to send-side
/// state (`write_seq`, `send_queue`, `snd_nxt`). This matches Linux's behavior
/// where `lock_sock()` serializes TCP socket operations.
///
/// The NAPI batch delivery path (ACK processing) also acquires `tcb.lock`.
/// In Tier 1 umka-net, both sendmsg and ACK processing run as consumer threads
/// — the lock serializes them. No IRQ-disable is needed (Tier 1 is not
/// softirq context). The lock prevents concurrent mutation when an ACK arrives
/// for data we just queued.
/// Takes `&TcpCb` (not `&mut TcpCb`) because `SocketOps::sendmsg` dispatches
/// via `&self` (interior mutability pattern). The function acquires `tcb.lock`
/// internally to get `SpinLockGuard<TcpMutableState>`, which provides `&mut`
/// access to the mutable fields (send_queue, write_seq, snd_nxt, etc.).
fn tcp_sendmsg(
    tcb: &TcpCb,
    msg: &MsgHdr,
    flags: MsgFlags,
) -> Result<usize, KernelError>

tcp_write_xmit() — Send Queue to Wire:

/// Transmit segments from the send queue, subject to congestion and flow control.
/// Called from tcp_sendmsg() after data is queued, and from ACK processing when
/// the send window opens.
///
/// # Algorithm
///
/// While send_queue is non-empty and the congestion/flow window permits:
///   1. Peek the head segment of send_queue.
///   2. Check send window: if snd_nxt + segment.len > snd_una + min(cwnd, snd_wnd),
///      stop — the window is full. Start the zero-window probe timer if snd_wnd == 0.
///   3. If segment exceeds MSS: split into MSS-sized chunks (GSO / software TSO).
///   4. Build TCP header: seq = snd_nxt, ack = rcv_nxt, window = rcv_wnd >> snd_wscale.
///      Attach timestamp option if ts_ok. Set PSH flag if marked by tcp_sendmsg().
///   5. Dequeue from send_queue, enqueue original onto retrans_queue
///      (for retransmission).
///   5a. Clone: `tx_clone = NetBuf::clone_shared(&original)` — shares data
///       pages (increments DMA buffer refcount). The retrans_queue holds the
///       original; `ip_queue_xmit()` receives the clone.
///   6. Advance snd_nxt += segment.payload_len.
///   7. If retransmit_timer is not running, start it with current RTO.
///   8. Pass tx_clone to ip_queue_xmit(tcb, tx_clone) for IP output.
///      (`ip_queue_xmit` stamps the per-connection cached route from
///      `tcb.common.cached_route` onto the NetBuf before calling `ip_output`.)
///
/// Returns the number of segments transmitted (0 if window-limited).
///
/// Called from within `tcp_sendmsg()` which already holds `tcb.lock`.
/// Takes `&mut TcpMutableState` directly — the caller passes the
/// `SpinLockGuard`'s mutable borrow to give this function access to
/// send_queue, retrans_queue, snd_nxt, etc.
///
/// Note: takes `&TcpCb` for read-only fields (common, state) and
/// `&mut TcpMutableState` for the mutable state protected by the lock.
/// This satisfies Rust's borrow checker: the caller holds the
/// `SpinLockGuard<TcpMutableState>` and can pass `&mut *guard` while
/// retaining `&tcb` for the immutable fields.
///
/// **NetBuf clone for retransmit**: Before passing a segment to
/// `ip_queue_xmit()`, `tcp_write_xmit()` clones the NetBuf via
/// `NetBuf::clone_shared()`. The retrans_queue holds the original
/// (for potential retransmission); `ip_queue_xmit()` receives the clone.
/// On TX completion, the clone is freed (data page refcount decremented).
/// On ACK, the retrans_queue copy is freed. This prevents use-after-free
/// when the qdisc/NIC frees the transmitted NetBuf.
fn tcp_write_xmit(tcb: &TcpCb, state: &mut TcpMutableState) -> u32

TCP segmentation in tcp_write_xmit(): When the send buffer contains more data than MSS (step 3 above), the TCP layer splits it into MSS-sized segments. Each segment is a separate NetBuf (metadata header + reference to the data pages via scatter-gather list). Splitting is zero-copy: all segments share the same underlying data pages via NetBuf::clone_shared() (Section 16.5), each referencing a different (offset, len) range within the page list. TCP sequence numbers are assigned sequentially (snd_nxt += segment.payload_len per segment). The split point is always at MSS boundaries (never mid-byte); the final segment in a batch may be shorter than MSS (a partial segment).

For TSO/GSO-capable NICs, tcp_write_xmit() may produce a single large NetBuf with gso_size = tcb.mss_clamp and gso_segs set to the number of MSS-sized segments, deferring actual segmentation to the NIC hardware (TSO) or the GSO software fallback path (Section 16.13). This reduces per-packet overhead from O(N) NetBuf allocations to O(1), with the NIC or GSO layer performing the split at transmit time. The NetBuf.gso_type field (GSO_TCPV4 or GSO_TCPV6) signals to the output path that segmentation is deferred.

Integration with existing UmkaOS components: - Uses CongestionOps trait (Section 16.9) — pluggable CUBIC/BBR/RENO/custom - Uses timer wheel (Section 7.8) for all five timers - Uses NetBufHandle (Section 16.5) for retransmit/reorder queues - Socket state lives in TcpSock which embeds TcpCb plus the Socket from Section 16.3

16.8.2 TCP Zero-Copy Receive (SO_ZEROCOPY)

Zero-copy TCP receive delivers incoming data directly into user-space pages without an intermediate kernel copy, reducing CPU usage for high-throughput bulk transfers (file serving, video streaming, bulk data pipelines).

Enable:

int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));

Receive:

ssize_t n = recvmsg(fd, &msg, MSG_ZEROCOPY);

On the zero-copy path: kernel maps incoming NetBuf pages (Section 16.5) directly into the process's address space as read-only anonymous mappings. The user sees data in msg.msg_iov as usual, but the pages are shared with the kernel receive buffer (no memcpy).

UmkaOS advantage: NetBuf pages are pre-registered as zero-copy eligible at buffer pool creation time (no per-receive decision needed). All TCP receives are zero-copy capable; SO_ZEROCOPY only enables the user-mapping step.

Completion notification (mandatory):

After processing the received data, the application must notify the kernel that it is done with the pages, so they can be returned to the buffer pool. (Note: this MSG_ERRQUEUE-based page-return notification is a UmkaOS extension beyond Linux's current RX zero-copy API. Applications that do not use RX zero-copy are unaffected.)

// Read the completion notification (blocks if not yet ready)
ssize_t ret = recvmsg(fd, &notification_msg, MSG_ERRQUEUE);
// notification_msg contains a struct sock_extended_err:
//   ee_errno = 0, ee_origin = SO_EE_ORIGIN_ZEROCOPY
//   ee_data = highest sequence number of acknowledged zero-copy data

After this call returns, the user-mapped pages are unmapped from the process's address space and returned to the NetBuf pool.

Constraints: - Minimum useful payload size: 4 KB. Below 4 KB, the mapping overhead exceeds the copy cost; the kernel falls back to a regular copy silently. - User buffer pointers in msg_iov must be page-aligned (enforced; EFAULT returned if not aligned). - Not compatible with in-kernel kTLS decryption (data must be decrypted before zero-copy mapping). Compatible with kTLS hardware offload (data arrives pre-decrypted from NIC). - On copy fallback (e.g., data < 4 KB): the MSG_ZEROCOPY flag has no effect; standard copy is used. No error is returned. The application does NOT need to drain MSG_ERRQUEUE for the copy path.

Zero-copy vs copy-fallback notification semantics:

The two code paths for a sendmsg() with MSG_ZEROCOPY are mutually exclusive per call and have different notification behavior:

  • Zero-copy path (kernel maps pages into NIC DMA directly): Notification is mandatory. The application MUST drain MSG_ERRQUEUE after the send to release the page pin and allow the buffer to be reused. Failure to drain causes page leaks. The notification carries SO_EE_ORIGIN_ZEROCOPY in the error ancillary data along with the byte range that was sent, so userspace knows exactly which buffer region can now be reused.

Zerocopy completion notification structure (Linux-compatible layout):

/// Zero-copy send completion notification.
/// Delivered via MSG_ERRQUEUE as ancillary data in a sock_extended_err.
/// Layout matches Linux's zerocopy completion for binary compatibility.
///
/// The notification identifies the sent byte range via `ee_info` (start
/// offset relative to the beginning of the send buffer) and `ee_data`
/// (notification sequence counter). The sequence counter is a per-socket
/// monotonically increasing u32 assigned to each MSG_ZEROCOPY send. The
/// application uses it to correlate completions with sends.
///
/// Userspace reads this via:
///   recvmsg(fd, &msg, MSG_ERRQUEUE)
///   → msg.msg_control contains:
///     struct sock_extended_err {
///         ee_errno:  0,
///         ee_origin: SO_EE_ORIGIN_ZEROCOPY (5),
///         ee_type:   0,
///         ee_code:   SO_EE_CODE_ZEROCOPY_COPIED (1) if copy-fallback,
///                    0 if true zero-copy,
///         ee_info:   start notification ID (u32),
///         ee_data:   end notification ID (u32, inclusive),
///     }
///
/// When ee_info == ee_data, exactly one send completed. When ee_info < ee_data,
/// multiple consecutive sends completed and are batched into a single notification
/// (Linux batches completions for efficiency).
///
/// The application correlates notification IDs with send calls: each
/// MSG_ZEROCOPY sendmsg() increments the socket's zerocopy notification
/// counter. The first send gets ID 0, the second ID 1, etc. The application
/// tracks which buffer region corresponds to each ID and frees the region
/// when the notification for that ID is received.
  • Copy-fallback path (kernel copies data, frees the original buffer immediately): No notification is sent. The kernel freed the buffer internally after copying. The application MUST NOT drain MSG_ERRQUEUE for this send — there is nothing to drain, and waiting would block. The send flags return value indicates MSG_ZEROCOPY_SKIPPED to signal the fallback occurred.

Applications can unconditionally drain MSG_ERRQUEUE with MSG_DONTWAIT and handle EAGAIN as "no notification pending" — this is safe for both paths.

Page reclaim policy for leaked zero-copy pages:

If an application crashes (SIGKILL) or hangs before draining MSG_ERRQUEUE, zero-copy mapped pages would leak indefinitely without a reclaim mechanism. UmkaOS tracks every outstanding zero-copy page and enforces bounded lifetimes:

/// Tracks a single zero-copy page lent to userspace via MSG_ERRQUEUE.
///
/// Inserted into the per-netns timeout list when the page is mapped into the
/// process address space. Removed when the application drains MSG_ERRQUEUE or
/// when the reclaim worker expires it.
pub struct ZcopyPageRef {
    /// Socket that owns this zero-copy mapping.
    pub socket_id: u64,
    /// Deadline after which the page is forcibly reclaimed.
    /// Set to `Instant::now() + Duration::from_secs(60)` at delivery time.
    pub deadline: Instant,
    /// Physical page frame backing this mapping.
    pub page: PageRef,
}

Each network namespace maintains a zcopy_timeout_list: Mutex<Vec<ZcopyPageRef>> ordered by deadline. The reclaim rules are:

  • Deadline: 60 seconds from when the packet was delivered to MSG_ERRQUEUE. This is generous enough for any well-behaved application but prevents indefinite page leaks.
  • Reclaim worker: A kernel thread (zcopy_reclaim_worker) runs every 10 seconds, scanning the timeout list and reclaiming (unmapping from userspace + returning to the NetBuf pool) all pages past their deadline.
  • Socket close (normal): When a socket is closed, all pending zero-copy pages for that socket are immediately reclaimed — the deadline is set to Instant::now() and the reclaim worker is woken.
  • Application crash (SIGKILL path): The SIGKILL handler runs the file descriptor cleanup path, which closes all open sockets. Socket close triggers the immediate reclaim described above, so pages are recovered promptly even on abnormal exit.
  • Accounting: The per-netns count of outstanding zero-copy pages is exposed via umkafs at /ukfs/network/<netns>/zcopy_pages_outstanding for monitoring.

This design ensures that zero-copy pages are never permanently leaked, regardless of application behavior.

Performance: Zero-copy receive eliminates a memcpy of typically 16-64 KB per call, saving ~10-50 μs per receive on a modern CPU. Break-even vs. copy: ~4 KB on x86-64 (mmap overhead amortized over one or more pages).

/// Poll event bitflags (matches Linux POLL* values).

[repr(transparent)]

pub struct PollEvents(u16); impl PollEvents { pub const IN: Self = Self(0x0001); // POLLIN pub const PRI: Self = Self(0x0002); // POLLPRI pub const OUT: Self = Self(0x0004); // POLLOUT pub const ERR: Self = Self(0x0008); // POLLERR pub const HUP: Self = Self(0x0010); // POLLHUP pub const NVAL: Self = Self(0x0020); // POLLNVAL }

/// Shutdown direction (matches Linux SHUT_* values).

[repr(i32)]

pub enum ShutdownHow { Rd = 0, // SHUT_RD Wr = 1, // SHUT_WR RdWr = 2, // SHUT_RDWR }

/// Message header for sendmsg/recvmsg (matches Linux struct msghdr). /// UAPI ABI — pointer-width-dependent (contains pointers and usize).

[repr(C)]

pub struct MsgHdr { /// Optional destination address (for connectionless sockets). pub msg_name: mut SockAddr, pub msg_namelen: u32, /// Scatter-gather array (iovec). pub msg_iov: mut IoVec, pub msg_iovlen: usize, /// Ancillary data (control messages). pub msg_control: *mut u8, pub msg_controllen: usize, /// Flags. pub msg_flags: i32, } // UAPI ABI — pointer-width-dependent. // 64-bit: msg_name(8)+msg_namelen(4)+pad(4)+msg_iov(8)+msg_iovlen(8)+msg_control(8)+msg_controllen(8)+msg_flags(4)+pad(4) = 56 bytes.

[cfg(target_pointer_width = "64")]

const_assert!(core::mem::size_of::() == 56);

[cfg(target_pointer_width = "32")]

const_assert!(core::mem::size_of::() == 28);

/// I/O vector for scatter-gather I/O (matches Linux struct iovec). /// UAPI ABI — pointer-width-dependent (contains pointer and usize).

[repr(C)]

pub struct IoVec { pub iov_base: *mut u8, pub iov_len: usize, } // 64-bit: iov_base(8) + iov_len(8) = 16 bytes.

[cfg(target_pointer_width = "64")]

const_assert!(core::mem::size_of::() == 16);

[cfg(target_pointer_width = "32")]

const_assert!(core::mem::size_of::() == 8);

// ConntrackState and NatType enums are defined in // Section 16.18 alongside ConntrackEntry, where they // are semantically home (protocol-agnostic connection tracking types).

**Socket concurrency model**: The socket dispatch layer wraps each `dyn SocketOps`
in a per-socket `RwLock<SlabRef<dyn SocketOps>>`. The `RwLock` serializes socket
lifecycle operations (`close`, which drops the `SlabRef`) against concurrent data
operations (`sendmsg`, `recvmsg`). Socket-internal state uses fine-grained interior
mutability (per-field atomics or internal locks), so the outer `RwLock` is never
contended on the data path -- readers acquire the read lock for all data operations,
and only `close` acquires the write lock. Slab allocation avoids per-socket
heap allocation; `SlabRef` provides stable references suitable for the `RwLock` wrapper,
and matches the return type of `SocketOps::accept()`. The dispatch layer — not the trait
implementation — acquires the appropriate lock before calling trait methods:
- Data-path operations (`sendmsg`, `recvmsg`, `poll`) acquire a **shared (read) lock**,
  allowing concurrent sends/receives from multiple threads (matching Linux's behavior
  where multiple threads can read/write the same socket simultaneously).
- Lifecycle operations (`close`, `shutdown`, `setsockopt`, `bind`, `listen`)
  acquire an **exclusive (write) lock**, ensuring they are serialized with respect to
  all other operations. Note that `connect()` acquires the write lock only to transition
  the socket state to `SYN_SENT`, then releases the lock and blocks on a wait queue,
  allowing concurrent `close()` or `shutdown()` to abort the connection attempt.

The `SocketOps` trait methods all take `&self` because the dispatch layer guarantees
the correct lock is held before invocation. Implementations use interior mutability
(per-field atomics or fine-grained locks) for their mutable state, as is standard for
Rust traits shared across threads.

This means:
- `close()` waits for any in-flight `recvmsg()` or `sendmsg()` to complete before
  proceeding. No use-after-free is possible. To prevent unbounded waits when
  `recvmsg()` is blocked in a long-polling receive, `close()` sets a `SOCK_DEAD`
  flag (visible to the socket's wait queue) before acquiring the write lock. This
  wakes any blocked readers, which check the flag and return `-EBADF`, releasing
  their read locks. This mirrors Linux's `sock_flag(sk, SOCK_DEAD)` mechanism.
- `shutdown()` + `recvmsg()`: `shutdown(SHUT_RD)` acquires the exclusive lock, sets a
  "read-shutdown" flag, and releases the lock. Subsequent `recvmsg()` calls see the
  flag and return 0 (EOF) without waiting for the exclusive lock.
- The `RwLock` is a per-CPU reader-optimized lock (the shared-lock fast path is a
  single atomic increment on the local CPU's counter, adding < 10 ns to data-path
  operations).

Socket objects returned by `accept()` are allocated from a per-CPU slab allocator (the
kernel slab allocator described in Section 4.1), not the general-purpose heap. `SlabRef<T>`
is a typed reference into a slab pool, providing O(1) allocation and deallocation without
contending on a global heap lock. This is critical for servers handling millions of
concurrent connections — `Box<dyn SocketOps>` would introduce a heap allocation on every
`accept()`, creating allocator contention under load. The slab is pre-sized per socket
type during protocol registration and grows in page-granularity chunks on demand.

**Static dispatch for the common path**: The `SocketOps` trait uses dynamic dispatch
(`dyn SocketOps`) to support runtime protocol registration and heterogeneous socket
collections. However, the common case — TCP sockets using the built-in CUBIC congestion
control — is monomorphized at compile time via generic specialization. The TCP
implementation calls its own concrete methods directly on the hot path (connect, send,
recv); `dyn` dispatch is only exercised when the socket layer must operate on a
protocol-agnostic socket handle (e.g., `epoll` readiness checks across mixed socket
types, or the `close()` path that iterates the fd table). This ensures the TCP fast path
has zero vtable overhead.

Protocol registration happens at umka-net initialization:

```rust
/// Register a transport protocol with the socket layer.
/// Called during umka-net init for built-in protocols (TCP, UDP, SCTP, MPTCP).
/// Can also be called at runtime to register dynamically loaded protocols.
pub fn register_protocol(
    family: AddressFamily,       // AF_INET, AF_INET6
    sock_type: SocketType,       // SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET
    protocol: u16,               // IPPROTO_TCP, IPPROTO_UDP, IPPROTO_SCTP, IPPROTO_MPTCP
    factory: &'static dyn SocketFactory,
) -> Result<(), KernelError>;

Adding a new transport (e.g., QUIC kernel offload) requires only: (1) implement SocketOps, (2) implement SocketFactory, (3) call register_protocol with the appropriate family/type/protocol tuple. No changes to the socket layer, syscall dispatch, or any other transport's code.

Linux comparison: Linux's struct proto_ops serves a similar role, but the implementation is entangled with struct sock internals. Adding MPTCP to Linux required modifying tcp_input.c, tcp_output.c, the socket layer, and the connection tracking subsystem. UmkaOS's trait boundary enforces that transports are self-contained.

TCP timer implementation (RTO computation, delayed ACK, zero-window probe, TIME_WAIT, keepalive) follows RFC 6298 (RTO), RFC 1122 (delayed ACK ≤ 500ms, keepalive ≥ 2h), RFC 7323 (timestamps/window scaling), and RFC 7413 (TFO). Timer integration uses the kernel's hierarchical timer wheel (Section 7.5.4, 06-scheduling.md) with O(1) insertion and cancellation. Key constants (matching Linux defaults for compatibility): TCP_RTO_MIN = 200ms, TCP_RTO_MAX = 120s, TCP_TIMEWAIT_LEN = 60s, TCP_DELACK_MAX = 200ms. Specific timer algorithms (SRTT smoothing, Karn's algorithm) are implementation details within these RFC-defined bounds and are not specified further in the architecture.

16.9 Congestion Control Framework

Congestion control is pluggable via a trait, selectable per-socket at runtime:

The congestion control interface is the CongestionOps trait (fully specified in Section 16.10). Each algorithm is a stateless descriptor (&'static dyn CongestionOps); per-connection state lives in TcpCb.cong_priv (64 bytes inline). All byte counters in the interface use u64 to support high-BDP networks: at 400 Gbps with 100ms RTT the bandwidth-delay product is ~5 GB, which would overflow u32 (max ~4.3 GB). Using u64 avoids silent overflow on datacenter and WAN paths.

Built-in algorithms:

Algorithm Description Default
CUBIC Linux default since 2.6.19. Cubic function for cwnd growth. Yes
BBR Google's bottleneck bandwidth and RTT-based CC. No
BBRv3 Revised BBR merging bandwidth and loss models into a single state machine; available from Google's BBR repository (not yet in Linux mainline as of 2026). No
Reno Classic AIMD (additive increase, multiplicative decrease). No

Per-socket selection via setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr") — same API as Linux. Applications that set congestion control on Linux work identically on UmkaOS.

eBPF struct_ops lifecycle: Custom congestion control algorithms can be loaded at runtime via eBPF, using the same struct_ops mechanism as Linux 5.6+.

  1. Registration: A BPF program of type BPF_PROG_TYPE_STRUCT_OPS provides implementations for the CongestionOps trait methods. The verifier checks that all required methods (cwnd_event, cong_control, ssthresh) are implemented and that the BPF programs are safe (bounded loops, valid memory access).

  2. Activation: bpf(BPF_MAP_UPDATE_ELEM) on a BPF_MAP_TYPE_STRUCT_OPS map registers the algorithm with the kernel CC registry under a user-chosen name (max 16 bytes, e.g., "bpf_bbr2_exp"). The name becomes available for setsockopt(TCP_CONGESTION).

  3. Per-socket attachment: When a socket selects the BPF CC via setsockopt(TCP_CONGESTION, "bpf_bbr2_exp"), the kernel stores a &'static dyn CongestionOps pointing to the BPF-backed vtable. Per-connection state lives in TcpCb.cong_priv (64 bytes inline), same as built-in CCs.

  4. Deregistration: bpf(BPF_MAP_DELETE_ELEM) removes the algorithm from the registry. Existing sockets using it continue with their current vtable reference (the BPF program is pinned in memory until all references are dropped). New sockets cannot select the deregistered name.

This enables production A/B testing of new algorithms without kernel rebuilds — the same workflow used at Meta and Google on Linux.

MPTCP per-subflow CC interaction: Each MPTCP subflow (MptcpSubflow.tcp: TcpCb) has its own independent congestion control instance. The MPTCP scheduler (Section 16.11) selects subflows based on their individual cwnd availability, and each subflow's CC algorithm operates on its own RTT samples and loss signals. MPTCP-specific CCs (e.g., LIA, OLIA, BALIA) are implemented as CongestionOps that read the MptcpConnection state via TcpCb.mptcp_conn backpointer to coordinate cwnd growth across subflows and achieve resource pooling.

CC algorithm registration/deregistration path: Built-in algorithms are registered at umka-net module init. The CC registry is a global RcuHashMap<&str, &'static dyn CongestionOps> keyed by algorithm name. Registration is cold-path (module init or BPF map update). Lookup is warm-path (per-socket setsockopt). The per-socket cc_ops pointer (Section 16.10) is resolved at setsockopt time and cached for the socket's lifetime.

16.10 Pluggable TCP Congestion Control

Linux parallel: Linux exposes tcp_congestion_ops as a loadable module API. UmkaOS provides the same extensibility through the CongestionOps trait registered in umka-net's congestion control registry. Section 16.6 introduced BBR and the trait outline; this section specifies the full interface, registration lifecycle, per-socket selection, and the data structures that congestion algorithms receive.

16.10.1 CongestionOps Trait (Full Specification)

/// Congestion control algorithm interface.
///
/// Each algorithm is a stateless descriptor (`&'static dyn CongestionOps`).
/// Per-connection algorithm state lives in `TcpCb.cong_priv` (64 bytes inline,
/// heap-allocated if larger).
///
/// Methods marked `optional` have a default no-op implementation.
/// The TCP engine calls every method from within umka-net's isolation domain;
/// no domain crossing is required.
pub trait CongestionOps: Send + Sync {
    /// Algorithm name (ASCII, null-terminated, max 16 bytes including NUL).
    /// Used by TCP_CONGESTION sockopt and /proc/sys/net/ipv4/tcp_congestion_control.
    fn name(&self) -> &'static str;

    /// Capability flags declared by this algorithm.
    fn flags(&self) -> CaFlags;

    /// Called when a new TCP connection is allocated and this algorithm is selected.
    /// Initialise per-connection state in `state.cong_priv`.
    ///
    /// `cb` provides read-only access to immutable socket fields (`common`, socket
    /// identity). `state` provides mutable access to congestion-relevant fields
    /// (`cwnd`, `ssthresh`, `cong_priv`, `srtt_us`) via the caller's
    /// `SpinLockGuard<TcpMutableState>`.
    fn init(&self, cb: &TcpCb, state: &mut TcpMutableState);

    /// Called when the connection is destroyed or a different algorithm is selected.
    /// Release per-connection resources allocated in `init`.
    fn release(&self, cb: &TcpCb, state: &mut TcpMutableState);

    /// Return the slow-start threshold for the current congestion event.
    /// Must return a value >= 2*MSS; never returns 0 (TCP engine enforces).
    fn ssthresh(&self, cb: &TcpCb, state: &mut TcpMutableState) -> u64;

    /// Called on each ACK in the congestion-avoidance phase.
    ///
    /// `ack` is the acknowledged sequence number.
    /// `acked` is the number of bytes newly acknowledged by this ACK.
    ///
    /// Typical CUBIC/Reno implementation: advance cwnd by acked/cwnd per ACK.
    fn cong_avoid(&self, cb: &TcpCb, state: &mut TcpMutableState, ack: u32, acked: u64);

    /// Optional full-ACK processing hook (used by BBR, not by Reno/CUBIC).
    ///
    /// Called instead of `cong_avoid` when `CaFlags::CA_FLAG_FULL_CONTROL` is set.
    /// Provides the full `TcpAck` structure for pacing-based algorithms.
    ///
    /// Default implementation: delegates to `cong_avoid(cb, state, ack.ack_seq, ack.bytes_acked)`.
    fn cong_control(&self, cb: &TcpCb, state: &mut TcpMutableState, ack: &TcpAck) {
        self.cong_avoid(cb, state, ack.ack_seq, ack.bytes_acked);
    }

    /// Notify algorithm of a TCP state transition.
    ///
    /// Called when the connection enters a new state (e.g., CongState::Loss
    /// on RTO or triple-duplicate ACK). Algorithms may reset cwnd or adjust
    /// internal state here.
    fn set_state(&self, cb: &TcpCb, state: &mut TcpMutableState, new_state: CongState);

    /// Notify algorithm of a discrete congestion event.
    ///
    /// Called for events that do not change TCP state but affect the congestion
    /// algorithm (e.g., the sender starts transmitting after an idle period).
    fn cwnd_event(&self, cb: &TcpCb, state: &mut TcpMutableState, ev: CaEvent);

    /// Optional: called after SACK processing with a per-ACK rate sample.
    ///
    /// Used by RTT-based algorithms (BBR) that need delivery-rate estimation.
    /// Only called when `CaFlags::CA_FLAG_RTT_BASED` is set in `flags()`.
    ///
    /// Default implementation: no-op.
    fn pkts_acked(&self, cb: &TcpCb, state: &mut TcpMutableState, sample: &RateSample) {
        let _ = (cb, state, sample);
    }

    /// Return the cwnd value to restore on an undo event (spurious RTO).
    ///
    /// Called when F-RTO or DSACK identifies a retransmit as spurious.
    /// Must return a value >= the current cwnd (undo cannot reduce cwnd).
    fn undo_cwnd(&self, cb: &TcpCb, state: &TcpMutableState) -> u64;

    /// Optional: fill `buf` with TCP_INFO algorithm-specific bytes (up to 16 bytes).
    ///
    /// The TCP engine calls this when `getsockopt(TCP_INFO)` is issued.
    /// The returned bytes are appended to the standard `tcp_info` struct
    /// as the `tcpi_opt_vals` extension field.
    ///
    /// Returns: number of bytes written (0 if algorithm has no info to export).
    /// Default implementation: writes 0 bytes.
    fn get_info(&self, cb: &TcpCb, state: &TcpMutableState, buf: &mut [u8]) -> usize {
        let _ = (cb, state, buf);
        0
    }
}

Inline private state: TcpCb.cong_priv is a CongPriv union — 64 bytes of inline storage. Algorithms with <= 64 bytes of per-connection state (Reno, CUBIC, Vegas) store directly there. Algorithms needing more (BBR v2 with bandwidth estimation tables) allocate a heap box and store its raw pointer in the first 8 bytes of cong_priv, freeing it in release(). The TCP engine zeroes cong_priv before calling init() so algorithms may rely on zero-initialisation.

16.10.2 Supporting Types

/// Discrete congestion events delivered via `cwnd_event()`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CaEvent {
    /// Sender began transmitting after an idle period (no in-flight data).
    TxStart,
    /// cwnd was reset to IW after an idle period (TCP RFC 5681 §4.1).
    CwndRestart,
    /// Quick-ack mode completed (returned to delayed-ACK).
    CompleteQuickAck,
    /// A loss event was detected (fast-retransmit or RTO).
    Loss,
}

/// TCP connection congestion states.
///
/// The TCP engine transitions between these states; algorithms receive
/// `set_state()` on every transition.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CongState {
    /// Normal operation (slow-start or congestion avoidance).
    Open,
    /// Disorder: duplicate ACKs or SACK holes detected, no loss confirmed.
    Disorder,
    /// ECN congestion signal received; cwnd reduced without loss.
    Cwr,
    /// Confirmed loss; performing fast recovery (RFC 6675).
    Recovery,
    /// RTO fired; entering slow-start from ssthresh.
    Loss,
}

/// Algorithm capability flags.
bitflags! {
    pub struct CaFlags: u32 {
        /// Algorithm uses RTT-based control; receive `pkts_acked()` calls.
        const CA_FLAG_RTT_BASED    = 0x1;
        /// Algorithm is per-connection (not flow-aggregate); used by BBR.
        const CA_FLAG_CONN         = 0x2;
        /// Algorithm requires ECN; negotiation fails if peer does not support ECN.
        const CA_FLAG_NEEDS_ECN    = 0x4;
        /// Algorithm implements full ACK processing via `cong_control()`.
        const CA_FLAG_FULL_CONTROL = 0x8;
    }
}

/// Per-ACK delivery rate sample (RFC 8148, Section 3.1).
///
/// Computed by the TCP engine's rate-sampling code after SACK processing.
/// Delivered to algorithms that set `CA_FLAG_RTT_BASED`.
#[derive(Debug, Clone, Copy)]
pub struct RateSample {
    /// Total bytes delivered since connection start at this ACK's arrival.
    pub delivered: u64,
    /// Bytes delivered that were CE-marked (ECN congestion experienced).
    pub delivered_ce: u64,
    /// Elapsed time (us) of the measurement interval.
    // Longevity: u32 us wraps at ~71 min. Structurally safe: RateSample
    // is ephemeral (per-ACK, not cumulative). Any single RTT exceeding
    // 71 min triggers TCP RTO/keepalive long before. Matches Linux
    // struct rate_sample.
    pub interval_us: u32,
    /// Sender-side measurement interval (us).
    pub snd_interval_us: u32,
    /// Receiver-side measurement interval (us).
    pub rcv_interval_us: u32,
    /// Latest RTT sample (us); 0 if unavailable.
    pub rtt_us: u32,
    /// Packets lost during this ACK's interval (SACK + RTO inference).
    pub losses: u32,
    /// Packets newly ACKed or SACKed.
    pub acked_sacked: u32,
    /// `delivered` counter value at the start of the measurement interval.
    pub prior_delivered: u64,
    /// True if the sender was app-limited during this interval.
    pub is_app_limited: bool,
}

/// Partial view of a received ACK, passed to `cong_control()`.
#[derive(Debug, Clone, Copy)]
pub struct TcpAck {
    /// Acknowledged sequence number (cumulative).
    pub ack_seq: u32,
    /// Bytes newly acknowledged by this ACK (excluding retransmits).
    /// u64 per congestion-control-framework.md: "All byte counters use u64."
    pub bytes_acked: u64,
    /// SACK score: bytes newly SACKed.
    /// u64 per byte counter policy.
    pub sack_newly_sacked: u64,
    /// Receiver-advertised window (in bytes, after scaling).
    pub win: u32,
    /// ECN congestion window reduction signal received with this ACK.
    pub ecn_cwr: bool,
}

16.10.3 Registration API

The congestion control registry is a module-global list inside umka-net. It is initialised at umka-net startup with the builtin algorithms and is thereafter modified only by explicit register/unregister calls.

/// Entry in the congestion control registry.
/// Slot ID assigned at registration. Stored in `TcpCb` at `setsockopt()` time
/// so that the `connect()` hot path performs an O(1) array lookup — no string
/// comparison, no hash, no tree traversal.
pub type CongAlgorithmId = u8;

/// Entry in the congestion control registry. Contains only the algorithm
/// descriptor — reference counts live in the separate `CONG_CTL_REFCNTS`
/// array so that the RCU-protected registry can be copied atomically on
/// register/unregister without losing in-flight refcount updates from the
/// connect/close hot path.
#[derive(Clone, Copy)]
pub struct CongCtlEntry {
    pub ops: &'static dyn CongestionOps,
}

/// Maximum number of simultaneously registered congestion control algorithms.
/// Bounded at compile time: the registry is a fixed-size array with no heap use.
pub const MAX_CONG_CTLS: usize = 32;

/// Fixed-size congestion control registry. Slot index = `CongAlgorithmId`.
///
/// **Hot path** (`connect()`, 100K+ connections/sec): one RCU read guard
/// (~1-3 cycles) + one array index. No string comparison, no lock.
///
/// **Write path** (register/unregister, rare module load/unload): acquires
/// `CONG_CTL_WRITE_LOCK`, modifies a slot in a cloned array, publishes via
/// `RcuCell::update`. Old snapshot freed after RCU grace period.
static CONG_CTL_REGISTRY: RcuCell<[Option<CongCtlEntry>; MAX_CONG_CTLS]> =
    RcuCell::new_empty();

/// Name → slot ID lookup. Used only by `setsockopt(TCP_CONGESTION)` (warm path)
/// and register/unregister (cold path). Never touched on the `connect()` hot path.
/// Protected by `CONG_CTL_WRITE_LOCK` for writes; reads acquire the SpinLock
/// briefly (setsockopt frequency is negligible vs. connect frequency).
static CONG_CTL_NAME_TO_ID: SpinLock<ArrayMap<ArrayString<16>, CongAlgorithmId, MAX_CONG_CTLS>> =
    SpinLock::new(ArrayMap::new());

/// System default algorithm slot ID. Updated atomically when the default name
/// changes via `/proc/sys/net/ipv4/tcp_congestion_control`. Sockets that have
/// not called `setsockopt(TCP_CONGESTION)` use this at `connect()` time.
static DEFAULT_CONG_CTL_ID: AtomicU8 = AtomicU8::new(0); // reno = slot 0

/// Per-slot active socket reference counts. Indexed by `CongAlgorithmId`.
/// Incremented at `tcp_init_cong_control()` (connect), decremented at
/// `tcp_cleanup_cong_control()` (close / algorithm change).
/// Separate from the RCU-protected registry so that connect/close (hot path)
/// never triggers an RCU publish. Plain atomic increment: ~1 cycle.
static CONG_CTL_REFCNTS: [AtomicU32; MAX_CONG_CTLS] =
    [const { AtomicU32::new(0) }; MAX_CONG_CTLS];

/// Serializes register/unregister calls. Never held during TCP processing.
static CONG_CTL_WRITE_LOCK: Mutex<()> = Mutex::new(());

/// Register a congestion control algorithm and return its assigned slot ID.
///
/// # Errors
/// - `KernelError::AlreadyExists` if an algorithm with the same name is registered.
/// - `KernelError::InvalidArgument` if `ops.name()` is empty or longer than 15 bytes.
/// - `KernelError::ResourceExhausted` if all `MAX_CONG_CTLS` slots are occupied.
pub fn tcp_register_congestion_control(
    ops: &'static dyn CongestionOps,
) -> Result<CongAlgorithmId, KernelError> {
    let name = ops.name();
    if name.is_empty() || name.len() > 15 {
        return Err(KernelError::InvalidArgument);
    }
    let _guard = CONG_CTL_WRITE_LOCK.lock();

    // Check name uniqueness.
    let mut name_map = CONG_CTL_NAME_TO_ID.lock();
    if name_map.contains_key(name) {
        return Err(KernelError::AlreadyExists);
    }

    // Find the first empty slot in the registry.
    let snapshot = CONG_CTL_REGISTRY.read();
    let slot_id = snapshot.iter().position(|s| s.is_none())
        .ok_or(KernelError::ResourceExhausted)? as CongAlgorithmId;

    // Copy-on-write: clone the fixed-size array, insert new entry, RCU-publish.
    // CongCtlEntry is Copy (just a fat pointer), so *snapshot is a bitwise copy.
    let mut new_array = *snapshot;
    new_array[slot_id as usize] = Some(CongCtlEntry { ops });
    CONG_CTL_REGISTRY.update(new_array);

    // Record name → slot mapping.
    name_map.insert(ArrayString::from(name), slot_id);

    Ok(slot_id)
}

/// Unregister a congestion control algorithm by name.
///
/// # Errors
/// - `KernelError::NotFound` if the algorithm is not registered.
/// - `KernelError::Busy` if any socket is using this algorithm (refcnt > 0).
/// - `KernelError::PermissionDenied` if trying to unregister `"reno"`.
pub fn tcp_unregister_congestion_control(name: &str) -> Result<(), KernelError> {
    if name == "reno" {
        return Err(KernelError::PermissionDenied);
    }
    let _guard = CONG_CTL_WRITE_LOCK.lock();

    // Look up slot ID from name.
    let mut name_map = CONG_CTL_NAME_TO_ID.lock();
    let slot_id = *name_map.get(name).ok_or(KernelError::NotFound)?;

    // Clear the slot FIRST in a new array snapshot and RCU-publish.
    // This prevents new connections from acquiring a refcount on this slot
    // via tcp_init_cong_control() — they will see None and fall back to reno.
    // Note: during the window between slot clear and possible restore, new
    // connections may fall back to Reno. This is intentional — the
    // administrator is attempting to remove the algorithm.
    let mut new_array = *CONG_CTL_REGISTRY.read();
    let old_entry = new_array[slot_id as usize].take();
    CONG_CTL_REGISTRY.update(new_array);

    // Wait for an RCU grace period so all in-flight tcp_init_cong_control()
    // calls that obtained the ops pointer before the slot was cleared have
    // completed and incremented their refcount.
    synchronize_rcu();

    // Now check refcount — any in-flight readers have either completed (and
    // incremented the refcount) or seen the cleared slot (and skipped it).
    if CONG_CTL_REFCNTS[slot_id as usize].load(Acquire) > 0 {
        // Restore the slot — there are still active users.
        let mut restore_array = *CONG_CTL_REGISTRY.read();
        restore_array[slot_id as usize] = old_entry;
        CONG_CTL_REGISTRY.update(restore_array);
        return Err(KernelError::Busy);
    }

    // Remove from name map. Slot ID is now free for reuse by a future register.
    name_map.remove(name);

    Ok(())
}

/// Hot-path lookup: resolve a `CongAlgorithmId` to the algorithm's ops table.
///
/// O(1): one RCU read guard + one array index. Called from `tcp_init_cong_control()`
/// on every `connect()`. Returns `None` if the slot was cleared between
/// `setsockopt()` and `connect()` (the caller falls back to reno).
fn tcp_find_cong_by_id(id: CongAlgorithmId) -> Option<&'static dyn CongestionOps> {
    let guard = CONG_CTL_REGISTRY.read();
    guard.get(id as usize).and_then(|slot| slot.as_ref()).map(|e| e.ops)
}

/// Warm-path lookup: resolve a name to a `CongAlgorithmId`.
///
/// Used by `setsockopt(TCP_CONGESTION)` — called once per socket configuration,
/// not on the `connect()` hot path. Acquires `CONG_CTL_NAME_TO_ID` SpinLock
/// briefly (setsockopt frequency is negligible vs. connect frequency).
fn tcp_find_cong_by_name(name: &str) -> Option<CongAlgorithmId> {
    CONG_CTL_NAME_TO_ID.lock().get(name).copied()
}

Builtin algorithms (registered at umka-net init, in this order):

Name Slot ID Default? Description
reno 0 fallback RFC 5681 Reno — always available, never unregistered
bbr 1 yes BBR v2 (pacing + bandwidth estimation)
cubic 2 no CUBIC (RFC 8312)

reno occupies slot 0 (compile-time default for DEFAULT_CONG_CTL_ID). At umka-net startup, bbr is registered in slot 1 and DEFAULT_CONG_CTL_ID is updated to 1. If the system default is unregistered between socket() and connect(), sockets fall back to reno (slot 0).

16.10.4 Per-Socket Selection Lifecycle

At connect() time: The TCP engine calls tcp_init_cong_control(cb, state) where cb: &TcpCb and state: &mut TcpMutableState (from the caller's SpinLockGuard):

/// Attach the selected (or system-default) congestion control algorithm
/// to a newly connecting TCP socket.
///
/// Called once, from `tcp_connect()`, before the SYN is transmitted.
/// The caller holds `TcpCb.lock` (providing `&mut TcpMutableState` via
/// `SpinLockGuard<TcpMutableState>`). Uses `state.cong_id` (set by a
/// prior `setsockopt(TCP_CONGESTION)`) for an O(1) array lookup.
/// If no algorithm was explicitly selected, uses the system default
/// (`DEFAULT_CONG_CTL_ID`).
///
/// Fallback: if the selected algorithm was unregistered between `setsockopt`
/// and `connect`, falls back to reno (slot 0). This matches Linux behaviour
/// where algorithm removal does not fail in-flight connects.
pub fn tcp_init_cong_control(
    cb: &TcpCb,
    state: &mut TcpMutableState,
) -> Result<(), KernelError> {
    let requested_id = state.cong_id.unwrap_or(DEFAULT_CONG_CTL_ID.load(Acquire));
    let (id, ops) = match tcp_find_cong_by_id(requested_id) {
        Some(ops) => (requested_id, ops),
        None => (0, tcp_find_cong_by_id(0).unwrap()),  // slot 0 = reno, always present
    };
    state.cong_ops = ops;
    state.cong_id = Some(id);
    CONG_CTL_REFCNTS[id as usize].fetch_add(1, Relaxed);
    ops.init(cb, state);
    Ok(())
}

At socket close() / algorithm change: tcp_cleanup_cong_control(cb, state):

/// Detach the congestion control algorithm from a TCP socket.
///
/// Called on connection close or when TCP_CONGESTION setsockopt changes the
/// algorithm. The caller holds `TcpCb.lock`. Decrements the slot's reference
/// count and resets `state.cong_ops` to reno to prevent use-after-free if a
/// racing timer fires between `release()` and the next `init()`.
pub fn tcp_cleanup_cong_control(cb: &TcpCb, state: &mut TcpMutableState) {
    let ops = state.cong_ops;
    ops.release(cb, state);
    if let Some(id) = state.cong_id {
        CONG_CTL_REFCNTS[id as usize].fetch_sub(1, Relaxed);
    }
    state.cong_ops = &RENO_CONG_OPS;
    state.cong_id = None;
}

TCP_CONGESTION sockopt (setsockopt(IPPROTO_TCP, TCP_CONGESTION, "bbr\0", 4)):

  1. Verify caller holds Capability::NetAdmin OR socket is not yet connected (unprivileged processes may set the algorithm before connecting, matching Linux).
  2. Null-terminate and validate the name (max 15 bytes, ASCII printable).
  3. Resolve the name to a CongAlgorithmId via tcp_find_cong_by_name(). Return ENOENT if not found.
  4. Store the resolved ID in state.cong_id (under the socket lock).
  5. If the socket is already connected, call tcp_cleanup_cong_control(cb, state) then tcp_init_cong_control(cb, state) (which reads state.cong_id for O(1) slot lookup).
  6. If not yet connected, state.cong_id is already set — connect() will use it.

getsockopt(IPPROTO_TCP, TCP_CONGESTION, buf, len) copies state.cong_ops.name() into buf (null-terminated, matching Linux).

16.10.5 System Default

The system default is stored as an atomic slot ID (DEFAULT_CONG_CTL_ID, declared above alongside the registry statics). At umka-net init, after all builtins are registered, the init code stores the BBR slot ID:

// umka-net init sequence (cold path, runs once at boot):
let reno_id = tcp_register_congestion_control(&RENO_CONG_OPS).unwrap();  // slot 0
let _       = tcp_register_congestion_control(&BBR_CONG_OPS).unwrap();   // slot 1
let cubic_id = tcp_register_congestion_control(&CUBIC_CONG_OPS).unwrap(); // slot 2
DEFAULT_CONG_CTL_ID.store(cubic_id, Release);  // CUBIC is the system default (matches Linux)

/// Return the name of the current system default algorithm (for sysfs/getsockopt).
/// One atomic load + one RCU read + one array index.
pub fn tcp_default_cong_control_name() -> &'static str {
    let id = DEFAULT_CONG_CTL_ID.load(Acquire);
    tcp_find_cong_by_id(id).map(|ops| ops.name()).unwrap_or("reno")
}

/proc/sys/net/ipv4/tcp_congestion_control: Reads return the current default algorithm's name (via tcp_default_cong_control_name()). Writes (requiring Capability::NetAdmin) resolve the name to a CongAlgorithmId via tcp_find_cong_by_name() and store it in DEFAULT_CONG_CTL_ID with Release ordering. Writing an unknown name returns ENOENT. This sysctl is per-network-namespace (Section 17.1), so each container may independently configure its default.

Fallback chain: If the configured default algorithm fails initialization on a new connection (e.g., Tier 1 module crashed and was not yet reloaded), the fallback order is: (1) CUBIC (built-in, always available), (2) Reno (built-in, always available, never unregistered). Reno is the last-resort fallback — tcp_find_cong_by_id(0) always returns Reno. This ensures no connection can ever be created without a congestion control algorithm.

16.10.6 TCP Sysctl Entries (/proc/sys/net/ipv4/tcp_*)

UmkaOS must implement the following /proc/sys/net/ipv4/tcp_* entries for Linux compatibility. These are required by Docker, Kubernetes, monitoring tools (Prometheus node_exporter, Datadog agent), and system tuning scripts (sysctl -w). All entries are per-network-namespace unless noted otherwise.

Sysctl Type Default Description
tcp_syn_retries u8 6 Max SYN retransmits before aborting a connect attempt.
tcp_synack_retries u8 5 Max SYN-ACK retransmits for a passive connection.
tcp_fin_timeout u32 (seconds) 60 Time a socket stays in FIN-WAIT-2 before being forcibly closed.
tcp_keepalive_time u32 (seconds) 7200 Idle time before the first keepalive probe is sent.
tcp_keepalive_intvl u32 (seconds) 75 Interval between successive keepalive probes.
tcp_keepalive_probes u8 9 Number of unacknowledged probes before declaring the connection dead.
tcp_max_syn_backlog u32 4096 Maximum length of the per-socket SYN backlog (incomplete connections).
tcp_max_tw_buckets u32 262144 Maximum number of TIME-WAIT entries. Excess entries are destroyed immediately (early expiry; no RST sent). TIME-WAIT entries use lightweight TwEntry structs (~64 bytes) stored in a separate TwHashTable (see Section 16.8).
tcp_tw_reuse u8 (0/1/2) 2 Allow reuse of TIME-WAIT sockets for new outgoing connections. 0 = disabled, 1 = global, 2 = loopback only.
tcp_window_scaling u8 (bool) 1 Enable RFC 1323 window scaling (required for windows > 64 KB).
tcp_sack u8 (bool) 1 Enable RFC 2018 Selective Acknowledgments.
tcp_timestamps u8 (bool) 1 Enable RFC 1323 timestamps (used for RTT measurement and PAWS).
tcp_ecn u8 (0/1/2) 2 Explicit Congestion Notification. 0 = disabled, 1 = enabled, 2 = server-only (negotiate if peer requests).
tcp_congestion_control string "cubic" Default congestion control algorithm (matches Linux; see Section 16.10).
tcp_available_congestion_control string (read-only) Space-separated list of registered algorithms. Not writable.
tcp_rmem 3 × u32 4096 131072 6291456 Min, default, max TCP receive buffer sizes (bytes). Auto-tuning operates within [min, max].
tcp_wmem 3 × u32 4096 16384 4194304 Min, default, max TCP send buffer sizes (bytes).
tcp_mem 3 × u64 (pages) auto Low, pressure, high watermarks for total TCP memory consumption (in pages). Below low: no pressure. Above high: new allocations may fail. Auto-computed at boot from total system memory.
tcp_slow_start_after_idle u8 (bool) 1 Reset cwnd to initial window after an idle period (RFC 2861). Set to 0 for long-lived connections with bursty traffic.
tcp_no_metrics_save u8 (bool) 0 If 1, do not cache TCP metrics (RTT, cwnd) in the route cache on connection close.
tcp_base_mss u32 1024 Starting MSS for Path MTU Discovery (PMTUD) search.
tcp_mtu_probing u8 (0/1/2) 0 Enable PMTUD probing. 0 = disabled, 1 = enabled when ICMP blackhole detected, 2 = always enabled.
tcp_fastopen u32 (bitmask) 0x1 TFO configuration. Bit 0 = client enable, Bit 1 = server enable, Bit 2 = client send data in SYN without cookie, Bit 10 = server accept data in SYN without cookie.
tcp_fastopen_key hex string random 128-bit server TFO cookie key (hex, e.g., "00112233-44556677-8899aabb-ccddeeff"). Writable for cluster-consistent TFO.

Implementation notes: - All entries are readable/writable via both /proc/sys/net/ipv4/ and the sysctl(2) system call. - Values are validated on write: out-of-range values return EINVAL. - Per-namespace scoping means Docker containers and Kubernetes pods see isolated sysctl namespaces (consistent with Linux net.ipv4.tcp_* namespace support). - tcp_mem auto-computation follows the Linux heuristic: low = total_pages/16, pressure = total_pages/8, high = total_pages/4, clamped to sane minimums.

16.10.7 /proc/net/ Filesystem Entries

UmkaOS must expose the following /proc/net/ entries for compatibility with Docker, Kubernetes, monitoring tools (Prometheus node_exporter, Datadog agent, ss, netstat, ip, sar), and container health checks. All entries must match Linux output byte-for-byte — tools parse these files with hardcoded field offsets, column positions, and sscanf/awk patterns.

Each entry is per-network-namespace (containers see only their own network state). Implementation uses umka's procfs layer (Section 20.5).

Path Format Description
/proc/net/dev Fixed-width columns, header on lines 1-2 Per-interface statistics: interface name, rx bytes, rx packets, rx errs, rx drop, rx fifo, rx frame, rx compressed, rx multicast, tx bytes, tx packets, tx errs, tx drop, tx fifo, tx colls, tx carrier, tx compressed. One row per interface. Used by ifconfig, Prometheus node_network_* metrics, sar -n DEV.
/proc/net/snmp <Protocol>: <field names>\n<Protocol>: <values>\n pairs SNMP MIB-II counters. Sections: Ip, Icmp, IcmpMsg, Tcp, Udp, UdpLite. Each section has a header line (field names) followed by a values line. Used by SNMP exporters, netstat -s, monitoring dashboards.
/proc/net/netstat Same format as /proc/net/snmp Extended TCP/IP statistics. Sections: TcpExt (SYN cookies, listen overflows, out-of-window drops, fast retransmits, etc.), IpExt (InOctets, OutOctets, InMcastPkts, etc.). Used by netstat -s, ss -s, TCP debugging.
/proc/net/tcp Fixed columns, header on line 1 TCP socket table. Columns: sl, local_address (hex IP:port), rem_address, st (state), tx_queue:rx_queue, tr:tm->when, retrnsmt, uid, timeout, inode, plus additional fields. Hex-encoded IPv4 addresses (little-endian on little-endian hosts). Used by ss, netstat -tnp, container health probes.
/proc/net/tcp6 Same format as tcp TCP6 socket table. IPv6 addresses as 32-hex-char strings. Required for IPv6-enabled containers and dual-stack Kubernetes.
/proc/net/udp Same format as tcp (fewer fields) UDP socket table. Columns match Linux's format. Used by ss -unp, netstat -unp.
/proc/net/udp6 Same format as udp UDP6 socket table.
/proc/net/unix Fixed columns, header on line 1 Unix domain socket table. Columns: Num, RefCount, Protocol, Flags, Type, St, Inode, Path. Used by ss -x, container debugging.
/proc/net/if_inet6 Space-separated, no header IPv6 interface addresses. Columns: address (32 hex chars, no colons), ifindex (hex), prefix_len (hex), scope (hex), flags (hex), ifname. Used by ip -6 addr, NetworkManager, container IPv6 setup.
/proc/net/route Tab-separated, header on line 1 IPv4 routing table (FIB). Columns: Iface, Destination, Gateway, Flags, RefCnt, Use, Metric, Mask, MTU, Window, IRTT. All addresses in hex (network byte order). Used by route -n, legacy routing tools. ip route uses netlink but some containers still parse this file.
/proc/net/arp Fixed columns, header on line 1 ARP cache. Columns: IP address, HW type, Flags, HW address, Mask, Device. Used by arp -n, container network debugging, ARP monitoring.
/proc/net/fib_trie Indented tree structure Routing trie dump. Shows the LC-trie structure of the IPv4 FIB. Used by ip route show table all internals and network diagnostic tools.
/proc/net/fib_triestat Key-value pairs FIB trie statistics: number of nodes, leaves, prefixes, null pointers, trie depth. Used by routing performance analysis tools.

Implementation requirements:

  • Atomicity: Each read() must return a consistent snapshot. Use seq_file-style iteration with RCU read-side protection for socket tables and routing tables, so readers never see partial updates.
  • Hex encoding: IPv4 addresses in /proc/net/tcp, /proc/net/route, and /proc/net/arp are encoded as 8-hex-character little-endian (on little-endian hosts) values — this matches Linux's %08X format and tools depend on it.
  • Performance: /proc/net/tcp can be large on busy servers (100K+ sockets). Use seq_file pagination to avoid allocating the entire output in kernel memory. Kubernetes liveness probes may read these files every few seconds.
  • Namespace isolation: Each entry shows only the network state visible within the reading process's network namespace. A container must not see the host's socket table.

16.11 MPTCP as First-Class Transport

MPTCP (RFC 8684) is designed into umka-net from the start, not retrofitted onto an existing TCP implementation. This avoids the years of integration pain that Linux experienced.

Architecture:

                    MPTCP Connection
                   /       |        \
              Subflow 0  Subflow 1  Subflow 2
              (WiFi)     (LTE)      (Ethernet)
                 |          |          |
              TCP stack  TCP stack  TCP stack
              (per-subflow congestion control)

Key design decisions:

  • Subflow management: A path manager component handles subflow creation and teardown. It monitors available network interfaces and creates subflows when new paths appear (e.g., WiFi connects). Subflow teardown is graceful (DATA_FIN) or abrupt (RST on path failure).

  • Packet scheduler: Distributes data segments across subflows. Built-in policies: round-robin, lowest-RTT (send on the subflow with the shortest current RTT estimate), and redundant (duplicate on all subflows for ultra-low-latency). Scheduler is pluggable via a trait, same pattern as congestion control.

  • Sequence number separation: Connection-level Data Sequence Numbers (DSN) are independent of per-subflow TCP sequence numbers. This is architecturally baked in — the MPTCP layer maintains a DSN-to-subflow-sequence mapping, and the per-subflow TCP machines operate with their own sequence spaces. In Linux, this separation was retrofitted and required careful locking; in UmkaOS, the type system enforces the distinction (DataSeqNum vs SubflowSeqNum are distinct newtypes).

  • Middlebox fallback: If a middlebox strips MPTCP options from the SYN/ACK, the connection falls back to single-path TCP transparently. The application sees a working connection regardless.

  • Use cases requiring MPTCP: iOS and macOS use MPTCP for seamless WiFi/cellular handoff. Multipath TCP proxies improve connection reliability for mobile clients. WireGuard multipath tunnels bond multiple network paths for increased throughput.

MPTCP Scheduler Trait. The packet scheduler is pluggable via the MptcpSchedulerOps trait, following the same registration pattern as CongestionOps (Section 16.4.1):

/// Trait for MPTCP packet schedulers. Distributes data segments across
/// available subflows based on path quality, policy, or redundancy needs.
///
/// Registration: `mptcp_scheduler_register(&'static dyn MptcpSchedulerOps)`.
/// Selection: per-socket via `setsockopt(IPPROTO_MPTCP, MPTCP_SCHEDULER, name)`,
/// or system-wide via `/proc/sys/net/mptcp/scheduler`.
pub trait MptcpSchedulerOps: Send + Sync {
    /// Scheduler name (ASCII, max 16 bytes including NUL).
    fn name(&self) -> &'static str;

    /// Called when a new MPTCP connection selects this scheduler.
    /// Initialise per-connection scheduler state in `conn.sched_priv`.
    fn init(&self, conn: &mut MptcpConnection);

    /// Called when the connection is destroyed or scheduler is changed.
    /// Release per-connection resources allocated in `init`.
    fn release(&self, conn: &mut MptcpConnection);

    /// Select the subflow(s) on which to send the next data segment.
    ///
    /// `candidates` is the set of subflows currently in ESTABLISHED state
    /// with available cwnd. Returns one or more subflow indices. Returning
    /// multiple indices sends the segment redundantly (used by `redundant`).
    ///
    /// Called on the send path for every segment — must be O(n) in subflow
    /// count or better. Typical connection has 2-4 subflows.
    fn select_subflow(
        &self,
        conn: &MptcpConnection,
        candidates: &[&MptcpSubflow],
        segment_len: u32,
    ) -> ArrayVec<SubflowIndex, MAX_MPTCP_SUBFLOWS>;

    /// Notify scheduler of a subflow state change (new subflow added,
    /// subflow removed, RTT update, cwnd change). Schedulers that cache
    /// per-subflow metrics update them here.
    fn subflow_event(&self, conn: &mut MptcpConnection, event: MptcpSubflowEvent) {
        let _ = (conn, event);
    }
}

/// Upper bound for subflows per MPTCP connection. Linux default limit is 2
/// (configurable via `ip mptcp limits`). 8 covers all practical deployments.
pub const MAX_MPTCP_SUBFLOWS: usize = 8;

/// Events delivered to `subflow_event`.
pub enum MptcpSubflowEvent {
    /// New subflow reached ESTABLISHED state.
    SubflowAdded { idx: SubflowIndex },
    /// Subflow closed or failed.
    SubflowRemoved { idx: SubflowIndex },
    /// RTT estimate updated (smoothed RTT in microseconds, matching TcpCb.srtt_us).
    RttUpdated { idx: SubflowIndex, srtt_us: u32 },
    /// Congestion window changed (bytes). u64 to accommodate datacenter
    /// environments with cwnd > 4 GiB (100 Gbps × 400ms RTT).
    CwndChanged { idx: SubflowIndex, cwnd: u64 },
}

Core data structures:

/// Type alias for subflow array index. Values 0..MAX_MPTCP_SUBFLOWS-1.
pub type SubflowIndex = u8;

/// MPTCP connection-level state. One per MPTCP socket.
/// Contains the subflow list, connection-level sequence space, and
/// receive reassembly buffer. The scheduler and path manager operate
/// on this structure.
///
/// **Slab allocation**: `MptcpConnection` is ~6.5 KB (dominated by
/// `ArrayVec<DsnMapEntry, 256>` at 6144 bytes). Use a dedicated slab cache
/// (`mptcp_conn_cache`) instead of the generic kmalloc-8192 class to minimize
/// internal fragmentation (~1.5 KB waste per allocation in generic slab).
// Kernel-internal, not KABI.
pub struct MptcpConnection {
    /// Local connection key (64-bit random, from MP_CAPABLE handshake).
    pub local_key: u64,
    /// Remote connection key (64-bit, from peer's MP_CAPABLE).
    pub remote_key: u64,
    /// Connection-level Data Sequence Number (DSN) for the next byte to send.
    /// Independent of per-subflow TCP sequence numbers.
    pub snd_nxt: DataSeqNum,
    /// Highest DSN acknowledged by the receiver (DATA_ACK).
    pub snd_una: DataSeqNum,
    /// Next expected receive DSN.
    pub rcv_nxt: DataSeqNum,
    /// Active subflows. Each `MptcpSubflow` is individually slab-allocated
    /// (warm-path: MP_JOIN handshake) and accessed via `SlabBox`. This avoids
    /// embedding 8 full TcpCb instances (~512-768 bytes each) inline, which
    /// would make the MptcpConnection struct ~12 KB — causing slab fragmentation.
    /// With pointer indirection, the base struct stores 8 pointers (64 bytes)
    /// and individual subflows live in their own slab cache with better
    /// utilization. The common case (2 subflows) allocates only 2 slab objects.
    pub subflows: ArrayVec<SlabBox<MptcpSubflow>, MAX_MPTCP_SUBFLOWS>,
    /// DSN-to-subflow-sequence mapping for the send path.
    /// Tracks which subflow carries which DSN range, for retransmission.
    /// Fixed-capacity ring buffer (bounded by MAX_DSN_MAP_ENTRIES).
    pub dsn_map: ArrayVec<DsnMapEntry, MAX_DSN_MAP_ENTRIES>,
    /// Receive reassembly buffer. Out-of-order segments (received on
    /// different subflows in non-DSN order) are held here until the gap
    /// is filled.
    ///
    /// **BTreeMap justified**: the reassembly drain path uses `range(rcv_nxt..)`
    /// to iterate contiguous segments starting from the next expected DSN.
    /// This is a genuine range-query pattern that XArray does not support
    /// efficiently. Warm path (per-segment, not per-packet on the fast path).
    ///
    /// **Enforcement**: Reorder queue bytes count toward `SockCommon.rcvbuf`
    /// accounting. The MPTCP RX path checks before insertion:
    ///   `reorder_bytes + segment_len > sock.common.rcvbuf` → drop segment.
    /// Additionally, a maximum entry count is enforced:
    ///   `max_entries = sock.common.rcvbuf / mss` (where `mss` is the
    ///   effective MPTCP-level MSS, typically 1460 for Ethernet).
    /// This bounds BTreeMap node overhead:
    ///   `max_btree_overhead = rcvbuf / MSS * sizeof(BTreeNode)`
    ///   `= 256K / 1460 * ~48 = ~8.4 KB` (negligible vs rcvbuf).
    /// Without the entry count limit, an attacker sending 1-byte MPTCP
    /// data ranges could amplify a 256 KB rcvbuf into ~13.6 MB of
    /// BTreeMap node overhead.
    pub reorder_queue: BTreeMap<DataSeqNum, ReorderEntry>,
    /// Current total bytes in the reorder queue (for rcvbuf enforcement).
    /// Incremented on reorder_queue insertion, decremented when segments
    /// are drained via `range(rcv_nxt..)` into the socket receive buffer.
    pub reorder_bytes: u32,
    /// Opaque per-connection scheduler state. Allocated by
    /// `MptcpSchedulerOps::init()`, freed by `release()`.
    /// Size bounded by MAX_SCHED_PRIV_BYTES (256).
    pub sched_priv: [u8; MAX_SCHED_PRIV_BYTES],
}

/// Distinct newtype for connection-level Data Sequence Numbers.
/// Prevents confusion with per-subflow TCP sequence numbers at compile time.
#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq)]
pub struct DataSeqNum(pub u64);

/// Per-subflow state within an MPTCP connection.
pub struct MptcpSubflow {
    /// Index within `MptcpConnection.subflows`.
    pub index: SubflowIndex,
    /// The underlying TCP control block for this subflow.
    pub tcp: TcpCb,
    /// Subflow-level sequence number tracking (independent of DSN).
    pub ssn_offset: u32,
    /// Cached smoothed RTT for this subflow path (microseconds). Shadowed
    /// from `tcp.lock: SpinLock<TcpMutableState>` for lock-free scheduler access.
    ///
    /// **Cache update protocol**: After the TCP stack updates `srtt_us` in
    /// `tcp_rcv_established()` (under `TcpCb.lock`), it calls
    /// `mptcp_conn.scheduler.subflow_event(RttUpdated { idx, srtt_us })`.
    /// The scheduler's `subflow_event` implementation writes
    /// `subflow.srtt_us = srtt_us`. The stale read window between the TCP
    /// update and the scheduler callback is bounded by one RTT — acceptable
    /// because the scheduler's decision is itself based on sampled RTTs.
    pub srtt_us: u32,
    /// Cached congestion window (bytes). Same update protocol as srtt_us:
    /// `subflow_event(CwndUpdated { idx, cwnd })`.
    pub cwnd: u64,
    /// Subflow state.
    pub state: MptcpSubflowState,
    /// Network interface this subflow is bound to.
    pub bound_ifindex: u32,
}

/// MPTCP subflow lifecycle state.
#[repr(u8)]
pub enum MptcpSubflowState {
    /// MP_JOIN handshake in progress.
    Joining   = 0,
    /// Subflow is established and carrying data.
    Active    = 1,
    /// Subflow is being gracefully closed (DATA_FIN sent).
    Closing   = 2,
    /// Subflow has failed (RST received, timeout, or path failure).
    Failed    = 3,
}

/// Entry in the DSN-to-subflow-sequence mapping table.
/// Tracks which subflow carries which DSN range for retransmission.
pub struct DsnMapEntry {
    pub dsn_start: DataSeqNum,
    pub dsn_end: DataSeqNum,
    pub subflow_idx: SubflowIndex,
    pub subflow_seq: u32,
}

/// Maximum DSN map entries per connection. Bounded to prevent unbounded
/// memory growth. Oldest entries are evicted when the limit is reached
/// (they are no longer needed once DATA_ACK advances past them).
const MAX_DSN_MAP_ENTRIES: usize = 256;

/// Maximum per-connection scheduler private state (bytes).
const MAX_SCHED_PRIV_BYTES: usize = 256;

/// Entry in the out-of-order receive reassembly queue.
pub struct ReorderEntry {
    pub dsn_end: DataSeqNum,
    pub data_offset: u32,
    pub data_len: u32,
}

MP_CAPABLE / MP_JOIN handshake: Connection establishment follows RFC 8684 Section 3.1. The initiator sends a SYN with MP_CAPABLE option containing local_key. The responder echoes with SYN+ACK containing remote_key. The third ACK confirms. For subflow addition (MP_JOIN, RFC 8684 Section 3.2): the initiator sends SYN with MP_JOIN containing a truncated HMAC of the connection keys; the responder validates via its own HMAC computation. The handshake state machine is managed by the path manager, not the scheduler.

Built-in schedulers (registered at boot):

Name Algorithm Use Case
default Lowest-RTT: send on subflow with smallest srtt_us General-purpose, minimises latency
roundrobin Cycle through subflows in order, skip those with zero cwnd Maximum aggregate throughput
redundant Send on ALL candidate subflows simultaneously Ultra-low-latency (VoIP, gaming), tolerates subflow loss

16.12 Domain Switch Overhead Analysis

Clarification: umka-core mediates domain transitions (switching PKRU/POR_EL0 state) but does NOT copy packet data. The NIC driver and umka-net share a zero-copy ring buffer in shared memory (accessible to both domains via PKEY 1). Domain switches occur when the CPU transitions between executing umka-core code (for dispatch/scheduling), NIC driver code (for DMA completion processing), and umka-net code (for TCP/IP processing). The 4 switches represent: (1) umka-core→NIC driver for interrupt dispatch, (2) NIC driver→umka-core on return, (3) umka-core→umka-net for protocol processing, (4) umka-net→umka-core on return. Data flows through shared-memory ring buffers without additional copies.

The network stack (umka-net, Tier 1) runs in its own isolation domain. Every packet traverses two domain boundaries, each requiring an entry and exit switch (4 switches total):

  1. umka-core to NIC driver and back: domain switch to enter NIC driver domain for interrupt handling (~23 cycles), then domain switch to return to umka-core (~23 cycles)
  2. umka-core to umka-net and back: domain switch to enter umka-net domain for TCP processing (~23 cycles), then domain switch to return to umka-core for socket delivery (~23 cycles)

Cross-NUMA note: The ~23-cycle figure assumes intra-node execution (WRPKRU is a local register write). Cross-NUMA-node domain switches add ~100-200 cycles of additional latency if the NAPI poll thread and the protocol processing thread run on different NUMA nodes (due to remote memory access for shared ring buffer metadata). IRQ affinity and NAPI thread pinning (Section 16.14) mitigate this by ensuring all per-flow processing stays NUMA-local.

For high-throughput networking (100 Gbps), the overhead matters.

Per-packet cost analysis (1500-byte frames at 100 Gbps = ~8.3M packets/sec):

Domain switches per packet:      4 (2 domain entries x 2 switches each)
Cycles per switch:               ~23 (WRPKRU, per [Section 11.2](11-drivers.md#isolation-mechanisms-and-performance-modes))
Total domain switch overhead/packet:  ~92 cycles (~20ns at 4.5 GHz)
Time budget per packet:          ~120ns (at 8.3M pps)
Domain switch overhead fraction: ~17% (at 4.5 GHz) to ~26% (at 3 GHz)

This 17-26% overhead (depending on clock speed) is unacceptable for production networking. Four mitigations reduce it to a negligible fraction:

  • Batching + GRO amortization: Process packets in batches of up to 64. The NIC driver poll processes the entire batch in a single domain switch pair (2 switches: enter/exit Tier 1 driver). umka-core's NAPI handler (Tier 0) then reconstructs NetBufs and delivers them to umka-net (Tier 1), where GRO coalesces raw packets into super-packets before protocol processing. Each GRO-coalesced super-packet is delivered as a single unit, so the umka-net domain switch pair (2 switches) is amortized across the GRO coalescing ratio. For a typical GRO ratio of ~16:1, a batch of 64 raw packets yields ~4 GRO-coalesced super-packets. Total switches per batch: 2 (driver poll) + 2 (umka-net delivery) = 4 switches for 64 raw packets, or ~0.06 switches/packet. Without GRO (e.g., non-coalesceable UDP traffic), the cost is still 4 switches per batch (the entire batch is delivered to umka-net in one domain switch pair), yielding ~0.06 switches/packet at batch size 64.
  • NAPI-style polling: After the first interrupt, switch to polling mode. The NIC driver translates its hardware-specific completion events into standardized KABI completion descriptors. These KABI descriptors are written to a shared isolation domain (the shared read-only PKEY per Section 10.2's domain allocation table), accessible by both umka-net and the NIC driver. umka-net reads the KABI descriptors directly from the shared domain without a per-packet domain switch. Write access to ring doorbell registers remains in the NIC driver's private domain — umka-net can observe completions but cannot manipulate the hardware directly. This deliberately places the ring descriptors outside the NIC driver's private domain, following the standard UmkaOS pattern for zero-copy data exchange between domains. No per-packet interrupt or domain switch while in polling mode. The polling-to-interrupt transition uses an adaptive threshold based on packet rate.
  • XDP fast path: XDP programs run in a dedicated BPF isolation domain (triggered from the driver's RX handler via a bounce buffer copy). Packets that are dropped, redirected, or TX-bounced by XDP never incur the driver-to-umka-net domain switch. For workloads like DDoS mitigation where >90% of packets are dropped, this eliminates nearly all domain switches. The driver→BPF domain switch cost (~23 cycles on x86-64) is amortized by NAPI batching.
  • GRO (Generic Receive Offload): Coalesce multiple small packets into larger aggregates before delivery across domain boundaries, amortizing the per-byte domain switch cost across multiple original packets.

NIC Hardware Offloads

Modern NICs perform significant protocol processing in hardware, offloading work from the CPU. UmkaOS exposes these through the NIC driver's KABI and umka-net configuration:

Offload Direction Description Benefit
TSO (TCP Segmentation Offload) TX Application sends large (up to 64KB) TCP segments; NIC splits into MTU-sized packets with correct TCP sequence numbers and checksums Eliminates per-packet CPU segmentation; up to 5x throughput improvement for bulk transfers
GSO (Generic Segmentation Offload) TX Software fallback for TSO — umka-net segments just before the NIC driver if hardware TSO is unavailable. Also handles UDP (UFO) and tunnel encapsulated packets (GSO_ENCAP) Same API for applications regardless of NIC capability
GRO (Generic Receive Offload) RX Coalesce multiple received packets into larger aggregates before protocol processing Reduces per-packet overhead; amortizes domain switch cost
TX Checksum Offload TX NIC computes TCP/UDP/IP checksums in hardware; umka-net marks the SKB with CHECKSUM_PARTIAL and provides the checksum start/offset Saves ~50ns CPU per packet
RX Checksum Offload RX NIC verifies checksums and reports status; umka-net skips software verification for CHECKSUM_COMPLETE or CHECKSUM_UNNECESSARY packets Saves ~50ns CPU per packet
Scatter-Gather I/O TX NIC can DMA from non-contiguous memory (multiple physical pages); umka-net passes a scatter-gather list instead of copying to a contiguous buffer Eliminates linearization copy for large packets

Offload capabilities are queried at driver bind time via NicDriver::query_offloads() and are individually toggleable at runtime via sysfs (/sys/class/net/<dev>/offload/{tso,gso,tx_csum,rx_csum,sg}), matching Linux's ethtool -K semantics. Offloads are enabled by default when the NIC reports support. GSO is always available as a software fallback for NICs without TSO.

Receive Flow Steering (RFS / aRFS)

On multi-queue NICs, interrupt affinity determines which CPU processes each received packet. Without flow steering, a packet may arrive on CPU 0 (interrupt handler) while the consuming application runs on CPU 5 — the packet traverses the socket buffer cache-cold, adding ~2-5μs cross-CPU latency at high packet rates.

UmkaOS implements both software and hardware flow steering:

  • RFS (Receive Flow Steering) — software-based. When a socket performs recvmsg(), umka-net records the {flow_hash → cpu} mapping in a per-NIC flow table. On the next packet for that flow, the softirq handler checks the table and, if the target CPU differs from the current CPU, enqueues the packet to the target CPU's backlog via inter-processor interrupt (IPI). This steers subsequent packets to the CPU where the application is running, improving cache locality.
sysfs control:
/sys/class/net/<dev>/queues/rx-<N>/rps_flow_cnt  — entries per RX queue (default: 0 = disabled)
/proc/sys/net/core/rps_sock_flow_entries         — global flow table size (default: 0 = disabled)
  • aRFS (Accelerated RFS) — hardware-based. For NICs that support hardware flow steering (Intel i40e/ice, Mellanox mlx5, Broadcom bnxt), umka-net programs the NIC's flow director or n-tuple filter table to steer packets to the correct RX queue at the hardware level. This eliminates the software IPI redirect — the NIC delivers the packet directly to the correct CPU's RX queue via MSI-X.

The NIC driver implements the ndo_rx_flow_steer() KABI method. umka-net calls it when the flow table is updated. aRFS is preferred over RFS when the NIC supports it; umka-net falls back to software RFS automatically.

Socket-Level Busy Polling (SO_BUSY_POLL)

For latency-critical applications (HFT, DPDK-adjacent workloads, <10μs requirement), interrupt-driven packet delivery has an inherent latency floor: the time from NIC DMA completion → MSI-X → interrupt handler → softirq → socket wakeup is typically 5-20μs even with well-tuned interrupt coalescing.

Busy polling eliminates this floor by having the application poll the NIC's completion queue directly from the recvmsg() / poll() / epoll_wait() syscall path:

Per-socket:
  setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &timeout_us, sizeof(timeout_us));

Global default:
  /proc/sys/net/core/busy_poll = <microseconds>    — busy-poll timeout for poll()/select()
  /proc/sys/net/core/busy_read = <microseconds>    — busy-poll timeout for read()/recvmsg()

When busy polling is active, recvmsg() and epoll_wait() spin in a tight loop calling the NAPI poll function (napi_poll()) directly from process context, which checks the RX completion queue without waiting for an interrupt. (Note: Linux removed the per-driver ndo_busy_poll() callback in kernel 4.11; busy polling now goes through the NAPI subsystem uniformly.) The thread burns CPU cycles during the poll window but reduces receive latency to ~1-3μs (NIC DMA completion → next poll iteration).

Trade-offs: Busy polling trades CPU efficiency for latency. A thread busy-polling at 50μs timeout wastes those cycles if no packet arrives. This is appropriate for dedicated-CPU, latency-critical workloads (trading NICs, real-time control) but inappropriate for shared servers. The per-socket granularity ensures only opted-in sockets pay the CPU cost.

Measured overhead target: With batching + NAPI polling active, the domain switch overhead for sustained 100 Gbps throughput is <2% of CPU time. This is comparable to Linux's combined interrupt + softirq overhead for the same workload, making domain isolation effectively free at high packet rates.

The stated overhead targets assume NAPI-64 batching (64 packets per poll cycle). Without batching, per-packet domain switch overhead dominates. NAPI batching is a required prerequisite for achieving the performance budget, not an optional optimization.

16.13 Network Device Interface (NetDevice)

Every network interface — physical NIC, veth, bridge, VLAN, tunnel — is represented as a NetDevice. This is the core struct for the network data path, equivalent to Linux's struct net_device.

/// Network device descriptor. One per interface (physical or virtual).
///
/// Allocated by the NIC driver (or virtual device creator) via
/// `alloc_netdev()`. Registered with umka-net via `register_netdev()`.
/// Freed when the last reference drops after `unregister_netdev()`.
pub struct NetDevice {
    /// Interface name ("eth0", "wlan0", "veth0", "br0").
    /// Userspace-visible. Max 15 bytes (IFNAMSIZ - 1).
    pub name: [u8; 16],

    /// Interface index (unique, assigned by umka-net on registration).
    /// Userspace-visible via `if_nametoindex()` / `SIOCGIFINDEX`.
    /// u32 with 0 as "no interface" sentinel (Linux uses `int` but valid
    /// indices are always positive; u32 unifies with NetBuf.ifindex).
    pub ifindex: u32,

    /// MAC address (6 bytes for Ethernet, variable for other L2).
    pub dev_addr: [u8; 6],

    /// Maximum transmission unit (bytes). Default 1500 for Ethernet.
    /// Changed via `NetDeviceOps::change_mtu()`.
    pub mtu: AtomicU32,

    /// Hardware feature flags (offloads the device supports).
    pub hw_features: NetDevFeatures,
    /// Active features (subset of hw_features, toggled at runtime).
    pub features: NetDevFeatures,

    /// Network namespace this device belongs to.
    pub net_ns: Arc<NetNamespace>,

    /// Driver-provided operations (open, stop, xmit, etc.).
    pub ops: &'static dyn NetDeviceOps,

    /// Per-queue TX state (one per hardware TX queue).
    /// Heap-allocated at `alloc_netdev()` time with length = `num_tx_queues`
    /// (runtime-discovered from driver). `Box<[TxQueue]>` is used instead of
    /// `ArrayVec` because: (1) TX queue count is runtime-discovered from NIC
    /// hardware, not a compile-time constant; (2) a stable pointer is better
    /// for live kernel evolution (ArrayVec bakes capacity into struct layout);
    /// (3) one allocation per NIC probe is trivially cheap (warm-path).
    pub tx_queues: Box<[TxQueue]>,
    /// Number of TX queues (set by driver at alloc time).
    pub num_tx_queues: u16,

    /// NAPI instances associated with this device (typically one per
    /// RX queue). Registered by the driver during `napi_enable()`.
    /// Heap-allocated at device open time with length = number of NAPI
    /// instances (runtime-discovered from driver). Same rationale as
    /// `tx_queues`: runtime-sized, stable pointer for live evolution,
    /// one allocation per device open (warm-path). SpinLock protects
    /// modifications during device open/close, not per-packet access.
    pub napi_list: SpinLock<Box<[Option<Arc<NapiContext>>]>>,

    /// Link carrier state (up/down). Set by driver via
    /// `netif_carrier_on()` / `netif_carrier_off()`.
    pub carrier: AtomicBool,

    /// Device flags (IFF_UP, IFF_BROADCAST, IFF_MULTICAST, etc.).
    pub flags: AtomicU32,

    /// Traffic control qdisc root (per-device, default: fq_codel).
    /// See [Section 16.21](#traffic-control-and-queue-disciplines).
    pub qdisc: RcuCell<Arc<Qdisc>>,

    /// TX dispatch strategy for tier-aware transmission. Determines how
    /// `start_xmit()` is invoked across isolation boundaries — direct call
    /// for Tier 0, KABI ring for Tier 1, IPC for Tier 2.
    /// Set at `register_netdev()` time based on the driver's isolation tier.
    /// See `TxDispatch` enum for variant descriptions.
    pub tx_dispatch: TxDispatch,

    /// Attached XDP program (set via `NetDeviceOps::bpf()` with `XDP_SETUP_PROG`).
    /// When `Some`, the NIC driver calls `bpf_prog_run_xdp()` on each received
    /// packet before GRO/protocol processing. The program returns XDP_PASS,
    /// XDP_DROP, XDP_TX, or XDP_REDIRECT. Cleared on program detach or device
    /// unregister. Protected by RTNL for updates; readers use RCU.
    pub xdp_prog: RcuCell<Option<Arc<BpfProg>>>,

    /// Per-CPU RX/TX statistics.
    pub stats: PerCpu<NetDevStats>,

    /// Ethtool operations (link settings, ring params, coalesce).
    pub ethtool_ops: Option<&'static dyn EthtoolOps>,

    /// Per-device NetBuf pool. Provides slab-managed packet buffers for
    /// this NIC's RX and TX paths. Created at `open()` time via
    /// `netbuf_pool_create(dev.numa_node, rx_ring_size + tx_ring_size)`.
    /// Destroyed at `stop()` / `unregister_netdev()` time.
    ///
    /// The pool is NUMA-local (allocated on the NIC's NUMA node for optimal
    /// DMA affinity). Each pool entry is a `NetBuf` slab slot (256 bytes
    /// metadata + data page references).
    ///
    /// Access: `dev.netbuf_pool()` returns `&NetBufPool`. Used by:
    /// - `napi_receive_buf()`: converts `NetBuf` → `NetBufHandle`.
    /// - `dev_queue_xmit()`: converts `NetBuf` → `NetBufHandle` before qdisc enqueue.
    /// - TX completion: `NetBufHandle::Drop` returns slot to this pool.
    ///
    /// See [Section 16.5](#netbuf-packet-buffer--netbufpool-per-cpu-slab-pool) for the
    /// full pool specification.
    pub pool: Option<Arc<NetBufPool>>,

    /// KABI ring handle for communication with umka-net (Tier 1).
    /// Used by `napi_deliver_batch()` to submit RX batches.
    /// Set at `register_netdev()` time. `None` for virtual devices
    /// that run inside umka-net (bridge, veth, VLAN).
    pub net_ring_handle: Option<KabiHandle>,

    /// Device-private data (driver state). Opaque to umka-net.
    ///
    /// **Lifecycle**: Set by the driver during `open()` via
    /// `driver_priv.store(ptr, Release)`. Nulled by umka-net during `stop()`
    /// or crash recovery (before the driver's isolation domain is released)
    /// via `driver_priv.store(null_mut(), Release)`.
    ///
    /// **Memory ordering**: `AtomicPtr` with `Release` on store and `Acquire`
    /// on load ensures that concurrent readers on other CPUs see a consistent
    /// null (after crash recovery) or a valid pointer (after open()). Without
    /// atomic ordering, a CPU observing a stale non-null pointer after crash
    /// recovery would dereference freed driver state.
    ///
    /// **Access discipline**: Tier 0 code must check the driver's domain
    /// generation before dereferencing. The pattern is:
    /// ```
    /// let ptr = dev.driver_priv.load(Acquire);
    /// if ptr.is_null() { return Err(EIO); }
    /// // SAFETY: ptr is non-null and domain generation matches.
    /// ```
    pub driver_priv: AtomicPtr<u8>,
}

impl NetDevice {
    /// Returns a reference to this device's NetBuf pool.
    ///
    /// # Panics
    /// Panics if the pool has not been created (device not open).
    /// This is a programming error — `netbuf_pool()` must only be called
    /// after `open()` and before `stop()`.
    #[inline]
    pub fn netbuf_pool(&self) -> &NetBufPool {
        self.pool.as_ref().expect("NetDevice::netbuf_pool() called on closed device")
    }
}

pub struct NetDevStats {
    pub rx_packets: u64,
    pub tx_packets: u64,
    pub rx_bytes: u64,
    pub tx_bytes: u64,
    pub rx_errors: u64,
    pub tx_errors: u64,
    pub rx_dropped: u64,
    pub tx_dropped: u64,
}

/// Per-TX-queue state. One per hardware transmit queue.
pub struct TxQueue {
    /// Queue index.
    pub index: u16,
    /// Queue is stopped (driver has no DMA descriptors available).
    /// Set by driver via `netif_tx_stop_queue()`, cleared via
    /// `netif_tx_wake_queue()` from TX completion interrupt.
    pub stopped: AtomicBool,
    /// NUMA node affinity for this queue.
    pub numa_node: u16,
    /// Per-queue qdisc (child qdisc). For multi-queue devices, each TX
    /// queue has its own qdisc instance (typically a child of the root
    /// qdisc). The root qdisc (`NetDevice.qdisc`) delegates to per-queue
    /// qdiscs via `qdisc_select_queue()`. For single-queue devices or
    /// classless qdiscs, all TX queues share the root qdisc (this field
    /// points to the same `Arc<Qdisc>` as `NetDevice.qdisc`).
    ///
    /// Linux equivalent: `struct netdev_queue.qdisc` (per-queue qdisc
    /// pointer, set by `dev_activate()` and `tc_modify_qdisc()`).
    pub qdisc: Arc<Qdisc>,
}

NetDeviceOps trait — driver-provided callbacks for device lifecycle, packet transmission, and configuration. Equivalent to Linux's struct net_device_ops.

/// NIC driver operations. Implemented by every network device driver.
///
/// For Tier 1 drivers, these are called through the KABI vtable
/// (`NicDriverVTable`, [Section 12.1](12-kabi.md#kabi-overview)). For virtual devices (veth,
/// bridge, VLAN), these are direct trait implementations.
pub trait NetDeviceOps: Send + Sync {
    /// Bring the interface up. Allocate RX/TX rings, enable interrupts,
    /// start NAPI. Called when userspace runs `ip link set <dev> up`.
    fn open(&self, dev: &NetDevice) -> Result<(), IoError>;

    /// Bring the interface down. Stop NAPI, disable interrupts, free
    /// rings. Called when userspace runs `ip link set <dev> down`.
    fn stop(&self, dev: &NetDevice) -> Result<(), IoError>;

    /// Transmit a packet. The driver takes ownership of the NetBufHandle,
    /// sets up DMA descriptors, and returns. Completion is asynchronous
    /// (via TX completion interrupt or NAPI TX poll).
    ///
    /// Returns `Ok(())` on successful queuing.
    /// Returns `Err(IoError::BUSY)` if the TX ring is full; umka-net
    /// will stop the queue and retry after `netif_tx_wake_queue()`.
    fn start_xmit(&self, dev: &NetDevice, buf: NetBufHandle) -> Result<(), IoError>;

    /// Select TX queue for a packet. Called before `start_xmit()` for
    /// multi-queue devices. Default: hash-based queue selection.
    fn select_queue(&self, dev: &NetDevice, buf: &NetBufHandle) -> u16 {
        // Default: XPS (Transmit Packet Steering) or hash-based.
        0
    }

    /// Set interface MTU. Validate against hardware limits.
    fn change_mtu(&self, dev: &NetDevice, new_mtu: u32) -> Result<(), IoError>;

    /// Set MAC address.
    fn set_mac_address(&self, dev: &NetDevice, addr: &[u8; 6]) -> Result<(), IoError>;

    /// Set receive mode: promiscuous, all-multicast, or filtered.
    /// Called when multicast group membership or promisc mode changes.
    fn set_rx_mode(&self, dev: &NetDevice);

    /// Get 64-bit statistics. Returns extended stats including per-queue
    /// counters, error breakdowns, etc.
    fn get_stats64(&self, dev: &NetDevice) -> NetDevStats;

    /// Fix up feature flags after a change request. The driver can veto
    /// or force features based on hardware constraints.
    fn fix_features(&self, dev: &NetDevice, features: NetDevFeatures) -> NetDevFeatures {
        features
    }

    /// Apply changed feature flags to hardware.
    fn set_features(&self, dev: &NetDevice, features: NetDevFeatures) -> Result<(), IoError> {
        Ok(())
    }

    /// Hardware flow steering: program the NIC's flow director to steer
    /// a flow to a specific RX queue. Used by aRFS.
    fn rx_flow_steer(
        &self,
        _dev: &NetDevice,
        _buf: &NetBufHandle,
        _rxq_index: u16,
        _flow_id: u32,
    ) -> Result<(), IoError> {
        Err(IoError::ENOSYS)
    }
}

16.13.1 TX Dispatch (Tier-Aware Transmission)

The RX path uses NapiPollDispatch (Section 16.14) to dispatch NAPI poll across isolation boundaries. The TX path requires an analogous mechanism: when umka-net (Tier 1) or the traffic control layer finishes building a packet, it must invoke the NIC driver's start_xmit() through the appropriate isolation boundary. Without TxDispatch, a Tier 1 NIC driver's start_xmit would need a bare function pointer call, bypassing the isolation boundary — the same problem NapiPollDispatch solves for RX.

TxDispatch is set at register_netdev() time and validated against the driver's KabiTransport class (same invariant enforcement as NapiPollDispatch). Virtual devices (veth, bridge, VLAN) always use TxDispatch::Direct because they are Tier 0 in-kernel constructs.

/// Determines how `start_xmit()` is dispatched across driver isolation
/// boundaries for the TX path.
///
/// This is the TX-side counterpart to `NapiPollDispatch` (which handles RX).
/// Each variant carries the transport-specific state needed to submit a packet
/// to the driver within its isolation tier. The network stack calls
/// `NetDevice::dispatch_xmit()` which matches on this enum.
///
/// **Invariant**: The variant MUST match the driver's actual isolation tier.
/// A Tier 1 driver registered with `TxDispatch::Direct` would bypass the
/// isolation boundary — this is prevented at `register_netdev()` time by
/// checking the driver's `KabiTransport` class.
pub enum TxDispatch {
    /// Tier 0 driver: direct `NetDeviceOps::start_xmit()` call (no domain
    /// crossing). Zero dispatch overhead. Used for Tier 0 NIC drivers and
    /// all virtual devices (veth, bridge, VLAN, tunnel).
    ///
    /// # Safety
    /// The `ops` trait object reference in `NetDevice.ops` must remain valid
    /// for the device's lifetime. Guaranteed because Tier 0 drivers are either
    /// never unloaded or outlive the NetDevice (unregister before unload).
    Direct,

    /// Tier 1 driver: KABI ring dispatch (hardware memory-domain isolated).
    /// The caller serializes the `NetBufHandle` into a `NetBufRingEntry`
    /// ([Section 16.5](#netbuf-packet-buffer--netbufringentry-kabi-wire-format)) and writes
    /// it to the driver's TX command ring. One domain switch pair per batch
    /// (~23 cycles on x86 MPK), amortized across the qdisc dequeue batch.
    ///
    /// The TX path follows a 3-hop relay for Tier 1 NIC drivers:
    /// umka-net (Tier 1) → umka-core (Tier 0 relay) → NIC driver (Tier 1).
    /// umka-net serializes `NetBufRingEntry` records to its TX output ring.
    /// umka-core (Tier 0) reads from this ring and forwards entries to the
    /// NIC driver's TX command ring, then rings the driver's doorbell.
    ///
    /// **Architectural note (3-hop relay justification)**: The design
    /// philosophy ({ref:00-design-philosophy}  <!-- UNRESOLVED -->) prefers direct Tier 1↔Tier 1
    /// rings (one hop). The 3-hop TX relay is a deliberate exception:
    /// umka-net and the NIC driver run in DIFFERENT Tier 1 domains (network
    /// domain vs driver domain), and the TX path requires qdisc processing
    /// which runs in Tier 0. Direct cross-domain rings would skip qdisc,
    /// breaking traffic control (rate limiting, prioritization). The 3-hop
    /// cost (~46 cycles) is amortized across the qdisc batch (typically 64
    /// packets), yielding ~0.7 cycles/packet overhead — well within the
    /// negative overhead budget.
    /// See [Section 16.5](#netbuf-packet-buffer--domain-crossing-protocol) for the
    /// full 3-hop TX path diagram.
    ///
    /// TX completion follows the reverse path: NIC driver → umka-core
    /// (completion ring) → umka-net (TX completion notification). umka-core
    /// frees the NetBuf and DMA-unmaps data pages on completion.
    ///
    /// **Batching**: The traffic control layer (`qdisc_run`) dequeues up to
    /// `tx_weight` packets (default 64) per run. All dequeued packets are
    /// serialized to the relay ring before umka-core performs a single
    /// batch forward, amortizing the domain switch cost across the batch —
    /// identical to how `NapiPollDispatch::KabiRing` amortizes across the
    /// NAPI budget on RX.
    KabiRing {
        /// Reference to the umka-net → umka-core TX relay ring pair.
        /// Allocated at driver registration time. umka-core owns a
        /// corresponding reference to the NIC driver's TX command ring
        /// and forwards entries from this relay ring to the driver ring.
        ring: KabiRingRef,
        /// Driver identifier for domain switch targeting.
        driver_id: DriverId,
        /// Driver domain generation at registration time. Checked on every
        /// `dispatch_xmit()` batch (one `AtomicU64::load(Acquire)` per batch,
        /// not per packet). Stale generation causes the TX path to drop the
        /// packet and return `IoError::NODEV`, triggering queue stop and
        /// crash recovery ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
        domain_generation: u64,
    },

    /// Tier 2 driver: IPC-based RPC (full process isolation).
    /// The caller sends the packet via an IPC message to the Tier 2 driver
    /// process. The driver process receives the message, programs DMA, and
    /// replies with a TX completion status.
    /// Full IPC round-trip per packet (~200-500 cycles).
    ///
    /// Packets are transferred via the shared NetBuf pool
    /// ([Section 16.5](#netbuf-packet-buffer--domain-crossing-protocol)). The IPC message
    /// contains a `DmaBufferHandle` referencing the packet data — no data
    /// copy, only metadata serialization.
    IpcRpc {
        /// IPC endpoint for communication with the Tier 2 driver process.
        endpoint: IpcEndpointRef,
        /// Driver identifier for packet ownership tracking.
        driver_id: DriverId,
    },
}

impl NetDevice {
    /// Dispatch a packet for transmission through the appropriate isolation
    /// tier. Called by the traffic control layer (`qdisc_dequeue` → `dev_xmit`)
    /// or by virtual devices that bypass qdisc (`PACKET_QDISC_BYPASS`).
    ///
    /// For `TxDispatch::Direct`, this calls `self.ops.start_xmit()` inline.
    /// For `TxDispatch::KabiRing`, the `NetBufHandle` is serialized into a
    /// `NetBufRingEntry` and written to the umka-net → umka-core relay ring.
    /// umka-core forwards the batch to the NIC driver's TX command ring
    /// (3-hop relay). The domain switch is deferred until the entire qdisc
    /// batch is written (the caller is responsible for calling
    /// `kabi_tx_doorbell()` after the batch to signal umka-core).
    /// For `TxDispatch::IpcRpc`, an IPC message is sent per packet.
    ///
    /// Returns `Ok(())` on successful queuing to the driver (or ring).
    /// Returns `Err(IoError::BUSY)` if the TX ring / KABI ring is full.
    /// Returns `Err(IoError::NODEV)` if the driver domain has crashed
    /// (stale generation).
    pub fn dispatch_xmit(&self, buf: NetBufHandle) -> Result<(), IoError> {
        match &self.tx_dispatch {
            TxDispatch::Direct => {
                // Ownership of `buf` transfers to the driver's start_xmit
                // (which takes NetBufHandle by value). The driver holds the
                // handle until TX completion, then drops it.
                self.ops.start_xmit(self, buf)
            }
            TxDispatch::KabiRing { ring, driver_id, domain_generation } => {
                // Check driver liveness (one atomic load per batch when
                // called from qdisc_run; callers may batch this check).
                if ring.domain_generation() != *domain_generation {
                    // `buf` is dropped here — NetBufHandle::Drop returns
                    // the slab slot to the pool. No leak on NODEV.
                    return Err(IoError::NODEV);
                }
                // Serialize NetBufHandle → NetBufRingEntry and enqueue to
                // the umka-net → umka-core relay ring. umka-core will forward
                // the entry to the NIC driver's TX command ring (3-hop relay).
                //
                // The ring entry is a 128-byte wire-format copy that encodes
                // the pool_id, slot_idx, generation, and DMA buffer coordinates.
                // Ownership of the slab slot transfers to the ring consumer
                // (umka-core), which reconstructs a NetBufHandle on dequeue.
                let entry = NetBufRingEntry::from_netbuf(&buf);
                let result = ring.producer_enqueue(entry)
                    .map_err(|_| IoError::BUSY);

                if result.is_ok() {
                    // Ownership successfully transferred to the ring.
                    // Suppress the Drop impl — the ring consumer will
                    // reconstruct a NetBufHandle and drop it on TX completion.
                    core::mem::forget(buf);
                }
                // If enqueue failed (BUSY), `buf` is dropped here —
                // NetBufHandle::Drop returns the slab slot. No leak.
                result
            }
            TxDispatch::IpcRpc { endpoint, driver_id } => {
                // Send TX request via IPC. The DmaBufferHandle in the message
                // transfers data ownership to the Tier 2 driver process.
                // `buf` is moved into IpcTxRequest — ownership transfers.
                let msg = IpcTxRequest::new(*driver_id, buf);
                endpoint.send(msg)
                    .map_err(|_| IoError::BUSY)
            }
        }
    }
}

16.13.1.1 GSO Software Segmentation (validate_xmit)

Generic Segmentation Offload (GSO) allows the TCP/UDP stack to build large packets (up to 64 KB) and defer segmentation to the NIC hardware (TSO/USO) or to a software fallback just before transmission. This amortizes per-packet overhead across many MSS-sized segments.

The GSO validation step runs after qdisc dequeue and before dispatch_xmit(). It is the last chance to segment an oversized packet before it reaches the NIC driver.

/// Result of GSO validation. Either the packet passes through unchanged
/// (hardware handles segmentation or packet is already MSS-sized) or it
/// is split into multiple MSS-sized segments.
pub enum GsoResult {
    /// Packet needs no segmentation — pass directly to dispatch_xmit().
    PassThrough(NetBufHandle),
    /// Packet was segmented into N smaller NetBufs. Each must be submitted
    /// individually via dispatch_xmit().
    Segmented(ArrayVec<NetBufHandle, GSO_MAX_SEGMENTS>),
}

/// Maximum segments from a single GSO split. 64 KB / 1460 (TCP MSS) ≈ 44;
/// 64 covers UDP/SCTP with smaller payloads and provides margin.
pub const GSO_MAX_SEGMENTS: usize = 64;

/// Validate a packet for transmission and perform software GSO if the NIC
/// cannot handle the requested offload.
///
/// Called by the TX path (qdisc_run → dev_xmit) for every dequeued packet
/// before `NetDevice::dispatch_xmit()`.
///
/// Takes ownership of `netbuf` (move-only `NetBufHandle`). On the
/// pass-through path (no segmentation needed), the handle is returned
/// inside `GsoResult::PassThrough`. On the segmentation path, the
/// original handle is consumed (dropped — `NetBufHandle::Drop` returns
/// its slab slot) and new handles are returned for each segment.
///
/// GSO fields (`gso_size`, `gso_type`, `data_len`, `transport_header_len`)
/// are accessed via `netbuf.peek()` — borrowing the underlying `NetBuf`
/// from the slab pool without consuming the handle. `NetBufHandle` itself
/// is a 16-byte pool token with no packet metadata fields.
///
/// If the packet has `gso_size > 0` (meaning the transport layer built an
/// oversized packet expecting segmentation), this function checks whether
/// the NIC supports the required hardware offload. If not, it performs
/// software segmentation and returns the individual segments.
pub fn validate_xmit(dev: &NetDevice, netbuf: NetBufHandle) -> GsoResult {
    // Borrow the underlying NetBuf to read GSO metadata.
    // SAFETY: we own the handle exclusively (move-only); no concurrent access.
    let nb = netbuf.peek().expect("validate_xmit: stale handle");

    // Non-GSO packet: no segmentation needed.
    if nb.gso_size == 0 {
        return GsoResult::PassThrough(netbuf);
    }

    // Check if the NIC supports the required offload for this GSO type.
    let required = match nb.gso_type {
        GsoType::TcpV4 => NetDevFeatures::TSO,
        GsoType::TcpV6 => NetDevFeatures::TSO6,
        GsoType::Udp   => NetDevFeatures::GSO,  // USO not yet standard
        GsoType::TcpTunnel => NetDevFeatures::GSO_UDP_TUNNEL | NetDevFeatures::TSO,
        GsoType::UdpTunnel => NetDevFeatures::GSO_UDP_TUNNEL,
        GsoType::GroPartial => {
            // GRO partial: always needs software re-segmentation.
            // `gso_segment` consumes `netbuf` (the original handle is
            // dropped after data is copied to new segment NetBufs).
            return GsoResult::Segmented(gso_segment(netbuf, dev.mtu.load(Relaxed)));
        }
        GsoType::Gre   => NetDevFeatures::GSO_GRE,
        GsoType::None  => unreachable!(), // guarded by gso_size == 0 check above
    };

    if dev.features.contains(required) {
        // NIC handles segmentation in hardware — pass through.
        // The NIC reads gso_size and gso_type from the NetBufRingEntry
        // descriptor and splits the packet into MSS-sized frames on the wire.
        return GsoResult::PassThrough(netbuf);
    }

    // Software GSO fallback: split the large packet into MSS-sized
    // segments. Each segment gets its own L3/L4 headers with correct
    // sequence numbers (TCP) or fragment offsets (UDP).
    // `gso_segment` consumes `netbuf` — the original slab slot is freed
    // after all segment data has been copied.
    let segments = gso_segment(netbuf, dev.mtu.load(Relaxed));
    GsoResult::Segmented(segments)
}

/// Perform software segmentation of a GSO packet.
///
/// Splits the packet referenced by `netbuf` into segments of at most `mtu`
/// bytes (payload). Each segment is a new `NetBufHandle` (allocated from
/// the same pool as the original) with:
/// - Correctly adjusted IP total length / UDP length
/// - TCP: incremented sequence number per segment; PSH/FIN only on last
/// - UDP: each segment is an independent UDP datagram (UFO) or IP fragment
/// - Correct L3/L4 checksums (computed in software since the NIC does not
///   support the required offload)
///
/// For kTLS packets (gso_type includes TLS flag): segmentation respects
/// TLS record boundaries — a segment never splits a TLS record mid-stream.
/// If a record does not fit in a single MSS, it occupies multiple segments
/// but each segment boundary aligns with a record boundary or the packet
/// boundary.
///
/// The original `netbuf` handle is consumed: after all segment data has
/// been copied to new segment `NetBuf`s, `netbuf` is dropped. The
/// `NetBufHandle::Drop` impl returns the original slab slot to the pool
/// and decrements the DMA data page refcount.
///
/// On partial failure (pool exhaustion mid-segmentation), all successfully
/// allocated segment handles are returned in the `ArrayVec`. The caller's
/// `dispatch_xmit` loop handles partial batches; remaining undispatched
/// segment handles are dropped (returning their slab slots to the pool).
fn gso_segment(
    netbuf: NetBufHandle,
    mtu: u32,
) -> ArrayVec<NetBufHandle, GSO_MAX_SEGMENTS> {
    // Borrow the underlying NetBuf to read GSO metadata and packet data.
    // SAFETY: we own the handle exclusively (moved in by value).
    let nb = netbuf.peek().expect("gso_segment: stale handle");

    let mss = nb.gso_size as u32;
    let payload_len = nb.data_len() - nb.transport_header_len();
    let num_segments = (payload_len + mss - 1) / mss;
    let gso_type = nb.gso_type;
    let tcp_seq_base = nb.tcp_seq();
    let pool_id = netbuf.pool_id;
    let mut segments = ArrayVec::new();

    for i in 0..num_segments {
        let seg_offset = i * mss;
        let seg_len = core::cmp::min(mss, payload_len - seg_offset);

        // Allocate a new NetBuf from the same pool as the original.
        // Returns a NetBuf (slab-allocated); we access it mutably to
        // set up headers and payload.
        let seg = netbuf_pool_alloc(pool_id);

        // Re-borrow the original for data copy (the borrow was dropped
        // by the seg allocation above, which may have invalidated the
        // reference if the pool resized — in practice, pool resizing does
        // not happen during TX, but re-borrowing is correct).
        let nb = netbuf.peek().expect("gso_segment: stale handle mid-segment");

        // Copy L2/L3/L4 headers from original.
        seg.copy_headers_from(nb);
        // Copy segment payload.
        seg.copy_payload(nb, seg_offset, seg_len);

        // Fix up headers for this segment.
        seg.set_ip_total_length(seg.header_len() + seg_len);
        match gso_type {
            GsoType::TcpV4 | GsoType::TcpV6 => {
                seg.set_tcp_seq(tcp_seq_base + seg_offset);
                // PSH and FIN only on the last segment.
                if i < num_segments - 1 {
                    seg.clear_tcp_flags(TcpFlags::PSH | TcpFlags::FIN);
                }
            }
            _ => {}
        }

        // Compute checksums in software (NIC lacks the offload).
        seg.compute_l3_checksum();
        seg.compute_l4_checksum();

        // Clear GSO fields — this segment is ready for direct transmission.
        seg.gso_size = 0;
        seg.gso_type = GsoType::None;

        // Convert the new NetBuf to a handle for return.
        let pool = SHARED_NETBUF_POOLS.load(pool_id as u64)
            .expect("gso_segment: pool disappeared");
        segments.push(pool.handle_for(seg));
    }

    // `netbuf` (the original handle) is dropped here at end of scope.
    // NetBufHandle::Drop returns the original slab slot to the pool.
    segments
}

TX path integration: The full TX flow with GSO validation is:

qdisc_run():
    let mut budget = dev.tx_weight;  // default 64 packets per run
    while budget > 0:
        netbuf = qdisc.dequeue()
        if netbuf.is_none(): break

        match validate_xmit(dev, netbuf):
            GsoResult::PassThrough(buf):
                dev.dispatch_xmit(buf)
            GsoResult::Segmented(segments):
                for seg in segments:
                    dev.dispatch_xmit(seg)
        budget -= 1

For TxDispatch::KabiRing, all segments from a single GSO split are written to the relay ring before the batch doorbell is rung — the domain switch cost is amortized across the entire GSO burst, not paid per-segment.

DMA buffer ownership on TX dispatch: When dispatch_xmit() succeeds, DMA buffer ownership transfers to the NIC driver regardless of tier. The caller (umka-net / qdisc) MUST NOT access the packet data after a successful return. The driver signals TX completion asynchronously: Tier 0 via netif_tx_wake_queue() from the TX interrupt handler, Tier 1 via the KABI TX completion ring, Tier 2 via an IPC completion message. On completion, the DmaBufferHandle is returned to the NetBuf pool for reuse. This ownership model mirrors the RX path's DmaBufferHandle transfer documented in Section 16.14.

Tier 0 TX completion: Tier 0 NIC drivers handle TX completion via one of two mechanisms: (1) a dedicated TX completion interrupt that fires when the NIC has finished DMA-reading the packet data, or (2) NAPI TX poll where completions are harvested in the same napi_poll() callback that handles RX (the poll function checks both RX and TX completion rings). In either case, the completion handler unmaps the DMA descriptor and drops the NetBufHandle — the Drop impl returns the slab slot to the pool and frees DMA pages (if refcount reaches zero). The handler then calls netif_tx_wake_queue() if the TX queue was previously stopped due to ring exhaustion. The NetBufHandle must not be dropped before the NIC confirms transmission — premature drop causes use-after-free on the DMA buffer.

Tier 1 TX completion: For Tier 1 NIC drivers, TX completion follows a ring-based protocol through the KABI domain runtime:

  1. NIC hardware signals TX completion (MSI-X interrupt). Tier 0 generic_irq_handler() posts an IrqNotification to the driver's IrqRing (for non-NAPI TX completion vectors) or directly calls napi_schedule() (for shared RX+TX NAPI vectors — see Section 16.14).

  2. Driver's consumer loop (running in the Tier 1 isolation domain) processes the TX completion: reads the NIC's TX completion queue to determine which command IDs (CIDs) have completed.

  3. Driver posts TX completion entries to its KABI TX completion ring (a RingBuffer<TxCompletionEntry> in the driver→Tier 0 direction):

    #[repr(C)]
    pub struct TxCompletionEntry {
        /// The NetBufHandle that was submitted for TX.
        pub handle: NetBufHandle,
        /// TX queue index (for netif_tx_wake_queue targeting).
        pub txq_index: u16,
        /// Completion status: 0 = success, nonzero = error code.
        pub status: u16,
        pub _pad: u32,
    }
    const_assert!(core::mem::size_of::<TxCompletionEntry>() == 24);
    

  4. Tier 0 completion ring consumer (runs in the domain runtime's completion processing path, NOT in the driver's domain): for each entry: a. DMA-unmap the data pages referenced by handle (iommu_unmap if IOMMU isolation is active for this device). b. Drop the TxCompletionEntry — the NetBufHandle field's Drop impl returns the slab slot to the pool and decrements the data page refcount (freeing DMA pages if this was the last reference). c. If the TX queue was stopped (dev.txqs[txq_index].stopped == true): call netif_tx_wake_queue(dev, txq_index) to resume qdisc draining. d. Increment NetDevice.tx_completions counter.

NIC hardware ring corruption detection: If the NIC driver reads an RX descriptor with an invalid magic/length/status pattern (vendor-defined), it increments rx_ring_corruption_count, logs an FMA event (HealthEventClass::Network), and initiates a NIC reset via the crash recovery path (Section 11.9). The corrupted descriptor is skipped. This handles cases such as DMA engine errors, firmware bugs, or memory corruption in the descriptor ring region.

Feature flags (equivalent to Linux netdev_features_t):

bitflags::bitflags! {
    #[repr(transparent)]
    pub struct NetDevFeatures: u64 {
        /// Scatter-gather I/O (DMA from non-contiguous buffers).
        const SG              = 1 << 0;
        /// IPv4 TCP/UDP TX checksum offload.
        const IP_CSUM         = 1 << 1;
        /// Hardware TX checksum for all protocols.
        const HW_CSUM         = 1 << 2;
        /// IPv6 TCP/UDP TX checksum offload.
        const IPV6_CSUM       = 1 << 3;
        /// DMA to/from high memory (above 4GB).
        const HIGHDMA         = 1 << 4;
        /// TCP Segmentation Offload (hardware).
        const TSO             = 1 << 5;
        /// TCP Segmentation Offload for IPv6.
        const TSO6            = 1 << 6;
        /// Generic Receive Offload.
        const GRO             = 1 << 7;
        /// Hardware GRO.
        const GRO_HW          = 1 << 8;
        /// Large Receive Offload (deprecated, driver-only).
        const LRO             = 1 << 9;
        /// RX checksum offload.
        const RXCSUM          = 1 << 10;
        /// RX flow hash computation (RSS).
        const RXHASH          = 1 << 11;
        /// Hardware VLAN TX insertion.
        const HW_VLAN_TX      = 1 << 12;
        /// Hardware VLAN RX stripping.
        const HW_VLAN_RX      = 1 << 13;
        /// Hardware VLAN filtering.
        const HW_VLAN_FILTER  = 1 << 14;
        /// Generic Segmentation Offload (software TSO fallback).
        const GSO             = 1 << 15;
        /// Hardware TC offload.
        const HW_TC           = 1 << 16;
        /// Hardware ESP (IPsec) offload.
        const HW_ESP          = 1 << 17;
        /// Kernel TLS TX offload.
        const HW_TLS_TX       = 1 << 18;
        /// Kernel TLS RX offload.
        const HW_TLS_RX       = 1 << 19;
        /// GSO for GRE tunnels.
        const GSO_GRE         = 1 << 20;
        /// GSO for UDP tunnels (VXLAN, Geneve).
        const GSO_UDP_TUNNEL  = 1 << 21;
    }
}

Ethtool interface — link settings, ring parameters, coalescing, offloads:

/// Ethtool operations for NIC configuration and diagnostics.
/// Equivalent to Linux's `struct ethtool_ops`.
pub trait EthtoolOps: Send + Sync {
    /// Get link settings (speed, duplex, autoneg).
    fn get_link_ksettings(&self, dev: &NetDevice) -> Result<LinkKsettings, IoError>;
    /// Set link settings.
    fn set_link_ksettings(&self, dev: &NetDevice, settings: &LinkKsettings) -> Result<(), IoError>;

    /// Get ring buffer sizes (RX/TX descriptor count).
    fn get_ringparam(&self, dev: &NetDevice) -> RingParam;
    /// Set ring buffer sizes (may require interface restart).
    fn set_ringparam(&self, dev: &NetDevice, param: &RingParam) -> Result<(), IoError>;

    /// Get interrupt coalescing parameters.
    fn get_coalesce(&self, dev: &NetDevice) -> CoalesceParams;
    /// Set interrupt coalescing parameters.
    fn set_coalesce(&self, dev: &NetDevice, params: &CoalesceParams) -> Result<(), IoError>;

    /// Get channel count (combined RX+TX queues, separate RX/TX queues).
    fn get_channels(&self, dev: &NetDevice) -> ChannelInfo;
    /// Set channel count (may require interface restart).
    fn set_channels(&self, dev: &NetDevice, info: &ChannelInfo) -> Result<(), IoError>;

    /// Get driver name and version.
    fn get_drvinfo(&self, dev: &NetDevice) -> DrvInfo;

    /// Get link state (carrier up/down).
    fn get_link(&self, dev: &NetDevice) -> bool {
        dev.carrier.load(Relaxed)
    }
}

pub struct LinkKsettings {
    pub speed: u32,            // Mbps (1000 = 1Gbps, 100000 = 100Gbps)
    pub duplex: Duplex,        // Full, Half, Unknown
    pub autoneg: bool,
    pub port: PhyPort,         // TP, FIBRE, DA, OTHER
    pub supported_speeds: u64, // Bitmask of supported link speeds
}

pub struct RingParam {
    pub rx_max: u32,
    pub tx_max: u32,
    pub rx_current: u32,
    pub tx_current: u32,
}

pub struct CoalesceParams {
    pub rx_coalesce_usecs: u32,     // Interrupt delay (microseconds)
    pub rx_max_frames: u32,         // Max frames before interrupt
    pub tx_coalesce_usecs: u32,
    pub tx_max_frames: u32,
    pub use_adaptive_rx: bool,      // NIC adapts coalescing dynamically
    pub use_adaptive_tx: bool,
}

pub struct ChannelInfo {
    pub max_rx: u32,
    pub max_tx: u32,
    pub max_combined: u32,
    pub rx_count: u32,
    pub tx_count: u32,
    pub combined_count: u32,
}

16.13.2 NetDevice Registration Lifecycle

The driver must follow this exact sequence when bringing up a physical NIC. Ordering violations cause races (interrupt arrival before NAPI ready, DMA before IOMMU isolation, etc.).

1. alloc_netdev(sizeof(DriverPriv), "eth%d", ether_setup)
     → allocates NetDevice + driver-private area.

2. IOMMU setup: iommu_attach_device(dev.pci_dev, domain)
     → device DMA now confined to the driver's IOMMU domain.
     MUST complete before any DMA descriptors are programmed.

3. NetBuf pool creation: netbuf_pool_create(dev.numa_node, rx_ring_size)
     → allocates DMA-capable RX buffers for the pool.

4. NAPI registration: netif_napi_add(dev, napi, poll_fn, pool_id)
     → links NAPI context to the device. poll_fn is NOT called yet.
     Requires pool_id from step 3.

5. register_netdev(dev)
     → assigns ifindex, adds to global device list, notifies userspace.
     Precondition: dev.iommu_group.is_some() — returns EINVAL otherwise.
     Precondition: at least one NAPI context registered — returns EINVAL.
     After this call, the device is visible to userspace but not yet
     receiving packets (NAPI not enabled, interrupts not enabled).

6. open() callback (triggered by `ip link set dev up`):
     a. Allocate RX/TX descriptor rings.
     b. Program RX descriptors with NetBuf DMA addresses from pool.
     c. napi_enable(napi) — NAPI poll is now schedulable.
     d. Enable device interrupts.
     → First RX interrupt can now fire; NAPI and pool are ready.

Teardown is the reverse: stop() → disable interrupts → napi_disable()unregister_netdev()netbuf_pool_destroy()iommu_detach_device()free_netdev().

16.13.3 VETH (Virtual Ethernet Pair)

For the full veth specification — including VethDevice struct, TX path with XDP integration, cross-namespace metadata scrubbing, namespace move protocol, and performance characteristics — see Section 16.16.

16.13.4 Software Bridge (L2 Switch)

A software bridge implements L2 frame forwarding between multiple network interfaces (physical NICs, veth pairs, VLAN devices, tunnel endpoints). It is a virtual NetDevice that implements NetDeviceOps and participates in the normal packet path — frames received on any bridge port are forwarded based on a learned MAC address table (forwarding database, FDB).

/// Software L2 bridge state. Registered as a virtual NetDevice via
/// `register_netdev()`. Implements `NetDeviceOps` for frame forwarding.
///
/// The bridge learns source MAC addresses from incoming frames and builds
/// a forwarding database (FDB) mapping MAC → port. Unknown unicast,
/// broadcast, and multicast frames are flooded to all ports except the
/// ingress port.
pub struct BridgeState {
    /// Bridge ports (physical or virtual NICs added via `ip link set <dev> master <br>`).
    /// Protected by a spinlock; port add/remove is a warm-path operation
    /// (admin action, not per-packet). The array is sized for enterprise
    /// bridge configurations; typical deployments use 2-8 ports.
    pub ports: SpinLock<ArrayVec<Arc<BridgePort>, MAX_BRIDGE_PORTS>>,

    /// Forwarding database: (MAC, VLAN) → port mapping.
    /// Uses `RcuHashMap` keyed by `([u8; 6], u16)` — a composite non-integer
    /// key (MAC addresses are 6-byte arrays, not integers). XArray is
    /// inappropriate here because hash-as-index would cause silent data loss
    /// on collision. RcuHashMap provides RCU-protected lock-free reads on
    /// the forwarding hot path, matching Linux's `rhashtable` for bridge FDB.
    /// Writer (learning, static entry add/remove) acquires `fdb_lock`.
    pub fdb: RcuHashMap<([u8; 6], u16), BridgeFdbEntry>,

    /// Spinlock protecting FDB writes (learning, static entry management).
    /// Not held on the read (forwarding) path — RCU suffices for reads.
    pub fdb_lock: SpinLock<()>,

    /// STP (Spanning Tree Protocol) state for the bridge.
    /// IEEE 802.1D STP prevents L2 loops by selectively blocking ports.
    /// Default: `STP_DISABLED` (all ports forward). When enabled, ports
    /// transition through Blocking → Listening → Learning → Forwarding.
    pub stp_enabled: AtomicBool,

    /// Bridge-wide VLAN filtering. When `true`, the bridge inspects the
    /// 802.1Q VLAN tag on each frame and forwards only to ports that are
    /// members of the frame's VLAN. When `false`, all frames are forwarded
    /// regardless of VLAN tag (transparent bridging).
    /// Configured via `ip link set <br> type bridge vlan_filtering 1`.
    pub vlan_filtering: AtomicBool,

    /// FDB entry ageing time in seconds. Dynamically learned entries that
    /// have not been refreshed (source MAC seen again) within this interval
    /// are evicted. Default: 300 seconds (matches Linux and IEEE 802.1D).
    /// Static entries (added via `bridge fdb add`) never age.
    pub ageing_time: AtomicU32,

    /// Bridge MAC address. Used as source MAC for locally-originated frames
    /// (STP BPDUs, ARP replies for the bridge's own IP). Set to the lowest
    /// MAC among all ports, or explicitly via `ip link set <br> address`.
    pub bridge_addr: [u8; 6],
}

/// Maximum number of ports on a single bridge. 1024 is sufficient for
/// large-scale virtual switch deployments (e.g., a hypervisor with many VMs).
pub const MAX_BRIDGE_PORTS: usize = 1024;

/// Maximum VLANs per bridge port. 64 covers the vast majority of
/// production VLAN configurations; environments needing more should use
/// a dedicated L2 switch ASIC or VLAN-aware bridge with trunk ports.
pub const MAX_VLANS_PER_PORT: usize = 64;

/// Per-port state for a bridge member interface.
pub struct BridgePort {
    /// Underlying NetDevice (physical NIC, veth, VLAN sub-interface, etc.).
    /// The bridge holds an `Arc` reference, keeping the device alive as long
    /// as it is a bridge member.
    pub dev: Arc<NetDevice>,

    /// Port number (0-based index within this bridge). Stable for the
    /// lifetime of the port's membership. Used as the key in FDB entries.
    pub port_no: u16,

    /// STP port state. Determines whether this port forwards, learns, or
    /// blocks frames. Only meaningful when `BridgeState::stp_enabled` is true.
    pub stp_state: AtomicU8, // StpPortState discriminant

    /// Per-port VLAN membership table. Only consulted when
    /// `BridgeState::vlan_filtering` is true. Each entry specifies a VLAN ID
    /// that this port is a member of, plus whether frames should egress
    /// tagged or untagged.
    pub vlans: SpinLock<ArrayVec<BridgeVlanEntry, MAX_VLANS_PER_PORT>>,

    /// Whether this port is the designated PVID (Port VLAN ID) port.
    /// Untagged ingress frames on this port are assigned `pvid` VLAN.
    pub pvid: u16,
}

/// STP port states (IEEE 802.1D). Values match Linux BR_STATE_* constants.
#[repr(u8)]
pub enum StpPortState {
    /// Port is disabled (administratively down or STP-blocked permanently).
    Disabled    = 0,
    /// Port is transitioning. Processes BPDUs but does not learn or forward.
    Listening   = 1,
    /// Port learns source MACs but does not yet forward data frames.
    Learning    = 2,
    /// Port is fully operational: learns MACs and forwards data frames.
    Forwarding  = 3,
    /// Port is blocked by STP to prevent loops. Receives BPDUs only.
    Blocking    = 4,
}

/// VLAN membership entry for a bridge port.
pub struct BridgeVlanEntry {
    /// 802.1Q VLAN ID (1-4094).
    pub vid: u16,
    /// Whether frames egressing this port for this VLAN should retain
    /// the 802.1Q tag (`true`) or have it stripped (`false`, for access ports).
    pub tagged: bool,
}

/// Forwarding database entry. Maps a MAC address to the bridge port
/// where that MAC was last seen (or statically configured).
pub struct BridgeFdbEntry {
    /// Ethernet MAC address (6 bytes).
    pub mac: [u8; 6],
    /// Port number where this MAC was last seen (source MAC learning)
    /// or statically assigned.
    pub port_no: u16,
    /// Static entries are configured by the administrator and never age.
    /// Dynamic entries are learned from incoming frame source MACs.
    pub is_static: bool,
    /// Timestamp of last frame seen from this MAC (monotonic nanoseconds).
    /// Used by the ageing timer to evict stale entries. Updated atomically
    /// on the forwarding fast path (single `store(Relaxed)` — no lock).
    pub last_seen: AtomicU64,
    /// VLAN ID associated with this entry (0 if VLAN filtering is disabled).
    /// When VLAN filtering is enabled, FDB lookups match on (MAC, VLAN) pairs.
    pub vlan_id: u16,
}

Packet forwarding path (bridge_input):

When a frame arrives on a bridge port, netif_receive_buf() delivers it to the bridge's RX handler (registered when the port joined the bridge):

bridge_input(port: &BridgePort, buf: &NetBuf):
    // 1. Source MAC learning: update or create FDB entry.
    src_mac = buf.eth_header().src_mac
    fdb_lock.lock()
    if fdb.lookup(src_mac, vlan_id).is_none() || fdb[src_mac].port_no != port.port_no:
        fdb.insert(BridgeFdbEntry {
            mac: src_mac,
            port_no: port.port_no,
            is_static: false,
            last_seen: monotonic_ns(),
            vlan_id: buf.vlan_id_or_zero(),
        })
    else:
        fdb[src_mac].last_seen.store(monotonic_ns(), Relaxed)
    fdb_lock.unlock()

    // 2. Destination MAC lookup.
    dst_mac = buf.eth_header().dst_mac

    if dst_mac == bridge.bridge_addr:
        // Destination is the bridge itself → deliver to the bridge's own
        // IP stack (local_deliver). This handles ARP, DHCP, management traffic
        // destined for the bridge's own IP address.
        netif_receive_local(bridge.netdev, buf)
        return

    if dst_mac.is_broadcast() || dst_mac.is_multicast():
        // Broadcast/multicast: flood to all ports except ingress + local_deliver.
        bridge_flood(bridge, port.port_no, buf.clone())
        netif_receive_local(bridge.netdev, buf)
        return

    // 3. Unicast forwarding.
    match fdb.lookup(dst_mac, vlan_id):
        Some(entry) if entry.port_no != port.port_no:
            // Known unicast: forward to the single destination port.
            bridge_forward(bridge.ports[entry.port_no], buf)
        Some(_):
            // Source and destination on the same port: drop (hairpin disabled
            // by default). Hairpin mode is configurable per-port for
            // VM-to-VM traffic on the same physical NIC.
            drop(buf)
        None:
            // Unknown unicast: flood to all ports except ingress.
            bridge_flood(bridge, port.port_no, buf)

FDB ageing: A kernel timer fires every ageing_time / 4 seconds and scans the FDB. Entries where monotonic_ns() - last_seen > ageing_time (and !is_static) are evicted. The scan is performed under fdb_lock; on large FDBs (>10K entries), the scan is split across multiple timer invocations to avoid holding the lock for extended periods.

Netlink interface: Bridge configuration and FDB management use the standard Linux netlink interface (RTM_NEWLINK, RTM_DELLINK for port add/remove; RTM_NEWNEIGH, RTM_DELNEIGH for FDB static entries). This ensures compatibility with iproute2, brctl, and container networking tools (Docker, CNI plugins).

Cross-references: - NIC driver KABI vtable: Section 12.1 (NicDriverVTable) - RX tier-aware dispatch (NapiPollDispatch): Section 16.14 - Domain switch overhead: Section 16.12 - TX/RX ring buffer protocol: Section 16.5 - NetBufRingEntry wire format: Section 16.5 - Traffic control (qdisc): Section 16.21 - Crash recovery for stale driver domains: Section 11.9 - VLAN device model: Section 16.27 - WiFi driver (extends NetDeviceOps): Section 13.15 - Netlink socket interface: Section 16.17 - Network namespace isolation: Section 17.1

16.14 NAPI — New API for Packet Polling

NAPI is the interrupt-coalescing and polling framework for high-performance packet receive. Instead of processing one packet per interrupt, NAPI batches packet processing: the first packet triggers an interrupt, which schedules a poll loop that drains up to budget packets from the RX ring before re-enabling interrupts. This reduces interrupt overhead from O(packets) to O(batches). NAPI poll functions execute in softirq context (Section 3.8) — specifically the NET_RX_SOFTIRQ vector (index 3) — or in dedicated NAPI kernel threads when threaded mode is enabled.

Isolation tier: NAPI is Tier 0, Evolvable (not Nucleus). It runs in the NIC driver's scheduling context — the NAPI softirq or NAPI thread executes the driver's poll function. NAPI itself is part of umka-core (Tier 0) because it manages interrupt coalescing and softirq scheduling, which are core kernel responsibilities. The poll dispatch mechanism (NapiPollDispatch) handles the isolation boundary between NAPI (Tier 0) and NIC drivers (Tier 0/1/2).

Separation from umka-net: NAPI collects raw packets from the NIC driver and delivers them to umka-net (Tier 1) in batches. GRO coalescing, protocol parsing, and all L2+ processing happen inside umka-net, not in NAPI. NapiContext does not contain GRO state — GRO hash tables belong to umka-net's NetRxContext (see Section 16.2).

/// Per-queue NAPI context. Each RX queue (and optionally each TX
/// completion queue) has its own NapiContext.
///
/// Registered by the NIC driver during `open()` via `napi_register()`.
/// Unregistered during `stop()` via `napi_unregister()`.
pub struct NapiContext {
    /// NAPI state machine (bit flags).
    pub state: AtomicU64,

    /// Maximum packets to process per poll cycle (default: 64).
    /// Set by the driver at registration. Can be tuned via sysfs.
    pub weight: u32,

    /// Driver-provided poll function. Called by the NAPI subsystem
    /// when this NAPI instance is scheduled. Returns the number of
    /// packets actually processed (0..=budget).
    ///
    /// If `processed < budget`: all packets drained, NAPI re-enables
    /// interrupts via `napi_complete_done()`.
    /// If `processed == budget`: more packets pending, NAPI will
    /// re-schedule this instance for another poll cycle.
    ///
    /// **Poll dispatch**: encapsulates both the poll implementation and the
    /// isolation-tier transport. The `NapiPollDispatch` enum eliminates the
    /// bare `fn` pointer (which is only valid for Tier 0) and ensures that
    /// Tier 1/Tier 2 drivers cannot be called via a direct function pointer
    /// that would violate the isolation boundary.
    pub poll: NapiPollDispatch,

    /// The NetDevice this NAPI instance belongs to.
    pub dev: Arc<NetDevice>,

    /// Batch accumulator: raw NetBufs collected during this poll cycle.
    /// Filled by `napi_receive_buf()` as the driver (or Tier 0 KABI
    /// trampoline) produces packets. The entire batch is delivered to
    /// umka-net in a single domain switch at `napi_complete_done()` time
    /// via `napi_deliver_batch()`.
    ///
    /// SAFETY: Exclusive access guaranteed by NAPI SCHED bit invariant —
    /// each NAPI instance is polled by exactly one CPU at a time
    /// (napi_schedule sets SCHED atomically; napi_poll checks and clears
    /// it). napi_receive_buf() is called only from within poll(), never
    /// from hardirq. No concurrent access. UnsafeCell avoids the ~5ns
    /// SpinLock overhead per access (~640ns total per 64-packet poll cycle).
    /// Implementation: private — accessed only through `napi_receive_buf()`
    /// and `napi_deliver_batch()` methods that assert the SCHED bit.
    /// The `pub(crate)` visibility restricts access to umka-core, preventing
    /// external code from bypassing the SCHED-bit safety invariant.
    pub(crate) rx_batch: UnsafeCell<ArrayVec<NetBufHandle, 64>>,
    /// Number of packets received in the current poll cycle.
    /// `Cell<u32>` for interior mutability: the poll function takes
    /// `&NapiContext` (shared reference) due to the NAPI SCHED bit
    /// single-writer guarantee. `Cell<u32>` is sound because NAPI
    /// poll is single-threaded per instance (enforced by SCHED bit).
    /// Private — accessed only via `napi_receive_buf()` and
    /// `napi_complete_done()`, which assert the SCHED bit.
    pub(crate) rx_count: Cell<u32>,

    /// NAPI instance ID (unique system-wide, for sysfs and busy polling).
    pub napi_id: u32,

    /// Threaded NAPI: dedicated kernel thread for this instance.
    /// When `Some`, the poll function runs in thread context instead
    /// of softirq context, enabling sleeping allocations and better
    /// CPU affinity control.
    pub thread: Option<KthreadHandle>,

    /// Interrupt affinity: which CPU should process this NAPI.
    /// Set via `/sys/class/net/<dev>/queues/rx-<N>/rps_cpus`.
    pub affinity_cpu: AtomicI32,

    /// Busy polling: allow process-context polling from recvmsg().
    pub busy_poll_enabled: bool,

    /// Deferred hard IRQs: number of poll cycles to defer hard
    /// interrupt re-enablement (software IRQ coalescing).
    pub defer_hard_irqs: u32,
}

/// Determines how the NAPI poll function is dispatched across driver
/// isolation boundaries.
///
/// This enum replaces the separate `poll: fn(...)` + `poll_mode: NapiPollMode`
/// Opaque reference to a KABI shared-memory ring pair (command + completion).
/// Allocated by the driver framework at registration time; valid for the
/// driver's lifetime. Defined in [Section 12.6](12-kabi.md#kabi-transport-classes).
pub type KabiRingRef = Arc<KabiRingPair>;

/// Opaque reference to an IPC endpoint for Tier 2 driver communication.
/// Wraps a kernel IPC channel handle. Defined in [Section 12.6](12-kabi.md#kabi-transport-classes).
pub type IpcEndpointRef = Arc<IpcEndpoint>;

/// Unique identifier for a registered driver instance. Monotonically
/// increasing u64 assigned by `DRIVER_REGISTRY` at driver registration.
/// Used to correlate NAPI instances, ring references, and crash recovery
/// state with their owning driver.
pub type DriverId = u64;

/// Global driver registry. XArray keyed by `DriverId` (u64). Stores
/// `Arc<DriverInstance>` for each registered driver. Provides
/// `generation(id) -> u64` to detect driver crash-and-reload (the
/// generation increments on each reload).
pub static DRIVER_REGISTRY: SpinLock<XArray<Arc<DriverInstance>>> =
    SpinLock::new(XArray::new());

/// pair. Each variant carries the transport-specific state needed to invoke the
/// driver's poll implementation within its isolation tier. The NAPI subsystem
/// calls `NapiContext::poll_driver(budget)` which matches on this enum.
///
/// **Invariant**: The variant MUST match the driver's actual isolation tier.
/// A Tier 1 driver registered with `NapiPollDispatch::Direct` would bypass the
/// isolation boundary — this is prevented at `napi_register()` time by checking
/// the driver's `KabiTransport` class.
pub enum NapiPollDispatch {
    /// Tier 0 driver: direct function call (no domain crossing).
    /// The function pointer is called in-line from the NAPI softirq or
    /// NAPI thread context. Zero dispatch overhead.
    ///
    /// # Safety
    /// The function pointer must remain valid for the lifetime of the
    /// NapiContext. Guaranteed because Tier 0 drivers are never unloaded
    /// (`load_once: true`) or live for the lifetime of the NapiContext
    /// (the NAPI instance is unregistered before the driver is unloaded).
    Direct(unsafe extern "C" fn(napi: &NapiContext, budget: i32) -> i32),

    /// Tier 1 driver: KABI ring dispatch (hardware memory-domain isolated).
    /// The NAPI subsystem posts `PollRequest { budget }` on the driver's
    /// command ring, performs one domain switch pair (Tier 0 → Tier 1 →
    /// Tier 0), and reads `PollResponse { work_done }` from the completion
    /// ring. The driver processes RX descriptors within its isolation domain
    /// and writes `NetBufRingEntry` records to the completion ring.
    ///
    /// One domain switch pair per poll cycle (~23 cycles on x86 MPK),
    /// amortized across up to `budget` packets.
    KabiRing {
        /// Reference to the shared KABI ring pair (command + completion)
        /// for this driver. Allocated at driver registration time and
        /// valid for the driver's lifetime.
        ring: KabiRingRef,
        /// Driver identifier for the ring dispatch (selects the correct
        /// isolation domain on the domain switch).
        driver_id: DriverId,
        /// Driver domain generation at NAPI registration time. Checked
        /// on every `napi_poll()` invocation (one `AtomicU64::load(Acquire)`
        /// per batch, not per packet — negligible overhead). If the driver
        /// crashes and reloads, `domain.generation` increments, and the
        /// stale generation here causes `napi_poll()` to return
        /// `NAPI_POLL_DONE` and disable this NAPI instance. Crash recovery
        /// re-registers a new NapiContext with the updated generation.
        /// This prevents use-after-free on the ring/domain after driver crash.
        domain_generation: u64,
    },

    /// Tier 2 driver: IPC-based RPC (full process isolation).
    /// The NAPI subsystem sends a poll request message via the IPC channel
    /// to the Tier 2 driver process. The driver replies with work_done.
    /// Packets are transferred via the shared NetBuf pool
    /// ([Section 16.5](#netbuf-packet-buffer--domain-crossing-protocol)).
    /// Full IPC round-trip per poll cycle (~200-500 cycles).
    IpcRpc {
        /// IPC endpoint for communication with the Tier 2 driver process.
        endpoint: IpcEndpointRef,
        /// Driver identifier for packet ownership tracking.
        driver_id: DriverId,
    },
}

impl NapiContext {
    /// Dispatch the poll function through the appropriate isolation tier.
    /// Called by the NAPI softirq or NAPI thread to drain RX descriptors.
    ///
    /// Returns the number of packets processed (0..=budget).
    pub fn poll_driver(&self, budget: i32) -> i32 {
        match &self.poll {
            NapiPollDispatch::Direct(poll_fn) => {
                // SAFETY: Tier 0 driver function pointer is valid for the
                // NapiContext lifetime (see Direct variant documentation).
                unsafe { poll_fn(self, budget) }
            }
            NapiPollDispatch::KabiRing { ring, driver_id, domain_generation } => {
                // Acquire a refcount on the driver domain BEFORE checking
                // generation or accessing the ring. The refcount prevents the
                // domain's memory (including the ring buffer) from being
                // reclaimed during the entire poll operation, closing the
                // TOCTOU window between generation check and ring access.
                //
                // If the domain has already been destroyed (refcount == 0),
                // try_acquire returns None and we skip the poll.
                let domain_ref = match DRIVER_REGISTRY.try_acquire(*driver_id) {
                    Some(r) => r,
                    None => {
                        log_warn!("NAPI: driver domain {:?} already destroyed, \
                                   skipping poll", driver_id);
                        return 0; // NAPI_POLL_DONE
                    }
                };

                // Verify the driver's domain generation matches the generation
                // recorded at NAPI registration time. If the driver has crashed
                // and reloaded, DRIVER_REGISTRY.generation(driver_id) will have
                // incremented, making this NAPI instance stale. The domain_ref
                // we hold prevents memory reclamation even if the domain is
                // being torn down concurrently.
                let current_gen = domain_ref.generation();
                if *domain_generation != current_gen {
                    log_warn!("NAPI: stale domain_generation for driver {:?} \
                               (registered={}, current={}), skipping poll",
                              driver_id, domain_generation, current_gen);
                    // domain_ref dropped here — releases refcount
                    return 0; // NAPI_POLL_DONE — caller will disable this instance
                }
                // Post PollRequest to command ring, switch domain, read
                // PollResponse from completion ring.
                // domain_ref is held for the entire poll duration — ring
                // memory cannot be freed until this refcount is released.
                let result = kabi_napi_poll(ring, *driver_id, budget);
                drop(domain_ref);  // explicit drop for clarity
                result
            }
            NapiPollDispatch::IpcRpc { endpoint, driver_id } => {
                // Send poll request via IPC, await response.
                ipc_napi_poll(endpoint, *driver_id, budget)
            }
        }
    }
}

/// Dispatch a NAPI poll request to a Tier 1 NIC driver via the KABI ring.
///
/// Posts `PollRequest { budget }` on the driver's command ring, performs
/// one domain switch pair (Tier 0 → Tier 1 → Tier 0), and reads
/// `PollResponse { work_done }` from the completion ring. The driver
/// processes RX descriptors within its isolation domain and writes
/// `NetBufRingEntry` records to the completion ring.
///
/// Returns the number of packets processed (0..=budget).
fn kabi_napi_poll(ring: &KabiRingRef, driver_id: DriverId, budget: i32) -> i32;

/// Dispatch a NAPI poll request to a Tier 2 NIC driver via IPC.
///
/// Sends a poll request message via the IPC channel to the Tier 2 driver
/// process. The driver replies with the number of packets processed.
/// Packets are transferred via the shared NetBuf pool
/// ([Section 16.5](#netbuf-packet-buffer--domain-crossing-protocol)).
/// Full IPC round-trip per poll cycle (~200-500 cycles).
///
/// Returns the number of packets processed (0..=budget).
fn ipc_napi_poll(endpoint: &IpcEndpointRef, driver_id: DriverId, budget: i32) -> i32;

/// Write an MMIO doorbell to a Tier 1 NIC driver via the KABI ring.
///
/// Used by the TX path to notify the NIC that new TX descriptors are
/// available. The doorbell write is a single MMIO store to a device
/// register mapped into the driver's isolation domain. The KABI ring
/// carries the doorbell address and value; the Tier 0 trampoline
/// performs the actual MMIO write within the driver's memory domain.
fn kabi_tx_doorbell(dev: &NetDevice);

/// NAPI state bits.
///
/// NOTE: This is the SOFTWARE scheduling state of the NAPI instance
/// (scheduled, disabled, busy-polling). It is distinct from the KABI
/// ring's hardware lifecycle state (ring_state in the DomainRingBuffer).
/// NAPI state controls when poll() is called; ring state controls
/// whether the ring buffer memory is valid for DMA.
pub mod NapiState {
    /// Poll is scheduled (softirq or thread will call poll()).
    pub const SCHED: u64         = 1 << 0;
    /// Reschedule needed (race between poll completion and new IRQ).
    pub const MISSED: u64        = 1 << 1;
    /// Disable pending (napi_disable() called, waiting for poll to finish).
    pub const DISABLE: u64       = 1 << 2;
    /// In NAPI hash table (busy polling can find this instance).
    pub const HASHED: u64        = 1 << 3;
    /// Do not add to NAPI hash (busy polling disabled for this instance).
    pub const NO_BUSY_POLL: u64  = 1 << 4;
    /// A busy-polling thread currently owns this NAPI.
    pub const IN_BUSY_POLL: u64  = 1 << 5;
    /// Prefer busy polling over interrupt-driven for this instance.
    pub const PREFER_BUSY: u64   = 1 << 6;
    /// Threaded mode: poll runs in kernel thread, not softirq.
    pub const THREADED: u64      = 1 << 7;
    /// Has preferred busy-poll CPU assigned (Linux 6.4+).
    pub const PREFER_BUSY_CPU: u64 = 1 << 8;
    /// Schedule-disable: temporarily prevents scheduling this NAPI
    /// (used during driver reconfiguration, ring resize). Unlike DISABLE,
    /// this does not wait for poll completion — it only prevents new
    /// scheduling. Set by `napi_disable_sched()`, cleared by `napi_enable_sched()`.
    pub const SCHED_DISABLE: u64 = 1 << 9;
}

NAPI lifecycle:

Driver init (open):
  napi = NapiContext::new(dev, poll_fn, weight=64)
  napi_register(dev, napi)    // adds to dev.napi_list
  napi_enable(napi)           // clears DISABLE bit

Packet arrives:
  NIC DMA → RX ring descriptor filled → MSI-X interrupt
  → IRQ handler:
      napi_schedule(napi)
        → set SCHED bit (atomic test-and-set)
        → if was not already SCHED:
            add napi to per-CPU poll_list
            raise NET_RX_SOFTIRQ

Softirq (or NAPI thread):
  net_rx_action()
    → for each napi in this CPU's poll_list:
        budget = min(napi.weight, remaining_budget)
        work_done = napi.poll(napi, budget)
        → Driver poll function:
            while work_done < budget:
                desc = read_rx_descriptor(ring)
                if no more descriptors: break
                buf = build_netbuf_from_descriptor(desc)
                napi_receive_buf(napi, buf)  // accumulate into rx_batch
                work_done += 1
            return work_done
        if work_done < budget:
            napi_complete_done(napi, work_done)
              → napi_deliver_batch(napi)  // one domain switch for entire batch
              → clear SCHED bit
              → re-enable NIC interrupts for this queue
        else:
            // Budget exhausted — flush any accumulated packets before
            // rescheduling. Without this, packets in rx_batch would be
            // stranded until the next poll cycle completes under budget.
            napi_deliver_batch(napi)  // flush accumulated packets to umka-net
            // Stay on poll_list for next softirq cycle.
            // Prevents a single NIC from starving other NAPI instances.

Driver shutdown (stop):
  napi_disable(napi)        // set DISABLE, wait for active poll to finish
  napi_unregister(napi)     // remove from dev.napi_list and NAPI hash

Tier 1 (KabiRing) NAPI Poll Path

When the NIC driver runs as a Tier 1 driver (hardware memory-domain isolated via MPK/POE), it cannot be called directly from the NAPI softirq. Instead, the NAPI subsystem (Tier 0) communicates with the driver via a shared-memory ring buffer (KABI ring). The following pseudocode describes the Tier 0 side of the poll cycle:

Tier 1 (KabiRing) NAPI poll path:

Driver side (Tier 1, runs in driver's isolation domain):
  — Woken by PollRequest on KABI command ring.
  — For each completed RX descriptor (up to budget):
      1. Read RX descriptor from NIC hardware ring.
      2. Build a NetBufRingEntry (128-byte flattened metadata struct,
         [Section 16.5](#netbuf-packet-buffer--netbufringentry-kabi-wire-format)).
         Includes DmaBufferHandle referencing the DMA data page,
         offsets, checksum status, VLAN, RSS hash. If >2 frags,
         write continuation entries ([Section 16.5](#netbuf-packet-buffer--netbufringentry-kabi-wire-format)).
      3. Write the NetBufRingEntry (and any continuations) to the KABI
         completion ring (shared memory, writable by Tier 1, readable by Tier 0).
  — Write work_done count to the poll response slot.
  — Ring the doorbell (write to a shared atomic flag).

Tier 0 side (NAPI softirq context, napi_poll trampoline):
  1. Post PollRequest { budget } to the driver's KABI command ring.
     — Single domain switch: Tier 0 → Tier 1 (driver poll entry).
     — Driver processes up to `budget` RX descriptors, writes
       NetBufRingEntry records to the completion ring.
     — Domain switch: Tier 1 → Tier 0 (driver poll return).

  2. Reconstruct NetBufs from the completion ring (Tier 0 context):
     for i in 0..budget:
         entry = kabi_ring.consumer_dequeue::<NetBufRingEntry>()
         if entry.is_none():
             break   // ring empty — driver drained all descriptors
         // Allocate fresh NetBuf in Tier 0's pool and populate from
         // the serialized entry. Data pages shared via DmaBufferHandle.
         netbuf = NetBuf::from_ring_entry(entry)
         processed += 1

         // --- XDP execution point ---
         // If an XDP program is attached to this NIC's RX queue, run it
         // before GRO/netfilter/L4 delivery. The XDP program runs in the
         // eBPF domain ([Section 19.2](19-sysapi.md#ebpf-subsystem)) within Tier 0 context.
         if napi.dev.xdp_prog.is_some():
             let xdp_ctx = netbuf_to_xdp_context(&netbuf, netbuf.ifindex, netbuf.rxq_index);
             xdp_action = bpf_run_xdp(napi.dev.xdp_prog, &xdp_ctx)
             match xdp_action:
                 XDP_PASS   => { }                 // continue to stack
                 XDP_DROP   => { netbuf_free(netbuf); continue }
                 XDP_TX     => { xdp_do_tx(napi.dev, netbuf); continue }
                 XDP_REDIRECT => { xdp_do_redirect(netbuf); continue }
                 XDP_ABORTED => { trace_xdp_exception(napi.dev); netbuf_free(netbuf); continue }

         // Accumulate into the NAPI batch (no domain switch yet).
         napi_receive_buf(napi, netbuf)

  3. If budget exhausted (processed == budget):
         // Flush accumulated rx_batch to umka-net before returning.
         // Without this, packets already in rx_batch would be stranded
         // until the next poll cycle completes under budget.
         napi_deliver_batch(napi)  // flush partial batch to umka-net
         return budget   // stay in poll mode — NAPI re-schedules
  4. If ring drained (processed < budget):
         napi_complete_done(napi, processed)
           → napi_deliver_batch(napi)  // ONE domain switch for entire batch
             // Tier 0 → Tier 1 (umka-net): delivers all accumulated
             // NetBufs. umka-net performs GRO coalescing, L2 dispatch,
             // IP routing, netfilter, L4 delivery.
           → clear SCHED bit
           → re-enable NIC interrupts for this queue
         return processed

Key differences from Tier 0 poll path: - Two domain switches per poll cycle (not per packet): one to enter the driver domain for RX descriptor processing, one to return. The batch of up to budget packets is processed within a single domain switch pair. - Data zero-copy, metadata serialized: Packet data pages are shared via DmaBufferHandle (no data memcpy). Metadata is serialized as NetBufRingEntry (128 bytes per packet, plus continuation entries for >2 frags; Section 16.5). Each domain reconstructs its own NetBuf from the entry, preventing TOCTOU attacks on header offsets across the isolation boundary. - Amortized cost: At batch size 64, the domain switch overhead is ~0.7 cycles per packet (2 × ~23 cycles / 64 packets). At low packet rates (batch size 1), per-packet overhead is ~46 cycles (one full domain switch pair per packet). The ~0.7 cycles/packet figure assumes batch=64, typical for line-rate operation. For low-rate interfaces, interrupt-driven mode avoids wasted poll cycles. See Section 16.12 for the full analysis. - Crash recovery: If the Tier 1 driver crashes during poll, the KABI ring contains partially completed entries. The crash recovery handler (Section 11.9) drains any valid entries from the ring, then reloads the driver. In-flight packets already in the completion ring are delivered; packets not yet dequeued from the NIC hardware ring are lost (same as a NIC reset on Linux).

DMA buffer ownership transfer: When the driver writes a NetBufRingEntry to the completion ring, it transfers DMA buffer ownership to Tier 0. The driver MUST NOT access the DMA data page after ring submission (the entry is consumed by Tier 0 which calls NetBuf::from_ring_entry() and takes ownership of the DmaBufferHandle). Tier 0 is responsible for dma_unmap_single() after protocol processing completes (either in kfree_netbuf() or after the page is copied to userspace). The reverse applies for TX: start_xmit() receives a NetBufHandle with DMA ownership transferred to the driver; Tier 0 must not access the data until the TX completion ring signals that the NIC has finished DMA. This single-owner model prevents both double-free and use-after-DMA-unmap.

Budget accounting: The global softirq budget is netdev_budget (default 300 packets per softirq invocation, configurable via /proc/sys/net/core/netdev_budget). This budget is split across all NAPI instances on the poll_list. Each instance gets min(napi.weight, remaining_global_budget) per round. If any instance hits its budget, it stays on the poll_list and gets another turn in the next softirq cycle. The softirq itself yields after netdev_budget total packets or netdev_budget_usecs (default 2000μs) wall time, whichever comes first.

Byte weight consideration: The budget is packet-count-based, not byte-weighted. Jumbo frames (9 KB) consume ~6x more memory than standard frames (1.5 KB) per packet. For jumbo-frame NICs, the effective memory consumption per NAPI poll round can reach napi.weight * 9216 bytes (~576 KB at weight=64). This is bounded by the per-NetBuf page allocation: each NetBuf consumes exactly one page (4 KB) for the data payload plus the inline NetBuf header. Memory pressure is managed by the page allocator's watermark system, not by NAPI budget tuning.

napi_receive_buf() function body:

/// Accumulate a received packet into the NAPI batch for deferred delivery to umka-net.
/// Called by the driver's poll function (Tier 0 or Tier 1 trampoline) for each
/// received packet. The NetBuf is converted to a NetBufHandle for compact storage
/// in the batch array.
///
/// # Arguments
/// - `napi`: The NapiContext for this poll cycle (exclusive access via SCHED bit).
/// - `buf`: The received NetBuf (metadata + data page references).
///
/// # Steps
/// 1. Convert `NetBuf` to `NetBufHandle` via `NetBufPool::handle_for(buf)`
///    (consuming). This computes the slot index from pointer arithmetic within
///    the slab page and returns the 16-byte handle (pool_id + slot_idx +
///    generation). The `NetBuf` metadata remains in its slab slot; the handle
///    is the ownership token. The caller cannot access the `NetBuf` after this
///    call — all subsequent access goes through `handle.peek()`.
/// 2. Push the `NetBufHandle` into `rx_batch` (ArrayVec<NetBufHandle, 64>).
///    If the batch is full (64 entries), flush immediately via
///    `napi_deliver_batch(napi)` — this adds one extra domain crossing but
///    prevents data loss. After flush, push the handle into the now-empty batch.
/// 3. Increment `napi.rx_count`.
fn napi_receive_buf(napi: &NapiContext, buf: NetBuf) {
    let handle = napi.dev.netbuf_pool().handle_for(buf);
    // SAFETY: Exclusive access to rx_batch guaranteed by NAPI SCHED bit.
    let batch = unsafe { &mut *napi.rx_batch.get() };
    if batch.is_full() {
        // WakeupAccumulator overflow: flush mid-batch (one extra domain crossing).
        napi_deliver_batch(napi);
    }
    batch.push(handle);
    napi.rx_count.set(napi.rx_count.get() + 1);
}
/// Deliver accumulated RX packets to umka-net (Tier 1) in a single batch.
/// Performs ONE Tier 0 → Tier 1 domain switch for the entire batch.
///
/// # Precondition
/// Called from NAPI poll context with the SCHED bit set (exclusive access
/// to `napi.rx_batch`). May be called multiple times per poll cycle (if
/// `rx_batch` overflows mid-poll, or at `napi_complete_done()`).
///
/// # Algorithm
/// 1. Take the current `rx_batch` array (swap with empty ArrayVec).
/// 2. If the batch is empty, return immediately (no domain switch).
/// 3. Also flush the `WakeupAccumulator` in umka-net's `NetRxContext`
///    (accumulated `SocketWakeEvent`s from `sk_data_ready()` and
///    `sk_write_space_ready()` calls during this poll cycle).
/// 4. Submit the batch via `kabi_call!` to umka-net's inbound ring:
///    - The `NetBufHandle` array (up to 64 × 16 = 1024 bytes) is placed
///      in the KABI shared argument buffer (not inline in the ring slot).
///    - `napi_id` is included in the ring command so umka-net can look up
///      the correct `NetRxContext` via `NET_RX_CONTEXTS.load(napi_id)`.
///    - Domain switch: Tier 0 → Tier 1. umka-net's consumer dequeues
///      the batch and calls `NetRxContext::receive_batch()`.
///    - Domain switch: Tier 1 → Tier 0 (return).
/// 5. Reset `napi.rx_count` to 0.
///
/// # Error handling
/// If the `kabi_call!` fails (umka-net crashed, ring full):
/// - Drop all `NetBufHandle`s in the batch. Each handle's `Drop` impl
///   returns the slab slot to the pool and decrements the DMA data page
///   refcount. No memory leak.
/// - Clear `rx_count`.
/// - Log a warning via FMA. The KABI domain runtime initiates umka-net
///   restart ([Section 12.8](12-kabi.md#kabi-domain-runtime--crash-behavior)).
///
/// # Performance
/// One domain switch pair (~46 cycles on x86 MPK) for up to 64 packets.
/// Amortized per-packet cost: ~0.7 cycles at batch size 64.
fn napi_deliver_batch(napi: &NapiContext) {
    // SAFETY: Exclusive access to rx_batch guaranteed by NAPI SCHED bit.
    let batch = unsafe { &mut *napi.rx_batch.get() };
    if batch.is_empty() { return; }

    let handles: ArrayVec<NetBufHandle, 64> = core::mem::take(batch);
    let napi_id = napi.napi_id;

    // Submit to umka-net via KABI shared argument buffer.
    let result = kabi_call!(
        napi.dev.net_ring_handle,
        deliver_rx_batch,
        napi_id,
        &handles[..]
    );

    if let Err(e) = result {
        // Domain fault or ring full: drop handles (returns slab slots).
        drop(handles);
        log_warn!("NAPI: napi_deliver_batch failed for napi_id {}: {:?}", napi_id, e);
    }

    napi.rx_count.set(0);
}

Batch delivery to umka-net: During the poll function, drivers call napi_receive_buf(napi, buf) which accumulates NetBufHandles into NapiContext.rx_batch. At napi_complete_done(), the entire batch is delivered to umka-net via napi_deliver_batch() — a single Tier 0 → Tier 1 domain switch for the whole batch (not per-packet). This amortizes the domain switch cost (~23 cycles on x86 MPK) across up to budget packets.

WakeupAccumulator overflow handling: The WakeupAccumulator is an ArrayVec<SocketWakeEvent, 64>. If more than 64 wakeup events accumulate in a single poll cycle (e.g., each packet going to a different socket plus ACK-triggered sk_write_space_ready events), napi_deliver_batch() flushes the accumulator mid-batch by performing an extra domain crossing to post the events. This adds ~23 cycles but prevents event loss. The default napi.weight of 64 matches the accumulator capacity exactly; overflow is rare.

umka-net crash recovery during batch delivery: If umka-net crashes during napi_deliver_batch() (domain fault detected by the KABI runtime): 1. Tier 0 detects the crash via domain fault exception in kabi_call!. 2. Tier 0 clears rx_batch — dropping the ArrayVec drops all remaining NetBufHandles. Each handle's Drop impl returns the slab slot to the pool and decrements the DMA data page refcount (freeing the DMA buffer if refcount reaches zero). No explicit NetBuf::free() loop needed. 3. Clear rx_count. 5. The KABI domain runtime initiates umka-net restart (Section 12.8).

Inside umka-net, the batch is received by NetRxContext::receive_batch(), which performs GRO coalescing, L2 dispatch, and protocol processing. GRO state (hash tables, flow tracking) lives entirely in umka-net's NetRxContext — NAPI (Tier 0) never touches GRO. See Section 16.2 for the GRO data structures and the GRO-TCP contiguous coalescing invariant in Section 16.2 that governs when segments can be merged.

Two separate handoffs in the RX path: 1. NIC → NAPI (via NapiPollDispatch): The driver's poll function produces raw NetBufs. For Tier 0 drivers, this is a direct function call. For Tier 1 drivers, the KABI ring carries NetBufRingEntry records across the isolation boundary, and Tier 0 reconstructs NetBufs. This handoff is NOT via KABI ring in the umka-net sense — it is the NIC driver's own KABI ring. 2. NAPI → umka-net (batch delivery): napi_deliver_batch() submits the entire rx_batch array to umka-net's inbound ring via kabi_call!. NAPI runs in Tier 0 (domain 0), umka-net runs in Tier 1 (network domain) — this is a cross-domain ring dispatch. The batch is passed as a contiguous array of NetBufHandles in shared memory accessible to both domains. The implicit batching of the ring protocol means all packets from one NAPI poll cycle are delivered in a single ring submission.

Threaded NAPI: When enabled (via echo 1 > /sys/class/net/<dev>/threaded or per-device via netlink), each NAPI instance runs in its own kernel thread instead of softirq context. Benefits: - CPU affinity: the NAPI thread can be pinned to a specific core. - Sleeping allocations: the poll function can call GFP_KERNEL allocations. - No softirq starvation: heavy NAPI processing doesn't delay other softirqs. - Better accounting: CPU time is charged to the NAPI thread, visible in /proc.

Cross-references: - Domain switch overhead and NAPI batching: §16.9 - NetBuf pool integration: Section 16.5 - Busy polling: §16.9 (SO_BUSY_POLL) - GRO state in umka-net: Section 16.2 - CpuLocalBlock.napi_budget: Section 3.12 - Domain crossing protocol: Section 16.5 - Softirq subsystem (vector table, processing algorithm, ksoftirqd fallback): Section 3.8

16.15 Kernel TLS (kTLS)

Kernel TLS (kTLS): UmkaOS supports TCP_ULP with tls to offload TLS record-layer encryption/decryption to the kernel (TLS_TX, TLS_RX socket options). This enables sendfile() for HTTPS without userspace encryption (used by nginx, Envoy, HAProxy). The TLS record layer runs in umka-net (Tier 1); key material is confined to the connection's socket structure and wiped on close. Hardware TLS offload to capable NICs is supported via the standard NETIF_F_HW_TLS_TX / NETIF_F_HW_TLS_RX feature flags.

Offload Negotiation and Fallback

After the TLS handshake completes in userspace, the application calls setsockopt(SOL_TLS, TLS_TX, tls_crypto_info, ...) (and optionally TLS_RX) to hand off the record layer to the kernel. At this point the kernel decides whether to use NIC hardware offload or software kTLS:

  1. Capability discovery: NIC drivers expose TLS offload support via a TlsOffloadCaps bitfield advertised to umka-net during device registration. Capabilities are reported per direction (TX, RX) and per cipher suite so the stack can make per-connection decisions.

  2. Supported cipher suites for offload (NICs may support a subset):

  3. TLS_CIPHER_AES_GCM_128 — most widely supported
  4. TLS_CIPHER_AES_GCM_256
  5. TLS_CIPHER_CHACHA20_POLY1305

Software kTLS mandatory cipher support (all must be implemented):

Cipher Suite TLS Identifier Mandated By Key/IV Size Tag Size
AES-128-GCM TLS_AES_128_GCM_SHA256 RFC 8446 §B.4 (MUST) 16 B / 12 B 16 B
AES-256-GCM TLS_AES_256_GCM_SHA384 RFC 8446 §B.4 (SHOULD) 32 B / 12 B 16 B
ChaCha20-Poly1305 TLS_CHACHA20_POLY1305_SHA256 RFC 8446 §B.4 (SHOULD) 32 B / 12 B 16 B

TLS 1.2 backward compatibility also supports:

Cipher Suite TLS ID Standard Key/IV Size
AES-128-GCM (TLS 1.2) TLS_RSA_WITH_AES_128_GCM_SHA256 RFC 5246 16 B / 4 B + 8 B
AES-256-GCM (TLS 1.2) TLS_RSA_WITH_AES_256_GCM_SHA384 RFC 5246 32 B / 4 B + 8 B
  1. Negotiation sequence:

    userspace: setsockopt(SOL_TLS, TLS_TX, tls_crypto_info, ...)
    kernel:    check if NIC supports the negotiated cipher suite and direction
    kernel:    if yes → call driver's .tls_dev_add(), pass key material to NIC
               if no  → fall back silently to software kTLS (same API, app unchanged)
    
    The setsockopt() API is identical to Linux (SOL_TLS socket option) for full compatibility with existing TLS-aware applications.

  2. Transparent fallback: if the NIC rejects offload (key table full, unsupported cipher suite, or device error), the kernel falls back to software kTLS without surfacing the failure to the application — the setsockopt() call succeeds either way. Only the data path changes (NIC encrypt/decrypt vs. kernel encrypt/decrypt); the socket API and application behaviour are identical in both cases.

  3. Asymmetric offload: TX and RX offload are independently negotiated. A NIC may support TX offload but not RX (or vice versa). Each direction is offloaded if and only if the NIC supports it for the negotiated cipher suite; the other direction falls back to software kTLS. Both directions may coexist on the same connection.

16.15.1 kTLS Mandatory Cipher Support

All three cipher suites below must be supported by the software kTLS implementation. NIC hardware offload for any subset of them is optional (driver declares support via the TlsOffloadCaps bitfield advertised during device registration).

Mandatory cipher suites:

Cipher suite TLS version Linux kTLS since RFC mandate Notes
TLS_AES_128_GCM_SHA256 TLS 1.3 Linux 4.13 RFC 8446 §11.1 MUST Default TLS 1.3 cipher; most widely NIC-offloaded
TLS_AES_256_GCM_SHA384 TLS 1.3 Linux 5.1 RFC 8446 §11.1 SHOULD Required for high-security deployments
TLS_CHACHA20_POLY1305_SHA256 TLS 1.3 Linux 5.11 RFC 8446 SHOULD Required on platforms lacking AES hardware acceleration

RFC 8446 §11.1 requirement: Implementations MUST implement TLS_AES_128_GCM_SHA256. TLS_AES_256_GCM_SHA384 and TLS_CHACHA20_POLY1305_SHA256 are recommended. UmkaOS implements all three.

Crypto info structs (passed via setsockopt(SOL_TLS, TLS_TX/TLS_RX, ...)). These are layout-compatible with Linux's tls_crypto_info family in include/uapi/linux/tls.h, ensuring unmodified applications work without recompilation:

/// Base TLS crypto info header. Passed as the first field of each cipher-specific
/// struct. Layout matches Linux `struct tls_crypto_info` for setsockopt compat.
#[repr(C)]
pub struct KtlsCryptoInfo {
    /// TLS protocol version: `0x0303` = TLS 1.2, `0x0304` = TLS 1.3.
    pub version: u16,
    /// Cipher type constant. Must match one of the `CIPHER_*` values below.
    pub cipher_type: u16,
}

/// Cipher type constants. Values match Linux `TLS_CIPHER_*` in `linux/tls.h`
/// for setsockopt compatibility with existing TLS-aware userspace applications.
pub mod cipher_type {
    /// AES-128-GCM. Value = 51 (`TLS_CIPHER_AES_GCM_128`). Linux 4.13+.
    pub const AES_GCM_128:       u16 = 51;
    /// AES-256-GCM. Value = 52 (`TLS_CIPHER_AES_GCM_256`). Linux 5.1+.
    pub const AES_GCM_256:       u16 = 52;
    /// ChaCha20-Poly1305. Value = 54 (`TLS_CIPHER_CHACHA20_POLY1305`). Linux 5.11+.
    pub const CHACHA20_POLY1305: u16 = 54;
}

/// AES-128-GCM crypto parameters for kTLS (TLS 1.2 or TLS 1.3).
/// Mandatory cipher (RFC 8446 §11.1 MUST).
/// Layout matches Linux `struct tls12_crypto_info_aes_gcm_128`.
#[repr(C)]
pub struct KtlsAes128GcmInfo {
    /// Base header (`version` + `cipher_type = cipher_type::AES_GCM_128`).
    pub info: KtlsCryptoInfo,
    /// Implicit nonce (IV) — 8 bytes. XOR'd with the sequence number to form
    /// the full 12-byte GCM nonce together with `salt`.
    pub iv:      [u8; 8],
    /// AES-128 symmetric key — 16 bytes.
    pub key:     [u8; 16],
    /// Fixed salt — 4 bytes. Prepended to `iv` to form the 12-byte GCM nonce.
    pub salt:    [u8; 4],
    /// TLS record sequence number — 8 bytes. Used for nonce reconstruction
    /// and AAD construction on the receive path.
    pub rec_seq: [u8; 8],
}
// UAPI ABI: KtlsCryptoInfo(4) + iv(8) + key(16) + salt(4) + rec_seq(8) = 40 bytes.
const_assert!(core::mem::size_of::<KtlsCryptoInfo>() == 4);
const_assert!(core::mem::size_of::<KtlsAes128GcmInfo>() == 40);

/// AES-256-GCM crypto parameters for kTLS (TLS 1.2 or TLS 1.3).
/// Layout matches Linux `struct tls12_crypto_info_aes_gcm_256`.
#[repr(C)]
pub struct KtlsAes256GcmInfo {
    /// Base header (`version` + `cipher_type = cipher_type::AES_GCM_256`).
    pub info: KtlsCryptoInfo,
    /// Implicit nonce (IV) — 8 bytes.
    pub iv:      [u8; 8],
    /// AES-256 symmetric key — 32 bytes.
    pub key:     [u8; 32],
    /// Fixed salt — 4 bytes.
    pub salt:    [u8; 4],
    /// TLS record sequence number — 8 bytes.
    pub rec_seq: [u8; 8],
}
// UAPI ABI: KtlsCryptoInfo(4) + iv(8) + key(32) + salt(4) + rec_seq(8) = 56 bytes.
const_assert!(core::mem::size_of::<KtlsAes256GcmInfo>() == 56);

/// ChaCha20-Poly1305 crypto parameters for kTLS (TLS 1.3 only).
/// ChaCha20 uses a 96-bit (12-byte) nonce directly; there is no separate salt.
/// Layout matches Linux `struct tls12_crypto_info_chacha20_poly1305`.
#[repr(C)]
pub struct KtlsChaCha20Poly1305Info {
    /// Base header (`version` + `cipher_type = cipher_type::CHACHA20_POLY1305`).
    pub info: KtlsCryptoInfo,
    /// Full 12-byte nonce (no salt split; nonce XOR'd with sequence number).
    pub iv:      [u8; 12],
    /// ChaCha20 symmetric key — 32 bytes.
    pub key:     [u8; 32],
    /// TLS record sequence number — 8 bytes.
    pub rec_seq: [u8; 8],
}
// UAPI ABI: KtlsCryptoInfo(4) + iv(12) + key(32) + rec_seq(8) = 56 bytes.
const_assert!(core::mem::size_of::<KtlsChaCha20Poly1305Info>() == 56);

NIC hardware offload: When a capable NIC driver is bound (e.g., a NIC advertising TlsOffloadCaps::TX_AES_GCM_128), the kernel calls the driver's .tls_dev_add() callback, passing the crypto parameters. The NIC encrypts outbound records and/or decrypts inbound records in hardware. Software kTLS is always available as a fallback — the setsockopt() call succeeds even when NIC offload is unavailable or when the key table is full.

TlsOffloadCaps bitfield (reported by driver via NetDeviceInfo during device registration):

bitflags! {
    /// NIC TLS offload capability flags. Reported by driver via NetDeviceInfo.
    /// Each flag indicates per-direction, per-cipher hardware offload capability.
    pub struct TlsOffloadCaps: u32 {
        /// TX offload: AES-128-GCM (TLS 1.3)
        const TX_AES_128_GCM          = 0x0001;
        /// RX offload: AES-128-GCM (TLS 1.3)
        const RX_AES_128_GCM          = 0x0002;
        /// TX offload: AES-256-GCM (TLS 1.3)
        const TX_AES_256_GCM          = 0x0004;
        /// RX offload: AES-256-GCM (TLS 1.3)
        const RX_AES_256_GCM          = 0x0008;
        /// TX offload: ChaCha20-Poly1305 (TLS 1.3)
        const TX_CHACHA20_POLY1305    = 0x0010;
        /// RX offload: ChaCha20-Poly1305 (TLS 1.3)
        const RX_CHACHA20_POLY1305    = 0x0020;
        /// Device supports software-assisted crypto (e.g., Intel QuickAssist)
        const CRYPTO_ENGINE_ASSIST    = 0x0040;
        /// TX/RX offload: both directions for all three TLS 1.3 ciphers
        const FULL_TLS13 = Self::TX_AES_128_GCM.bits()
                         | Self::RX_AES_128_GCM.bits()
                         | Self::TX_AES_256_GCM.bits()
                         | Self::RX_AES_256_GCM.bits()
                         | Self::TX_CHACHA20_POLY1305.bits()
                         | Self::RX_CHACHA20_POLY1305.bits();
    }
}

16.15.2 kTLS Position in the TX Pipeline

Software kTLS encryption executes in the socket layer, before TCP segmentation. The TLS ULP (Upper Layer Protocol) intercepts sendmsg(), forms TLS records from the plaintext, encrypts each record, then hands the ciphertext to the TCP layer for segmentation and transmission:

Application: send(fd, plaintext, len)
  → Socket layer: tcp_sendmsg() → TLS ULP intercept (tls_push_data)
    → **kTLS encryption point (software path):**
      Accumulate plaintext into TLS records (up to 16384 bytes per RFC 8449):
        1. Construct TLS record header (content type, version, length)
        2. Construct AAD (Additional Authenticated Data) from record sequence number
        3. Encrypt payload: AES-GCM / ChaCha20-Poly1305
        4. Append authentication tag (16 bytes)
        5. Update record sequence number
      The buffer now contains complete, encrypted TLS records.
    → Hand encrypted TLS record data to TCP send path
    → TCP segmentation: split into MSS-sized segments
    → TCP output: tcp_transmit_skb()
      → IP layer: ip_queue_xmit()
        → Routing, netfilter hooks
        → GSO/TSO: if the NIC supports TSO AND kTLS is NOT offloaded,
          GSO splits at TLS record boundaries (not arbitrary MSS boundaries)
          to preserve record integrity.
        → Qdisc: traffic control scheduling
        → NIC driver: ndo_start_xmit()

Why before TCP segmentation: kTLS operates at the TLS record level, which is a higher abstraction than TCP segments. The TLS ULP produces self-contained encrypted records (up to 16384 bytes payload + header + tag). TCP then segments these records into MSS-sized segments for transmission. This matches the layering: TLS records are produced at the application-data boundary, TCP segments are a transport concern.

Why before GSO/TSO: GSO must respect TLS record boundaries when splitting large segments. If encryption happened after GSO, the GSO splitter would need to understand TLS framing -- a layering violation. By encrypting first, each record is self-contained and GSO can split at record boundaries.

Hardware offload path: When TlsOffloadCaps indicates NIC TLS TX support, the encryption step above is SKIPPED. Instead: - The socket layer constructs TLS record headers and AAD but does NOT encrypt. - The plaintext segment (with TLS record framing) is passed to the NIC. - The NIC encrypts in hardware as part of TX DMA, using the key material installed via .tls_dev_add(). TSO operates normally — the NIC handles both segmentation and encryption atomically. - The sequence number is managed by the NIC hardware (the kernel communicates the initial sequence number during .tls_dev_add()).

Retransmit queue: When kTLS encrypts a segment for transmission, the post-encryption ciphertext is stored in TCP's retransmit queue. Retransmissions use this already-encrypted copy directly — no re-encryption is performed. This avoids both the CPU cost of re-encrypting and the complexity of maintaining the correct TLS record sequence number for retransmitted segments (the ciphertext already contains the correct nonce/sequence).

GSO interaction: Generic Segmentation Offload (GSO) must not split TLS records across segment boundaries. When kTLS is active, the GSO engine inspects TLS record headers in the payload and segments only at record boundaries. If a GSO segment would split a TLS record, the segment boundary is moved to the next record header. This ensures that each GSO-produced segment contains one or more complete TLS records, which the receiver can decrypt independently. The gso_size on kTLS sockets is set to the TLS record payload size (typically 16384 - tag_size) to align segmentation with record boundaries naturally.

Key rotation: When the TLS library rotates keys (TLS 1.3 key update message), it calls setsockopt(SOL_TLS, TLS_TX, new_crypto_info, ...) again with the new key. If the connection is offloaded to a NIC, UmkaOS calls the driver's .tls_dev_add() with a TLS_OFFLOAD_OP_UPDATE operation. The NIC must update its key atomically (no packet may be encrypted with an old key after the new key is installed). If the NIC cannot perform atomic key rotation, UmkaOS removes the offload and falls back to software kTLS for the remainder of the connection.

Crypto re-attestation on live evolution: When the underlying cipher algorithm implementation is live-evolved (Section 13.18), active kTLS connections using software encryption must re-attest the new implementation before continuing. The evolution framework increments the crypto algorithm's generation counter via the EvolvableComponent callback chain. No bulk iteration of connections occurs. Instead, each kTLS connection detects the version mismatch on its next check_attestation() call (invoked once per TLS record on both TX and RX paths) and re-attests that single connection by re-resolving the AeadTfm handle via crypto_alloc_aead() — this picks up the new algorithm implementation atomically. Re-attestation is therefore amortized across the natural traffic flow: idle connections pay zero cost until they next send or receive data. Connections using NIC hardware offload are unaffected (the NIC's crypto engine is not replaced by kernel evolution). If the new implementation fails allocation or self-test, the affected connection falls back to a full TLS renegotiation initiated by returning EKEYEXPIRED on the next sendmsg()/recvmsg(), prompting the userspace TLS library to renegotiate.

16.15.3 kTLS Session Teardown on Crypto Transform Death

When the kernel crypto API marks a transform as TfmState::Dead (all implementations of the algorithm have been removed — see Section 10.1), any kTLS session using that transform must be torn down gracefully. The transform becomes Dead when reattest_transform() finds no alternative implementation and returns Err(CryptoError::AlgorithmRemoved).

Detection: The kTLS TX and RX paths call check_attestation() before processing each TLS record. When the underlying transform is Dead, check_attestation() propagates CryptoError::AlgorithmRemoved to the kTLS record processing function.

Teardown sequence:

  1. Detect Dead state: On the next sendmsg() or recvmsg() call, the kTLS encryption/decryption path encounters CryptoError::AlgorithmRemoved.

  2. Initiate graceful TLS close_notify: If the TCP connection is still alive and the transform was functional for the last completed record, kTLS sends a TLS close_notify alert (AlertLevel::warning = 1, AlertDescription::close_notify = 0; plaintext alert content [0x01, 0x00] per RFC 8446 Section 6, encrypted by the kTLS encryption engine as a standard TLS record before transmission) using the last valid encryption state (the nonce/key from the most recently completed record). This notifies the remote peer that the TLS session is ending. The close_notify is best-effort — if the transform is already non-functional (key schedule corrupted), the alert is skipped.

  3. Attempt userspace TLS fallback: If the socket has SO_REUSEPORT or the application registered a TCP_ULP removal callback, kTLS removes the kernel TLS offload from the socket and returns EKEYEXPIRED on the current sendmsg()/recvmsg() call. The userspace TLS library (OpenSSL, GnuTLS, rustls) can then:

  4. Renegotiate TLS with a different cipher suite.
  5. Re-establish the TLS session entirely.
  6. Close the connection gracefully.

  7. No fallback configured: If the application has no TLS library fallback (e.g., a simple sendfile() user that relies entirely on kTLS), the sendmsg()/recvmsg() call returns -EIO. The application sees a fatal I/O error on the socket and must close the connection.

  8. Release kernel TLS state: The kTLS context (KtlsTxCtx / KtlsRxCtx) is freed. Key material is zeroized before deallocation. The socket reverts to plain TCP mode (no encryption on the kernel data path). If NIC hardware offload was active, the driver's .tls_dev_del() callback is invoked to release the NIC-side key table entry.

Interaction with record boundaries: Teardown is deferred until the current TLS record is fully processed. The in_record_progress flag (Section 10.1) prevents mid-record teardown. Once the current record completes (successfully or with error), the Dead state is acted upon at the next record boundary.

FMA event: When a kTLS session is torn down due to a Dead transform, the kernel emits an FMA event: FaultEvent::KtlsCryptoRevoked { sock_cookie, cipher, reason: AlgorithmRemoved }. This provides operational visibility into security-driven connection teardowns.

Cross-references: - Kernel Crypto API (algorithm registry, AeadTfm, AeadRequest IV management, bucket locking): Section 10.1 - Forced drain for dying algorithms (transform lifecycle): Section 10.1 - Live kernel evolution (crypto re-attestation): Section 13.18 - Network stack architecture: Section 16.2


16.16 Network Overlay and Tunneling

Linux problem: Overlay networking (VXLAN, Geneve) was bolted onto the stack over many years. Bridge/veth code is complex and poorly isolated — a bug in the bridge module can crash the kernel.

UmkaOS design:

Tunnel protocols as umka-net modules — Each tunnel type runs as a Tier 1 module and implements a TunnelDevice trait:

/// A network packet passing through the stack. Wraps a `NetBufHandle` with
/// parsed header offsets for L2/L3/L4 layers. Lightweight (pointer + offsets).
pub struct Packet {
    pub buf: NetBufHandle,
    /// Byte offsets from NetBuf data_offset to L2/L3/L4 headers.
    /// u16 matches the canonical NetBuf definition ([Section 16.5](#netbuf-packet-buffer)).
    /// Maximum offset = 65535, sufficient for any packet (jumbo frames ≤ 9KB).
    pub l2_offset: u16,
    pub l3_offset: u16,
    pub l4_offset: u16,
    pub payload_len: u32,
}

/// Per-protocol metadata carried alongside decapsulated packets.
/// Inline enum avoids heap allocation on the hot RX path.
pub struct VxlanMeta {
    pub vni: u32,          // 24-bit virtual network identifier
    pub gbp: u16,          // Group-Based Policy ID (VXLAN-GBP extension, 0 = none)
}

pub struct GeneveMeta {
    pub vni: u32,          // 24-bit virtual network identifier
    pub opt_len: u8,       // length of variable-length options (in 4-byte units)
    /// Geneve TLV options. Inline ArrayVec — 252-byte max per RFC 8926
    /// (63 × 4 bytes). On the RX decap hot path; no heap allocation.
    /// The 253-byte stack footprint (252 data + 1 len) is acceptable
    /// for a per-packet temporary that lives for one packet processing call.
    pub options: ArrayVec<u8, 252>,
}

pub struct GreMeta {
    pub key: u32,          // GRE key (0 = no key)
    pub seq: u32,          // GRE sequence number (0 = not present)
    pub has_key: bool,
    pub has_seq: bool,
}

pub struct IpIpMeta {
    // IP-in-IP carries no extra metadata beyond the outer header;
    // this variant exists for type completeness.
}

pub enum TunnelMeta {
    Vxlan(VxlanMeta),
    Geneve(GeneveMeta),
    Gre(GreMeta),
    IpIp(IpIpMeta),
}

impl TunnelMeta {
    /// Virtual network identifier (24-bit for VXLAN/Geneve, GRE key, 0 for IPIP).
    pub fn vni(&self) -> u32 {
        match self {
            Self::Vxlan(m) => m.vni,
            Self::Geneve(m) => m.vni,
            Self::Gre(m) => m.key,
            Self::IpIp(_) => 0,
        }
    }
}

/// Tunnel device operations. Extends `NetDeviceOps` (defined in
/// [Section 16.13](#network-device-interface-netdevice)) with encapsulation-specific methods.
/// `NetDevice` is a data struct, not a trait — driver operations are separated
/// into `NetDeviceOps` (base) and `TunnelDeviceOps` (tunnel extension).
pub trait TunnelDeviceOps: NetDeviceOps {
    /// Encapsulate an inner packet for transmission through the tunnel.
    fn encap(&self, dev: &NetDevice, inner: &Packet, metadata: &TunnelMeta)
        -> Result<Packet, IoError>;

    /// Decapsulate a received packet, returning the inner packet and metadata.
    /// Returns an inline `TunnelMeta` enum — no heap allocation on the RX path.
    fn decap(&self, dev: &NetDevice, outer: &Packet)
        -> Result<(Packet, TunnelMeta), DecapError>;

    /// Maximum overhead added by encapsulation (for MTU calculation).
    fn encap_overhead(&self) -> usize;
}

/// A tunnel device. Wraps a `NetDevice` (data struct) with tunnel-specific
/// operations and state. Registered via `register_tunnel()`.
pub struct TunnelDevice {
    /// The underlying network device (name, ifindex, queues, stats, etc.).
    pub net_dev: NetDevice,
    /// Tunnel-specific driver operations (encap/decap).
    pub tunnel_ops: &'static dyn TunnelDeviceOps,
}

Supported tunnel protocols:

Protocol Description Use case
VXLAN Virtual Extensible LAN (UDP port 8472, IANA 4789) Cloud overlay, OpenStack
Geneve Generic Network Virtualization Encap OVN, next-gen cloud overlay
GRE/GRE6 Generic Routing Encapsulation Site-to-site tunnels
IPIP/SIT IP-in-IP and IPv6-in-IPv4 IPv6 transition
WireGuard Modern VPN (ChaCha20-Poly1305) Secure tunnels

VXLAN default UDP destination port: 8472 (Linux compatibility). The IANA-assigned port is 4789 per RFC 7348, but Linux defaults to 8472 and Docker/Flannel/Calico/Cilium all use 8472. Configurable via dstport parameter at tunnel creation. Boot parameter umka.net.vxlan_default_port overrides the default for all new VXLAN interfaces.

WireGuard tunnel specification (Noise IK handshake, ChaCha20-Poly1305 AEAD, key rotation, roaming) follows the WireGuard protocol specification (wireguard.com/protocol). Kernel integration details are deferred to Phase 3 implementation (Phase 3 is when the full TCP/IP stack and network drivers land; see Section 24.2.3): - Key storage: Private keys stored in kernel memory (not accessible from userspace after configuration); keys are zeroized on interface teardown. - Netlink interface: GENL_FAMILY "wireguard" with WG_CMD_SET_DEVICE, WG_CMD_GET_DEVICE — compatible with wg(8) and wg-quick. - Namespace interaction: WireGuard interfaces are namespace-aware (can be moved between namespaces like any netdev). - Rekeying timers: REKEY_AFTER_MESSAGES (2^60), REKEY_AFTER_TIME (120s), REJECT_AFTER_TIME (180s) — per upstream protocol spec. - Deferral rationale: WireGuard is a self-contained protocol module that plugs into the register_tunnel() framework above. No architectural changes are needed — only implementation of the Noise IK state machine and ChaCha20-Poly1305 AEAD (both provided by the crypto subsystem, Section 10.1).

Isolation tier: WireGuard runs as a Tier 1 driver (ring 0, hardware memory-domain isolated via MPK/POE/equivalent). The rationale:

  • WireGuard requires direct access to the network stack's packet path (NetBuf interface) for performance. The ring-crossing overhead of Tier 2 (~5–15 μs per batch) is unacceptable for a cryptographic tunnel that is otherwise a ~1–5 μs per-packet operation; routing every packet through a full user-ring boundary would eliminate the performance advantage of in-kernel tunneling.
  • Tier 1 isolation (hardware memory domain) still confines a WireGuard crash to a driver reload without causing a kernel panic, providing meaningful fault containment at lower cost than a full Tier 2 process boundary.
  • The WireGuard cryptographic state (device private key, peer public keys, session symmetric keys, handshake state machine) lives entirely within the WireGuard Tier 1 isolation domain and is never readable by other Tier 1 drivers or by Tier 0 code. Key zeroization on interface teardown is enforced before the domain is released.

Configuration: WireGuard interfaces are configured via the standard wg(8) / wg-quick(8) userspace tools using Linux-compatible Generic Netlink (GENL_FAMILY "wireguard", WG_CMD_* commands). No API changes from Linux — existing WireGuard tooling works without modification.

Namespace Move Semantics

When a WireGuard interface is moved between network namespaces via RTM_SETLINK / ip link set wg0 netns <ns>:

  1. Cryptographic state preserved: The Noise IK handshake state, session symmetric keys, and replay counters move with the interface. No rekeying is triggered. Existing tunnels continue without interruption.

  2. Peer table preserved: All configured peers, their endpoints, allowed IPs, and persistent keepalive settings remain intact.

  3. Routing re-evaluation: Peer endpoint addresses are re-resolved in the target namespace's routing context. If a peer endpoint (e.g., 10.0.0.1) is unreachable in the new namespace's routing table, packets to that peer are queued until routing changes or the handshake times out (rekey after REKEY_TIMEOUT = 5 seconds).

  4. Firewall context change: Incoming/outgoing packets are processed against the target namespace's netfilter/nftables rules after the move.

  5. Isolation tier unchanged: WireGuard runs as a Tier 1 driver (in-kernel, fully trusted). Namespace move does not affect isolation boundaries — the driver continues to execute with kernel privileges regardless of which namespace owns the interface.

16.16.1 GRE (Generic Routing Encapsulation)

GRE encapsulates arbitrary protocols inside IP packets. UmkaOS implements GRE as a Tier 1 tunnel module conforming to TunnelDeviceOps. RFC 2784 defines the base protocol; RFC 2890 adds optional key and sequence number fields for demultiplexing and ordering.

/// GRE tunnel device — encapsulates arbitrary protocols in IP.
/// RFC 2784 (base), RFC 2890 (key + sequence).
pub struct GreTunnel {
    /// Tunnel network device.
    pub dev: Arc<NetDevice>,
    /// Tunnel parameters.
    pub params: GreTunnelParams,
    /// GRE flags (determines which optional fields are present).
    pub flags: GreFlags,
    /// Outgoing sequence number (if SEQUENCE flag set).
    pub o_seqno: AtomicU32,
    /// Expected incoming sequence number.
    pub i_seqno: AtomicU32,
    /// Statistics.
    pub stats: PerCpu<TunnelStats>,
}

pub struct GreTunnelParams {
    /// Local (outer) source IP address.
    pub local: Ipv4Addr,
    /// Remote (outer) destination IP address.
    pub remote: Ipv4Addr,
    /// GRE key (optional, for demultiplexing multiple tunnels to same endpoint).
    pub key: u32,
    /// TTL for outer IP header (0 = inherit from inner).
    pub ttl: u8,
    /// TOS for outer IP header (0 = inherit from inner).
    pub tos: u8,
    /// Link (output device index) for the tunnel.
    pub link: u32,
    /// Encapsulation limit (IPv6 tunnels).
    pub encap_limit: u8,
    /// Protocol (inner payload: ETH_P_IP, ETH_P_IPV6, ETH_P_TEB for GRETAP).
    pub protocol: u16,
}

bitflags! {
    /// GRE header flags (RFC 2784, RFC 2890).
    ///
    /// Values are big-endian bit positions matching the on-wire GRE header
    /// layout and Linux `include/uapi/linux/if_tunnel.h` definitions:
    ///   GRE_CSUM = __cpu_to_be16(0x8000)  // bit 0 (MSB) of flags+version word
    ///   GRE_KEY  = __cpu_to_be16(0x2000)  // bit 2
    ///   GRE_SEQ  = __cpu_to_be16(0x1000)  // bit 3
    ///
    /// The GRE flags+version field is a big-endian u16 on the wire. These
    /// constants represent the big-endian wire values directly. Parsing code
    /// reads the u16 in network byte order and tests against these masks.
    pub struct GreFlags: u16 {
        /// Checksum present in GRE header (bit 0 = MSB of flags word).
        const CHECKSUM = 0x8000;
        /// Key field present (bit 2).
        const KEY      = 0x2000;
        /// Sequence number present (bit 3).
        const SEQUENCE = 0x1000;
    }
}

16.16.1.1 GRE Header Format

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|C| |K|S| Reserved0       | Ver |         Protocol Type         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Checksum (optional)      |       Reserved1 (optional)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Key (optional)                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Sequence Number (optional)                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • Base header: 4 bytes (no optional fields)
  • With key: 8 bytes
  • With key + sequence: 12 bytes
  • Protocol type: ETH_P_IP (0x0800), ETH_P_IPV6 (0x86DD), ETH_P_TEB (0x6558 for GRETAP)

16.16.1.2 GRE Variants

Variant Device type Inner payload Use case
GRE (ip_gre) gre0, greN IP packets (L3) Point-to-point IP tunnel
GRETAP (gretap) gretapN Ethernet frames (L2) L2 bridge over IP
IP6GRE ip6greN IP packets over IPv6 outer GRE with IPv6 transport
IP6GRETAP ip6gretapN Ethernet over IPv6 outer L2 over IPv6
ERSPAN erspanN Mirrored frames with ERSPAN header Network monitoring

16.16.1.3 ERSPAN (Encapsulated Remote SPAN)

ERSPAN extends GRE with session metadata for remote port mirroring. Used by network monitoring infrastructure to capture traffic from one switch/host and deliver it to a remote analyzer.

/// Parsed representation of an ERSPAN header. Wire encoding/decoding is
/// performed by `erspan_parse()`/`erspan_build()`. The actual wire format
/// is a packed bitfield (version:4 + VLAN:12 + COS:3 + Encap:2 + T:1 +
/// Session ID:10 = 32 bits), not individual struct fields.
pub struct ErspanHeader {
    /// Version: 1 (Type II) or 2 (Type III).
    pub version: u8,
    /// VLAN ID of original frame.
    pub vlan: u16,
    /// COS (Class of Service) from original frame.
    pub cos: u8,
    /// Truncated flag.
    pub truncated: u8, // 0 = false, 1 = true
    /// Session ID (0-1023).
    pub session_id: u16,
    /// Port index (ERSPAN Type II) or hardware ID + direction (Type III).
    pub index: u32,
    /// Timestamp (ERSPAN Type III only, 32-bit granularity).
    pub timestamp: u32,
}

16.16.2 IPIP (IP-in-IP Encapsulation)

IPIP is the simplest tunneling protocol: an IP packet encapsulated directly inside another IP packet with no additional header. Protocol number 4 (IPPROTO_IPIP). RFC 2003.

/// IPIP tunnel — simplest IP tunnel: IP packet inside IP packet.
/// RFC 2003. Protocol number 4 (IPPROTO_IPIP).
pub struct IpIpTunnel {
    pub dev: Arc<NetDevice>,
    pub local: Ipv4Addr,
    pub remote: Ipv4Addr,
    pub ttl: u8,
    pub tos: u8,
    pub link: u32,
    pub stats: PerCpu<TunnelStats>,
}
  • 20 bytes overhead (outer IP header only, no encapsulation header)
  • Lowest overhead of all tunneling protocols
  • No key/demux: one tunnel per (local, remote) pair
  • Used by: mobile IP (RFC 5944), some IPVS direct routing

16.16.3 SIT (Simple Internet Transition)

SIT tunnels carry IPv6 packets over an IPv4 transport, providing IPv6 transition mechanisms. RFC 4213 defines the base 6in4 encapsulation; RFC 3056 specifies automatic 6to4 tunneling; RFC 5969 defines ISP-managed 6rd (IPv6 Rapid Deployment).

/// SIT tunnel — IPv6-in-IPv4 for 6in4 transition.
/// RFC 4213 (6in4), RFC 3056 (6to4), RFC 5969 (6rd).
pub struct SitTunnel {
    pub dev: Arc<NetDevice>,
    pub local: Ipv4Addr,
    pub remote: Ipv4Addr,  // 0.0.0.0 for 6to4/6rd (derived from inner IPv6)
    pub ttl: u8,
    pub tos: u8,
    /// 6rd parameters (if 6rd mode).
    pub ip6rd: Option<Ip6rdParams>,
    pub stats: PerCpu<TunnelStats>,
}

/// 6rd (IPv6 Rapid Deployment) parameters.
pub struct Ip6rdParams {
    /// 6rd prefix.
    pub prefix: Ipv6Addr,
    /// 6rd prefix length.
    pub prefixlen: u32,
    /// IPv4 common prefix length (bits shared by all endpoints).
    pub relay_prefixlen: u32,
    /// IPv4 relay address.
    pub relay_prefix: Ipv4Addr,
}

SIT variants:

Mode Remote Description
6in4 Specific IPv4 Point-to-point IPv6 tunnel over IPv4 (RFC 4213)
6to4 0.0.0.0 (derived) Automatic: derive IPv4 from 2002::/16 prefix (RFC 3056)
6rd 0.0.0.0 (derived) ISP-assigned prefix + embedded IPv4 (RFC 5969)
ISATAP 0.0.0.0 Intra-Site Automatic Tunnel (RFC 5214)

16.16.4 Common Tunnel Infrastructure

All tunnel types (GRE, IPIP, SIT, VXLAN, Geneve, WireGuard) share common infrastructure for statistics, management, and MTU handling.

/// Shared statistics for all tunnel types.
/// TunnelStats tracks per-tunnel encap/decap overhead (packets, bytes, errors).
/// `NetDevOps::get_stats64()` on a tunnel netdev aggregates TunnelStats into
/// `NetDevStats` for userspace visibility (`ip -s link show`, `/proc/net/dev`).
/// The aggregation sums per-CPU TunnelStats and copies to the corresponding
/// NetDevStats fields (tx_packets, tx_bytes, tx_errors, rx_packets, rx_bytes,
/// rx_errors). `rx_dropped` / `tx_dropped` in NetDevStats are computed from
/// tunnel-layer drops (e.g., decap failures, queue overflows) not tracked in
/// TunnelStats — these are maintained as separate per-CPU atomic counters in
/// the tunnel device.
pub struct TunnelStats {
    pub tx_packets: u64,
    pub tx_bytes: u64,
    pub tx_errors: u64,
    pub rx_packets: u64,
    pub rx_bytes: u64,
    pub rx_errors: u64,
}

All tunnels are created and managed via netlink (Linux iproute2 compatible): - ip tunnel add gre1 mode gre remote 10.0.0.1 local 10.0.0.2 key 100 - ip link add gretap1 type gretap remote 10.0.0.1 local 10.0.0.2 - ip tunnel add sit1 mode sit remote 203.0.113.1 local 198.51.100.1 - ip tunnel add ipip1 mode ipip remote 10.0.0.1 local 10.0.0.2

Netlink attributes: IFLA_GRE_LOCAL, IFLA_GRE_REMOTE, IFLA_GRE_KEY, IFLA_GRE_FLAGS, IFLA_IPTUN_LOCAL, IFLA_IPTUN_REMOTE, etc.

16.16.4.2 MTU Handling

  • GRE: outer_mtu - 20 (IP) - 4..12 (GRE header) = inner MTU
  • IPIP: outer_mtu - 20 (IP) = inner MTU
  • SIT: outer_mtu - 20 (IP) = inner MTU
  • PMTUD: tunnel sets DF bit on outer header; ICMP "too big" received adjusts inner MTU
  • Fragmentation: if inner packet exceeds tunnel MTU and DF is not set on inner, fragment inner packet before encapsulation (never fragment outer — causes reassembly issues at the remote endpoint)

16.16.4.3 Tunnel Namespace Scoping

  • Each tunnel device belongs to a network namespace
  • Tunnel endpoints (local/remote) use the namespace's routing table
  • Cross-namespace tunnels: create in one namespace, move to another via RTM_SETLINK (same mechanism as veth namespace moves)

16.16.4.4 Tunnel Cross-References

16.16.5 Software L2 Switch (Bridge)

A Linux bridge equivalent in umka-net, supporting: - STP (Spanning Tree Protocol) for loop prevention - VLAN filtering (802.1Q tag-aware forwarding) - FDB (Forwarding Database) learning with configurable aging - Per-port traffic shaping - Hairpin mode for VM-behind-bridge scenarios

16.16.5.1 Bridge ↔ Namespace Interaction

The bridge is the primary mechanism for connecting containers (network namespaces) to each other and to the external network. Understanding the bridge-namespace relationship is essential for Docker bridge-mode networking and Kubernetes CNI plugin compatibility.

Bridge device ownership: A bridge device (br0, docker0, cni0) lives in exactly one network namespace — typically the host (root) namespace. The bridge itself is a NetDevice with BridgeOps as its driver operations. It cannot span multiple namespaces directly; cross-namespace connectivity is achieved through veth pairs whose ends are in different namespaces.

Bridge port model: Each port is a NetDevice attached to the bridge via RTM_SETLINK (ip link set dev veth0 master br0). Bridge ports can be: - Physical NICs in the same namespace as the bridge - Veth ends whose peers are in different namespaces (the common container case) - VLAN sub-interfaces - Tunnel device endpoints (VXLAN, Geneve)

The canonical BridgePort struct is defined in Section 16.13. It includes dev, port_no, stp_state, vlans, pvid, and additional fields. This section uses the same type; BridgePortFlags below extends the canonical definition with bitflags for hairpin mode, BPDU guard, etc.

/// Bridge port flags.
bitflags::bitflags! {
    pub struct BridgePortFlags: u32 {
        /// Hairpin mode: allow packet to exit the same port it entered.
        /// Required when VMs or containers behind a bridge need to
        /// communicate with each other via the bridge (the packet enters
        /// on veth-host and must exit on the same port to reach another
        /// VM behind the same veth).
        const HAIRPIN        = 1 << 0;
        /// Enable BPDU guard (drop BPDUs on this port).
        const BPDU_GUARD     = 1 << 1;
        /// Root guard: prevent this port from becoming root port.
        const ROOT_BLOCK     = 1 << 2;
        /// Flood unknown unicast to this port.
        const FLOOD          = 1 << 3;
        /// Learning enabled (update FDB on incoming frames).
        const LEARNING       = 1 << 4;
        /// Proxy ARP enabled.
        const PROXYARP       = 1 << 5;
    }
}

/// STP port state.
#[repr(u8)]
pub enum StpPortState {
    Disabled   = 0,
    Listening  = 1,
    Learning   = 2,
    Forwarding = 3,
    Blocking   = 4,
}

FDB (Forwarding Database) — MAC learning and lookup:

When a frame arrives on a bridge port, the bridge learns the source MAC → port association and stores it in the FDB. Subsequent frames destined for that MAC are forwarded directly to the learned port instead of being flooded to all ports.

The canonical BridgeFdbEntry struct is defined in Section 16.13. It includes mac, port_no, is_static, last_seen, and vlan_id. This section uses the same BridgeFdbEntry type via RcuHashMap<([u8; 6], u16), BridgeFdbEntry>.

The FDB is an RcuHashMap<([u8; 6], u16), BridgeFdbEntry> keyed by (MAC, VLAN ID). Readers (per-packet forwarding lookup) use RCU — no locks on the fast path. Writers (learning, aging, static configuration) acquire a per-bridge mutex.

Cross-namespace forwarding path:

Frame arrives on bridge port (e.g., physical NIC eth0, host namespace):
  1. Learn: FDB.insert(src_mac → port_eth0)
  2. Lookup: dst_entry = FDB.lookup(dst_mac)
  3. If dst_entry found:
       forward to dst_entry.port (e.g., veth-host)
         → veth-host.ndo_start_xmit(frame)
         → peer = rcu_dereference(veth-host.peer)  // peer = eth0 in container ns
         → dev_forward_skb(peer, frame)
           → scrub metadata (mark, priority, tstamp)
           → netif_rx(frame) on peer's backlog
         → Frame is now in the container's namespace, delivered to its protocol stack
  4. If dst_entry not found (unknown unicast):
       flood to all ports in FORWARDING state (except ingress port)
  5. If dst_mac is broadcast/multicast:
       flood to all ports (including local delivery if bridge has an IP)

NAT/masquerade on bridge: Netfilter hooks (NF_BR_PRE_ROUTING, NF_BR_FORWARD, NF_BR_POST_ROUTING, NF_BR_LOCAL_IN, NF_BR_LOCAL_OUT) are invoked at each stage of bridge processing, matching Linux's br_netfilter behavior. This enables: - NF_BR_FORWARD (BROUTING chain equivalent): filter/mangle packets traversing the bridge between ports. - iptables/nftables rules applied to bridged traffic via br_netfilter physdev matching (-m physdev --physdev-in veth0). - DNAT/SNAT rules for port forwarding from host → container (Docker -p flag): applied at NF_BR_PRE_ROUTING (DNAT) and NF_BR_POST_ROUTING (SNAT/masquerade).

Hairpin mode: When BridgePortFlags::HAIRPIN is set on a port, the bridge allows a frame to exit the same port it entered. Without hairpin mode, such frames are silently dropped (standard 802.1D behavior). Hairpin mode is required when multiple VMs or containers behind the same veth (via macvlan or similar) need to communicate through the bridge. Enabled per-port via netlink: ip link set dev veth-host type bridge_slave hairpin on.

Virtual device pairs:

16.16.6 Veth (Virtual Ethernet Pairs)

Veth LinkOps registration: Registered via register_link_ops("veth", VethLinkOps) during network stack init. The create() callback allocates two peer NetDevices, links them via peer: RcuCell<Arc<VethDevice>> fields. Namespace move updates dev.net_ns and re-inserts into the target namespace's InterfaceTable XArray.

/// Create a veth pair, one end in each namespace.
fn create_veth_pair(
    name_a: &str, name_b: &str,
    net_ns_a: &NetNamespace, net_ns_b: &NetNamespace,
) -> Result<(Arc<NetDevice>, Arc<NetDevice>)>;

Veth pairs provide bidirectional packet delivery between two network namespaces. Each end is a full NetDevice that can be assigned to any namespace, have IP addresses, participate in bridges, and run XDP programs. Required for Docker bridge-mode networking and Kubernetes pod networking.

/// Veth pair: two linked virtual Ethernet devices.
/// Creating a veth pair yields two NetDevice handles (peer_a, peer_b).
/// Packets transmitted on peer_a appear as received on peer_b, and vice versa.
///
/// Created via netlink: `ip link add veth0 type veth peer name veth1`
/// Namespace move: `ip link set veth1 netns <ns>`
pub struct VethDevice {
    /// Standard NetDevice fields (name, ifindex, MAC, namespace, etc.).
    pub dev: NetDevice,

    /// RCU-protected peer reference. The peer may be in a different
    /// namespace. `None` if the peer has been destroyed (orphaned veth).
    /// RCU protection: readers (TX path) dereference without locking;
    /// writers (namespace move, teardown) synchronize via rcu_synchronize.
    pub peer: RcuCell<Arc<VethDevice>>,

    /// Per-CPU RX statistics (packets received from peer's TX).
    pub rx_stats: PerCpu<VethStats>,
    /// Per-CPU TX statistics (packets sent to peer's RX).
    pub tx_stats: PerCpu<VethStats>,

    /// XDP program attached to this end (runs on "receive" from peer).
    pub xdp_prog: RcuCell<Option<Arc<BpfProg>>>,
}

pub struct VethStats {
    pub packets: u64,
    pub bytes: u64,
    pub drops: u64,
}

Packet delivery (TX → peer RX):

veth0.ndo_start_xmit(packet)
  → peer = rcu_dereference(veth0.peer)
  → if peer is None: drop packet (orphaned), return
  → if peer has XDP program:
      → run XDP program on packet (in TX softirq context)
      → XDP_PASS: continue delivery
      → XDP_DROP: drop packet, increment peer.rx_stats.drops
      → XDP_REDIRECT: redirect via xdp_do_redirect()
      → XDP_TX: "transmit" back = deliver to veth0's own RX
  → dev_forward_skb(peer.dev, packet)
      → scrub packet metadata (clear skb->mark, skb->priority,
        skb->tstamp) to prevent cross-namespace information leaks
      → set packet.dev = peer.dev
      → netif_rx(packet) → enqueue on peer's CPU backlog
  → increment veth0.tx_stats, peer.rx_stats

Cross-namespace isolation: - dev_forward_skb() scrubs all fields that could leak information between namespaces: skb->mark, skb->priority, skb->tstamp, skb->tc_index, connection tracking state. - Cgroup accounting re-evaluation: When a packet crosses namespace boundaries via dev_forward_skb() (veth, bridge forwarding, macvlan), the packet's cgroup association is re-evaluated at the receiving end. The callsite path is: dev_forward_skb() -> netif_receive_buf_internal() -> ip_rcv() -> transport lookup (tcp_v4_rcv()/udp_rcv()) -> socket found -> packet associated with sock.sk_cgroup. The cgroup transition happens at socket lookup, not at dev_forward_skb itself. The receiving namespace performs socket lookup against its own socket table; the matched socket's cgroup determines the accounting target. Ingress bytes are charged to the receiving cgroup, not the sending cgroup. For packets with no matching socket (e.g., broadcast, ARP), ingress accounting falls to the receiving namespace's default network cgroup. This ensures that container-to-container traffic via veth is correctly attributed to each container's resource limits (Section 17.2). - Each end of the veth pair has its own NetNamespace reference. Routing, iptables rules, and socket lookups use the receiving end's namespace. - MAC addresses are independently assigned per end (random by default).

Namespace move: Either end can be moved to a different namespace via RTM_SETLINK netlink message (equivalent to ip link set veth1 netns <ns>). The move operation: 1. Unregister the device from the source namespace's netdev list. 2. Update dev.net_ns to point to the target namespace. 3. Register in the target namespace's netdev list. 4. The peer reference remains valid (it points to the device, not the namespace). 5. If the target namespace is destroyed, veth devices in it are deleted, which triggers peer destruction (both ends go down).

Teardown: Destroying either end destroys both. When one end is unregistered (namespace teardown or ip link del), the peer detects it via the NETDEV_UNREGISTER notifier, sets its own peer = None, and unregisters itself.

Performance: Veth delivery is zero-copy for the packet buffer itself (the NetBufHandle is transferred, not copied). Header scrubbing is O(1) — just clearing a few fields. The per-CPU backlog enqueue is the main cost (~200-500ns per packet including the softirq netif_rx processing).

XDP on veth: XDP programs can be attached to either end. The XDP program runs in the peer's TX softirq context (not the receiving end's context), which means it executes before dev_forward_skb(). This enables early filtering/redirection without the cost of full protocol stack processing. XDP_REDIRECT on veth enables efficient chaining: veth0 → XDP redirect → physical NIC, bypassing the TCP/IP stack entirely (used by Cilium for Kubernetes pod networking).

macvlan/ipvlan: Lightweight container networking without bridges. macvlan assigns unique MACs per container; ipvlan shares the parent MAC and routes by IP.

VRF (Virtual Routing and Forwarding) — L3 domain isolation for multi-tenant routing. Each VRF has its own routing table and forwarding decisions, enabling multiple tenants to use overlapping IP ranges on the same host.

Hardware offload — Tunnel encap/decap can be offloaded to NIC hardware via KABI. This is the equivalent of Linux TC flower offload: - NIC firmware handles VXLAN/Geneve encap/decap in hardware - umka-net falls back to software path transparently if NIC lacks offload support - Offload rules are programmed via the same TunnelDevice trait (the NIC driver implements the trait with hardware acceleration)

XDP integration — XDP programs can inspect inner headers of tunneled packets via a "decap-before-XDP" mode: - Because XDP runs in the NIC driver before reaching umka-net, XDP programs that need to see inner headers must explicitly call a BPF helper (e.g., bpf_xdp_decap()). - This helper invokes the NIC's hardware offload or a fast-path software decapsulator to strip the tunnel headers. - The XDP program then sees the inner (original) packet headers. - Allows filtering/load-balancing decisions based on inner flow information - This avoids the Linux problem where XDP programs must manually parse tunnel headers

Container networking compatibility — Docker bridge network mode and Kubernetes CNI plugins (Calico, Cilium, Flannel) must work without modification. This requires: - veth pair creation via netlink - Bridge port management via netlink - VXLAN device creation via netlink - iptables/nftables rules for masquerade and port mapping - All of these are covered by the netlink subsystem (Section 16.14) and BPF-based packet filtering (Section 16.15)

RTNL lock serialization: Veth create, namespace-move, and bridge-attach operations acquire the RTNL lock (network configuration mutex). RTNL serializes all network device topology changes. Lock ordering: RTNL < per-device locks. RTNL is held during the entire create→move→attach sequence to ensure atomic container network setup.

MASQUERADE NAT: Translated to a BPF program that rewrites source IP to the outgoing interface's primary address. The BPF program is auto-generated at rule install time and attached to the TC egress hook on the masquerade interface. Dynamic IP changes (DHCP) trigger program regeneration.

Netlink is the primary kernel-userspace IPC mechanism for network configuration. Docker, Kubernetes CNI plugins (Calico, Cilium, Flannel), iproute2 (ip route, ip addr, ip link), and NetworkManager all depend on netlink. UmkaOS implements netlink as a socket family within umka-net.

Socket family: AF_NETLINK sockets are created via socket(AF_NETLINK, SOCK_DGRAM, protocol). Each protocol family controls a different subsystem:

Protocol Purpose Capability required
NETLINK_ROUTE Routes, addresses, links, neighbors, rules CAP_NET_ADMIN for writes; reads are unprivileged
NETLINK_AUDIT Audit event delivery (see Section 20.2.9) CAP_AUDIT_READ
NETLINK_KOBJECT_UEVENT Device hotplug events (udev) Unprivileged (receive only)
NETLINK_GENERIC Generic extensible netlink (genetlink) Per-family capability check
NETLINK_NETFILTER Conntrack entry dump/delete/event, nftables rule management. Required by Docker, Kubernetes kube-proxy, and conntrack-tools. CAP_NET_ADMIN
NETLINK_XFRM IPsec SA/SP management (xfrm_user). Used by strongSwan, libreswan, and iproute2 ip xfrm. See Section 16.22. CAP_NET_ADMIN

Message format: Every netlink message starts with an nlmsghdr (16 bytes):

/// Netlink message header (matches Linux struct nlmsghdr exactly).
#[repr(C)]
pub struct NlMsgHdr {
    /// Total message length including header.
    pub nlmsg_len: u32,
    /// Message type (RTM_NEWROUTE, RTM_DELADDR, etc.).
    pub nlmsg_type: u16,
    /// Flags (NLM_F_REQUEST, NLM_F_DUMP, NLM_F_ACK, etc.).
    pub nlmsg_flags: u16,
    /// Sequence number (for request/response matching).
    pub nlmsg_seq: u32,
    /// Sending process port ID (0 = kernel).
    pub nlmsg_pid: u32,
}
const_assert!(size_of::<NlMsgHdr>() == 16);

Messages are followed by type-specific payload structs and nested TLV attributes (rtattr). The payload structs match Linux UAPI exactly (#[repr(C)], same field order and sizes):

/// Link information message. Payload for RTM_NEWLINK/RTM_DELLINK/RTM_GETLINK.
/// Matches Linux `struct ifinfomsg` (include/uapi/linux/rtnetlink.h).
#[repr(C)]
pub struct IfInfoMsg {
    pub ifi_family: u8,     // AF_UNSPEC for most queries
    pub _pad: u8,
    pub ifi_type: u16,      // ARPHRD_* device type
    pub ifi_index: i32,     // interface index (0 = unspecified)
    pub ifi_flags: u32,     // IFF_* flags
    pub ifi_change: u32,    // IFF_* change mask
}
const_assert!(size_of::<IfInfoMsg>() == 16);

/// Address information message. Payload for RTM_NEWADDR/RTM_DELADDR/RTM_GETADDR.
/// Matches Linux `struct ifaddrmsg` (include/uapi/linux/if_addr.h).
#[repr(C)]
pub struct IfAddrMsg {
    pub ifa_family: u8,     // AF_INET or AF_INET6
    pub ifa_prefixlen: u8,  // prefix length (e.g., 24 for /24)
    pub ifa_flags: u8,      // IFA_F_* flags
    pub ifa_scope: u8,      // RT_SCOPE_* address scope
    pub ifa_index: u32,     // interface index
}
const_assert!(size_of::<IfAddrMsg>() == 8);

/// Route message. Payload for RTM_NEWROUTE/RTM_DELROUTE/RTM_GETROUTE.
/// Matches Linux `struct rtmsg` (include/uapi/linux/rtnetlink.h).
#[repr(C)]
pub struct RtMsg {
    pub rtm_family: u8,    // AF_INET or AF_INET6
    pub rtm_dst_len: u8,   // destination prefix length
    pub rtm_src_len: u8,   // source prefix length (policy routing)
    pub rtm_tos: u8,       // TOS filter
    pub rtm_table: u8,     // routing table ID (RT_TABLE_*)
    pub rtm_protocol: u8,  // routing protocol (RTPROT_*)
    pub rtm_scope: u8,     // route scope (RT_SCOPE_*)
    pub rtm_type: u8,      // route type (RTN_*)
    pub rtm_flags: u32,    // RTM_F_* flags
}
const_assert!(size_of::<RtMsg>() == 12);

/// Neighbor (ARP/NDP) message. Payload for RTM_NEWNEIGH/RTM_DELNEIGH/RTM_GETNEIGH.
/// Matches Linux `struct ndmsg` (include/uapi/linux/neighbour.h).
#[repr(C)]
pub struct NdMsg {
    pub ndm_family: u8,    // AF_INET or AF_INET6
    pub ndm_pad1: u8,
    pub ndm_pad2: u16,
    pub ndm_ifindex: i32,  // interface index
    pub ndm_state: u16,    // NUD_* neighbor state
    pub ndm_flags: u8,     // NTF_* flags
    pub ndm_type: u8,      // RTN_* type
}
const_assert!(size_of::<NdMsg>() == 12);

/// Policy routing rule header. Payload for RTM_NEWRULE/RTM_DELRULE.
/// Matches Linux `struct fib_rule_hdr` (include/uapi/linux/fib_rules.h).
#[repr(C)]
pub struct FibRuleHdr {
    pub family: u8,
    pub dst_len: u8,
    pub src_len: u8,
    pub tos: u8,
    pub table: u8,
    pub res1: u8,         // reserved, must be zero
    pub res2: u8,         // reserved, must be zero
    pub action: u8,       // FR_ACT_*
    pub flags: u32,       // FIB_RULE_*
}
const_assert!(size_of::<FibRuleHdr>() == 12);
umka-net implements the full NETLINK_ROUTE message set required for container networking:

  • Link management: RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK — create/destroy/query veth pairs, bridges, VXLAN devices, macvlan/ipvlan
  • Address management: RTM_NEWADDR, RTM_DELADDR, RTM_GETADDR — assign/remove IPv4/IPv6 addresses
  • Route management: RTM_NEWROUTE, RTM_DELROUTE, RTM_GETROUTE — manipulate routing tables (including per-VRF tables)
  • Neighbor management: RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH — ARP/NDP neighbor table entries
  • Rule management: RTM_NEWRULE, RTM_DELRULE — policy routing rules

Note: NLM_F_BATCH (0x1000) is a UmkaOS-specific netlink flag extension, not present in upstream Linux. Bit 12 (0x1000) was chosen after exhaustive collision analysis of the complete Linux NLM_F_* namespace (all assigned bits: 0x001-0x020 common flags, 0x100-0x800 type-specific flags in include/uapi/linux/netlink.h). Bits 6-7 and 12-15 are unassigned in Linux mainline. Bit 12 is safely available.

Linux applications that do not set this flag are unaffected. UmkaOS tools and libraries that use batch route updates must set this flag explicitly.

Netlink messages with NLM_F_BATCH flag set on RTM_NEWROUTE / RTM_DELROUTE are accumulated into a FibTrieBatchBuilder (see Section 16.6). The batch is committed when: - A message without NLM_F_BATCH is received (end of batch) - The netlink socket buffer is drained (implicit batch end) - The batch reaches FIB_BATCH_MAX mutations (default: 4096, prevents unbounded memory use in the working copy)

All mutations within a batch are applied atomically — readers see either the pre-batch or post-batch trie, never an intermediate state. Errors within a batch abort the entire batch (all-or-nothing semantics); the original trie is preserved and an NLMSG_ERROR is returned for the failing message.

Capability gating: Netlink write operations require the appropriate capability in the caller's network namespace. Read operations and multicast group subscriptions are unprivileged, matching Linux semantics. This ensures that unprivileged containers can observe network state but cannot modify it without explicit capability grants.

Multicast groups: Processes subscribe to multicast groups (e.g., RTNLGRP_LINK, RTNLGRP_IPV4_ROUTE) to receive asynchronous notifications of network state changes. This is how ip monitor, container runtimes, and NetworkManager track link state.

16.18 Packet Filtering (BPF-Based)

UmkaOS does not implement a separate nftables or iptables subsystem. Packet filtering uses the BPF-based filtering infrastructure described in Section 19.2.

Architecture: All packet filtering hooks (prerouting, input, forward, output, postrouting) are BPF attachment points. BPF programs attached to these hooks perform the equivalent of iptables/nftables rules: matching on headers, NATing, dropping, marking, and logging.

nftables/iptables compatibility: The syscall interface (Section 19.1) translates legacy iptables and nftables rule manipulations (setsockopt for iptables, netlink NFT_MSG_* for nftables) into equivalent BPF programs that are compiled and attached to the appropriate hooks. This translation happens transparently:

  • iptables -t nat -A POSTROUTING -s 10.0.0.0/8 -j MASQUERADE is translated to a BPF program attached to the postrouting hook that performs source NAT
  • nft add rule ip filter input tcp dport 80 accept is translated to a BPF program attached to the input hook

This approach provides Docker/Kubernetes compatibility (which depend on iptables/nftables for port mapping, masquerade, and network policy) without maintaining a separate packet filtering subsystem. The BPF JIT ensures that translated rules execute at native speed.

Connection tracking (conntrack): Stateful NAT (MASQUERADE, DNAT, SNAT) requires tracking connection state to map return packets back to the original source. UmkaOS implements connection tracking as a BPF-accessible hash map maintained by umka-net:

/// 5-tuple identifying one direction of a connection.
/// For ICMP, `src_port` and `dst_port` are repurposed as type and code.
#[repr(C)]
pub struct ConntrackTuple {
    /// Source IP address (IPv4-mapped-IPv6 for v4, native for v6).
    pub src_addr: [u8; 16], // 16 bytes raw array to guarantee layout
    /// Destination IP address.
    pub dst_addr: [u8; 16], // 16 bytes
    /// Source port (or ICMP type).
    pub src_port: u16,
    /// Destination port (or ICMP code).
    pub dst_port: u16,
    /// IP protocol number (TCP=6, UDP=17, ICMP=1, ICMPv6=58, etc.).
    pub protocol: u8,
    /// Explicit padding byte to avoid implicit alignment padding between
    /// `protocol` (u8 at offset 36) and `zone` (u16 at offset 38).
    /// Without this field, `#[repr(C)]` inserts 1 byte of uninitialized
    /// padding that would corrupt Jenkins hash if hashed as raw bytes.
    /// This field is always zero and included in the hash. All fields
    /// including `_pad` must be zero-initialized at allocation. Slab
    /// allocations use `SlabFlags::ZERO` equivalent, or the caller uses
    /// `ConntrackTuple::default()` which zeros all fields.
    pub _pad: u8,
    /// Conntrack zone (u16, matches Linux `nf_conntrack` zones). Part of the
    /// hash key to allow overlapping IP ranges in different network namespaces
    /// to coexist without collisions. Two entries with identical 5-tuples but
    /// different zones are distinct connections.
    pub zone: u16,
}
const_assert!(size_of::<ConntrackTuple>() == 40);

/// IPv4-mapped-IPv6 prefix: [0,0,0,0, 0,0,0,0, 0,0,0xFF,0xFF].
const V4_MAPPED_PREFIX: [u8; 12] = [0,0,0,0, 0,0,0,0, 0,0,0xFF,0xFF];

/// Encode an IPv4 address as IPv4-mapped-IPv6 for conntrack tuple storage.
fn ipv4_to_tuple_addr(addr: Ipv4Addr) -> [u8; 16] {
    let mut out = [0u8; 16];
    out[..12].copy_from_slice(&V4_MAPPED_PREFIX);
    out[12..16].copy_from_slice(&addr.octets());
    out
}

/// Connection tracking state (matches Linux conntrack states).
/// Protocol-agnostic: applies to TCP, UDP, ICMP, SCTP connections.
#[repr(u8)]
pub enum ConntrackState {
    New = 0,
    Established = 1,
    Related = 2,
    Invalid = 3,
    Untracked = 4,
}

/// NAT type applied to a tracked connection.
/// Protocol-agnostic: applies to all conntrack-eligible protocols.
#[repr(u8)]
pub enum NatType {
    None = 0,
    Snat = 1,      // Source NAT
    Dnat = 2,      // Destination NAT
    Masquerade = 3, // Source NAT with auto-IP
}

/// Connection tracking entry.
/// Keyed by (protocol, src_ip, src_port, dst_ip, dst_port) 5-tuple.
#[repr(C, align(64))]
pub struct ConntrackEntry {
    /// Original direction 5-tuple.
    pub original: ConntrackTuple,
    /// Reply direction 5-tuple (after NAT translation).
    pub reply: ConntrackTuple,
    /// Connection state (NEW, ESTABLISHED, RELATED, INVALID, UNTRACKED).
    pub state: ConntrackState,
    /// NAT type applied (SNAT, DNAT, MASQUERADE, or None).
    pub nat_type: NatType,
    /// Conntrack zone — duplicated from ConntrackTuple for fast access during
    /// NAT and accounting without dereferencing the tuple. Must always equal
    /// `original.zone`.
    pub zone: u16,
    /// Explicit alignment padding for net_ns_inum (u64, requires 8-byte alignment).
    /// Must be zeroed. ConntrackState(u8) + NatType(u8) + zone(u16) = 4 bytes at
    /// offset 80. This _pad1 brings us to offset 88 for the u64 field.
    pub _pad1: [u8; 4],
    /// Network namespace inode number of the process that created this connection.
    /// Set when the connection is first tracked; never changes afterwards.
    ///
    /// Used by `bpf_ct_lookup()` to enforce namespace isolation: a BPF program
    /// running in network namespace A must not observe conntrack entries from
    /// namespace B. The filter `entry.net_ns_inum == caller_ns.inum` enforces
    /// this boundary. Cross-namespace access requires `CAP_NET_ADMIN` in the
    /// initial network namespace AND the `BPF_F_CONNTRACK_GLOBAL` flag.
    pub net_ns_inum: u64,
    /// Connection mark (set by iptables CONNMARK target). Used by Kubernetes
    /// kube-proxy for service routing and by iptables CONNMARK save/restore.
    pub mark: u32,
    /// Explicit alignment padding for timeout_ns (u64). mark(u32) at offset 96,
    /// this _pad2 at offset 100 brings us to offset 104.
    pub _pad2: [u8; 4],
    /// Timeout (nanoseconds since boot). Entry is garbage-collected after expiry.
    pub timeout_ns: u64,
    /// Packet/byte counters (for accounting). AtomicU64 because counters are
    /// updated on the forwarding path under RCU (no per-entry lock held) and
    /// read by BPF programs and conntrack dumps concurrently.
    ///
    /// **32-bit architecture note (ARMv7, PPC32)**: `AtomicU64` is not natively
    /// lock-free on these architectures; `fetch_add` falls through to LLVM's
    /// `__atomic_fetch_add_8` (global spinlock hash table) or `portable-atomic`
    /// equivalent. On multi-core 32-bit systems with high packet rates, this
    /// serialization may become a scalability bottleneck. Mitigation: on 32-bit
    /// architectures, extend the per-bucket lock to cover counter updates on
    /// the write path (the lock is already held for new connection insertion),
    /// and batch RCU read-side counter updates per-CPU for periodic drain.
    /// This matches the ErrSeq and TTY ring buffer 32-bit adaptation pattern
    /// used elsewhere in the spec.
    pub packets_original: AtomicU64,
    pub packets_reply: AtomicU64,
    pub bytes_original: AtomicU64,
    pub bytes_reply: AtomicU64,
    // 48 bytes of tail padding to align(64) boundary (offset 144..192).
    // ConntrackState and NatType are #[repr(u8)] enums defined above.
}
const_assert!(size_of::<ConntrackEntry>() == 192);

The conntrack table is a concurrent hash map with per-bucket spinlocks and RCU-protected lookup (matching Linux's nf_conntrack design: a per-namespace hash table — one ConntrackTable per NetNamespace — with per-bucket locking, not per-CPU sharding, because connection state must be visible across all CPUs for NAT reply-direction lookups). See Section 17.1 for the NetNamespace.conntrack field definition.

Conntrack hash table scalability design:

The hash table uses Jenkins hash (same as Linux nf_conntrack) over the 5-tuple + zone, distributing entries uniformly across buckets. The design separates the read path (hot, lockless) from the write path (rare, per-bucket locked):

Read path (packet lookup — hot, every packet):
  1. Compute hash(5-tuple, zone) → bucket index
  2. rcu_read_lock()
  3. Walk bucket chain (RCU-protected linked list), compare 5-tuples
  4. Return ConntrackEntry pointer (valid under RCU read-side)
  5. rcu_read_unlock()
  Cost: ~40-80 ns (hash + 1-2 pointer chases, no atomics, no locks)

Write path (new connection — rare, ~1 per 1000 packets for typical HTTP):
  1. Compute hash → bucket index
  2. spin_lock(&bucket[idx].lock)
  3. Allocate ConntrackEntry from per-CPU slab cache (no global lock)
  4. Insert at head of bucket chain (RCU publish: rcu_assign_pointer)
  5. spin_unlock(&bucket[idx].lock)
  Cost: ~200-400 ns (lock + slab alloc + RCU publish)

Conntrack table sizing: UmkaOS's conntrack hash table size is determined at boot from available physical memory and is runtime-resizable:

initial_buckets = clamp(
    next_power_of_two(system_ram_bytes / 65536),
    65_536,       // minimum: 64K buckets
    16_777_216    // maximum: 16M buckets
)

Example: 4 GB RAM → 65536 buckets; 64 GB → 1048576; 256 GB → 4194304.

Runtime resize: The table doubles when the running average chain length exceeds 8 (monitored per-bucket via exponentially-weighted moving average), and halves when the average falls below 2 for more than 60 continuous seconds. Resize is RCU-safe: a new table is allocated, all entries rehashed with RCU-protected pointer update, and the old table freed after a grace period. No packet drops during resize.

The kernel tunable conntrack.max_buckets overrides the boot-time calculation (requires CAP_NET_ADMIN). For Linux compatibility, the boot parameter nf_conntrack_buckets=N is also accepted. The maximum connection count is capped by nf_conntrack_max (default: conntrack_buckets × 4).

Memory per bucket: ~24 bytes (spinlock + RCU list head + counter). Memory per entry: 192 bytes (ConntrackEntry) + slab metadata. Total memory at maximum fill is dominated by entries, not buckets.

Contention analysis for 256+ CPUs: The per-bucket spinlock is the only serialization point. Under uniform hash distribution, the probability of two CPUs contending on the same bucket during insertion is:

P(contention) = (insert_rate × lock_hold_time) / num_buckets

For 256 CPUs each creating 10K connections/sec (2.56M total inserts/sec), with ~200 ns lock hold time and 1048576 buckets (typical on a 64 GB system per the memory-based formula):

P(contention) = (2.56M × 200ns) / 1048576 = 0.51 / 1048576 ≈ 0.0000005 per insert

This means contention occurs approximately once per 2,000,000 inserts — negligible. At 10M inserts/sec (extreme load), contention rises to ~once per 500,000 inserts, still well within acceptable limits. Larger memory gives more buckets, so contention decreases further on memory-rich systems.

Scaling beyond 10M connections/sec: For extreme-scale deployments (512+ CPUs, 10M+ new connections/sec), two additional strategies are available:

  1. Per-namespace sharding: Each network namespace has its own conntrack table. Traffic sharded across N namespaces yields N independent hash tables, each with 1/N the contention. Kubernetes pod networking naturally provides this sharding (each pod has its own network namespace). The kernel creates a new conntrack table per namespace via struct net→ct (same as Linux), so no additional design is required — the sharding is automatic.

  2. Percpu insertion batching: For workloads with extremely high short-lived connection rates (SYN floods, UDP scanning), insertions can be batched per-CPU and flushed to the global table periodically. This trades insertion latency (~1ms batch interval) for reduced lock contention. Enabled via umka.net.conntrack.batch_insert=1 (default: disabled, as most workloads don't need it).

Table saturation policy: When the conntrack table reaches its maximum entry count (umka.net.conntrack.max, default: conntrack_buckets × 4, tunable), new connection attempts receive -ENOMEM from bpf_ct_insert(). The BPF program decides the policy: drop the packet (default for SYN flood protection) or allow it untracked (stateless fallback). Under sustained SYN flood conditions (10-100x normal rate), the percpu batching mode and early drop heuristic (evict the oldest unassured connection in the target bucket) prevent table-full drops for legitimate traffic.

Hash distribution under NAT pools: When a small NAT pool (e.g., 4 public IPs) serves a large private network, all reverse-flow lookups (external → internal) hash to the same ~4 buckets, creating a hot-bucket problem under high reverse traffic. Mitigation: the hash includes the source port in addition to the 5-tuple, distributing NAT pool entries across source_ports buckets. For aggressive scanning workloads that fix the source IP and port, the administrator can enable net.conntrack.nat_pool_scatter=1 which additionally hashes on a per-session nonce, breaking the hot-bucket skew at the cost of one extra memory access per lookup.

Garbage collection: Expired entries are reclaimed by a per-CPU GC thread that scans its local slab and removes entries whose timeout_ns has passed. The per-CPU GC thread scans slab objects allocated on its CPU, which may belong to any namespace.

Namespace safety: The net_ns_inum field on each entry identifies the owning namespace. The GC thread looks up the namespace via the global namespace XArray (NET_NS_XARRAY.xa_load(net_ns_inum)) under an RCU read guard. If the namespace is being torn down (lookup returns None or refcount increment fails via Arc::try_increment()), the entry is freed directly without acquiring the bucket lock -- the bucket itself is being destroyed. Namespace teardown sequence: destroy_net_ns() first flushes all conntrack entries for the namespace (walking all buckets), THEN marks the namespace as destroyed and removes it from the XArray.

Removal acquires the per-namespace bucket lock for the entry's conntrack table. GC runs every umka.net.conntrack.gc_interval_ms (default: 1000ms). Removal holds the bucket lock briefly (~100 ns) and uses call_rcu() to defer freeing until all RCU readers have completed.

16.18.1.1 BPF Kfuncs for Conntrack

BPF programs at the prerouting and postrouting hooks query and update conntrack entries via BPF kfuncs (bpf_ct_lookup(), bpf_ct_insert(), bpf_ct_set_nat()). These are exposed via the kfunc mechanism (not classic BPF helpers), matching Linux 6.x where conntrack operations are registered as kfuncs by the conntrack subsystem. This integrates with the BPF-based packet filtering: a MASQUERADE BPF program creates a conntrack entry with SNAT on the outgoing path; the prerouting hook automatically reverses the NAT for return packets by looking up the conntrack entry.

BPF conntrack access: kfunc-only, no direct mapping.

BPF programs access conntrack state exclusively via the bpf_ct_lookup() kfunc. The conntrack hash table is NOT mapped read-only into the BPF address space.

The kfunc enforces namespace isolation automatically: a BPF program attached to a network interface in namespace N sees only conntrack entries belonging to namespace N. The attachment point determines the filtering context — no explicit namespace argument needed, and no bypass possible.

The ~50–100 ns per-call overhead of the kfunc is negligible relative to the full TCP/IP stack processing cost (~2–5 μs per packet), not the wire-rate time budget. This is a deliberate UmkaOS design decision: the Linux optimization of mapping the full conntrack table read-only into BPF space creates a namespace isolation bypass — a BPF program can walk the raw hash table to enumerate all connections across all namespaces, violating container isolation in Kubernetes multi-tenant environments. UmkaOS eliminates this attack surface from day one.

BPF isolation domain specification: see Section 19.2.

BPF helper isolation model: The general isolation rules for BPF programs in the networking stack (and all other subsystems) are:

  1. Domain confinement: Each BPF program executes in a dedicated BPF isolation domain (Section 19.2), separate from both umka-core and the driver or subsystem that loaded it. An XDP program attached to a NIC driver does not run in the driver's domain — it runs in its own BPF domain and accesses driver or subsystem state only through verified BPF helpers, which perform cross-domain access on the program's behalf. This means a verifier bug in a BPF program cannot compromise the NIC driver's memory or umka-net's internal state. The map access control (rule 2) and capability-gated helpers (rule 3) are enforced by this domain boundary, not solely by the verifier's static analysis.

  2. Map access control: BPF maps are owned by the isolation domain that created them. A BPF program can only access maps owned by its own domain. Cross-domain map sharing is explicit: the owning domain grants a capability (with MAP_READ, MAP_WRITE, or both permission bits) to the target domain via the standard capability delegation mechanism (Section 9.1.1). The verifier rejects programs that reference map file descriptors for which the loading domain does not hold a valid capability.

  3. Capability-gated helpers: BPF helpers that access kernel state beyond the program's own domain require the BPF domain to hold the corresponding capability. For example: bpf_sk_lookup() (socket table lookup) requires CAP_NET_LOOKUP; bpf_fib_lookup() (route table lookup) requires CAP_NET_ROUTE_READ; bpf_ct_lookup() / bpf_ct_insert() require CAP_NET_CONNTRACK. Enforcement is dual: the verifier rejects programs at load time if the BPF domain does not hold the required capabilities (see rule 5), and the eBPF runtime re-checks the domain's capability set at helper invocation time. The runtime check is necessary because capabilities can be revoked after a program is loaded (Section 9.1.1) — without it, a revoked capability would remain effective until the program is explicitly unloaded.

  4. Cross-domain packet redirect: XDP redirect actions (XDP_REDIRECT, bpf_redirect_map()) that forward a packet to an interface in a different driver's isolation domain require the source domain to hold CAP_NET_REDIRECT for the target interface. Without this capability, the redirect returns -EACCES and the packet is dropped. This prevents a compromised NIC driver from injecting traffic into another driver's domain.

XDP Redirect Rate Limiting:

XDP programs can redirect frames to any network interface, including loopback and physical interfaces. Without rate limiting, a malicious or buggy XDP program can saturate links at line rate.

UmkaOS enforces: 1. Redirect to the same interface (hairpin): always allowed. 2. Redirect to another interface in the root network namespace: requires CAP_NET_ADMIN. Unrestricted redirect in root ns is intentional (root ns is trusted). 3. Redirect to another interface in a non-root network namespace: - The target interface must have a configured TX rate limit (ip link set dev X rate <limit>bps via netlink). - If no rate limit is configured: redirect is rejected with XDP_ABORTED and a log message: "XDP redirect blocked: no rate limit on [interface]". - The rate limit is enforced by the token-bucket scheduler already present in the UmkaOS Tier 1 network stack (Section 16.18).

Rationale: CAP_NET_ADMIN is required in the root ns because it controls physical hardware. In tenant namespaces, the rate limit is the safety valve — tenants can use XDP redirect but cannot monopolize shared physical links.

  1. Verifier enforcement: The verifier enforces constraints (2)–(4) at program load time by checking the BPF domain's capability set against the program's map references and helper calls. Programs that reference inaccessible maps or call helpers requiring capabilities the domain does not hold are rejected before JIT compilation. This is a static gate: it prevents unauthorized programs from being loaded in the first place. Runtime capability checks at helper invocation time (rule 3) serve a distinct purpose: they enforce capability revocation for already-loaded programs. Both mechanisms are primary for their respective concerns — load-time verification prevents unauthorized loading, and runtime checks ensure revocation takes immediate effect.

Linux compatibility: The conntrack subsystem exposes /proc/net/nf_conntrack and the NETLINK_NETFILTER netlink family for userspace tools (conntrack -L, conntrack -D). Docker and Kubernetes depend on conntrack for NAT state visibility.

Advantages over separate subsystems: A single filtering mechanism (BPF) eliminates the complexity of maintaining iptables, ip6tables, ebtables, arptables, and nftables as separate subsystems — a major source of bugs and inconsistencies in Linux networking. Connection tracking is the sole stateful component, shared by all BPF-translated NAT rules regardless of their original iptables/nftables syntax.


16.19 Network Interface Naming

This section serves as a cross-reference index for network interface naming, which is specified across three subsystems. See the linked sections for implementation specifications.

Linux problem: Network interface naming was chaotic (eth0 could be different NICs each boot). systemd's "predictable names" (enp0s3, etc.) partially fixed this but introduced confusing names and edge cases.

UmkaOS design: - Deterministic, stable device naming in sysfs based on physical topology (bus/slot/function) from the first boot. - The device manager assigns stable names based on firmware (ACPI, Device Tree) hints first, then physical topology, then driver enumeration order as last resort. - User-defined naming rules via a declarative config (similar to udev rules but simpler). - Network namespaces get their own independent naming scope (Section 17.1).

The interface naming specification is distributed across three subsystems:

  • Canonical device naming convention (PCI, USB, Platform, Virtio, ACPI, NVMe naming patterns, collision handling, sysfs alias symlinks): Section 20.5.
  • Persistent topology-based naming (bus identity + serial, per-bus naming stability, ACPI HID/UID, Device Tree aliases, USB hub-chain+port): Section 11.4.
  • NetDevice.name field (IFNAMSIZ, if_nametoindex(), driver registration): Section 16.13.

Naming priority order (first match wins):

Priority Source Example Rationale
1 ACPI _DSM device name eno1 Firmware-assigned, stable across reboots
2 Device Tree label property eth0 (DT alias) Embedded/ARM canonical naming
3 PCI slot/function enp3s0f1 Topology-based, survives driver changes
4 USB hub-chain + port enx001122334455 MAC-based for hot-pluggable devices
5 Driver enumeration order eth0, wlan0 Last resort, not guaranteed stable

User-defined overrides via /etc/umka/naming.d/*.conf take priority over all automatic naming. The format is a simple match PCI_SLOT=0000:03:00.0 name=lan0 declarative syntax (simpler than udev rules, no shell execution).

16.20 AF_UNIX Socket Specification

Unix domain sockets provide local inter-process communication with semantics that differ from network sockets. UmkaOS implements the full Linux AF_UNIX interface for compatibility with systemd, D-Bus, X11/Wayland, and container runtimes.

Socket types:

Type Semantics Use case
SOCK_STREAM Byte stream, in-order, reliable D-Bus, systemd socket activation
SOCK_DGRAM Datagram, message-boundary-preserving, reliable (AF_UNIX: in-kernel, no packet loss; AF_INET/AF_INET6 UDP: unreliable) Logging, low-overhead IPC (AF_UNIX), DNS/NTP (UDP)
SOCK_SEQPACKET Message-preserving stream, in-order, reliable Protocol-framed IPC (e.g., varlink)

Address format:

/// Unix domain socket address (matches Linux struct sockaddr_un).
/// Path sockets start with a non-NUL byte; abstract sockets start with NUL.
#[repr(C)]
pub struct SockAddrUnix {
    /// Address family (AF_UNIX = 1).
    pub sun_family: u16,
    /// Path name or abstract name.
    /// - Path socket: null-terminated filesystem path (max 107 bytes including NUL)
    /// - Abstract socket: sun_path[0] = '\0', followed by abstract name (no filesystem entry)
    /// The Linux limit is 108 bytes total (sizeof(sockaddr_un) - 2 for sun_family).
    pub sun_path: [u8; 108],
}
const_assert!(size_of::<SockAddrUnix>() == 110);

Abstract namespace: Names starting with \0 (e.g., \0com.example.app) exist independently of the filesystem. Abstract sockets are destroyed when the last reference closes and are not affected by filesystem operations (unlink, rename). They are scoped to the network namespace (Section 17.1), providing isolation between containers.

Control messages (SCM_RIGHTS, SCM_CREDENTIALS, SCM_SECURITY):

/// Ancillary data types for AF_UNIX sockets.
pub enum UnixControlMsg {
    /// Pass file descriptors to the receiver.
    /// The sender's fd table entries are duplicated into the receiver's fd table.
    /// Fds are closed by the sender after sendmsg() returns (not transferred).
    /// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_RIGHTS,
    ///              cmsg_data = [i32; N] (array of fds)
    ScmRights {
        /// File descriptors to duplicate (max 253 per message, matching Linux SCM_MAX_FD).
        fds: [i32; 253],
        /// Number of valid entries in fds.
        count: usize,
    },

    /// Send sender's credentials to the receiver.
    /// Works on all Unix socket types (SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET).
    /// On SOCK_STREAM, at least one byte of non-ancillary data must accompany the message.
    /// The receiver can validate the sender's identity.
    /// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_CREDENTIALS,
    ///              cmsg_data = struct ucred
    ScmCredentials {
        /// Sender's PID in the receiver's PID namespace (translated if different).
        /// If the sender's PID is not visible in the receiver's PID namespace
        /// (different namespace hierarchy), the reported PID is 0 — matching
        /// Linux behavior (PID 0 = "not translatable in your namespace").
        pid: i32,
        /// Sender's UID in the receiver's user namespace.
        uid: u32,
        /// Sender's GID in the receiver's user namespace.
        gid: u32,
    },

    /// Send sender's LSM security label to the receiver.
    /// Requires `SO_PASSSEC` socket option to be enabled on the receiving socket.
    /// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_SECURITY,
    ///              cmsg_data = NUL-terminated security label string (e.g.,
    ///              "unconfined_u:unconfined_r:unconfined_t:s0" for SELinux).
    /// The label is the sender's security context at sendmsg() time, as
    /// determined by the active LSM ([Section 9.8](09-security.md#linux-security-module-framework)).
    ScmSecurity {
        /// Security label string (NUL-terminated). Maximum length: PAGE_SIZE
        /// (matching the practical limit for LSM security labels in Linux's
        /// `security_socket_getpeersec_dgram()`). Box-allocated because
        /// SCM_SECURITY is cold-path (requires explicit SO_PASSSEC opt-in).
        /// This avoids inflating the enum size for the common ScmRights and
        /// ScmCredentials variants. SELinux labels are typically <100 bytes.
        label: Box<[u8]>,
        /// Invariant: label.len() <= PAGE_SIZE. Enforced at the recvmsg()
        /// boundary before copying to userspace.
        label_len: u16,
    },
}

/// Credential structure for SCM_CREDENTIALS (matches Linux struct ucred).
#[repr(C)]
pub struct UCred {
    pub pid: i32,
    pub uid: u32,
    pub gid: u32,
}
const_assert!(size_of::<UCred>() == 12);

SCM_CREDENTIALS outgoing validation: When a process sends SCM_CREDENTIALS via sendmsg(), the kernel validates the supplied fields before transmission (matching Linux's scm_check_creds()): - pid: Must equal the sender's real PID (in the sender's PID namespace). Spoofing a different PID requires CAP_SYS_ADMIN in the sender's user namespace. - uid: Must equal the sender's real, effective, or saved-set UID. Spoofing a different UID requires CAP_SETUID in the sender's user namespace. - gid: Must equal the sender's real, effective, or saved-set GID. Spoofing a different GID requires CAP_SETGID in the sender's user namespace. Without this validation, any process could forge arbitrary credentials — a critical security hole for D-Bus, systemd, and all services relying on SO_PEERCRED.

SO_PEERCRED: The getsockopt(SOL_SOCKET, SO_PEERCRED, ...) call retrieves the credentials of the peer process at connect() time. This is the standard authentication mechanism for D-Bus and systemd. The credentials are snapshotted when the connection is established and do not change if the peer later calls setuid() or exits.

Socketpair: socketpair(AF_UNIX, type, 0, sv) creates a connected pair of unnamed sockets. Both ends are interchangeable (no client/server distinction). Used for pthreads IPC, async I/O notification pipes, and subprocess communication.

Autobind: Binding to an empty address (sun_path[0] = '\0' with length 2) triggers autobind, which assigns a unique abstract name \0<inode>. This is used for unnamed socket peers that need a bindable address for sockname().

16.20.1.1 SCM_RIGHTS Kernel-Side Mechanics

When a sender calls sendmsg() with cmsg_level = SOL_SOCKET, cmsg_type = SCM_RIGHTS:

  1. Send path: For each fd in cmsg_data, the kernel calls fget(sender_fd) to obtain a reference to the sender's File object. The File references (not fd numbers) are attached to the in-flight message (sk_buff ancillary data). The sender's fd table is not modified — the sender retains its fds after sendmsg() returns.

  2. Receive path: When the receiver calls recvmsg(), for each in-flight File reference the kernel calls fd_install() to allocate a new fd in the receiver's FdTable and install the File reference. The receiver sees new fd numbers in cmsg_data (these are receiver-local fd numbers, not the sender's original numbers).

  3. Security check: Before installing each fd, the kernel invokes security_file_receive() (LSM hook). SELinux, AppArmor, or other LSM modules can deny the receive based on the receiver's security context and the file's label. Denial returns EPERM and the fd is not installed.

Rollback on partial failure: If the LSM denies fd N, all previously-installed fds 0..N-1 must be rolled back — sys_close(installed_fds[k]) for each k < N — before returning EPERM to userspace. Without rollback, the receiver's fd table contains orphaned file references with no corresponding userspace awareness (the recvmsg() call returns an error, so the receiver never learns the fd numbers). These leaked fds consume fd table slots and hold references to the underlying File objects, preventing cleanup.

Preferred implementation: Two-pass validation. Pass 1: call security_file_receive() for all fds without installing any. Pass 2 (only if all LSM checks pass): call fd_install() for each fd. This avoids the rollback path entirely and matches Linux's implementation in scm_detach_fds() (v6.x).

  1. Limits: SCM_MAX_FD = 253 fds per sendmsg() call (matching Linux). Exceeding this limit returns EINVAL.

  2. Garbage collection: AF_UNIX fd passing can create reference cycles (socket A passes its own fd to socket B, and socket B passes its fd to socket A — both sockets hold a reference to the other via in-flight messages, preventing either from being freed). The kernel runs an SCC-based (Strongly Connected Components) cycle detector (unix_gc()), matching the Linux 6.10+ algorithm by Kuniyuki Iwashima. This replaces the older three-phase mark-and-sweep with O(V+E) incremental detection, eliminating the global gc_lock bottleneck for full graph scans.

Incremental edge tracking: Edges in the in-flight fd graph are tracked as fds are transmitted and received. Each AF_UNIX socket with in-flight fds has a unix_vertex (graph node) and each transmitted fd creates a unix_edge (directed edge from sender to receiver). Edges are added in unix_add_edges() during sendmsg() (under gc_lock) and removed in unix_del_edges() during recvmsg() or socket close.

Graph state machine: Each unix_vertex tracks a cyclic state: - NOT_CYCLIC — no cycle involving this socket has been detected. - MAYBE_CYCLIC — an edge change may have created a cycle (needs re-evaluation). - CYCLIC — this socket is part of a confirmed dead SCC (garbage).

SCC detection: When triggered, __unix_walk_scc() performs an iterative DFS (not recursive, to bound stack usage) starting from MAYBE_CYCLIC vertices. Tarjan's algorithm identifies SCCs. For each SCC found, if no vertex in the SCC has an external reference (fd table entry or non-SCC socket's receive queue), the entire SCC is dead. Dead SCC sockets are purged: all in-flight fds in their receive queues are closed (fput()), breaking the cycle.

Scope: GC runs globally across all network namespaces. AF_UNIX sockets in different namespaces can hold cross-namespace references via fd passing (if the fd was obtained before namespace separation), so per-namespace GC would miss cycles.

Locking: gc_lock is a global SpinLock protecting edge graph mutations (unix_add_edges(), unix_del_edges()). The lock scope is narrow — held only during edge add/remove, not during the full SCC traversal. Individual socket locks are NOT held during SCC detection to avoid lock ordering inversions. Socket receive queue locks are acquired only during the purge phase for fd extraction.

Trigger conditions: - A socket with in-flight fds is closed (unix_release()) and gc_inflight_count > GC_THRESHOLD (default: 16384 in-flight fds globally). - The GC runs on a workqueue (system_wq) to avoid blocking the close() syscall. - Re-entrancy guard: if GC is already running, the trigger is a no-op.

Complexity: O(V + E) where V = vertices (sockets with in-flight fds) and E = edges (in-flight fd references). The incremental edge tracking means only MAYBE_CYCLIC vertices are traversed, not the full graph. This is a significant improvement over the old O(N^2) worst-case mark-and-sweep for large fd-passing workloads (systemd environments with 10K+ sockets).

16.20.1.2 SO_PASSCRED

The SO_PASSCRED socket option (set via setsockopt(SOL_SOCKET, SO_PASSCRED, &1, 4)) enables automatic credential passing on a Unix socket. When enabled on the receiving socket, the kernel attaches an SCM_CREDENTIALS ancillary message to every incoming recvmsg(), containing the sender's kernel-verified pid, uid, and gid. The sender does not need to explicitly send credentials — the kernel fills them in from the sender's task credentials. This is the standard mechanism for D-Bus peer authentication.

16.20.1.3 SOCK_DGRAM Detailed Semantics

SOCK_DGRAM AF_UNIX sockets provide connectionless, message-boundary-preserving, reliable local IPC:

  • Reliability: Unlike UDP (AF_INET SOCK_DGRAM), AF_UNIX datagrams are in-kernel and never dropped due to network congestion. Delivery fails only if the receiver's receive buffer is full (EAGAIN / EWOULDBLOCK for non-blocking, blocks otherwise).
  • Message boundaries: Each sendmsg() / sendto() call produces exactly one message. Each recvmsg() / recvfrom() retrieves exactly one message. Messages are not coalesced or split. If the receive buffer is smaller than the message, the excess is silently discarded (with MSG_TRUNC flag set in msghdr.msg_flags).
  • Addressing: The sender specifies the destination via sendto(fd, buf, len, 0, &dest_addr, addrlen) or by calling connect() to set a default destination. connect() on a SOCK_DGRAM socket is optional and sets the default peer for subsequent send() calls. A connected datagram socket can still use sendto() with a different address.
  • Max message size: Limited by SO_SNDBUF (default 212992 bytes, ~208 KB, matching Linux). Sending a message larger than SO_SNDBUF returns EMSGSIZE.
  • Ancillary data: SCM_RIGHTS and SCM_CREDENTIALS work on SOCK_DGRAM sockets, attached per-message.
  • Receive queue: Each socket has an independent receive queue. Unconnected sockets receive from any sender. Connected sockets filter to only the connected peer.
  • Unbound receivers: sendto() to an unbound address returns ECONNREFUSED. The receiver must bind() to a path or abstract address before datagrams can be delivered.

16.20.1.4 SOCK_SEQPACKET Detailed Semantics

SOCK_SEQPACKET AF_UNIX sockets combine connection-oriented reliability with message boundary preservation:

  • Connection model: Uses bind() / listen() / accept() / connect() like SOCK_STREAM. A connection must be established before data transfer.
  • Message boundaries: Each sendmsg() produces one message, each recvmsg() retrieves one message (unlike SOCK_STREAM where byte-stream semantics allow partial reads). MSG_TRUNC is set if the receive buffer was too small.
  • Zero-length messages: Supported and delivered as-is (unlike SOCK_STREAM where a zero-length read indicates EOF). Used by some protocols as keepalive or delimiter signals.
  • In-order delivery: Messages arrive in the order they were sent (no reordering).
  • Shutdown semantics: shutdown(SHUT_WR) sends an EOF indicator to the peer. Subsequent recvmsg() on the peer returns 0 (EOF). shutdown(SHUT_RD) discards incoming data.
  • Ancillary data: SCM_RIGHTS and SCM_CREDENTIALS work per-message.
  • Use cases: varlink (JSON-RPC over SOCK_SEQPACKET), some container runtimes (conmon), Bluetooth L2CAP emulation over local sockets.

16.21 Traffic Control and Queue Disciplines (tc/qdisc)

The Traffic Control subsystem schedules packets on each network device's transmit path. It sits between the socket layer (where ip_output() delivers a NetBuf, converted to NetBufHandle at dev_queue_xmit() entry) and the NIC driver's hardware transmit ring. Qdiscs operate on NetBufHandle tokens (16 bytes) for queue management; the full NetBuf metadata (~256 bytes) remains in the slab pool and is accessed via handle.peek() when needed (e.g., GSO validation). Qdiscs enable rate limiting, latency control, and hierarchical QoS without modifying NIC drivers.

Linux parallel: Linux implements tc through net/sched/ -- struct Qdisc, struct Qdisc_ops, and the RTM_NEWQDISC netlink interface. UmkaOS maps these concepts faithfully so that iproute2 tc and Kubernetes CNI plugins using tc (Cilium, Calico, bandwidth plugin) operate without modification.

16.21.1 Architecture

sendmsg() -> socket TX queue -> NetDev::transmit(buf)
                                    |
                            root qdisc enqueue(buf)
                                    |   [rate limiting / shaping wait here]
                            NIC driver poll / NAPI TX
                                    |
                            root qdisc dequeue()
                                    |
                            NIC hardware ring enqueue

Each NetDev has one root qdisc (TX path) and optionally one ingress qdisc (RX path, for filtering and policing before socket delivery). The root qdisc may be classful (HTB, HFSC) -- containing child qdiscs on leaf classes -- or classless (pfifo_fast, fq_codel).

16.21.2 TcHandle and the Handle Namespace

/// Traffic control handle (major:minor encoded as a u32).
///
/// Major identifies a qdisc; minor identifies a class within that qdisc.
/// Minor 0 refers to the qdisc itself (not any class).
///
/// Encoding: upper 16 bits = major, lower 16 bits = minor.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct TcHandle(pub u32);

impl TcHandle {
    /// Root qdisc of the device (attach point for the first qdisc).
    /// Linux TC_H_ROOT = 0xFFFFFFFF (impossible major:minor, used as sentinel).
    pub const ROOT: TcHandle = TcHandle(0xFFFF_FFFF);
    /// Ingress pseudo-qdisc handle.
    pub const INGRESS: TcHandle = TcHandle(0xFFFF_FFF1);
    /// Clsact pseudo-qdisc handle (used by BPF/Cilium for tc redirect).
    /// In Linux, TC_H_CLSACT == TC_H_INGRESS == 0xFFFF_FFF1.
    pub const CLSACT: TcHandle = TcHandle(0xFFFF_FFF1);

    pub fn new(major: u16, minor: u16) -> Self {
        TcHandle(((major as u32) << 16) | (minor as u32))
    }
    pub fn major(self) -> u16 { (self.0 >> 16) as u16 }
    pub fn minor(self) -> u16 { self.0 as u16 }
}

16.21.3 QdiscOps Trait

/// Network device error codes.
pub enum NetDevError {
    /// Queue is full — caller should drop or requeue the packet.
    QueueFull,
    /// Device is down or not ready.
    DeviceDown,
    /// Transmission timeout.
    TxTimeout,
    /// General hardware error.
    HardwareError { code: i32 },
}

/// Qdisc algorithm interface.
///
/// Implementations are stateless descriptors. Per-device qdisc state lives in
/// `Qdisc.priv_data`. All methods execute in umka-net's isolation domain.
pub trait QdiscOps: Send + Sync {
    /// Algorithm name (ASCII, max 15 bytes + NUL, matches Linux IFNAMSIZ for qdiscs).
    fn name(&self) -> &'static str;

    /// Enqueue a packet.
    ///
    /// The qdisc takes ownership of `buf` (move-only `NetBufHandle`). If the
    /// queue is full, the qdisc drops `buf` — the `NetBufHandle::Drop` impl
    /// returns the slab slot to the pool automatically — and returns
    /// `Err(NetDevError::QueueFull)`. No explicit `netbuf_free()` call is
    /// needed; Rust's drop semantics handle cleanup.
    /// Returning `Ok(())` guarantees eventual dequeue.
    ///
    /// Takes `&Qdisc` (not `&mut Qdisc`) because the Qdisc is shared across
    /// concurrent TX paths (multiple CPUs call `qdisc_run()` concurrently,
    /// serialized by `Qdisc.lock: SpinLock<QdiscState>`). Interior mutability
    /// via the SpinLock provides `&mut QdiscState` access. Statistics fields
    /// (`qlen`, `backlog`, `drops`) are `AtomicU64` for lock-free reads.
    fn enqueue(&self, buf: NetBufHandle, qdisc: &Qdisc) -> Result<(), NetDevError>;

    /// Dequeue the next packet to transmit.
    ///
    /// Returns `None` if the qdisc has no packet to transmit right now
    /// (queue empty, or rate limited -- the qdisc will call `netdev_wake_queue()`
    /// when it is ready). The NIC driver calls this from its NAPI TX poll.
    fn dequeue(&self, qdisc: &Qdisc) -> Option<NetBufHandle>;

    /// Reset the qdisc to its initial (empty) state, dropping all queued packets.
    /// Called when the device is brought down (NETDEV_DOWN).
    fn reset(&self, qdisc: &Qdisc);

    /// Free all resources allocated by this qdisc instance.
    /// Called after `reset()` when the qdisc is detached or the device destroyed.
    fn destroy(&self, qdisc: &mut Qdisc);

    /// Reconfigure the qdisc from netlink attributes.
    ///
    /// Called for `RTM_NEWQDISC` with `NLM_F_REPLACE` on an existing qdisc,
    /// or after initial creation. Must validate `opts` before mutating state.
    /// On error, the existing configuration is unchanged.
    fn change(
        &self,
        qdisc: &mut Qdisc,
        opts: &NlAttrSet,
    ) -> Result<(), KernelError>;

    /// Serialise the qdisc's current configuration into `skb` as netlink attributes.
    /// Called for `RTM_GETQDISC` and `RTM_NEWQDISC` replies.
    fn dump(&self, qdisc: &Qdisc, skb: &mut NetBuf) -> Result<(), KernelError>;

    /// Return current statistics snapshot.
    fn stats(&self, qdisc: &Qdisc) -> QdiscStats;
}

16.21.4 Qdisc Struct

Multi-queue native Qdisc design:

Each Qdisc instance is scoped to a single TX queue. Multi-queue NICs (virtio-net, i40e, mlx5) create N Qdisc instances — one per hardware queue. Contention is per-queue, eliminating the single-lock bottleneck of a per-NIC design.

struct Qdisc {
    queue_lock: QueueLock<QdiscState>, // per-queue, not per-NIC
    queue_index: u16,                  // which TX queue this Qdisc serves
    // ... scheduler-specific state
}

Lock-free fast path (simple FIFO): For pfifo (simple priority FIFO with no traffic shaping), the enqueue path uses a lock-free ring buffer: the producer atomically advances the ring tail via AtomicUsize::fetch_add. No spinlock. The consumer (NIC driver NAPI poll) reads the head and advances it after DMA. Lock-free throughput: limited only by ring buffer capacity and NIC hardware speed.

Lock path (hierarchical schedulers): HTB, HFSC, and CBS schedulers require traversing the class hierarchy to find the target leaf queue. These take the queue_lock for the duration of the hierarchy traversal and enqueue. The lock is per-queue, so N queues can run simultaneously on N cores.

RX path: Symmetric per-queue design. Each RX queue has an associated NAPI instance; no shared state between queues on the RX path.

/// Qdisc configuration flags.
/// Values match Linux `TCQ_F_*` from `include/net/sch_generic.h`.
/// These are kernel-internal but propagate via netlink TCA attributes
/// (e.g., `tc qdisc show` reports offloaded status) and control critical
/// branching in `qdisc_create`, the NOLOCK fast path, and the bypass
/// optimization.
bitflags::bitflags! {
    pub struct QdiscFlags: u32 {
        /// Built-in qdisc (not user-created, cannot be deleted).
        const TCQ_F_BUILTIN       = 0x1;
        /// Qdisc accepts incoming traffic (ingress qdisc).
        const TCQ_F_INGRESS       = 0x2;
        /// Qdisc supports bypass (empty queue optimization).
        const TCQ_F_CAN_BYPASS    = 0x4;
        /// Multiqueue root qdisc (one child per TX queue).
        const TCQ_F_MQROOT        = 0x8;
        /// Qdisc is attached to exactly one TX queue.
        const TCQ_F_ONETXQUEUE    = 0x10;
        /// Use per-CPU statistics (avoids global atomic contention).
        const TCQ_F_CPUSTATS      = 0x20;
        /// Qdisc has no parent (root-level or ingress).
        const TCQ_F_NOPARENT      = 0x40;
        /// Qdisc is invisible to `tc` dump (internal implementation detail).
        const TCQ_F_INVISIBLE     = 0x80;
        /// Lockless qdisc: uses per-CPU enqueue paths instead of `Qdisc.lock`.
        const TCQ_F_NOLOCK        = 0x100;
        /// Hardware offload active for this qdisc.
        const TCQ_F_OFFLOADED     = 0x200;
        /// Count drops in dequeue path (for qdiscs that drop on dequeue).
        const TCQ_F_DEQUEUE_DROPS = 0x400;
        /// Warn if non-work-conserving (debugging flag).
        const TCQ_F_WARN_NONWC    = 1 << 16;
        /// Qdisc is being evolved (algorithm swap in progress).
        /// UmkaOS-internal flag (no Linux equivalent). While set, new
        /// enqueue/dequeue calls are rejected (return QueueFull / None).
        /// Set by `qdisc_evolve()` during the quiescence window; cleared
        /// after the algorithm swap completes. Analogous to the block
        /// layer's `QUEUE_FLAG_QUIESCING` pattern.
        const TCQ_F_EVOLVING      = 1 << 17;
    }
}

/// A qdisc instance attached to a single TX queue of a network device.
///
/// Scoped to one hardware TX queue (`queue_index`). Multi-queue NICs have one
/// `Qdisc` per queue. Statistics fields are atomics -- readable without locking.
pub struct Qdisc {
    /// Active qdisc algorithm. Plain field — not atomic, not RcuCell.
    /// Reads on the enqueue/dequeue hot path are zero-cost (no RCU read
    /// lock, no atomic load). Mutation happens only in `qdisc_evolve()`,
    /// which takes `&mut self` after quiescing the qdisc (all in-flight
    /// enqueue/dequeue operations have drained, `TCQ_F_EVOLVING` blocks
    /// new ones). Exclusive `&mut` access is safe because quiescence
    /// guarantees no concurrent readers.
    pub ops: &'static dyn QdiscOps,
    /// This qdisc's handle (major:0 = the qdisc itself).
    pub handle: TcHandle,
    /// Parent handle: `TcHandle::ROOT` for the device root qdisc,
    /// or the parent HTB class handle for leaf qdiscs.
    pub parent: TcHandle,
    /// Which TX hardware queue this Qdisc instance serves.
    pub queue_index: u16,
    /// Weak reference to the owning device (prevents retain cycle).
    pub dev: Weak<NetDev>,
    /// Algorithm-private state for this Qdisc instance (e.g., HTB class tree,
    /// TBF token bucket, FQ flow table). One allocation per Qdisc instance
    /// (one per TX queue) — this is a cold-path configuration object, not a
    /// per-packet hot-path structure.
    ///
    /// **Not related to TCP congestion control**: TCP CC uses `TcpCb.cong_priv`
    /// (a 64-byte inline `CongPriv` union, zero heap allocation per connection)
    /// with a `&'static dyn CongestionOps` ops pointer. See Section 16.6.
    /// `Box<dyn Any>` here is exclusively for Qdisc (traffic-shaping) algorithms.
    pub priv_data: Box<dyn Any + Send>,
    /// Bytes enqueued (cumulative; wraps on overflow).
    pub bytes: AtomicU64,
    /// Packets enqueued (cumulative).
    pub packets: AtomicU64,
    /// Packets dropped due to queue full or policing.
    pub drops: AtomicU64,
    /// Packets that exceeded the rate limit (overlimit / shaped).
    pub overlimits: AtomicU64,
    /// Queue length in packets (current, not cumulative — instantaneous gauge).
    /// u32 is correct: this is a bounded gauge (not a monotonic counter), capped
    /// by `limit` (default 1000) and physical memory. It increments on enqueue
    /// and decrements on dequeue; it does NOT accumulate over time. A u32 can
    /// represent up to ~4 billion packets in queue, which exceeds any physical
    /// memory constraint. 50-year longevity: N/A (gauge, not counter).
    pub qlen: AtomicU32,
    pub flags: QdiscFlags,
    /// Optional size table for overhead accounting (ATM cell padding, etc.).
    pub stab: Option<SizeTable>,
    /// Serialises enqueue/dequeue for hierarchical schedulers (HTB, HFSC, CBS).
    /// Not used by pfifo (lock-free ring buffer fast path).
    /// Lock level 50 — acquired after `TcpCb.lock` (level 40) on the TX path.
    /// 10-level gap provides 9 insertion points (41-49) for future TX locks.
    lock: SpinLock<()>,  // level 50
    /// Per-CPU lockless defer list for TX packets from non-owning CPUs.
    /// Each CPU appends to its own list (plain store, zero contention).
    /// The drainer reads `pending_cpus` bitmap and drains non-empty lists.
    /// Replaces the removed `busylock` field (which was a ghost from older
    /// Linux kernels that is no longer present in mainline).
    pub defer: QdiscDeferState,
    /// Re-entrancy guard. `true` = `qdisc_run()` in progress. Only one
    /// CPU runs `qdisc_run()` at a time; non-owners append to `defer`
    /// lists and return. Written under `Qdisc.lock` (for locked qdiscs)
    /// or with atomic swap (for NOLOCK qdiscs). Read via atomic load by
    /// concurrent `qdisc_run()` callers — if `true`, defer to
    /// NET_TX softirq or append to the defer list.
    pub running: AtomicBool,
}

/// Per-CPU lockless defer list for TX packets from non-owning CPUs.
/// Each CPU appends to its own list (plain store, zero contention).
/// The drainer reads `pending_cpus` bitmap and drains non-empty lists.
///
/// This replaces the historical `busylock` (removed from Linux mainline)
/// and the fabricated `PerCpu<AtomicU32>` seqcount. The design ensures:
/// 1. Non-owner CPUs never touch the qdisc lock — they append to their
///    per-CPU defer list and set their bit in `pending_cpus`.
/// 2. The qdisc owner (the CPU running `qdisc_run()`) drains all defer
///    lists before returning, coalescing deferred packets into the qdisc.
/// 3. No false sharing: each CPU's `NetBufList` is on its own cache line.
pub struct QdiscDeferState {
    /// Per-CPU packet lists. Only the local CPU writes (no atomics needed
    /// for the list itself). The drainer reads under the invariant that the
    /// local CPU has moved on (its bit is set in `pending_cpus`, and the
    /// drainer clears it before reading the list).
    pub per_cpu_lists: PerCpu<UnsafeCell<NetBufList>>,
    /// Bitmap: which CPUs have pending packets. One AtomicU64 per 64 CPUs.
    /// Set via `fetch_or` by enqueueing CPU, read+clear by drainer.
    /// Allocated at Qdisc init time, length = (num_possible_cpus() + 63) / 64.
    /// Runtime-sized to match `per_cpu_lists` — UmkaOS has no compile-time
    /// `MAX_CPUS` for heap-allocated structures (see CLAUDE.md §Runtime Discovery).
    pub pending_cpus: Box<[AtomicU64]>,
    /// Number of deferred packets across all CPUs (approximate, for stats).
    pub defer_count: AtomicU32,
}

16.21.5 Builtin Qdiscs

16.21.5.1.1 pfifo_fast -- Default for Newly Created Devices

Three-band strict-priority FIFO. Band 0 is highest priority, band 2 lowest. Packet priority is determined by the IP DSCP field: DSCP bits [5:3] are mapped to a band via a static priority map (matching Linux's prio_map). SO_PRIORITY on the socket overrides the DSCP classification.

Parameters: maximum queue depth 1000 packets per band (fixed; not configurable via RTM_NEWQDISC for pfifo_fast). Total limit: 3000 packets.

Enqueue: append to tail of the packet's band. Drop if band is at its limit (tail drop). Dequeue: scan bands 0 to 2; return head of first non-empty band.

16.21.5.1.2 fq_codel -- Fair Queue with Controlled Delay

FQ-CoDel (RFC 8290) combines per-flow FIFO queuing (fair queuing) with the CoDel AQM algorithm for delay control.

/// Maximum per-flow packet queue depth for fq_codel.
/// Limits worst-case memory per flow; typical flows stay well under this.
/// Memory per flow: 32 × size_of::<Option<NetBufHandle>>() = 32 × 16 = 512 bytes.
/// (NetBufHandle is 16 bytes with niche optimization for Option.)
pub const FQ_CODEL_FLOW_DEPTH: usize = 32;
const_assert!(core::mem::size_of::<Option<NetBufHandle>>() == 16);

/// Sentinel index meaning "end of intrusive list" (no next flow).
pub const FQ_FLOW_NONE: u32 = u32::MAX;

/// FQ-CoDel qdisc private state.
///
/// All per-packet structures are pre-allocated at qdisc creation — there is
/// no dynamic allocation on the TX fast path. `flows` is allocated once as a
/// `Box<[CodelFlow]>` with `num_flows` entries. Flow lists use intrusive links
/// embedded in `CodelFlow` rather than heap-allocated `LinkedList` nodes.
pub struct FqCodelPriv {
    /// Hash table of per-flow queues; indexed by 5-tuple hash mod `num_flows`.
    /// Allocated once at qdisc creation; never grown or shrunk at runtime.
    pub flows: Box<[CodelFlow]>,
    /// CoDel target delay (default: 5 ms). Packets sojourning longer than
    /// this in the queue are ECN-marked or dropped.
    pub target_us: u32,
    /// CoDel interval (default: 100 ms). Minimum time between consecutive drops.
    pub interval_us: u32,
    /// DRR quantum in bytes (default: 1514 = max Ethernet frame + 2 for alignment).
    pub quantum: u32,
    /// Number of per-flow queues (default: 1024, must be power-of-two).
    pub num_flows: u32,
    /// Total packet limit across all flows (default: 10240).
    pub limit: u32,
    /// Number of packets currently queued across all flows.
    pub backlog: u32,
    /// Head of the new-flows intrusive list (index into `flows`; FQ_FLOW_NONE = empty).
    /// New flows (sparse, recently active after idle) are served before old flows.
    pub new_flows_head: u32,
    /// Head of the old-flows intrusive list (index into `flows`; FQ_FLOW_NONE = empty).
    pub old_flows_head: u32,
}

/// Per-flow state within fq_codel.
///
/// The packet queue is a fixed-capacity ring buffer embedded directly in this
/// struct (no heap allocation after flow initialization). The flow's position
/// in new_flows or old_flows is tracked via intrusive links (`next_active`),
/// eliminating `LinkedList` node allocation on flow transitions.
pub struct CodelFlow {
    /// Packet queue for this flow: fixed-capacity ring buffer, no allocation.
    /// Enqueue: write to queue_buf[tail % FQ_CODEL_FLOW_DEPTH], advance tail.
    /// Dequeue: read from queue_buf[head % FQ_CODEL_FLOW_DEPTH], advance head.
    /// Drop (tail drop): discard tail packet if (tail - head) == FQ_CODEL_FLOW_DEPTH.
    pub queue_buf: [Option<NetBufHandle>; FQ_CODEL_FLOW_DEPTH],
    pub queue_head: u32,
    pub queue_tail: u32,
    /// DRR deficit counter (credits accumulated for this flow).
    pub deficit: i32,
    /// CoDel state: whether the flow is in "dropping state".
    pub dropping: bool,
    /// Time when CoDel entered dropping state for this flow.
    pub drop_next_us: u64,
    /// Number of packets dropped by CoDel on this flow.
    pub drop_count: u32,
    /// Number of packets ECN-marked (CE) instead of dropped.
    pub ecn_mark: u32,
    /// Intrusive list link: next flow index in the active list (new or old).
    /// `FQ_FLOW_NONE` means this flow is not in any active list or is the tail.
    pub next_active: u32,
    /// Which active list this flow is currently in.
    pub list_tag: FqFlowList,
}

/// Which active list a flow is currently in.
#[repr(u8)]
pub enum FqFlowList {
    None = 0,  // idle, not in any list
    New  = 1,  // in new_flows (sparse)
    Old  = 2,  // in old_flows (bulk)
}

Scheduling: Deficit Round Robin across the two lists (new then old). Each flow gets quantum bytes of credit per round. CoDel monitors the sojourn time of the head packet in each flow; if the sojourn exceeds target for longer than interval, it either ECN-marks (if the packet has the ECT bit) or drops, computing the next drop time as drop_next = drop_next + interval / sqrt(drop_count) (matching RFC 8290).

Sparse flow optimisation: flows that send a packet after an idle period are placed in the new-flow list with a full quantum, allowing latency-sensitive flows (DNS, SSH) to bypass the bulk-flow queue.

16.21.5.1.3 htb -- Hierarchical Token Bucket

HTB enables guaranteed bandwidth allocation with optional bursting up to a configured ceiling. It is the standard QoS mechanism for Kubernetes network bandwidth enforcement.

/// Maximum depth of the HTB class hierarchy (root = level MAX_HTB_DEPTH-1, leaves = level 0).
/// Limits worst-case tree walk depth on dequeue. 8 levels is sufficient for all
/// practical QoS hierarchies (Linux's HTB implementation also uses 8 levels).
pub const MAX_HTB_DEPTH: usize = 8;

/// Maximum number of child classes per HTB inner class.
/// Bounds per-level scanning during dequeue. Typical deployments use 2-10 children;
/// 64 accommodates large multi-tenant configurations.
pub const MAX_HTB_CHILDREN: usize = 64;

/// HTB class state.
pub struct HtbClass {
    pub handle: TcHandle,
    pub parent: TcHandle,
    /// Guaranteed rate (bytes/second). Token bucket replenished at this rate.
    pub rate: u64,
    /// Ceiling rate (bytes/second). Class may borrow up to this rate if parent allows.
    pub ceil: u64,
    /// Token bucket tokens available (in bytes; negative = in deficit).
    pub tokens: i64,
    /// Ceiling token bucket.
    pub ctokens: i64,
    /// Last time tokens were updated (ktime_us).
    pub t_c: u64,
    /// Maximum burst size in bytes (rate * burst_us).
    pub burst: u32,
    /// Leaf qdisc (if this is a leaf class).
    pub leaf: Option<Box<Qdisc>>,
    /// Child classes (if inner class). Bounded to `MAX_HTB_CHILDREN` per class.
    pub children: ArrayVec<Arc<SpinLock<HtbClass>>, MAX_HTB_CHILDREN>,
    /// HTB level (0 = leaf, increases toward root).
    pub level: u32,
    /// Priority queue key for dequeue scheduling.
    pub pq_key: u64,
}

HTB maintains per-level priority queues (HtbLevel arrays) in the qdisc's private data. At each dequeue call, HTB walks from the root downward, selecting the highest- priority class that has tokens available. Borrowed bandwidth: a child class at its rate limit may borrow from its parent's excess capacity up to ceil. Token buckets are replenished lazily on each dequeue, computed from elapsed time since t_c.

Tree walk complexity: HTB tree walk is O(depth x max_children_per_level). With MAX_HTB_DEPTH = 8 and typical 2-10 children per level, worst case is O(8 x 64) = O(512) comparisons per dequeue. Acceptable for traffic-shaped flows where per-packet scheduling overhead is expected. A proper priority heap (min-heap of eligible classes) would reduce dequeue to O(log N), but the added complexity is justified only for very deep/wide HTB hierarchies — documented as a future optimisation if profiling shows the linear scan as a bottleneck.

16.21.5.1.4 noqueue -- No Queuing

Used for loopback and virtual devices (veth, tun/tap) where the driver accepts packets immediately. Enqueue: calls the device's hard-start-xmit directly and returns. Dequeue: always returns None (nothing is buffered). If the device rejects the packet, enqueue() propagates the error to the caller.

16.21.5.1.5 tbf -- Token Bucket Filter

TBF is a classless rate-limiting qdisc. It controls the rate at which packets are dequeued from a single FIFO queue using a token bucket algorithm. Docker uses TBF for container bandwidth limiting (tc qdisc add dev veth... root tbf rate... burst... latency...), and the Kubernetes bandwidth CNI plugin relies on TBF. Phase 2 (required for Docker/Kubernetes bandwidth limiting).

/// Token Bucket Filter private state. Stored in `Qdisc.priv_data`.
pub struct TbfPriv {
    /// Configured rate in bytes per second. Set by `tc qdisc add ... rate <X>`.
    pub rate_bps: u64,
    /// Maximum burst size in bytes (token bucket capacity).
    /// At least one MTU — otherwise no packet can ever be dequeued.
    pub burst_bytes: u64,
    /// Queue byte limit. If the internal queue grows beyond this, packets
    /// are tail-dropped. `limit` = `rate * latency + burst` (derived from
    /// the `latency` parameter in the `tc` command).
    pub limit_bytes: u32,
    /// Current token count (bytes). Tokens accumulate at `rate_bps` and are
    /// consumed when packets are dequeued. Clamped to [0, burst_bytes].
    /// Atomic for concurrent `enqueue()` + timer-driven `dequeue()`.
    pub tokens: AtomicI64,
    /// Timestamp (ns, monotonic) of the last token replenishment.
    pub last_refill_ns: AtomicU64,
    /// Internal FIFO queue. Bounded by `limit_bytes`. Queue takes
    /// ownership of enqueued packets via `NetBufHandle`.
    ///
    /// **Synchronization**: `enqueue()` acquires `queue` SpinLock to push
    /// and increments `queue_bytes` atomically (Relaxed). `dequeue()`
    /// acquires `queue` SpinLock to pop and decrements `queue_bytes`.
    /// Token operations (`tokens`, `last_refill_ns`) use atomics and are
    /// NOT protected by the SpinLock — they are read/written in
    /// `dequeue()` only (single consumer on the NAPI path).
    ///
    /// **`queue_bytes` TOCTOU note**: The atomic `queue_bytes` check in
    /// `enqueue()` is approximate limit enforcement — under concurrent
    /// enqueue from multiple CPUs, the queue may momentarily exceed
    /// `limit_bytes` by up to one MTU. This matches Linux TBF behavior.
    pub queue: SpinLock<ArrayDeque<NetBufHandle, TBF_QUEUE_LIMIT>>,
    /// Current queue byte count (for limit enforcement). Approximate —
    /// see TOCTOU note on `queue` above.
    pub queue_bytes: AtomicU32,
}
/// Maximum packets in TBF internal queue. Packets beyond this are dropped.
/// Configured via `limit_bytes` (byte-based); this is the slot count for
/// the fixed-capacity deque. 10000 slots * 1500 MTU = 15 MB max, sufficient
/// for 10 Gbps at 12 ms latency.
const TBF_QUEUE_LIMIT: usize = 10000;

Token bucket algorithm (executed at dequeue() time):

1. Compute elapsed = now_ns - last_refill_ns.
2. Refill: tokens += (elapsed * rate_bps) / 1_000_000_000. Clamp to burst_bytes.
3. Update last_refill_ns = now_ns.
4. Peek front packet. If tokens >= packet_len:
   a. Dequeue packet.
   b. tokens -= packet_len.
   c. Return packet.
5. Else: no packet is dequeued (rate-limited). Schedule a watchdog timer
   to re-run dequeue in (packet_len - tokens) * 1e9 / rate_bps nanoseconds.

QdiscOps implementation: - enqueue(): if queue_bytes + pkt_len > limit_bytes, drop and return NET_XMIT_DROP. Otherwise push to tail. - dequeue(): run the token bucket algorithm above. - init(): parse rate, burst, latency/limit from netlink TLV attributes. burst must be >= MTU. limit = rate * latency + burst if latency is provided. - destroy(): drain the queue, dropping all buffered NetBufHandles (the Drop impl returns each slab slot to the pool and frees DMA pages).

16.21.6 Classifiers (tc Filters)

Filters classify packets into qdisc classes. Each filter is attached to a qdisc (or a filter chain within it) and inspects the packet to return a class handle.

/// Classifier (tc filter) interface.
pub trait ClsOps: Send + Sync {
    fn name(&self) -> &'static str;

    /// Classify `buf` into a class.
    ///
    /// Returns:
    /// - `ClsResult::Class(handle)`: packet goes to this class
    /// - `ClsResult::Drop`: packet is dropped immediately
    /// - `ClsResult::Ok`: no match; continue to next filter in chain
    /// - `ClsResult::Redir(ifindex)`: redirect to another device (tc redirect action)
    fn classify(&self, buf: NetBufHandle, tp: &TcFilter) -> ClsResult;

    /// Install or update a filter from netlink attributes.
    fn change(&self, tp: &mut TcFilter, opts: &NlAttrSet) -> Result<(), KernelError>;

    /// Destroy the filter, releasing allocated resources.
    fn destroy(&self, tp: &mut TcFilter);

    /// Dump the filter configuration as netlink attributes.
    fn dump(&self, tp: &TcFilter, skb: &mut NetBuf) -> Result<(), KernelError>;
}

#[derive(Debug)]
pub enum ClsResult {
    /// Packet classified into the specified class.
    Class(TcHandle),
    /// Packet should be dropped.
    Drop,
    /// No match; fall through to next filter.
    Ok,
    /// Redirect packet to another network device (by ifindex).
    Redir(u32),
}

Builtin classifiers:

  • u32: Bitmask matching on arbitrary 32-bit words at fixed offsets in the packet header. Supports up to 128 keys per filter and optional hash tables for O(1) lookup on large rule sets. Used for IP address and port matching.

  • flower: Exact-match classifier on a set of header fields (Ethernet type, IP src/dst, L4 proto, TCP/UDP ports, VLAN id, MPLS label, etc.). Backed by a hash table; O(1) lookup regardless of rule count. Used by Kubernetes CNI plugins for policy enforcement and network overlays.

  • bpf: Attaches a verified eBPF program as the classifier. The program receives the NetBufHandle (via a BPF map lookup that translates the handle to the data pointer) and returns a class handle or TC_ACT_SHOT. This is the primary mechanism used by Cilium and Calico for Kubernetes network policy -- all policy logic is compiled to eBPF by the CNI plugin and loaded via RTM_NEWTFILTER. Requires Capability::NetAdmin.

The following RTM message types are handled by the rtnetlink processor (Section 16.17):

Message Direction Description
RTM_NEWQDISC user->kernel Create or replace a qdisc on a device
RTM_DELQDISC user->kernel Delete a qdisc; reverts to pfifo_fast
RTM_GETQDISC user->kernel Get one qdisc; kernel replies RTM_NEWQDISC
RTM_DUMPQDISC user->kernel Dump all qdiscs on all devices
RTM_NEWTFILTER user->kernel Attach a filter to a qdisc
RTM_DELTFILTER user->kernel Remove a filter
RTM_GETTFILTER user->kernel Get one filter
RTM_DUMPTFILTER user->kernel Dump all filters on a qdisc
RTM_NEWCHAIN user->kernel Create a named filter chain on a qdisc
RTM_DELCHAIN user->kernel Delete a filter chain
RTM_GETCHAIN user->kernel Get/dump filter chains

All mutating operations require Capability::NetAdmin.

16.21.8 Integration with cgroups Network Bandwidth Enforcement

Cgroup v2 has no net_cls or net_prio controllers; packet classification is performed by eBPF programs attached to the cgroup instead. Each socket captures its owning cgroup at creation time (Socket.cgroup: Arc<CgroupCss>, see Section 16.3). A BPF_PROG_TYPE_CGROUP_SKB program attached to the cgroup inspects outgoing packets and returns a classid (TcHandle) that the tc classifier layer uses to route the packet to the correct HTB leaf class. This is how Docker's --network-opt bandwidth limit and Kubernetes's bandwidth CNI plugin enforce per-container egress shaping. Per-cgroup priority (replacing the legacy net_prio controller) is handled the same way: the eBPF program sets skb_priority, which pfifo_fast uses for band selection.

Integration path: 1. Container runtime creates HTB qdisc on the host-side veth of the container's network namespace. 2. An HTB class is created with the container's rate/ceil limits. 3. A BPF_PROG_TYPE_CGROUP_SKB program is attached to the container's cgroup. The program returns the target classid for each packet, which the bpf tc classifier matches to the HTB class. 4. On the egress path, umka-net invokes the cgroup-attached eBPF program before calling QdiscOps::enqueue, stamping the packet with the returned classid.

16.21.9 Ingress Path

The ingress pseudo-qdisc (handle TcHandle::INGRESS) and clsact (handle TcHandle::CLSACT) attach to the RX path rather than TX. Packets arrive from the NIC driver before protocol processing; classifiers may drop them, redirect them to another device (tc redirect), or pass them through for normal processing.

clsact supports two hook points: - egress (TC_EGRESS): applied after routing, before NIC TX -- same as the TX qdisc chain but without buffering/shaping. - ingress (TC_INGRESS): applied before IP routing -- used by Cilium for pre-routing network policy and XDP-equivalent packet manipulation without the full XDP driver port.

Both hooks execute eBPF classifiers attached via RTM_NEWTFILTER and are the primary mechanism for Kubernetes CNI plugin data planes.

16.21.9.1 Ingress Policing (Container Bandwidth Enforcement)

The ingress pseudo-qdisc supports policing — rate-limiting incoming traffic without buffering. Unlike egress shaping (HTB/TBF, which delays packets in a queue), ingress policing drops excess packets immediately. This is the mechanism used for container ingress bandwidth enforcement.

Use case: Kubernetes bandwidth CNI plugin and Docker --device-read-bps enforce per-container ingress rate limits. The container runtime attaches an ingress qdisc to the host-side end of the container's veth pair, with a tc filter ... action police rule that drops packets exceeding the configured rate.

Integration path (container ingress):

  1. Container runtime creates a veth pair: veth-host (host namespace) ↔ eth0 (container namespace).
  2. An ingress qdisc is attached to veth-host (the host-side end): tc qdisc add dev veth-host ingress
  3. A police action is attached via a flower/u32 filter: tc filter add dev veth-host parent ffff: protocol ip flower action police rate 100mbit burst 64k conform-exceed drop
  4. Kubernetes bandwidth CNI plugin performs steps 2-3 via netlink (RTM_NEWQDISC + RTM_NEWTFILTER with TCA_POLICE nested attribute), reading annotations kubernetes.io/ingress-bandwidth from the pod spec.

IngressPolice struct (per-filter policing state):

/// Token bucket policer for ingress rate limiting.
///
/// Attached to a tc filter via `TCA_POLICE` netlink attribute. Operates
/// on the ingress path (before protocol processing). Each `IngressPolice`
/// is per-filter — multiple filters on the same ingress qdisc can have
/// independent rate limits (e.g., different rates for different subnets).
///
/// Token refill is lazy: computed on each packet arrival from elapsed
/// time since `last_refill_ns`, avoiding periodic timer overhead.
///
/// **Atomicity model**: The timestamp and token count are packed into a
/// single `AtomicU64` (`token_state`) to avoid the TOCTOU race where two
/// CPUs load the same `old_tokens`, both compute a refill, and both
/// successfully CAS — effectively doubling the refill. Packing both
/// values into one atomic word ensures the refill calculation and token
/// deduction are a single atomic step.
pub struct IngressPolice {
    /// Sustained rate limit in bytes per second.
    /// Set via `TCA_POLICE_RATE` netlink attribute.
    /// Example: 100 Mbps = 12_500_000 bytes/sec.
    pub rate_bps: u64,

    /// Maximum burst size in bytes. Determines the peak burst that
    /// can pass through without drops after an idle period.
    /// Set via `TCA_POLICE_BURST` netlink attribute.
    /// Typical: 64 KB for general traffic, 256 KB for bursty workloads.
    pub burst_bytes: u64,

    /// Packed token state: high 32 bits = timestamp (milliseconds since
    /// policer creation, wraps after ~49 days — acceptable because only
    /// elapsed deltas are used and the max refill per packet is capped at
    /// `burst_bytes`); low 32 bits = current token count (bytes).
    ///
    /// Packing into a single AtomicU64 ensures that refill + deduct is
    /// an atomic CAS — no double-refill race between concurrent CPUs.
    /// The 32-bit token count limits burst_bytes to 4 GiB, which is
    /// sufficient (typical burst: 64 KB–256 KB; maximum practical: ~1 GiB).
    pub token_state: AtomicU64,

    /// Monotonic base timestamp (nanoseconds) captured at policer creation.
    /// `token_state` stores millisecond offsets from this base.
    ///
    /// **Longevity analysis**: The 32-bit millisecond timestamp wraps after
    /// ~49 days. `wrapping_sub` correctly computes deltas across the wrap.
    /// Worst case: after >49 days idle, the first packet gets a full
    /// `burst_bytes` refill. This is bounded and matches the configured
    /// burst, so the behavior is acceptable. The 49-day wrap is inherent
    /// to the 32-bit timestamp packing; widening would reduce
    /// token_count precision.
    pub base_ns: u64,

    /// Action for conforming packets (packets within the rate limit).
    /// Default: `PoliceAction::Ok` (pass through to protocol stack).
    pub conform_action: PoliceAction,

    /// Action for exceeding packets (packets that exceed the rate limit).
    /// Default: `PoliceAction::Drop`.
    pub exceed_action: PoliceAction,

    /// Cumulative count of packets dropped by this policer.
    pub drops: AtomicU64,

    /// Cumulative bytes dropped by this policer.
    pub drop_bytes: AtomicU64,
}

/// Action to take on a packet after policing evaluation.
#[repr(u8)]
pub enum PoliceAction {
    /// Pass the packet through (conforming traffic).
    Ok       = 0,
    /// Drop the packet (exceeding traffic).
    Drop     = 1,
    /// Reclassify the packet (re-run classifier chain).
    Reclassify = 2,
    /// Continue to next filter in chain (no police decision).
    Pipe     = 3,
}

Token bucket algorithm (per-packet, lock-free):

The timestamp and token count are packed into a single AtomicU64 to eliminate the double-refill race. A fetch_update loop atomically reads, computes the refill, deducts the packet cost, and writes back — if another CPU modified the state between read and write, the loop retries with the updated value.

/// Pack timestamp_ms (high 32 bits) and tokens (low 32 bits) into u64.
fn pack(ts_ms: u32, tokens: u32) -> u64:
    (ts_ms as u64) << 32 | tokens as u64

fn unpack(state: u64) -> (u32, u32):  // (ts_ms, tokens)
    ((state >> 32) as u32, state as u32)

fn police_check(police: &IngressPolice, pkt_len: u32) -> PoliceAction:
    now_ms = ((ktime_get_ns() - police.base_ns) / 1_000_000) as u32

    result = police.token_state.fetch_update(AcqRel, Relaxed, |old_state| {
        (prev_ms, old_tokens) = unpack(old_state)
        elapsed_ms = now_ms.wrapping_sub(prev_ms)

        // Lazy refill: add tokens for elapsed time.
        // saturating_mul prevents u64 overflow at 100+ Gbps rates:
        // at 400 Gbps (50 GB/s), elapsed_ms=u32::MAX produces
        // 4.29e9 * 5e10 = 2.15e20, which exceeds u64::MAX (1.84e19).
        // Saturating to u64::MAX then /1000 yields ~1.84e16 bytes,
        // far exceeding any practical burst_bytes, so the subsequent
        // min(..., burst_bytes) clamp produces the correct result:
        // a full bucket after a long idle period.
        new_tokens = (elapsed_ms as u64).saturating_mul(police.rate_bps) / 1_000
        refilled = min(old_tokens as u64 + new_tokens, police.burst_bytes) as u32

        if refilled >= pkt_len:
            Some(pack(now_ms, refilled - pkt_len))
        else:
            None  // Not enough tokens — reject without modifying state
    })

    if result.is_ok():
        return police.conform_action

    // Exceed: not enough tokens (fetch_update returned None)
    police.drops.fetch_add(1, Relaxed)
    police.drop_bytes.fetch_add(pkt_len as u64, Relaxed)
    return police.exceed_action

The fetch_update loop ensures correctness when multiple CPUs process packets on the same ingress qdisc simultaneously (RSS may hash different flows to different CPUs, but all flows through the same veth share the policer). Because timestamp and tokens are packed into a single atomic word, there is no window where two CPUs can independently refill the same elapsed time — the CAS retry re-reads both values atomically. The loop body is cheap (one multiply, one compare), and contention is bounded by the number of CPUs sharing an ingress qdisc (typically 1–4 for a veth pair).

Netlink configuration: The TCA_POLICE nested attribute (within RTM_NEWTFILTER) contains TCA_POLICE_RATE (u32, rate cell table), TCA_POLICE_BURST (u32, burst cell), and TCA_POLICE_RESULT (u32, exceed action). UmkaOS parses these identically to Linux's net/sched/act_police.c for iproute2 and CNI plugin compatibility.

16.21.10 Qdisc Ownership and Domain Crossing

The Qdisc struct is allocated in Tier 0 (kernel core) and owned by the NetDevice. However, qdisc enqueue()/dequeue() operations are invoked from Tier 1 (umka-net) during packet transmission. This section specifies the ownership boundary and the mechanism by which umka-net operates on Tier 0-owned qdisc state.

Domain crossing mechanism: The QdiscOps vtable is registered by umka-net during network stack initialization. The kernel core holds a &'static dyn QdiscOps pointer in each Qdisc instance. When the TX path in umka-net needs to enqueue or dequeue, it calls through this vtable — the call originates within the Tier 1 domain (umka-net) and operates on the Qdisc struct which lives in Tier 0 memory. This is safe because the Qdisc struct (including its priv_data) is allocated in shared memory accessible to both domains (mapped into umka-net's address space with read-write permission via the KABI shared memory region, same mechanism used for NetBufRingEntry rings).

Data placement:

Component Domain Rationale
Qdisc struct Tier 0 (kernel core), shared-mapped into Tier 1 Survives NIC driver crash; accessible to umka-net for enqueue/dequeue
Qdisc.priv_data Same allocation as Qdisc (Tier 0, shared) Algorithm state (token buckets, flow tables) must be accessible to QdiscOps methods running in Tier 1
QdiscOps vtable Tier 1 (umka-net static data) Algorithm logic lives in the network stack module
Qdisc.queue (NetBufQueue) Tier 0 shared memory Packet queue accessible to both umka-net (producer) and umka-core (consumer for TX relay)

Locking: Two mechanisms protect qdisc state:

  1. Qdisc.lock (SpinLock): Serializes enqueue()/dequeue() for classful hierarchical qdiscs (HTB, HFSC, CBS). Not used by NOLOCK qdiscs (pfifo_fast, fq, fq_codel) which use per-CPU enqueue.

  2. Qdisc.running (AtomicBool): Re-entrancy guard. Only one CPU runs qdisc_run() at a time. Non-owner CPUs that attempt to run the qdisc observe running == true and instead append their packets to the per-CPU defer list (Qdisc.defer.per_cpu_lists[local_cpu]), set the corresponding bit in pending_cpus, and return. The owner CPU drains all defer lists before clearing running.

  3. Qdisc.defer (QdiscDeferState): Per-CPU lockless defer lists. Replaces the historical busylock (which was removed from Linux mainline). Each non-owner CPU appends to its own NetBufList (no contention) and sets its bit in pending_cpus via fetch_or. The qdisc owner reads the bitmap, clears set bits, and drains the corresponding lists into the qdisc.

Relationship: lock (if present) serializes algorithm state; running ensures single-owner qdisc execution; defer allows non-owner CPUs to contribute packets without contention. All live in Tier 0 shared memory but are operated on by umka-net (Tier 1) exclusively — umka-core (Tier 0) never calls enqueue()/dequeue(), it only reads from the TX output ring after umka-net has dequeued and serialized the NetBufRingEntry. This single-writer design means no lock crosses the domain boundary under contention.

Crash recovery: Because the Qdisc struct and all shaping state live in Tier 0 memory, a Tier 1 NIC driver crash does not destroy qdisc state. When the replacement driver loads, it finds the existing qdisc tree intact (Section 11.9). If umka-net itself crashes (Tier 1 network stack failure — a more severe event), the Qdisc structs survive in Tier 0 but the QdiscOps vtable pointers become invalid. On umka-net reload, the network stack re-registers its QdiscOps implementations and the kernel core updates each Qdisc.ops pointer via an atomic store.

16.21.10.1 dev_queue_xmit() — Transmit Entry Point

The primary function for submitting a packet to the traffic control layer. Called from ip_output() step 8 (after neighbor resolution) and from af_packet raw sockets. Selects the TX queue, enqueues the packet, and invokes qdisc_run() to drain.

/// Submit a packet for transmission via the traffic control layer.
///
/// # Arguments
/// - `dev`: The output NetDevice.
/// - `buf`: The NetBuf to transmit. Ownership transfers to a `NetBufHandle`
///   via `dev.netbuf_pool().handle_for(buf)` (consuming) before enqueue.
///   After conversion, the handle's `Drop` impl guarantees the slab slot
///   is returned to the pool on all paths (enqueue success, enqueue failure,
///   qdisc evolution rejection).
///
/// # Algorithm
/// 1. Select TX queue: `txq_index = select_queue(dev, &buf)` — uses XPS
///    (Transmit Packet Steering) if configured, else hash-based selection.
///    Must happen before `handle_for(buf)` consumes the `NetBuf`.
/// 2. Look up the Qdisc for this TX queue: `q = dev.txqs[txq_index].qdisc`.
/// 3. Convert NetBuf to NetBufHandle: `handle = dev.netbuf_pool().handle_for(buf)`.
///    This consumes `buf` — all subsequent access to packet metadata goes
///    through `handle.peek()` if needed. The handle now owns the slab slot.
/// 4. Check evolution gate: if the qdisc is being evolved, drop `handle`
///    (returning the slab slot to the pool) and return `Err(NOBUFS)`.
/// 5. Enqueue:
///    - **NOLOCK qdisc** (pfifo_fast, fq, fq_codel with NOLOCK flag):
///      `q.ops.enqueue(handle, q)` — lock-free enqueue. The qdisc takes
///      ownership of the handle.
///    - **Locked qdisc** (HTB, HFSC, TBF):
///      Acquire `q.lock`. `q.ops.enqueue(handle, q)`. Release after `qdisc_run()`.
/// 6. Call `qdisc_run(q)` to attempt immediate drain. For locked qdiscs,
///    `qdisc_run()` is called while holding `q.lock` (the lock provides the
///    single-owner exclusion instead of the `running` CAS).
/// 7. For locked qdiscs: release `q.lock`.
///
/// # Ownership and error handling
/// - **Enqueue succeeds**: qdisc owns the handle; eventual TX completion
///   or qdisc drain drops it (returning slab slot to pool).
/// - **Enqueue fails (queue full)**: `enqueue()` must drop the handle
///   before returning `Err(NetDevError::QueueFull)`, which returns the
///   slab slot. The `?` operator propagates the error; handle was moved
///   into `enqueue()` so no double-free.
/// - **Evolution gate**: handle is dropped (goes out of scope at the
///   early return), returning the slab slot to the pool. No leak.
///
/// # Returns
/// `Ok(())` on successful enqueue (even if the packet has not yet been
/// transmitted). `Err(IoError::NOBUFS)` if the qdisc rejects the packet
/// (queue full, rate-limited, or qdisc evolving).
pub fn dev_queue_xmit(dev: &NetDevice, buf: NetBuf) -> Result<(), IoError> {
    // select_queue() returns u16. Cast to usize for array indexing.
    // Bounds check: select_queue() clamps the result to
    // [0, dev.num_tx_queues). Indexing panics if the driver returns
    // an out-of-range value (programming error in the driver).
    let txq_index = dev.ops.select_queue(dev, &NetBufPool::peek_handle(&buf)) as usize;
    let q = &dev.txqs[txq_index].qdisc;

    // Convert NetBuf → NetBufHandle (consuming). Must happen after
    // select_queue (which borrows &buf) but before any error path,
    // so that the handle's Drop impl guarantees cleanup on all paths.
    let handle = dev.netbuf_pool().handle_for(buf);

    // Evolution gate: if the qdisc is being evolved, reject immediately.
    // `handle` is dropped here (goes out of scope), returning the slab
    // slot to the pool via NetBufHandle::Drop. No leak.
    if q.flags.contains(QdiscFlags::TCQ_F_EVOLVING) {
        return Err(IoError::NOBUFS);
    }

    if q.flags.contains(QdiscFlags::TCQ_F_NOLOCK) {
        q.ops.enqueue(handle, q)?;
        qdisc_run(q);
    } else {
        let _guard = q.lock.lock();
        q.ops.enqueue(handle, q)?;
        qdisc_run(q);
        // _guard dropped here — releases q.lock
    }
    Ok(())
}

16.21.10.2 netif_tx_wake_queue() — TX Queue Restart

Called by the NIC driver when its TX hardware ring has space available after previously returning Err(IoError::BUSY) from dispatch_xmit(). This triggers qdisc_run() to resume draining packets into the hardware.

/// Signal that a TX queue has space available for new transmissions.
///
/// # Tier 0 path
/// Direct call from the NIC driver's TX completion IRQ handler or NAPI poll:
/// clears `TxQueue.stopped`, then triggers `qdisc_run()` via NET_TX softirq.
///
/// # Tier 1 path
/// The Tier 1 NIC driver posts a `TxWakeNotification { txq_index }` to its
/// KABI completion ring. The Tier 0 completion ring consumer detects the
/// notification and calls `netif_tx_wake_queue()` in Tier 0 context.
///
/// The connection to `qdisc_run()`:
/// 1. Clear `dev.txqs[txq_index].stopped` (AtomicBool, store(false, Release)).
/// 2. Raise `NET_TX_SOFTIRQ` on the current CPU. The softirq handler calls
///    `qdisc_run(dev.txqs[txq_index].qdisc)` which resumes draining.
///
/// This is equivalent to Linux's `__netif_schedule()` → `NET_TX_SOFTIRQ` →
/// `qdisc_run()` path.
pub fn netif_tx_wake_queue(dev: &NetDevice, txq_index: u16) {
    dev.txqs[txq_index as usize].stopped.store(false, Release);
    raise_softirq(SoftirqVec::NET_TX);
}

16.21.10.3 qdisc_run() — Per-CPU Drain Loop

The central transmit drain function. Called from dev_queue_xmit() after enqueuing a packet. Only one CPU may run qdisc_run() at a time per qdisc (serialized by Qdisc.running). Non-owner CPUs append their packets to the per-CPU defer list and return (zero contention on the fast path).

/// Run the qdisc transmit drain loop.
///
/// Attempts to become the drainer via `running` CAS. If another CPU is
/// already draining, the caller's packet was already enqueued (either
/// directly by a NOLOCK qdisc or into the per-CPU defer list) and will be
/// transmitted by the current drainer before it releases ownership.
///
/// The drain loop has an ABA re-check: after dequeueing all available
/// packets, the drainer re-scans `pending_cpus` to catch packets that
/// arrived while it was transmitting. This eliminates the race where a
/// non-owner CPU sets its `pending_cpus` bit after the drainer's initial
/// snapshot but before the drainer clears `running`.
///
/// # Locking
/// - NOLOCK qdiscs: no `Qdisc.lock` held; `running` CAS provides
///   single-owner exclusion.
/// - Locked qdiscs (HTB, HFSC): caller holds `Qdisc.lock` around the
///   entire `qdisc_run()` invocation.
///
/// # Context
/// Called from NET_TX softirq or directly from `dev_queue_xmit()` (process
/// context with BH disabled). Must not sleep.
pub fn qdisc_run(q: &Qdisc) {
    // Try to become the drainer. Acquire ordering: if we succeed, we must
    // see all defer-list writes from non-owner CPUs that preceded their
    // pending_cpus bit-set.
    if q.running
        .compare_exchange(false, true, Ordering::Acquire, Ordering::Relaxed)
        .is_err()
    {
        return; // Another CPU is draining — our packet will be picked up.
    }

    loop {
        // 1. Drain the per-CPU defer lists. Snapshot pending_cpus and clear
        //    each bit before reading the corresponding list (the clearing
        //    acts as a happens-before for the list contents).
        let pending = q.defer.snapshot_and_clear_pending();
        for cpu in pending.iter_set_bits() {
            let list = q.defer.per_cpu_lists.get_cpu(cpu);
            // SAFETY: The `Acquire` ordering on `snapshot_and_clear_pending()`
            // synchronizes with the `Release` ordering on the non-owner's
            // `fetch_or` bit-set. This ensures all list mutations that
            // preceded the bit-set are visible to the drainer. An append
            // that has not yet been followed by a bit-set is NOT guaranteed
            // to be visible here — the ABA re-check loop (below) catches
            // those stragglers on the next iteration. The non-owner's
            // append is a store-release to the list node; the drainer's
            // read (after Acquire on the bitmap) sees completed appends.
            // No torn writes: the drainer only reads lists for CPUs whose
            // bits were set, so the Release has already happened.
            while let Some(nb) = unsafe { (*list.get()).pop() } {
                // Enqueue into the qdisc's main queue via the algorithm.
                let _ = q.ops.enqueue(nb, q);
            }
        }

        // 2. Dequeue, validate, and transmit. Budget-limited to
        //    `dev.tx_weight` packets per run (default 64) for CPU fairness.
        let dev = match q.dev.upgrade() {
            Some(d) => d,
            None => break, // device destroyed
        };
        let mut budget = dev.tx_weight; // default 64 packets per run
        while budget > 0 {
            let nb = match q.ops.dequeue(q) {
                Some(nb) => nb,
                None => break, // qdisc empty
            };
            // GSO software fallback: validate_xmit() performs software
            // segmentation if the NIC lacks TSO/GSO offload for this packet.
            // MUST run between dequeue and dispatch — an oversized GSO packet
            // sent directly to a non-TSO NIC would be dropped or corrupt.
            match validate_xmit(&dev, nb) {
                GsoResult::PassThrough(buf) => {
                    if dev.dispatch_xmit(buf).is_err() {
                        // Device queue full — stop draining. `buf` was moved
                        // into dispatch_xmit; on failure, dispatch_xmit drops
                        // it (NetBufHandle::Drop returns the slab slot). No
                        // leak (fixes TX-07 error path).
                        // The completion IRQ will call netif_tx_wake_queue()
                        // which triggers another qdisc_run().
                        break;
                    }
                }
                GsoResult::Segmented(segments) => {
                    for seg in segments {
                        if dev.dispatch_xmit(seg).is_err() {
                            // TX ring full. The failed `seg` was moved into
                            // dispatch_xmit and dropped there (NetBufHandle::Drop
                            // returns the slab slot). Remaining segments in the
                            // ArrayVec iterator are dropped when `segments` goes
                            // out of scope — each handle's Drop returns its slot.
                            // No segment leak (fixes TX-08).
                            break;
                        }
                    }
                }
            }
            budget -= 1;
        }

        // 3. ABA re-check: scan pending_cpus again. If new packets arrived
        //    while we were draining (a non-owner CPU set its bit between
        //    step 1's snapshot and now), loop back to drain them.
        if q.defer.any_pending() {
            continue;
        }

        break;
    }

    // Release ownership. Release ordering: all our writes (dequeue state,
    // defer-list drains) must be visible before another CPU can acquire.
    q.running.store(false, Ordering::Release);
}

16.21.11 State Ownership for Live Evolution

The qdisc subsystem follows the state spill avoidance pattern (see Section 13.18): per-queue shaping state (token buckets, DRR credits, flow tables) is owned by the NetDevice via the Qdisc struct, not by the qdisc algorithm. The QdiscOps trait methods are stateless functions that operate on the Qdisc's priv_data. This enables:

  1. Qdisc algorithm swap without traffic loss: replacing a qdisc algorithm (e.g., swapping HTB for HFSC, or updating fq_codel parameters) quiesces the qdisc, takes &mut Qdisc, swaps the ops pointer, and resumes. All queued packets, token bucket state, and flow tracking survive -- shaping state that Linux would lose on tc qdisc replace is preserved.
  2. Driver crash recovery: when a Tier 1 NIC driver crashes and reloads (Section 11.9), the Qdisc instances and their priv_data survive in NetDevice (Tier 0 kernel memory). The new driver sees the existing qdisc tree with all accumulated shaping state.
  3. Zero-overhead steady state: ops is a plain &'static dyn QdiscOps field -- one pointer dereference per enqueue/dequeue, no RCU read lock, no atomic load, no interior mutability overhead on the hot path. This is cheaper than the RcuCell approach (saves ~5-10 cycles per packet from eliminated RCU read-side critical section enter/exit).

Qdisc state already follows this pattern. The existing Qdisc struct (Section 16.10.3) stores all per-queue state in priv_data: Box<dyn Any + Send>, and QdiscOps methods receive &Qdisc (not &mut self). This means:

  • QdiscOps::enqueue(&self, buf, qdisc) -- the algorithm reads/writes qdisc.priv_data, not its own fields.
  • QdiscOps::dequeue(&self, qdisc) -- same: state is in qdisc, not in the QdiscOps implementor.

The only addition needed for live evolution is the quiescence-based swap:

/// Replace the qdisc algorithm on a live device via quiescence + &mut.
///
/// Evolution is a rare event (~once per months in production). A brief TX
/// stall during quiescence is acceptable for this frequency. This matches
/// the block layer's QUEUE_FLAG_QUIESCING pattern
/// ({ref:block-device-framework#queue-quiescing}  <!-- UNRESOLVED -->).
///
/// Preserves all queued packets and shaping state. The new algorithm
/// must be compatible with the existing `priv_data` layout (same qdisc
/// type with updated logic) OR provide a `migrate_priv_data` function
/// to transform the state.
///
/// **Cross-type swap** (e.g., HTB -> HFSC): requires `priv_data` migration.
/// The migration function runs during the quiescence window with exclusive
/// `&mut Qdisc` access. If migration fails, the swap is aborted and traffic
/// resumes with the original algorithm.
///
/// **Same-type swap** (e.g., updated fq_codel with new CoDel parameters):
/// no migration needed -- the new `QdiscOps` reads the same `FqCodelPriv`
/// struct directly.
///
/// ## Quiescence Protocol
///
/// 1. Set `TCQ_F_EVOLVING` flag. New enqueue calls see this flag and
///    return `Err(NetDevError::QueueFull)` (caller retries via backpressure).
///    New dequeue calls return `None` (NIC polls again on next NAPI cycle).
/// 2. For locked qdiscs: acquire `qdisc.lock`, confirming no enqueue/dequeue
///    is in progress (they hold the same lock). For NOLOCK qdiscs: send IPI
///    to all CPUs that might be running `qdisc_run()` on this qdisc (read
///    from the `running` flag) and spin-wait until all `running` flags clear.
/// 3. At this point, no code is accessing `qdisc.ops` or `qdisc.priv_data`.
///    The caller has effectively exclusive access -- `&mut Qdisc` is sound.
/// 4. Run migration (if cross-type swap), swap `ops`, clear `TCQ_F_EVOLVING`.
/// 5. Resume: next enqueue/dequeue sees the new algorithm immediately.
///
/// **Stall duration**: <10 us for same-type swaps (flag set + lock/IPI round-trip
/// + pointer write). Cross-type swaps add migration cost (bounded by `priv_data`
/// size -- typically <1 ms for a 1024-flow fq_codel instance). At ~once per months,
/// this is negligible.
pub fn qdisc_evolve(
    qdisc: &mut Qdisc,
    new_ops: &'static dyn QdiscOps,
    migrate: Option<fn(&mut Qdisc) -> Result<(), KernelError>>,
) -> Result<(), KernelError> {
    // 1. Set EVOLVING flag to block new enqueue/dequeue.
    qdisc.flags.insert(QdiscFlags::TCQ_F_EVOLVING);

    // 2. Quiesce: drain in-flight operations.
    //    For locked qdiscs: acquire the qdisc lock. Any in-progress
    //    enqueue/dequeue holds this lock, so acquiring it proves they
    //    have finished. We release it immediately -- the EVOLVING flag
    //    prevents new entrants.
    //    For NOLOCK qdiscs: IPI all CPUs with `running == true` for this
    //    qdisc. The IPI handler is a no-op -- it merely forces the CPU
    //    to observe the EVOLVING flag. Then spin-wait until all `running`
    //    flags for this qdisc clear (bounded: NAPI budget ensures forward
    //    progress).
    if !qdisc.flags.contains(QdiscFlags::TCQ_F_NOLOCK) {
        let _drain = qdisc.lock.lock();
        // Lock acquired and released -- all in-flight locked operations
        // have completed. EVOLVING flag blocks new ones.
    } else {
        // NOLOCK path: IPI + spin-wait for running flags to clear.
        // send_ipi_to_running_cpus(qdisc);
        // while qdisc.running.load(Ordering::Acquire) { core::hint::spin_loop(); }
        //
        // Bounded spin: qdisc_run() drains at most `tx_weight` (64) packets
        // per invocation, so the wait is <64 * per-packet-cost (~10 us).
    }

    // 3. Quiesced: we have exclusive access. &mut Qdisc is sound.
    //    If cross-type swap, run migration on priv_data.
    if let Some(migrate_fn) = migrate {
        if let Err(e) = migrate_fn(qdisc) {
            // Migration failed -- abort. Clear flag, resume traffic.
            qdisc.flags.remove(QdiscFlags::TCQ_F_EVOLVING);
            return Err(e);
        }
    }

    // 4. Swap the ops pointer. Plain assignment -- no atomics needed
    //    because we have &mut (exclusive access after quiescence).
    qdisc.ops = new_ops;

    // 5. Clear EVOLVING flag. Resume traffic.
    qdisc.flags.remove(QdiscFlags::TCQ_F_EVOLVING);
    Ok(())
}

Stall budget: evolution is a rare event (~once per months in production). The brief TX stall during quiescence (<10 us same-type, <1 ms cross-type) is acceptable at this frequency. Packets are not lost -- enqueue callers see QueueFull and retry via normal backpressure (socket send buffer, TCP congestion window). This matches the block layer's QUEUE_FLAG_QUIESCING pattern where I/O submission stalls briefly during queue reconfiguration.

Hot-path benefit: because ops is a plain &'static dyn QdiscOps (not RcuCell), every enqueue and dequeue saves ~5-10 cycles that would otherwise be spent on RCU read-side critical section enter/exit. At millions of packets per second, this is a measurable gain. The quiescence cost is amortized over months of zero-overhead reads.

Comparison with Linux tc qdisc replace: Linux's replacement path destroys the old qdisc, drops all queued packets, and creates a new one from scratch. All token bucket state, DRR credits, and CoDel drop history are lost. UmkaOS preserves this state across the swap, providing seamless QoS continuity.


16.22 IPsec and XFRM Framework

The XFRM (transform) framework provides the kernel infrastructure for IPsec (ESP and AH) and any other per-packet cryptographic transform. IKEv2 key exchange is handled in userspace (strongSwan, libreswan, or systemd-networkd's IKEv2 client); the kernel implements packet transformation and the SA/SP databases.

Linux parallel: Linux's XFRM lives in net/xfrm/. UmkaOS implements the same xfrm_user netlink interface so that strongSwan, ip xfrm, and NetworkManager's IKEv2 support work unmodified.

16.22.1 Security Association (SA) -- XfrmState

/// SA lifetime limits (matching Linux `struct xfrm_lifetime_cfg`).
/// When any limit is reached, the SA transitions to expired state and
/// IKEv2 is notified to negotiate a replacement.
pub struct XfrmLifetime {
    /// Maximum bytes encrypted before SA expires (0 = unlimited).
    pub soft_byte_limit: u64,
    pub hard_byte_limit: u64,
    /// Maximum packets encrypted before SA expires (0 = unlimited).
    pub soft_packet_limit: u64,
    pub hard_packet_limit: u64,
    /// Wall-clock time limits in seconds since SA creation (0 = unlimited).
    pub soft_add_expires_seconds: u64,
    pub hard_add_expires_seconds: u64,
    /// Wall-clock time limits since last use (0 = unlimited).
    pub soft_use_expires_seconds: u64,
    pub hard_use_expires_seconds: u64,
}

/// An IPsec Security Association (SA).
///
/// An SA represents a one-directional security relationship between two endpoints.
/// It is identified by the triple (destination address, SPI, protocol) -- the `XfrmId`.
/// IKEv2 creates SAs in pairs (one for each direction).
// Kernel-internal, not KABI: contains Zeroizing, Option, SpinLock (no stable C layout).
pub struct XfrmState {
    /// SA identifier: (destination, SPI, protocol -- AH=51 or ESP=50).
    pub id: XfrmId,
    /// Source address of this SA (used to select the correct local interface).
    pub saddr: XfrmAddress,
    /// Traffic selector: which packets this SA covers (src/dst/proto/port ranges).
    /// For tunnel mode, this is the inner traffic; for transport, the endpoint pair.
    pub selector: XfrmSelector,
    /// Authenticated encryption algorithm (preferred: AES-GCM-128/256).
    /// Mutually exclusive with `auth` + `enc`.
    /// AeadTfm, ShashTfm, and SkcipherTfm are concrete types defined in
    /// [Section 10.1](10-security-extensions.md#kernel-crypto-api) (not trait objects). Box allocation uses the
    /// slab allocator for these sizes, ensuring zeroization-before-reuse.
    pub aead: Option<Box<AeadTfm>>,
    /// Authentication algorithm (HMAC-SHA256, etc.). Used with `enc` for CBC+HMAC.
    pub auth: Option<Box<ShashTfm>>,
    /// Encryption algorithm (AES-CBC, ChaCha20). Used with `auth`.
    pub enc: Option<Box<SkcipherTfm>>,
    /// SA lifetime limits (bytes transmitted, packets transmitted, wall-clock time).
    pub lifetime: XfrmLifetime,
    /// Counters for bytes, packets, and replay-window errors.
    pub stats: XfrmStats,
    /// ESP sequence number (low 32 bits); incremented atomically on each TX packet.
    /// When ESN (Extended Sequence Numbers, RFC 4303 Section 2.2.1) is negotiated,
    /// the full 64-bit sequence space is `(seq_hi << 32) | seq`. The high 32 bits
    /// are NOT transmitted on the wire but are included in the ICV computation.
    pub seq: AtomicU32,
    /// High 32 bits of the Extended Sequence Number (ESN, RFC 4303 Section 2.2.1).
    /// Only meaningful when `esn_enabled` is true. Incremented when `seq` wraps
    /// from 0xFFFF_FFFF to 0. The anti-replay window (Section 16.19.5) uses the
    /// full 64-bit sequence for ESN-enabled SAs.
    ///
    /// **TX lock**: The `seq`/`seq_hi` pair must be incremented atomically on the
    /// TX path. Since there is no native 64-bit atomic increment that spans two
    /// u32 fields, the TX path acquires `tx_seq_lock` when `seq` wraps: read
    /// `seq` via `fetch_add(1, AcqRel)`; if the result was `0xFFFF_FFFF` (wrap),
    /// acquire `tx_seq_lock` and increment `seq_hi`. The non-wrap fast path
    /// (99.9999977% of packets at u32 granularity) requires no lock.
    ///
    /// **Memory ordering**: `seq` uses `AcqRel` on `fetch_add` (not `Relaxed`)
    /// to ensure that any CPU observing a post-wrap `seq` value (e.g., 0 or 1)
    /// also sees the updated `seq_hi`. On weakly-ordered architectures (AArch64,
    /// RISC-V, PPC), `Relaxed` would allow a CPU to observe `seq = 1` without
    /// seeing the corresponding `seq_hi` increment, producing a wrong 64-bit
    /// sequence for the ICV computation. The performance impact is negligible:
    /// one `AcqRel` per packet in the ESN path, dominated by ESP encryption
    /// (~1-5 us).
    pub seq_hi: AtomicU32,
    /// Serializes `seq_hi` increments on the TX path (acquired only on u32 wrap).
    pub tx_seq_lock: SpinLock<()>,
    /// Whether Extended Sequence Numbers are enabled for this SA.
    /// Negotiated by IKEv2 via the ESN transform (RFC 7296 Section 3.3.2).
    /// When false, `seq_hi` is ignored and the 32-bit `seq` wraps with SA expiry.
    pub esn_enabled: bool,
    /// Anti-replay window (Section 16.19.5).
    pub replay_window: ReplayWindow,
    /// Serializes replay window updates across concurrent RX CPUs.
    /// Held for ~50-200ns per packet (check_and_record duration only).
    pub replay_lock: SpinLock<()>,
    /// IPsec mode: Transport (host-to-host) or Tunnel (gateway-to-gateway).
    pub mode: XfrmMode,
    /// Address family: AF_INET or AF_INET6.
    pub family: AddressFamily,
    pub flags: XfrmStateFlags,
    /// Outer header overhead added by this SA (used by PMTU).
    pub header_len: u16,
    /// Optional UDP encapsulation port (NAT traversal: ESP-in-UDP, RFC 3948).
    pub encap: Option<XfrmEncap>,
    /// RCU-protected: the SA is read-locked during packet processing.
    _rcu: RcuHead,
}

/// **Key material zeroization**: On SA deletion (via `XFRM_MSG_DELSA`) or lifetime
/// expiry, all key material MUST be zeroized before deallocation:
/// - `XfrmState`'s `Drop` impl calls `Zeroize::zeroize()` on the `aead`, `auth`,
///   and `enc` crypto transform fields. Each transform type's `Zeroize` impl
///   overwrites its internal key buffer with zeros.
/// - RCU grace period: the `XfrmState` is removed from the XArray under RCU,
///   so existing packet processing holds a read-side reference. Zeroization occurs
///   in the RCU callback (after all readers have released), ensuring no in-flight
///   packet sees a zeroed key.
/// - After zeroization, the slab allocator reclaims the memory. The page is NOT
///   returned to the buddy allocator until it has been overwritten (slab reuse
///   provides this naturally; freed slabs are immediately available for new SAs).

/// SA identifier triple.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct XfrmId {
    /// Destination IP address (outer for tunnel, inner for transport).
    pub daddr: XfrmAddress,
    /// Security Parameters Index (host byte order). Converted from network
    /// byte order at the XFRM_MSG_NEWSA/XFRM_MSG_NEWPOLICY netlink handler
    /// boundary. Linux stores SPI as `__be32` (network byte order) throughout;
    /// UmkaOS stores as native u32 for XArray key efficiency. The netlink
    /// handler (`xfrm_user_newsa()`) byte-swaps before storing.
    pub spi: u32,
    /// Protocol: IPPROTO_ESP (50) or IPPROTO_AH (51).
    pub proto: u8,
}

16.22.2 Security Policy (SP) -- XfrmPolicy

/// An IPsec Security Policy.
///
/// Policies are checked on every packet before the SA lookup.
/// A policy may require one or more transforms (an SA bundle), allow the
/// packet without transformation, or block it entirely.
pub struct XfrmPolicy {
    /// Traffic selector: src/dst addresses, L4 protocol, port ranges.
    pub selector: XfrmSelector,
    /// Action: apply transforms (Ipsec), pass (Allow), or drop (Block).
    pub action: XfrmAction,
    /// Required SA template chain. Each entry specifies the mode, protocol
    /// (ESP or AH), and algorithm requirements. At most 4 chained SAs
    /// (e.g., AH transport + ESP tunnel -- unusual but valid per RFC 4301).
    pub xfrm_vec: ArrayVec<XfrmTmpl, 4>,
    /// Priority (lower = higher priority). Policies are searched in priority order.
    pub priority: u32,
    /// Direction: In (inbound), Out (outbound), or Fwd (forwarded packets).
    pub dir: XfrmDir,
    pub flags: XfrmPolicyFlags,
    /// Index assigned at creation (for RTM_GETPOLICY lookup by index).
    pub index: u32,
    _rcu: RcuHead,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmAction {
    Allow,
    Block,
    Ipsec,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmMode {
    Transport,
    Tunnel,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmDir {
    In,
    Out,
    Fwd,
}

16.22.3 SA and SP Databases

SAD (Security Association Database): XArray keyed by SPI (u32). The SPI (Security Parameter Index) is a randomly generated 32-bit integer, making it an ideal XArray key — O(1) lookup, RCU-compatible reads, and 64-way fanout. Since SPI collisions are possible (different SAs may share an SPI if they differ in destination address or protocol), each XArray slot holds a short ArrayVec of SAs; the per-packet lookup checks destination and protocol to select the correct entry (typically exactly one). The SAD uses a two-layer concurrency design to eliminate lock contention on the per-packet lookup path:

  • Read path (per-packet, hot path): XArray RCU lookup by SPI — zero lock contention. xfrm_state_lookup() calls sad.xa_load(spi) under an RCU read guard, then linearly scans the (typically single-element) entry list for the matching (dst, proto). Never acquires sad_lock.
  • Write path (SA install/delete, cold path): acquire sad_lock (a SpinLock), insert or remove the entry in the XArray, then release sad_lock. XArray node allocation uses slab; RCU protects concurrent readers during structural changes.

Individual XfrmState entries have their own RCU protection for in-place state updates (lifetime counters, replay windows). SA policy changes (SPD) are rare; SA lookups occur per packet.

Security Policy Database (SPD) lookup: UmkaOS uses a Patricia trie (radix tree) as the baseline SPD data structure — not a linear list at any policy count.

  • IP prefix matching: Patricia trie on source/destination IP prefixes. O(W) lookup where W = key width (32 bits for IPv4, 128 for IPv6). Policy specificity (more-specific prefixes take priority) is handled by longest-prefix match semantics native to the trie.

  • Port range selectors: When a policy has port-range selectors, a two-level lookup applies: first the Patricia trie on IP prefix, then an interval tree (augmented red-black tree) on port ranges within the prefix bucket. O(W + log P) where P = number of port-range policies matching the IP prefix.

  • Insert/delete: O(W) trie operations. Policy database updates are rare (usually at IPsec SA negotiation) and not on the fast path.

This design is correct for all policy set sizes; there is no threshold above which a different structure is used.

/// XFRM subsystem state (per network namespace).
pub struct XfrmNetns {
    /// Security Association Database (SAD).
    ///
    /// XArray keyed by SPI (`u32`). Each slot holds an `ArrayVec` of SAs that share
    /// the same SPI (collisions are rare — SPI is a random 32-bit value — but the
    /// design handles them correctly by scanning on `(dst, proto)`).
    ///
    /// **Read path** (per-packet, hot path): `sad.xa_load(spi)` under RCU read guard,
    /// then linear scan of the (typically single-element) entry list for matching
    /// `(dst, proto)`. Zero lock contention. Individual `XfrmState` entries have their
    /// own RCU protection for in-place state updates (lifetime counters, replay windows).
    ///
    /// **Write path** (SA install/delete, cold path): acquire `sad_lock`, insert or
    /// remove the entry in the XArray, release `sad_lock`. XArray uses slab-allocated
    /// nodes; RCU protects concurrent readers during structural changes.
    ///
    /// This two-layer design eliminates lock contention on the lookup path entirely.
    /// SA policy changes (SPD) are rare; SA lookups occur per packet.
    pub sad: XArray<ArrayVec<Arc<XfrmState>, 4>>,
    /// Serializes concurrent SAD write operations. NOT held during reads.
    pub sad_lock: SpinLock<()>,

    // Security Policy Database (SPD) — policy changes are infrequent; RwLock is fine.
    pub spd_in:   RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,
    pub spd_out:  RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,
    pub spd_fwd:  RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,

    pub nlsk_group: NlMulticastGroup,
}

16.22.4 Packet Processing Hooks

Outbound (TX) -- xfrm_output(netns, buf):

  1. After routing decides the output interface, before QdiscOps::enqueue.
  2. Look up SPD (Out) with the packet's 5-tuple selector.
  3. If no matching policy or action == Allow: pass through.
  4. If action == Block: drop, return Err(KernelError::PermissionDenied).
  5. For action == Ipsec: find or trigger creation of the required SA chain.
  6. If SA exists in SAD: proceed.
  7. If SA missing: send XFRM_MSG_ACQUIRE to the IKEv2 daemon (via netlink); hold the packet in a xfrm_bundle_pending queue for up to 5 seconds. If no SA arrives within 5 seconds, drop and return EHOSTUNREACH.
  8. For each SA in the bundle (in order): apply transform.
  9. Transport mode: insert ESP/AH header between IP header and payload.
  10. Tunnel mode: prepend new outer IP header and ESP/AH; original packet becomes payload.
  11. Update XfrmState.stats.bytes, stats.packets. Check lifetime limits; if exceeded, send XFRM_MSG_EXPIRE and mark SA as expiring.
  12. Increment XfrmState.seq atomically; embed in ESP header.

Inbound (RX) -- xfrm_input(netns, buf):

  1. After IP receive, if protocol == ESP (50) or AH (51).
  2. Extract SPI from the ESP/AH header. Look up SAD by (dst, spi, proto).
  3. If no SA: drop, log INVALID_SPI error.
  4. Check anti-replay window (Section 16.19.5). If replayed: drop.
  5. Decrypt and authenticate using the SA's crypto transforms (AeadTfm for ESP-AEAD). On auth failure: drop, increment stats.integrity_failed.
  6. Strip the ESP/AH header. For tunnel mode: re-inject the inner packet at the IP receive path.
  7. Look up SPD (In) with the inner packet's 5-tuple; verify that a matching policy requiring this SA exists (inbound policy check -- prevents SA bypass by sending non-IPsec traffic to a port that should be protected).
  8. Deliver inner packet to the transport layer (TCP, UDP).

16.22.5 Anti-Replay Window

/// Configurable anti-replay sliding window.
///
/// Prevents replay attacks where an attacker re-injects captured ESP packets.
/// The window size is configurable per SA (default 4096 packets), supporting
/// modern multi-queue NICs where packets may arrive out of order across queues.
/// RFC 4303 §3.4.3 permits implementation-defined window sizes.
///
/// Internally the bitmap is stored as an array of u64 words; the number of
/// words is `window_size / 64` (rounded up). Common sizes:
/// - 64 packets:   1 word  (8 bytes)  — legacy compatibility
/// - 1024 packets: 16 words (128 bytes)
/// - 4096 packets: 64 words (512 bytes) — default
///
/// Maximum configurable window size: 65536 packets (1024 words, 8 KB).
/// Configured via XFRM_MSG_NEWSA `replay_window` attribute (netlink).
pub struct ReplayWindow {
    /// Sequence number of the right edge of the window (highest received).
    /// For ESN-enabled SAs, seq tracks the full 64-bit right edge.
    /// The low 32 bits match the on-wire ESP sequence; seq_hi is inferred
    /// per RFC 4303 Appendix A. For non-ESN SAs, only the low 32 bits
    /// are used.
    pub seq: u64,
    /// Bitmask array: bit N is set if seq-N has been received. Bit 0 = seq itself.
    /// Length = `(window_size + 63) / 64` words. Heap-allocated at SA creation
    /// (cold path — IKEv2 key exchange). The default 4096-packet window uses
    /// 64 words (512 bytes). SA creation is cold-path, so Box<[u64]> is
    /// acceptable per collection policy.
    pub bitmap: Box<[u64]>,
    /// Window size in packets (must be a multiple of 64, min 64, max 65536).
    pub window_size: u32,
}

/// Default anti-replay window size in packets.
pub const REPLAY_WINDOW_DEFAULT: u32 = 4096;
/// Maximum anti-replay window size in packets.
pub const REPLAY_WINDOW_MAX: u32 = 65536;

impl ReplayWindow {
    /// Maximum number of u64 words in the bitmap (65536 / 64).
    const MAX_WORDS: usize = (REPLAY_WINDOW_MAX as usize) / 64;

    /// Create a new replay window with the given size (in packets).
    /// `size` is clamped to [64, 65536] and rounded up to a multiple of 64.
    pub fn new(size: u32) -> Self {
        let size = size.clamp(64, REPLAY_WINDOW_MAX);
        let size = (size + 63) & !63; // round up to multiple of 64
        let n_words = (size as usize) / 64;
        let bitmap = vec![0u64; n_words].into_boxed_slice();
        Self { seq: 0, bitmap, window_size: size }
    }

    /// Check and record a received sequence number.
    ///
    /// Returns `Ok(())` if the sequence number is acceptable (in window and not seen).
    /// Returns `Err(ReplayError::TooOld)` if the sequence is before the window.
    /// Returns `Err(ReplayError::Duplicate)` if the sequence has been seen.
    ///
    /// On `Ok`, records the sequence number in the bitmap and advances the window
    /// if this is the new highest sequence number.
    ///
    /// **Synchronization**: This method takes `&mut self` and is called under the
    /// per-SA `replay_lock: SpinLock<()>` held by the caller (`esp_input`). ESP RX
    /// is multi-CPU (multiple cores may process packets for the same SA concurrently),
    /// so the lock serializes replay window updates per SA. The lock is held only
    /// for the duration of this check (~50-200ns), which is acceptable given that
    /// ESP decryption (~1-5μs) dominates per-packet cost. The `replay_lock` is
    /// defined on `XfrmState`, not on `ReplayWindow`, to keep this struct `Copy`-friendly.
    pub fn check_and_record(&mut self, new_seq_lo: u32, esn: bool, seq_hi_hint: u32) -> Result<(), ReplayError> {
        // Reconstruct the full 64-bit sequence number.
        // For non-ESN SAs, the high 32 bits are always 0.
        // For ESN SAs (RFC 4303 Appendix A), infer seq_hi from the 32-bit
        // on-wire value and the current window position.
        let new_seq: u64 = if esn {
            Self::reconstruct_esn(new_seq_lo, seq_hi_hint, self.seq, self.window_size)
        } else {
            new_seq_lo as u64
        };

        if new_seq == 0 {
            // Sequence 0 is invalid per RFC 4303 §3.3.3.
            return Err(ReplayError::TooOld);
        }
        let win = self.window_size as u64;
        if new_seq > self.seq {
            // New highest: advance window.
            let diff = (new_seq - self.seq) as usize;
            if diff < self.window_size as usize {
                // Shift bitmap left by `diff` bits across the word array.
                // `bitmap_shift_left(bitmap: &mut [u64], shift: usize)` shifts
                // all bits in the multi-word bitmap left by `shift` positions
                // (bit K moves to bit K+shift; lower bits are cleared). Algorithm:
                // iterate words from high to low index. For each word, the new
                // value is `(word[i - word_shift] << bit_shift) | (word[i - word_shift - 1] >> (64 - bit_shift))`.
                // Words below the shift range are zeroed. Same algorithm as Linux
                // `lib/bitmap.c __bitmap_shift_left()`.
                bitmap_shift_left(&mut self.bitmap, diff);
                self.bitmap[0] |= 1; // mark current position
            } else {
                // Entire window reset.
                for w in self.bitmap.iter_mut() { *w = 0; }
                self.bitmap[0] = 1;
            }
            self.seq = new_seq;
        } else {
            let diff = (self.seq - new_seq) as usize;
            if diff >= self.window_size as usize {
                return Err(ReplayError::TooOld);
            }
            let word_idx = diff / 64;
            let bit_idx = diff % 64;
            let mask = 1u64 << bit_idx;
            if self.bitmap[word_idx] & mask != 0 {
                return Err(ReplayError::Duplicate);
            }
            self.bitmap[word_idx] |= mask;
        }
        Ok(())
    }

    /// Reconstruct the full 64-bit ESN from the 32-bit on-wire sequence
    /// number per RFC 4303 Appendix A.
    ///
    /// Algorithm matches Linux `xfrm_replay_seqhi()` (net/xfrm/xfrm_replay.c):
    /// compute the bottom of the replay window in the low-32 subspace, then
    /// determine if the new sequence falls below that bottom (implying a
    /// high-word increment) or above it in the wrap case (implying a decrement).
    ///
    /// `seq_hi_hint` is the current high 32 bits from the SA's tx/rx state,
    /// i.e., `(window_top >> 32) as u32`. Passed explicitly to match the Linux
    /// `xfrm_replay_seqhi()` API signature. Callers must ensure
    /// `seq_hi_hint == (window_top >> 32) as u32`; any divergence is a bug.
    /// `window_top` is `self.seq` (full 64-bit right edge of the replay window).
    /// `window_size` is the configured replay window size in packets.
    fn reconstruct_esn(new_seq_lo: u32, seq_hi_hint: u32, window_top: u64,
                       window_size: u32) -> u64 {
        let seq_lo = window_top as u32; // low 32 bits of current window right edge
        let mut seq_hi = seq_hi_hint;
        // Bottom of the window in the low-32 subspace (wrapping subtraction).
        let bottom = seq_lo.wrapping_sub(window_size).wrapping_add(1);

        if seq_lo >= window_size.wrapping_sub(1) {
            // Case A: window does NOT span a u32 wraparound boundary.
            // If new_seq_lo is below the window bottom, it must belong to
            // the next high-word epoch (the low 32 bits wrapped around).
            if new_seq_lo < bottom {
                seq_hi = seq_hi.wrapping_add(1);
            }
        } else {
            // Case B: window SPANS the u32 wraparound boundary.
            // The bottom is a large value (near u32::MAX). If new_seq_lo
            // is >= bottom, it belongs to the previous epoch.
            if new_seq_lo >= bottom {
                seq_hi = seq_hi.wrapping_sub(1);
            }
        }

        ((seq_hi as u64) << 32) | (new_seq_lo as u64)
    }
}

All XFRM management messages use the NETLINK_XFRM socket family. Messages require Capability::NetAdmin. Key message types:

Message Description
XFRM_MSG_NEWSA Create an SA (called by IKEv2 daemon after key exchange)
XFRM_MSG_DELSA Delete an SA by XfrmId
XFRM_MSG_GETSA Get one SA; or dump all SAs (NLM_F_DUMP)
XFRM_MSG_UPDSA Update an existing SA (rekey without connection teardown)
XFRM_MSG_NEWPOLICY Create a policy
XFRM_MSG_DELPOLICY Delete a policy by index or selector
XFRM_MSG_GETPOLICY Get one policy; or dump all
XFRM_MSG_UPDPOLICY Update a policy
XFRM_MSG_ACQUIRE kernel->daemon: SA needed for a packet; carry policy selector
XFRM_MSG_EXPIRE kernel->daemon: SA lifetime exhausted; carry SA id + hard/soft flag
XFRM_MSG_NEWAE Update SA sequence / replay state (for SA migration)
XFRM_MSG_REPORT kernel->daemon: audit event (policy bypass, integrity failure)

ACQUIRE flow: When xfrm_output encounters a packet matching an Ipsec policy but no matching SA, it sends XFRM_MSG_ACQUIRE to all sockets subscribed to the XFRM_NLGRP_ACQUIRE multicast group. The IKEv2 daemon receives the acquire, negotiates keys with the peer, and installs the SA via XFRM_MSG_NEWSA. The pending packet is held in the kernel and transmitted once the SA is installed.

16.22.7 Crypto API Integration

All IPsec transforms use the Kernel Crypto API (Section 10.1):

IPsec Algorithm Crypto API Request
AES-GCM-128 ESP crypto_alloc_aead("gcm(aes)", 0, 0)
AES-GCM-256 ESP crypto_alloc_aead("gcm(aes)", 0, 0) (256-bit key)
ChaCha20-Poly1305 ESP crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, 0)
AES-CBC-128 + HMAC-SHA256 crypto_alloc_skcipher("cbc(aes)", ...) + crypto_alloc_ahash("hmac(sha256)", ...)
AH HMAC-SHA256 crypto_alloc_ahash("hmac(sha256)", 0, 0)

The algorithm name and key material are supplied by the IKEv2 daemon in XFRM_MSG_NEWSA. The kernel allocates the transform, validates the key length, and stores the handle in XfrmState.aead / .auth / .enc.


16.23 SCTP -- Stream Control Transmission Protocol

SCTP (RFC 4960) is a transport protocol providing multi-homing, multi-streaming, reliable ordered delivery, and message-boundary preservation. UmkaOS implements SCTP as a registered transport in umka-net's socket layer (Section 16.3), using the same SocketOps trait and NetBuf pipeline as TCP and UDP.

Use cases in the UmkaOS deployment context: Corosync cluster heartbeat (Section 15.15), telecom DIAMETER/SS7 gateways, and iSCSI login negotiation all require SCTP. The kernel-side SCTP implementation allows these to work over standard AF_INET/AF_INET6 sockets without any userspace SCTP library.

16.23.1 Association State Machine

SCTP connections are called associations. The state machine matches RFC 4960 Section 4:

Closed ──INIT──────────────────────────────► CookieWait
       ◄──INIT-ACK (with cookie)───────────── (peer)
CookieWait ──COOKIE-ECHO────────────────────► CookieEchoed
           ◄──COOKIE-ACK────────────────────── (peer)
CookieEchoed ────────────────────────────────► Established
Established ──close(data queued)──────────────► ShutdownPending
            ──recv SHUTDOWN──────────────────── ShutdownReceived
ShutdownPending ──all data acked, send SHUTDOWN► ShutdownSent
ShutdownSent ──recv SHUTDOWN-ACK─────────────► ShutdownAckSent
ShutdownReceived ──all data acked, send SHUTDOWN-ACK──► ShutdownAckSent
ShutdownAckSent ──recv SHUTDOWN-COMPLETE─────► Closed

Cookie mechanism: The INIT-ACK carries a State Cookie -- a MAC-protected (HMAC-SHA256) blob encoding the association parameters, timestamps, and a random nonce. The initiator echoes the cookie in COOKIE-ECHO without the responder storing any state between INIT and COOKIE-ECHO. This prevents memory exhaustion attacks (equivalent to TCP SYN cookies but specified by RFC 4960). The MAC key is rotated every 60 seconds; a grace period of one key-rotation interval accepts cookies from the previous key.

16.23.2 SctpAssoc Struct

/// Maximum local bind addresses per SCTP association.
/// Validated at `bind()`. Exceeding returns `EINVAL`.
pub const SCTP_MAX_BIND_ADDRS: usize = 256;

/// Maximum remote peer addresses per SCTP association.
/// Validated during `INIT`/`INIT-ACK` processing. Exceeding returns `EINVAL`.
pub const SCTP_MAX_PEER_ADDRS: usize = 256;

/// An SCTP association (the SCTP equivalent of a TCP connection).
pub struct SctpAssoc {
    /// Kernel-assigned association ID (exposed via SCTP_ASSOCINFO sockopt).
    /// Longevity: sctp_assoc_t is __s32 in Linux (protocol-mandated).
    /// Association IDs are recycled after teardown (not monotonic), so wrap
    /// is not a concern. At 1K new associations/sec, the i32 positive range
    /// supports ~2.1B IDs before collision risk.
    pub assoc_id: i32,
    /// Current state machine state.
    pub state: SctpState,
    /// Local IP addresses (multi-homing: all addresses bound to the socket).
    /// Bounded: max `SCTP_MAX_BIND_ADDRS` (256). Heap-allocated on warm
    /// association setup path. Not integer-keyed (iterated for heartbeats).
    pub local_addrs: Vec<IpAddr>,
    /// Remote addresses (multi-homed peer; each has independent path state).
    /// Bounded: max `SCTP_MAX_PEER_ADDRS` (256). Heap-allocated on warm
    /// association setup path. SctpPeer too large for ArrayVec<_, 256>.
    /// Typical deployments: 1-4 peer addresses.
    pub peer_addrs: Vec<SctpPeer>,
    /// Index into `peer_addrs` for the active primary path.
    pub primary_path: usize,
    /// Per-stream send/receive state. XArray keyed by Stream ID (u16).
    /// Sparse: up to 65535 streams per RFC 9260 §5.1.1, but typical
    /// associations negotiate only 4-16 streams. XArray provides O(1) lookup.
    pub streams: XArray<SctpStream>,
    /// Next TSN (Transmission Sequence Number) to use on the next DATA chunk TX.
    /// **RFC 9260 §3.3.1**: TSN is u32 on the wire. Wrap-safe by design:
    /// SCTP uses serial number arithmetic (RFC 1982) for TSN comparison —
    /// `TSN_lt(a, b)` compares `(a - b) as i32 < 0`. Wrap is harmless as
    /// long as no more than 2^31 TSNs are outstanding (ensured by cwnd/rwnd).
    ///
    /// **Serialization**: The SCTP TX path is serialized under the association
    /// lock (matching Linux's `lock_sock()` in `sctp_sendmsg()`). The multi-step
    /// sequence (assign TSN, construct DATA chunk, insert into retransmit_queue)
    /// is NOT concurrency-safe on its own — two concurrent `sendmsg()` calls
    /// would assign TSN N and N+1 but could insert in wrong order. The
    /// `AtomicU32` is used for cross-thread visibility: concurrent SACK
    /// processing reads `tsn_next` to compute the outstanding TSN window
    /// without acquiring the TX lock. The `AtomicU32` does NOT replace
    /// association-level serialization of the TX path.
    pub tsn_next: AtomicU32,
    /// Cumulative TSN ACKed by peer (from last received SACK).
    pub cum_tsn_ack: u32,
    /// Receiver window advertised by peer (bytes).
    pub rwnd: u32,
    /// Effective MTU (minimum across all active paths).
    /// `u16` is sufficient: SCTP MTU cannot exceed 65535 bytes (IPv4/IPv6 packet
    /// limit), and all practical paths use ≤9000 bytes (jumbo Ethernet).
    /// Note: `RouteEntry::mtu` uses `u32` for forward compatibility with hypothetical
    /// future link types; the narrower `u16` here is intentional for the association
    /// aggregate (always bounded by the smallest path MTU, which is ≤65535).
    pub mtu: u16,
    /// Current RTO for the primary path.
    pub rto: Duration,
    /// Heartbeat interval (default: 30 seconds).
    pub hb_interval: Duration,
    /// Retransmit queue: DATA chunks awaiting SACK, indexed by TSN.
    /// Warm path: accessed per-SACK and per-retransmit-timer, not per-packet.
    /// XArray provides O(1) lookup by TSN and ordered iteration for gap-ack processing.
    pub retransmit_queue: XArray<NetBuf>,
    /// Out-of-order received DATA chunks awaiting gap fill.
    pub ooo_queue: XArray<NetBuf>,
    /// SCTP socket this association belongs to.
    pub sock: Weak<SctpSock>,
}

16.23.3 Multi-Homing

/// Per-path (per-remote-address) state in an SCTP association.
pub struct SctpPeer {
    /// Remote IP address of this path.
    /// SockAddr used for Linux compat (sctp_paddrinfo); port field always
    /// equals the association's port.
    pub addr: SockAddr,
    /// Path reachability state.
    pub state: PathState,
    /// Per-path congestion window (bytes).
    /// u64 to support high-BDP paths (400 Gbps × 100ms RTT = ~5 GB BDP exceeds u32 max).
    ///
    /// **Congestion control algorithm**: RFC 4960 Section 7 mandates Reno-like
    /// behavior (slow start with `cwnd += MTU` per ACK, congestion avoidance
    /// with `cwnd += MTU * MTU / cwnd` per ACK, halving on loss). UmkaOS
    /// implements this as the default SCTP congestion control. Unlike TCP,
    /// SCTP does not support pluggable congestion control — the algorithm is
    /// fixed per RFC 4960.
    pub cwnd: u64,
    /// Per-path slow-start threshold (u64 to match cwnd — same high-BDP rationale).
    pub ssthresh: u64,
    /// Partial bytes ACKed on this path (for cwnd increment in congestion avoidance;
    /// per-path per RFC 4960 §6.2.1 — each destination address has its own cwnd).
    pub partial_bytes_acked: u64,
    /// Retransmission Timeout for this path (updated by RTTM: RFC 4960 Section 7.3).
    pub rto: Duration,
    /// RTO minimum (default: 1 second per RFC 4960; tunable via SCTP_RTOINFO).
    pub rto_min: Duration,
    /// RTO maximum (default: 60 seconds).
    pub rto_max: Duration,
    /// Smoothed RTT estimate (us).
    pub srtt_us: u32,
    /// RTT variance estimate (us).
    pub rttvar_us: u32,
    /// Heartbeat timer: fires if no data sent/received for `hb_interval`.
    pub hb_timer: TimerHandle,
    /// Consecutive retransmit timeouts on this path.
    pub error_count: u32,
    /// Threshold for declaring path failure (default: 5, tunable via SCTP_PADDRPARAMS).
    pub max_retrans: u32,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum PathState {
    /// Path is reachable and active.
    Active,
    /// Path is unreachable (error_count exceeded max_retrans).
    Inactive,
    /// Path has been added but not yet confirmed by a HEARTBEAT-ACK.
    Unconfirmed,
}

Path failure and failover: When error_count exceeds max_retrans on the primary path, the association marks it Inactive and selects the next Active path as the new primary. Retransmits are sent on the new primary path. Heartbeats continue on inactive paths; a successful HEARTBEAT-ACK resets error_count and marks the path Active again. If all paths become Inactive, the association is aborted (ABORT chunk sent on the last active path before marking it Inactive).

16.23.4 Multi-Streaming

/// Per-stream state in an SCTP association.
pub struct SctpStream {
    /// Stream ID (0 to num_streams - 1).
    pub sid: u16,
    /// Next Stream Sequence Number to assign to an outgoing ordered DATA chunk.
    /// **RFC 9260 §3.3.1**: SSN is u16 on the wire. Wrap-safe by design:
    /// SSN comparison uses serial number arithmetic (same as TSN but 16-bit).
    /// Wraps after 65536 messages per stream — normal and expected for
    /// long-lived associations. The reorder buffer holds at most `a_rwnd`
    /// bytes, preventing confusion across wrap boundaries.
    pub ssn_out: u16,
    /// Next expected SSN for inbound ordered delivery.
    pub ssn_in_expected: u16,
    /// True for ordered streams; false for unordered (no SSN tracking).
    pub ordered: bool,
    /// Reorder buffer for out-of-order ordered chunks: SSN -> NetBuf.
    /// Chunks are delivered to the socket in SSN order; gaps are held here.
    ///
    /// **Size bound**: The total bytes buffered across ALL per-stream reorder_buf
    /// entries for an association is bounded by the association's receive window
    /// (`a_rwnd`). When the aggregate buffered bytes reach `a_rwnd`, new
    /// out-of-order chunks are dropped with a FORWARD-TSN hint to the peer.
    /// This prevents memory exhaustion from adversarial peers sending chunks
    /// with large SSN gaps. Per-stream: soft limit of `a_rwnd / num_streams`.
    pub reorder_buf: XArray<NetBuf>,
    /// Fragment reassembly buffer: TSN -> partial DATA chunk.
    /// For multi-chunk messages (B-bit=1, E-bit=0 intermediate fragments).
    /// Same aggregate size bound as reorder_buf (shared a_rwnd budget).
    pub fragment_buf: XArray<NetBuf>,
}

Ordered delivery: When an ordered DATA chunk arrives with ssn != ssn_in_expected, it is placed in reorder_buf. When the gap is filled (the expected SSN arrives), all consecutive SSNs are delivered to the socket receive buffer in order.

Unordered delivery (I-DATA with U-bit set, or DATA with UNORDERED flag): Delivered immediately to the receive buffer regardless of SSN. Fragment reassembly still uses TSN-based tracking.

Message fragmentation: When sendmsg() delivers a message larger than the path MTU, SCTP splits it into DATA chunks. The first chunk has B-bit=1, E-bit=0; middle chunks have both clear; the last has E-bit=1. I-DATA (RFC 8260) adds a Message Identifier and fragment offset for interleaved reassembly -- avoids head-of-line blocking when large messages are mixed with small real-time messages.

16.23.5 SCTP Chunk Types

Value Name Direction Purpose
0x00 DATA bidirectional User data payload
0x01 INIT -> peer Association setup request
0x02 INIT-ACK <- peer Setup response with cookie
0x03 SACK bidirectional Selective acknowledgement of TSNs
0x04 HEARTBEAT -> peer Path liveness probe
0x05 HEARTBEAT-ACK <- peer Path liveness response
0x06 ABORT bidirectional Immediate association teardown
0x07 SHUTDOWN -> peer Graceful shutdown initiation
0x08 SHUTDOWN-ACK <- peer Shutdown acknowledgement
0x09 ERROR bidirectional Error notification chunk
0x0a COOKIE-ECHO -> peer Echo cookie from INIT-ACK
0x0b COOKIE-ACK <- peer Cookie accepted; association open
0x0c ECNE bidirectional ECN Echo (RFC 4960 Section 3.3.2)
0x0d CWR bidirectional Congestion Window Reduced (RFC 4960 Section 3.3.3)
0x0e SHUTDOWN-COMPLETE <- peer Shutdown sequence complete
0x40 I-DATA bidirectional Interleaved data (RFC 8260)

Unknown chunk types: if the high two bits of the type byte are 00, drop and report error. If 01, drop silently. If 10, skip and continue processing bundle. If 11, skip, continue, and report. This is the RFC 4960 Section 3.2 "chunk type bit" convention.

16.23.6 Socket API Compatibility

SCTP is accessible via two socket styles:

One-to-one (SOCK_STREAM, one association per socket):

fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_SCTP);
// bind, listen, accept, connect -- identical semantics to TCP
// sendmsg / recvmsg -- each sendmsg sends one SCTP message

One-to-many (SOCK_SEQPACKET, multiple associations multiplexed on one socket):

fd = socket(AF_INET, SOCK_SEQPACKET, IPPROTO_SCTP);
// bind; no connect needed -- associations created on first sendmsg to a new peer
// recvmsg returns SCTP notification events when associations change state

SCTP-specific sockopts (level IPPROTO_SCTP):

Sockopt Get Set Description
SCTP_NODELAY yes yes Disable Nagle-equivalent bundling delay
SCTP_MAXSEG yes yes Maximum message size (MTU override)
SCTP_STATUS yes no Association state, primary path, streams
SCTP_ASSOCINFO yes yes RTO params, max retransmits
SCTP_RTOINFO yes yes RTO.initial, RTO.min, RTO.max per assoc
SCTP_PADDRPARAMS yes yes Per-path heartbeat interval and max_retrans
SCTP_EVENTS yes yes Which SCTP notification events to receive
SCTP_INITMSG yes yes Number of streams, max retransmits for INIT
SCTP_PEER_ADDR_INFO yes no State/RTT/cwnd for a specific peer address

16.23.7 Integration with NetBuf

SCTP DATA chunks are carried in NetBuf segments. The SCTP TX path:

  1. sendmsg() delivers user data as a NetBuf (zero-copy from the socket send buffer using NetBuf::from_user_iov()).
  2. If msg_len <= mtu - sctp_header_overhead: wrap in a single DATA chunk, assign TSN, enqueue on the primary path's TX queue.
  3. If msg_len > mtu - sctp_header_overhead: fragment into N chunks. Each fragment is a separate NetBuf chained via the existing scatter-gather list (NetBuf.frags). Fragments share the data pages (reference-counted DmaBufferHandle) -- no copy.
  4. On SACK receipt: retire ACKed NetBufs from retransmit_queue, decrement refcounts.
  5. On retransmit: the retained NetBuf in retransmit_queue is retransmitted without allocating a new buffer.

16.24 AF_VSOCK -- Virtual Machine Sockets

AF_VSOCK (address family 40) enables bidirectional socket communication between a VM guest and its hypervisor host without configuring a network interface. It is used by the QEMU guest agent, containerd's CRI-over-vsock path, systemd-vmspawn, and cloud-init datasource queries.

Linux parallel: Linux implements AF_VSOCK in net/vmw_vsock/. UmkaOS implements the same sockaddr_vm ABI and VMADDR_CID_* constants so that unmodified guest agents and container runtimes work without recompilation.

16.24.1 Address Space

/// Virtual socket address (matches Linux struct sockaddr_vm, 16 bytes).
#[repr(C)]
pub struct SockAddrVm {
    /// Address family: AF_VSOCK (40).
    pub svm_family: u16,
    pub svm_reserved1: u16,
    /// Port number. Ports below 1024 are privileged: `bind()` requires
    /// `CAP_NET_BIND_SERVICE` in the caller's user namespace (matching Linux
    /// AF_VSOCK semantics). `VMADDR_PORT_ANY` (u32::MAX) requests automatic
    /// port allocation from the ephemeral range.
    pub svm_port: u32,
    /// Context ID (CID) of the communicating endpoint.
    pub svm_cid: u32,
    /// Connection flags. Currently defined: `VMADDR_FLAG_TO_HOST` (0x01) —
    /// route the connection via the host for sibling-VM communication.
    /// Unknown flag bits must be rejected with `EINVAL` for forward compat.
    pub svm_flags: u8,
    /// Must be zero (reserved for future use, matching Linux padding).
    /// On receive: validated as zero. Non-zero values cause `bind()` /
    /// `connect()` to return `EINVAL` (forward compatibility — prevents
    /// applications from accidentally relying on garbage in reserved fields).
    pub svm_zero: [u8; 3],
}
const_assert!(size_of::<SockAddrVm>() == 16);

/// svm_flags values.
pub const VMADDR_FLAG_TO_HOST: u8 = 0x01;

/// Well-known CID values.
pub mod vmaddr_cid {
    /// Bind to all local CIDs (wildcard, for listen sockets).
    pub const ANY: u32 = 0xFFFF_FFFF;
    /// Hypervisor CID. Originally defined by VMware VMCI; in QEMU/KVM virtio-vsock,
    /// CID 0 is reserved but not typically used for communication (use CID 2 for host).
    pub const HYPERVISOR: u32 = 0;
    /// Local loopback within the same VM or host context.
    pub const LOCAL: u32 = 1;
    /// Host (hypervisor userspace, e.g., QEMU process) -- used from the guest.
    pub const HOST: u32 = 2;
}

Guest VMs receive their CID from the hypervisor at VM creation (Section 16.24). CIDs >= 3 are dynamically assigned.

16.24.2 VsockTransport Trait

The transport layer is abstracted so that different hypervisor back-ends (virtio-vsock, VMware VMCI, loopback) can be registered. Only one transport is active per boot.

/// Virtual socket transport back-end.
///
/// Implemented by virtio-vsock (Tier 2 guest driver + Tier 1 vhost back-end),
/// and by the loopback transport (for LOCAL-to-LOCAL communication).
///
/// All data-path methods take `sock: &VsockSock` (shared reference) and
/// acquire `sock.lock` internally for mutable access to `VsockMutableState`.
/// This matches the `TcpCb` pattern: the socket is stored behind a shared
/// reference in the socket table, and multiple concurrent accessors (syscall
/// path, vhost_vsock RX demux, poll/epoll) can reach the same `VsockSock`.
pub trait VsockTransport: Send + Sync {
    /// Transport name (for sysfs reporting).
    fn name(&self) -> &'static str;

    /// Initialise the transport at module load.
    fn init(&self) -> Result<(), KernelError>;

    /// Shut down the transport; called when the vsock module is removed.
    fn release(&self);

    /// Initiate a connection from `sock` to its `remote_addr`.
    ///
    /// Acquires `sock.lock` internally. Sends a REQUEST packet; returns
    /// `Ok(())` immediately (async connect). The caller blocks in
    /// `VsockMutableState.state == Connecting` until a RESPONSE or RST arrives.
    fn connect(&self, sock: &VsockSock) -> Result<(), KernelError>;

    /// Disconnect the socket (send RST or SHUTDOWN).
    /// Acquires `sock.lock` internally.
    fn disconnect(&self, sock: &VsockSock, flags: u32) -> Result<(), KernelError>;

    /// Send data from the socket's send buffer (called after a credit update
    /// increases the available send window).
    /// Acquires `sock.lock` internally.
    fn send(&self, sock: &VsockSock, msg: &MsgHdr, flags: i32)
        -> Result<usize, KernelError>;

    /// Receive data into the caller's buffer.
    /// Acquires `sock.lock` internally.
    fn recv(&self, sock: &VsockSock, msg: &mut MsgHdr, flags: i32)
        -> Result<usize, KernelError>;

    /// Returns true if the socket has incoming data (for poll/epoll).
    fn notify_poll_in(&self, sock: &VsockSock) -> bool;

    /// Returns true if the socket has space to send (for poll/epoll).
    fn notify_poll_out(&self, sock: &VsockSock) -> bool;
}

/// Global active transport.
static VSOCK_TRANSPORT: RwLock<Option<&'static dyn VsockTransport>> =
    RwLock::new(None);

/// Register the active vsock transport (called once at module init).
pub fn vsock_register_transport(t: &'static dyn VsockTransport) -> Result<(), KernelError> {
    let mut slot = VSOCK_TRANSPORT.write();
    if slot.is_some() {
        return Err(KernelError::AlreadyExists);
    }
    *slot = Some(t);
    t.init()
}

16.24.3 Virtio-Vsock Transport

The virtio-vsock transport uses the existing UmkaOS virtio device model (Section 11.3). It operates over two virtio queues: TX (guest->host) and RX (host->guest), plus an event queue for connection lifecycle notifications.

/// A single virtio-vsock packet (maps to struct virtio_vsock_pkt in Linux).
/// Transmitted between guest and host in virtio ring descriptors.
///
/// **Alignment note**: `#[repr(C, packed)]` means all fields are at their
/// natural byte offset with no padding, matching the virtio spec's wire
/// format. On architectures that do not support unaligned loads (ARMv7,
/// s390x), the compiler emits byte-by-byte loads for packed field access.
/// This is acceptable because VsockPacket is accessed on the warm path
/// (connection setup, not per-byte data transfer) and the packet header is
/// small (44 bytes). The data payload is accessed separately via a pointer
/// to the data region following the header, which is always naturally aligned
/// by the virtio ring descriptor layout.
#[repr(C, packed)]
pub struct VsockPacket {
    /// Source context ID (virtio CIDs are 64-bit, wire format: `__le64`).
    pub src_cid: Le64,
    /// Destination context ID (virtio CIDs are 64-bit, wire format: `__le64`).
    pub dst_cid: Le64,
    /// Source port.
    pub src_port: Le32,
    /// Destination port.
    pub dst_port: Le32,
    /// Payload length (bytes following this header).
    pub len: Le32,
    /// Socket type: SOCK_STREAM (1) or SOCK_SEQPACKET (5).
    pub type_: Le16,
    /// Operation code (see `VsockOp` for values).
    /// `Le16` on the wire; convert via `VsockOp::try_from(self.op.to_ne())`.
    pub op: Le16,
    /// Operation-specific flags (e.g., SHUTDOWN_RCV, SHUTDOWN_SEND).
    pub flags: Le32,
    /// Receiver buffer allocation (bytes the sender is willing to buffer).
    pub buf_alloc: Le32,
    /// Bytes consumed by the receiver since last credit update.
    pub fwd_cnt: Le32,
}
// Wire layout: 8+8+4+4+4+2+2+4+4+4 = 44 bytes (matches Linux virtio_vsock_hdr).
const_assert!(size_of::<VsockPacket>() == 44);

/// Parsed form of the vsock operation code. On the wire, `VsockPacket.op`
/// is `Le16`; convert via `VsockOp::try_from(packet.op.to_ne())`.
#[repr(u16)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum VsockOp {
    Invalid       = 0,
    /// Connection request (CLIENT -> SERVER).
    Request       = 1,
    /// Connection accepted (SERVER -> CLIENT).
    Response      = 2,
    /// Connection reset (either direction).
    Rst           = 3,
    /// Half-close (SHUT_RD or SHUT_WR).
    Shutdown      = 4,
    /// Data payload (either direction).
    Rw            = 5,
    /// Credit update: `buf_alloc` and `fwd_cnt` updated, no payload.
    CreditUpdate  = 6,
    /// Request a credit update from the peer.
    CreditRequest = 7,
}

The host-side vhost_vsock runs as a Tier 1 umka-net thread bound to a specific VM's CID. It processes the virtio vhost ring (Section 11.3) and demultiplexes incoming packets to the correct VsockSock by (dst_cid, dst_port).

16.24.4 VsockSock Struct

/// Minimum segment size for vsock credit accounting (bytes).
/// A single `NetBuf` always carries at least this many payload bytes.
pub const MIN_SEGMENT_SIZE: u32 = 64;

/// Maximum number of `NetBuf` segments that can be queued in `recv_queue`.
/// Derived from the advertised receive buffer: `local_buf_alloc / MIN_SEGMENT_SIZE`.
/// Credit-based flow control prevents more than `local_buf_alloc` bytes from
/// being in flight, so `local_buf_alloc / MIN_SEGMENT_SIZE` is an upper bound
/// on the number of segments. Default `local_buf_alloc` = 256 KiB → 4096 entries.
pub const VSOCK_RECV_QUEUE_CAP: usize = 4096;

/// A virtual socket instance.
///
/// Mutable per-connection state is protected by `lock: SpinLock<VsockMutableState>`.
/// Transport methods take `&VsockSock` and acquire `lock` internally, matching
/// the `TcpCb` pattern. Immutable fields (addresses, transport pointer) are
/// outside the lock for lock-free read access on the poll path.
pub struct VsockSock {
    /// Local CID and port. Immutable after bind/connect.
    pub local_addr: SockAddrVm,
    /// Remote CID and port. Immutable after connect/accept (set once).
    pub remote_addr: SockAddrVm,
    /// Our receive buffer allocation (advertised to peer in buf_alloc field).
    /// Set at socket creation, immutable thereafter.
    pub local_buf_alloc: u32,
    /// Active transport. Set at socket creation, immutable thereafter.
    pub transport: &'static dyn VsockTransport,
    /// Wait queue for blocked send/recv operations.
    pub waitq: WaitQueue,
    /// Per-socket lock protecting all mutable connection state.
    /// Acquired by transport methods (`connect`, `send`, `recv`, `disconnect`)
    /// and by the vhost_vsock RX demux path.
    pub lock: SpinLock<VsockMutableState>,
}

/// Mutable state for a vsock connection, protected by `VsockSock.lock`.
///
/// All fields that can change after connection setup live here. Transport
/// methods acquire the `SpinLock` to access these fields, providing safe
/// interior mutability without requiring `&mut VsockSock`.
pub struct VsockMutableState {
    /// Connection state.
    pub state: VsockState,
    /// TX buffer: data waiting to be sent, bounded by the peer's credit.
    /// Ring capacity is negotiated at connect time (default: 256 KiB).
    pub send_buf: RingBuffer<u8>,
    /// RX queue: received `NetBuf` segments waiting for `recv()`.
    ///
    /// Bounded ring buffer with capacity `local_buf_alloc / MIN_SEGMENT_SIZE`.
    /// Credit-based flow control guarantees the peer cannot send more than
    /// `local_buf_alloc` bytes before receiving a `CREDIT_UPDATE`, so the
    /// ring can never overflow under correct protocol operation.
    pub recv_queue: BoundedRing<NetBuf, VSOCK_RECV_QUEUE_CAP>,
    /// Bytes peer has allocated for receiving from us (updated on CREDIT_UPDATE).
    ///
    /// **u32 wrap-safety (applies to credit_peer_buf_alloc, credit_peer_fwd_cnt,
    /// bytes_sent, local_fwd_cnt)**: These fields are u32 to match the Linux
    /// `virtio_vsock_hdr` wire format (`struct virtio_vsock_hdr` fields
    /// `buf_alloc`, `fwd_cnt` are `__le32`). Changing to u64 would break
    /// wire compatibility with existing VMs.
    ///
    /// **Wrapping arithmetic is correct by design**: The send_window computation
    /// `send_window = credit_peer_buf_alloc - (bytes_sent - credit_peer_fwd_cnt)`
    /// uses wrapping u32 subtraction. The inner subtraction produces the correct
    /// in-flight byte count modulo 2^32 as long as `credit_peer_buf_alloc <
    /// u32::MAX` (always true: practical buffer sizes are <=256 KiB). This is
    /// the same wrapping pattern as Linux's `virtio_transport_get_credit()`.
    ///
    /// An implementing agent must NOT "fix" these to u64 — the wrapping is
    /// intentional and the u32 size is mandated by the virtio-vsock wire format.
    pub credit_peer_buf_alloc: u32,
    /// Bytes peer has consumed from us (fwd_cnt from peer's last CREDIT_UPDATE).
    /// Wire format mandated u32. See `credit_peer_buf_alloc` for wrap-safety.
    pub credit_peer_fwd_cnt: u32,
    /// Bytes we have sent to the peer (tracked locally).
    /// Wire format mandated u32. See `credit_peer_buf_alloc` for wrap-safety.
    pub bytes_sent: u32,
    /// Bytes we have consumed from our receive buffer (reported as fwd_cnt).
    /// Wire format mandated u32. See `credit_peer_buf_alloc` for wrap-safety.
    pub local_fwd_cnt: u32,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum VsockState {
    Unconnected,
    Connecting,
    Connected,
    Disconnecting,
    /// Socket has been shut down; no more sends are possible but RX may still arrive.
    Shutdown,
}

16.24.5 Flow Control

AF_VSOCK uses credit-based flow control equivalent to UmkaOS's ring buffer token model (Section 3.6):

Send permission: The sender tracks:

send_window = credit_peer_buf_alloc - (bytes_sent - credit_peer_fwd_cnt)
The sender may transmit at most send_window additional bytes before the peer must issue a CREDIT_UPDATE. If send_window == 0, the sender blocks (or returns EAGAIN for O_NONBLOCK sockets) until a CREDIT_UPDATE packet arrives.

Receive credit replenishment: After delivering N bytes of received data to userspace via recv(), the kernel sends a CREDIT_UPDATE packet with the updated fwd_cnt value, replenishing the peer's send window. The credit update is coalesced: it is sent when local_fwd_cnt increases by at least local_buf_alloc / 2 (50% watermark), preventing a credit update storm on small reads.

Initial negotiation: The REQUEST packet carries the local buf_alloc. The RESPONSE packet carries the peer's buf_alloc. Both sides initialise credit_peer_buf_alloc from the received value before data transfer begins.

16.24.6 Integration with KVM

When a VM is created (Section 18.1):

/// Kernel-global CID allocator for vsock.
/// CIDs 0-2 are reserved (HYPERVISOR, LOCAL, HOST).
/// VMs receive CIDs starting from 3.
///
/// Uses `Idr` (built on XArray, §3.1.13): O(1) amortized allocation of the
/// lowest-available CID. The `Idr` internally tracks which indices are
/// occupied and finds the first empty slot without linear scanning.
/// Each allocated slot stores `()` — the CID itself is the index.
static VSOCK_CID_ALLOCATOR: SpinLock<Idr<()>> = SpinLock::new(Idr::new());

/// Allocate a CID for a new VM.
///
/// Returns a CID in the range [3, 0xFFFF_FFFE].
/// O(1) amortized — the Idr (XArray-based) maintains a next-free cursor
/// internally. No linear scan over allocated CIDs.
/// Returns `KernelError::ResourceExhausted` if no CIDs are available
/// (unlikely: ~4 billion VMs).
pub fn vsock_alloc_cid() -> Result<u32, KernelError> {
    let mut idr = VSOCK_CID_ALLOCATOR.lock();
    let cid = idr.alloc_range((), 3..=0xFFFF_FFFE)
        .ok_or(KernelError::ResourceExhausted)? as u32;
    Ok(cid)
}

/// Release a CID when the VM is destroyed.
///
/// Resets all sockets using this CID (sends RST to any connected peer
/// sockets) before returning the CID to the pool.
pub fn vsock_free_cid(cid: u32) {
    // Walk the global socket table; RST all sockets with local_cid == cid
    // or remote_cid == cid.
    vsock_reset_all_for_cid(cid);
    VSOCK_CID_ALLOCATOR.lock().remove(cid as u64);
}

The vhost_vsock file descriptor is created in the host context and linked to the VM struct via the KVM device model. When KVM_CREATE_VM is issued, the KVM subsystem calls vsock_alloc_cid() and stores the result in the VM struct. When the VM is destroyed (last kvm_fd closed), vsock_free_cid() is called from the VM teardown path.

16.24.7 sysfs Interface

/sys/kernel/umka/vsock/
|-- local_cid       (r--): This context's CID (guest: assigned CID; host: VMADDR_CID_HOST = 2)
|-- transport       (r--): Active transport name (e.g., "virtio-vsock", "loopback")
`-- connections/    (r--): Per-connection state (one subdir per active socket pair)
    `-- <local_cid>:<local_port>:<remote_cid>:<remote_port>/
        |-- state           (r--): VsockState as string
        |-- send_window     (r--): Current send window in bytes
        `-- recv_queued     (r--): Bytes queued in recv_queue

local_cid is written once at transport init and is read-only thereafter. The connections/ subtree uses the existing umka sysfs dynamic-attribute model (Section 20.5), with one entry per connected VsockSock. Entries appear on RESPONSE (connect) and disappear on RST or Shutdown completion.


16.25 AF_PACKET Raw Socket

AF_PACKET provides raw access to network device frames at the link layer (L2). It is required by tcpdump, wireshark, dhcpd, hostapd, DPDK fallback, and any tool that needs L2 packet capture or injection. UmkaOS implements the full Linux AF_PACKET ABI including PACKET_MMAP V3 for zero-copy ring-based operation.

Linux parallel: Linux implements AF_PACKET in net/packet/af_packet.c. UmkaOS provides the same sockaddr_ll ABI, TPACKET ring layout, and fanout semantics so that unmodified libpcap, tcpdump, wireshark, and DHCP servers work without recompilation.

16.25.1 Socket Creation

socket(AF_PACKET, type, htons(protocol)) creates a packet socket.

Type Semantics Use case
SOCK_RAW Full L2 frame including Ethernet header delivered to userspace; TX requires userspace to supply the complete Ethernet header tcpdump, wireshark, custom protocol stacks
SOCK_DGRAM L2 header stripped on RX (cooked mode); kernel prepends L2 header on TX based on sll_addr DHCP client/server, ARP tools, hostapd

The protocol argument selects which ethertypes the socket receives: - ETH_P_ALL (0x0003): receive all frames (promiscuous capture). - Specific ethertype (e.g., ETH_P_IP = 0x0800, ETH_P_ARP = 0x0806): filter at the protocol demux level before delivery to the socket.

Capability requirement: Creating an AF_PACKET socket requires CAP_NET_RAW in the caller's network namespace (Section 16.3). The check uses ns_capable(task.nsproxy.net_ns.user_ns, CAP_NET_RAW) — a container root with CAP_NET_RAW can capture packets on interfaces in its own network namespace but cannot observe host-namespace traffic.

16.25.2 Address Format

/// Link-layer socket address (matches Linux struct sockaddr_ll, 20 bytes).
///
/// Used with `bind()`, `sendto()`, and `recvfrom()` on AF_PACKET sockets.
/// On receive, the kernel fills all fields to describe the incoming frame.
/// On send (SOCK_DGRAM), the kernel uses `sll_addr` and `sll_halen` to
/// construct the L2 header; on send (SOCK_RAW), userspace provides the
/// complete L2 header and `sll_addr` is informational only.
#[repr(C)]
pub struct SockAddrLl {
    /// Address family: AF_PACKET (17).
    pub sll_family: u16,
    /// Ethertype in network byte order (e.g., 0x0800 for IPv4).
    /// Set by the kernel on RX. On TX, must match the socket's bound protocol
    /// or be zero (use the socket's protocol).
    pub sll_protocol: u16,
    /// Interface index (0 = any interface in the namespace).
    /// `bind()` with a specific ifindex restricts capture to that interface.
    /// Validated against `sock.net_ns.interfaces` at bind time.
    pub sll_ifindex: i32,
    /// ARP hardware type (ARPHRD_ETHER = 1 for Ethernet, ARPHRD_LOOPBACK = 772).
    pub sll_hatype: u16,
    /// Packet type classification (set by kernel on RX).
    /// PACKET_HOST (0): addressed to this host.
    /// PACKET_BROADCAST (1): link-layer broadcast.
    /// PACKET_MULTICAST (2): link-layer multicast.
    /// PACKET_OTHERHOST (3): addressed to another host (promiscuous capture).
    /// PACKET_OUTGOING (4): originated from this host (captured on TX path).
    pub sll_pkttype: u8,
    /// Hardware address length (6 for Ethernet MAC).
    pub sll_halen: u8,
    /// Hardware address, zero-padded to 8 bytes.
    /// For Ethernet: 6-byte MAC followed by 2 zero bytes.
    pub sll_addr: [u8; 8],
}
const_assert!(size_of::<SockAddrLl>() == 20);

16.25.3 PacketSocket Internal State

/// Per-CPU packet socket statistics.
/// All counters are u64 to satisfy the 50-year uptime requirement
/// (at 10 Mpps, a u32 wraps in ~430 seconds; u64 lasts ~58,000 years).
#[repr(C)]
pub struct PacketStats {
    /// Total packets delivered to userspace (ring or recv buffer).
    pub tp_packets: AtomicU64,
    /// Packets dropped due to full ring or recv buffer.
    pub tp_drops: AtomicU64,
    /// Packets that passed the BPF filter (if attached).
    pub tp_filter_passed: AtomicU64,
    /// Freeze count: number of times the ring was full when a packet arrived.
    pub tp_freeze_q_cnt: AtomicU64,
}
// kernel-internal, not KABI — uses AtomicU64 (UAPI wire format is tpacket_stats with u32).
const_assert!(core::mem::size_of::<PacketStats>() == 32);

// **Stats aggregation for `PACKET_STATISTICS` getsockopt**: Counters are
// per-socket AtomicU64 values (not per-CPU). `getsockopt(SOL_PACKET,
// PACKET_STATISTICS)` reads the counters with `Relaxed` ordering and
// atomically resets them to zero (matching Linux's "read and clear"
// semantics). The Linux wire format uses `struct tpacket_stats` with u32
// fields; the kernel truncates the internal u64 counters to u32 for the
// ABI response. No per-CPU split is needed because packet socket delivery
// is serialized by the socket's spinlock (recv_queue lock or ring lock).

/// Maximum multicast group subscriptions per packet socket.
/// Enforced at the `PACKET_ADD_MEMBERSHIP` setsockopt handler (warm-path).
/// 256 covers all practical multicast scenarios (IPTV, financial multicast).
/// Linux has no explicit cap (linked list), but 256 prevents resource exhaustion
/// from a malicious socket accumulating unbounded subscriptions.
pub const MAX_PACKET_MCLIST: usize = 256;

/// Multicast subscription entry for a packet socket.
/// Used by PACKET_ADD_MEMBERSHIP / PACKET_DROP_MEMBERSHIP.
#[repr(C)]
pub struct PacketMclist {
    /// Interface index.
    pub ifindex: i32,
    /// Multicast type (PACKET_MR_MULTICAST, PACKET_MR_PROMISC, PACKET_MR_ALLMULTI).
    pub mr_type: u16,
    /// Address length (6 for Ethernet).
    pub mr_alen: u16,
    /// Multicast address.
    pub mr_address: [u8; 8],
}
// UAPI ABI: ifindex(4)+mr_type(2)+mr_alen(2)+mr_address(8) = 16 bytes.
const_assert!(size_of::<PacketMclist>() == 16);

/// TPACKET version selector.
#[repr(u32)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TpacketVersion {
    /// TPACKET_V1: fixed-size frames, per-frame status word.
    V1 = 0,
    /// TPACKET_V2: adds VLAN info and 64-bit timestamp.
    V2 = 1,
    /// TPACKET_V3: block-based layout with variable-length frames and
    /// retire-on-timeout. Preferred for high-speed capture.
    V3 = 2,
}

/// Internal packet socket state.
///
/// Implements `SocketOps` ([Section 16.3](#socket-abstraction)).
/// Allocated from the per-CPU slab cache at socket(AF_PACKET, ...) time.
pub struct PacketSocket {
    /// Protocol-agnostic common state (net_ns, cred, flags).
    pub common: SockCommon,
    /// Bound ethertype filter (network byte order). Set at socket creation
    /// from the `protocol` argument; refined by `bind()`.
    pub protocol: u16,
    /// Bound interface index (0 = all interfaces in namespace).
    /// Set by `bind()` with a `SockAddrLl`.
    pub ifindex: i32,
    /// Fanout group membership (None if not part of a group).
    pub fanout: Option<Arc<FanoutGroup>>,
    /// PACKET_MMAP RX ring (None if mmap not configured).
    pub rx_ring: Option<PacketMmapRing>,
    /// PACKET_MMAP TX ring (None if mmap not configured).
    pub tx_ring: Option<PacketMmapRing>,
    /// Active TPACKET version (V1, V2, or V3). Set via PACKET_VERSION
    /// setsockopt before ring allocation. Immutable after mmap.
    pub tp_version: TpacketVersion,
    /// PACKET_AUXDATA: when true, deliver auxiliary metadata (VLAN tag,
    /// original length, hash) via CMSG on recvmsg().
    /// `AtomicBool` because `setsockopt(PACKET_AUXDATA)` (syscall, process
    /// context) races with the RX delivery path (softirq context) reading
    /// this flag. `Relaxed` ordering suffices: the exact value is advisory
    /// and a stale read causes at most one packet with/without metadata.
    pub auxdata: AtomicBool,
    /// PACKET_ORIGDEV: when true, report the original ingress device
    /// index in the ancillary data (before bridge/bond forwarding).
    /// `AtomicBool` for the same concurrent-access reason as `auxdata`.
    pub origdev: AtomicBool,
    /// PACKET_QDISC_BYPASS: when true, TX bypasses the traffic control
    /// layer and sends directly to the NIC driver's TX ring.
    /// `AtomicBool` because `setsockopt(PACKET_QDISC_BYPASS)` (syscall)
    /// races with TX path reads (possibly different CPU).
    pub qdisc_bypass: AtomicBool,
    /// Packet statistics (tp_packets, tp_drops, etc.).
    pub stats: PacketStats,
    /// Attached BPF filter program (SO_ATTACH_FILTER or SO_ATTACH_BPF).
    /// Evaluated per-packet before ring delivery; packets failing the
    /// filter are dropped without consuming ring space.
    pub bpf_filter: Option<Arc<BpfProg>>,
    /// Multicast group subscriptions (PACKET_ADD_MEMBERSHIP).
    /// Bounded to `MAX_PACKET_MCLIST` entries (256) enforced at the
    /// `PACKET_ADD_MEMBERSHIP` setsockopt handler. Linux has no hard
    /// limit but uses a linked list; UmkaOS uses a bounded `Vec` for
    /// cache locality on the teardown path (each entry requires an
    /// interface-level multicast leave on socket close). The
    /// `PACKET_ADD_MEMBERSHIP` handler is warm-path (not per-packet),
    /// so bounded heap allocation is acceptable. Typical use (tcpdump,
    /// DHCP, LLDP) requires 1-4 entries; IPTV/multicast receivers may
    /// require hundreds.
    pub mclist: Vec<PacketMclist>, // Max 256 entries, enforced by PACKET_ADD_MEMBERSHIP setsockopt handler. Vec acceptable: warm-path, bounded allocation.
    /// Wait queue for blocked recv/poll operations.
    pub waitq: WaitQueue,
    /// Socket-level receive buffer for non-mmap mode.
    /// Bounded by `common.rcvbuf`. Packets exceeding this limit are
    /// counted in `stats.tp_drops`.
    pub recv_queue: SpinLock<BoundedRing<NetBufHandle, 4096>>,
}

16.25.4 PACKET_MMAP V3 — Zero-Copy Ring Buffer

PACKET_MMAP provides a shared-memory ring buffer between kernel and userspace, eliminating the per-packet recvfrom() syscall overhead. V3 (block-based layout) is the preferred mode for high-speed capture.

16.25.4.1 Ring Configuration

/// TPACKET_V3 ring request (matches Linux struct tpacket_req3, 28 bytes).
/// Passed to `setsockopt(SOL_PACKET, PACKET_RX_RING)` or `PACKET_TX_RING`.
#[repr(C)]
pub struct TpacketReq3 {
    /// Size of each block in bytes (must be a power of two, page-aligned).
    pub tp_block_size: u32,
    /// Number of blocks in the ring (total ring size = block_size × block_nr).
    pub tp_block_nr: u32,
    /// Size of each frame within a block (must divide evenly into block_size
    /// minus the block header). V3 allows variable-length frames packed
    /// within a block, so this is the maximum frame size.
    pub tp_frame_size: u32,
    /// Total number of frames across all blocks (informational for V3;
    /// the kernel validates that block_nr × frames_per_block >= frame_nr).
    pub tp_frame_nr: u32,
    /// Block retire timeout in milliseconds. If a partially-filled block
    /// has not received a new packet within this interval, it is retired
    /// to userspace. 0 = no timeout (block retired only when full).
    /// Typical: 10-100 ms for interactive capture, 0 for bulk capture.
    pub tp_retire_blk_tov: u32,
    /// Private data area size per block (for userspace-defined metadata).
    pub tp_sizeof_priv: u32,
    /// Feature request flags (TP_FT_REQ_FILL_RXHASH: populate rxhash field).
    pub tp_feature_req_word: u32,
}
const_assert!(size_of::<TpacketReq3>() == 28);

16.25.4.2 Block and Frame Layout

/// TPACKET_V3 block descriptor (matches Linux struct tpacket_block_desc).
/// One per block in the mmap'd ring. The kernel writes block metadata
/// here; userspace reads it to find frames within the block.
#[repr(C)]
pub struct TpacketBlockDesc {
    /// Block version (1 for current TPACKET_V3).
    pub version: u32,
    /// Offset to private data area within the block (bytes from block start).
    pub offset_to_priv: u32,
    /// Block status and metadata (union — V1 header for current version).
    pub hdr: TpacketBdHeader,
}
// UAPI ABI: version(4)+offset_to_priv(4)+hdr(40) = 48 bytes.
const_assert!(size_of::<TpacketBlockDesc>() == 48);

/// Block header V1 (within TpacketBlockDesc, matches Linux struct tpacket_hdr_v1).
/// This is the UAPI struct shared with userspace — field types must match Linux
/// exactly. All fields are host-endian (same-machine shared memory, not cross-node).
#[repr(C)]
pub struct TpacketBdHeader {
    /// Block status (TP_STATUS_KERNEL or TP_STATUS_USER).
    /// The kernel sets TP_STATUS_USER when the block is retired;
    /// userspace sets TP_STATUS_KERNEL after processing all frames.
    pub block_status: AtomicU32,
    /// Number of packets in this block.
    pub num_pkts: u32,
    /// Offset from block start to the first packet in this block (bytes).
    pub offset_to_first_pkt: u32,
    /// Used length of this block in bytes (header + all packet data).
    pub blk_len: u32,
    /// Block sequence number (monotonically increasing per ring, u64 for
    /// 50-year uptime — at 10M blocks/sec, wraps in ~58K years).
    pub seq_num: u64,
    /// Timestamp of the first packet in the block.
    /// Matches Linux `struct tpacket_bd_ts` (two u32 fields, not u64).
    pub ts_first_pkt_sec: u32,
    pub ts_first_pkt_nsec: u32,
    /// Timestamp of the last packet in the block.
    pub ts_last_pkt_sec: u32,
    pub ts_last_pkt_nsec: u32,
}
const_assert!(size_of::<TpacketBdHeader>() == 40);
const_assert!(offset_of!(TpacketBdHeader, seq_num) == 16);

/// Per-frame header within a V3 block (variable-length, packed).
/// Frames are laid out consecutively within a block after the block
/// descriptor, each aligned to TPACKET_ALIGNMENT (16 bytes).
#[repr(C)]
pub struct Tpacket3Hdr {
    /// Offset to next frame (0 if last frame in block).
    pub tp_next_offset: u32,
    /// Seconds since epoch (packet timestamp).
    pub tp_sec: u32,
    /// Nanoseconds within the second.
    pub tp_nsec: u32,
    /// Captured length (bytes actually stored).
    pub tp_snaplen: u32,
    /// Original packet length on the wire.
    pub tp_len: u32,
    /// Frame status flags (TP_STATUS_COPY, TP_STATUS_LOSING, etc.).
    pub tp_status: u32,
    /// MAC header offset from the start of this frame header.
    pub tp_mac: u16,
    /// Network header offset from the start of this frame header.
    pub tp_net: u16,
    /// Union: VLAN TCI and TPID (when PACKET_AUXDATA or TP_STATUS_VLAN_VALID).
    pub hv1: TpacketHdrVariant1,
    /// Padding to TPACKET_ALIGNMENT boundary (16 bytes).
    pub tp_padding: [u8; 8],
}

/// VLAN metadata within the frame header.
#[repr(C)]
pub struct TpacketHdrVariant1 {
    /// RX hash (flow hash from NIC or software).
    pub tp_rxhash: u32,
    /// VLAN Tag Control Information (TCI).
    pub tp_vlan_tci: u32,
    /// VLAN Tag Protocol Identifier (TPID, e.g., 0x8100 for 802.1Q).
    pub tp_vlan_tpid: u16,
    /// Padding.
    pub tp_padding: u16,
}
const_assert!(size_of::<TpacketHdrVariant1>() == 12);
const_assert!(size_of::<Tpacket3Hdr>() == 48);

16.25.4.3 Ring State Machine

RX ring (kernel → userspace):

  ┌─────────────┐    packet arrives    ┌──────────────┐
  │ KERNEL-owned │ ─────────────────→  │  Filling...   │
  │ block_status │    (kernel writes    │  (kernel      │
  │ = TP_STATUS_ │     frame + hdr)    │   appends)    │
  │   KERNEL     │                     └──────┬────────┘
  └──────────────┘                            │
        ↑                         block full OR retire timeout
        │                                     │
        │                                     v
        │    userspace sets              ┌──────────────┐
        │    TP_STATUS_KERNEL            │ USER-owned   │
        └────────────────────────────────│ block_status │
                                         │ = TP_STATUS_ │
                                         │   USER       │
                                         └──────────────┘
                                         userspace reads
                                         all frames in
                                         block, then
                                         releases block

TX ring (userspace → kernel):

  1. Userspace writes frame data + sets TP_STATUS_SEND_REQUEST
  2. Userspace calls sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0)
  3. Kernel reads frames from TX ring, transmits via NIC driver
  4. Kernel sets TP_STATUS_AVAILABLE on transmitted frames

The mmap region is established via:

mmap(NULL, ring_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
where ring_size = tp_block_size × tp_block_nr. The RX ring occupies offset 0; the TX ring (if configured) occupies the region immediately after the RX ring.

16.25.4.4 Internal Ring Management

/// Kernel-side PACKET_MMAP ring state.
///
/// The ring pages are allocated via the page allocator (order-N pages
/// for block_size alignment), pinned in physical memory, and mapped
/// into the userspace process's address space via `vm_insert_page()`.
pub struct PacketMmapRing {
    /// Pointer to the contiguous mmap'd region (kernel virtual address).
    ///
    /// **Ownership**: Allocated at ring setup (`setsockopt(PACKET_RX_RING)`)
    /// via order-N page allocation, pinned in physical memory, and mapped
    /// into the userspace process's address space via `vm_insert_page()`.
    /// Freed at ring teardown (`close()` or `setsockopt` with size=0).
    /// The pointer is valid for the lifetime of the `PacketMmapRing`.
    ///
    /// `AtomicPtr` with `Release`/`Acquire` ordering ensures that concurrent
    /// readers (e.g., the retire timer callback) see a consistent null after
    /// ring teardown, preventing use-after-free.
    pub pg_vec: AtomicPtr<u8>,
    /// Total size of the mapped region in bytes.
    pub pg_vec_len: usize,
    /// Number of blocks.
    pub block_nr: u32,
    /// Size of each block in bytes.
    pub block_size: u32,
    /// Maximum frame size within a block.
    pub frame_size: u32,
    /// Current block index (kernel write cursor for RX; read cursor for TX).
    ///
    /// `AtomicU32` because the NAPI RX delivery path (softirq) and the block
    /// retire timer callback may access this concurrently. The retire timer
    /// fires when a partially-filled block has been idle; if it fires
    /// concurrently with NAPI delivery to the same block, both update the
    /// cursor. NAPI delivery and the retire timer for the same RX queue
    /// typically run on the same CPU (NAPI guarantees per-queue affinity),
    /// but the timer may migrate to another CPU under load balancing.
    /// `AtomicU32` with `Relaxed` ordering suffices: the cursor is
    /// monotonically increasing (modulo block_nr) and both paths advance it
    /// in the same direction.
    pub current_block: AtomicU32,
    /// Block retire timeout (nanoseconds, converted from ms at ring setup).
    pub retire_blk_tov_ns: u64,
    /// Timer for block retirement (fires when a partially-filled block
    /// has been idle for retire_blk_tov_ns).
    pub retire_timer: Option<TimerHandle>,
}

16.25.5 PACKET_FANOUT — Socket Load Distribution

PACKET_FANOUT distributes incoming packets across multiple sockets that share the same (ifindex, protocol) binding. This enables multi-threaded capture (one socket per thread) and multi-process load balancing.

/// Fanout group. All packet sockets joined to the same group share
/// incoming traffic according to the selected algorithm.
///
/// Indexed by fanout group ID (u16) in a global XArray. The XArray
/// key is the group ID; each entry holds an `Arc<FanoutGroup>`.
/// Integer-keyed lookup → XArray is the correct collection
/// ([Section 3.13](03-concurrency.md#collection-usage-policy)).
pub struct FanoutGroup {
    /// Group ID (0..65535). Set by the first socket's PACKET_FANOUT setsockopt.
    pub id: u16,
    /// Fanout algorithm.
    pub mode: FanoutMode,
    /// Member sockets. Fixed capacity of `PACKET_FANOUT_MAX` (256),
    /// matching Linux's compile-time constant in `net/packet/internal.h`.
    /// Linux does NOT expose a setsockopt to configure this limit.
    ///
    /// ArrayVec (not Vec): fixed-size, no heap allocation per join/leave.
    /// Protected by a SpinLock because membership changes are rare
    /// (socket open/close) while packet delivery on the RX path reads
    /// the member list under RCU.
    pub members: SpinLock<ArrayVec<Arc<PacketSocket>, PACKET_FANOUT_MAX>>,
    /// RCU-protected snapshot of the member list for the hot delivery
    /// path. Updated (copy-on-write) under `members` lock, read under
    /// `rcu_read_lock()` by the packet delivery path. `Box<[...]>` is
    /// allocated at each membership change (cold-path clone-and-swap).
    pub members_rcu: RcuCell<Box<[Arc<PacketSocket>]>>,
    /// Number of active members (for round-robin index).
    pub num_members: AtomicU32,
    /// Round-robin counter (for FANOUT_LB).
    pub rr_counter: AtomicU64,
    /// Bound interface index.
    pub ifindex: i32,
    /// Bound protocol.
    pub protocol: u16,
    /// BPF program for FANOUT_CBPF / FANOUT_EBPF modes.
    pub bpf_prog: Option<Arc<BpfProg>>,
}

/// Maximum sockets in a single fanout group. Matches Linux's compile-time
/// constant `PACKET_FANOUT_MAX = 256` in `net/packet/internal.h`.
/// This is NOT configurable at runtime (Linux has no setsockopt for it).
pub const PACKET_FANOUT_MAX: usize = 256;

/// Global fanout group registry. Keyed by group ID (u16).
/// XArray for integer-keyed O(1) lookup.
static FANOUT_GROUPS: SpinLock<XArray<Arc<FanoutGroup>>> =
    SpinLock::new(XArray::new());

/// Fanout distribution algorithm.
#[repr(u16)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FanoutMode {
    /// Flow hash: symmetric hash of (src_ip, dst_ip, src_port, dst_port,
    /// protocol). Same flow always goes to the same socket.
    Hash      = 0,
    /// Round-robin: rotate through sockets sequentially.
    Lb        = 1,
    /// CPU affinity: packet delivered to the socket whose index matches
    /// the receiving CPU (smp_processor_id() % num_members).
    Cpu       = 2,
    /// Rollover: deliver to the current socket; if its receive buffer is
    /// full, advance to the next socket.
    Rollover  = 3,
    /// Random: uniform random selection among members.
    Rnd       = 4,
    /// Queue mapping: use the NIC's RX queue index to select the socket
    /// (queue_index % num_members).
    Qm        = 5,
    /// Classic BPF: run an attached cBPF program that returns the socket
    /// index (0..num_members-1).
    Cbpf      = 6,
    /// eBPF: run an attached eBPF program that returns the socket index.
    Ebpf      = 7,
}

/// Fanout flag bits (OR'd into the mode value in setsockopt).
pub mod fanout_flags {
    /// If the selected socket's buffer is full, roll over to the next.
    pub const ROLLOVER: u32  = 0x1000;
    /// Request a unique group ID (kernel assigns one if the requested
    /// ID is already in use).
    pub const UNIQUEID: u32  = 0x2000;
    /// Defragment IP fragments before fanout hashing (ensures all
    /// fragments of a flow go to the same socket).
    pub const DEFRAG: u32    = 0x8000;
}

Fanout join protocol: A socket joins a fanout group via setsockopt(SOL_PACKET, PACKET_FANOUT, &val, sizeof(val)) where val encodes (group_id: u16) | (mode: u16 << 16) | flags. The first socket to join a group ID creates the FanoutGroup entry in the global XArray. Subsequent sockets must specify the same mode and flags; mismatches return -EINVAL. When the last member socket closes, the FanoutGroup is removed from the XArray.

16.25.6 BPF Filter Attachment

Packet sockets support BPF program attachment for in-kernel packet filtering. The filter is evaluated for every incoming packet before ring delivery; packets that fail the filter (return value 0) are dropped without consuming ring space or waking userspace.

Socket option BPF type Semantics
SO_ATTACH_FILTER Classic BPF (struct sock_fprog) Compiled to internal BPF bytecode at attach time. Legacy path for libpcap compatibility.
SO_ATTACH_BPF eBPF (program fd) Attached by fd obtained from bpf(BPF_PROG_LOAD, BPF_PROG_TYPE_SOCKET_FILTER, ...). Modern path with map access.
SO_DETACH_FILTER Remove the attached filter. All subsequent packets are delivered unfiltered.

Integration: BPF programs attached to packet sockets are subject to the same isolation rules as all BPF programs in the networking stack (Section 16.18). The program runs in a BPF isolation domain and accesses packet data through verified BPF helpers. Classic BPF programs are internally translated to eBPF before JIT compilation.

Filter evaluation point: The filter runs after protocol match and interface match but before ring delivery. This ordering ensures that the filter sees the complete frame (including any VLAN tags) and that dropped packets never consume ring space or trigger poll() wakeups.

16.25.7 Socket Options

Option Level Type Description
PACKET_ADD_MEMBERSHIP SOL_PACKET packet_mreq Join a multicast group or enable promiscuous/all-multicast mode on an interface
PACKET_DROP_MEMBERSHIP SOL_PACKET packet_mreq Leave a multicast group or disable promiscuous/all-multicast mode
PACKET_AUXDATA SOL_PACKET i32 Enable auxiliary data delivery (VLAN TCI, original length, packet hash) via CMSG
PACKET_FANOUT SOL_PACKET i32 Join a fanout group (group_id
PACKET_STATISTICS SOL_PACKET tpacket_stats Retrieve and atomically reset RX/TX packet and drop counters
PACKET_VERSION SOL_PACKET i32 Set TPACKET version (0=V1, 1=V2, 2=V3). Must be set before PACKET_RX_RING/PACKET_TX_RING
PACKET_RX_RING SOL_PACKET tpacket_req3 Configure the MMAP RX ring (block count, block size, frame size, retire timeout)
PACKET_TX_RING SOL_PACKET tpacket_req3 Configure the MMAP TX ring
PACKET_QDISC_BYPASS SOL_PACKET i32 Bypass the traffic control layer (Section 16.21) on TX; send directly to NIC driver
PACKET_ORIGDEV SOL_PACKET i32 Report the original ingress device index (before bridge/bond forwarding) in ancillary data
PACKET_LOSS SOL_PACKET i32 Enable TX loss reporting: if the NIC TX ring is full, report the drop count in PACKET_STATISTICS
PACKET_TIMESTAMP SOL_PACKET i32 Timestamp source: 0 = software (ktime), 1 = adapter hardware (NIC PTP clock), 2 = adapter raw
PACKET_RESERVE SOL_PACKET u32 Reserve N bytes of headroom before the frame data in each ring frame (for userspace to prepend headers)
PACKET_VNET_HDR SOL_PACKET i32 Prepend a virtio_net_hdr to each delivered frame (for QEMU tap devices; provides GSO/checksum offload metadata)

16.25.8 RX Delivery Path

The receive path for AF_PACKET sockets is hooked into the network stack's protocol demux layer. Every incoming frame is checked against all registered packet sockets before (or in parallel with) normal L3 protocol dispatch.

NIC RX DMA completion
    v
NapiContext.poll() ([Section 16.14](#napi-new-api-for-packet-polling))
    v
napi_deliver_batch() → umka-net: GRO coalesce → netif_receive(netbuf)
    v
packet_type_demux():
    ├── Match registered packet sockets by (protocol, ifindex)
    │   For each matching PacketSocket:
    │   │
    │   ├── 1. Protocol match: netbuf.ethertype == sock.protocol
    │   │      (ETH_P_ALL matches everything)
    │   │
    │   ├── 2. Interface match: sock.ifindex == 0 (all) OR
    │   │      sock.ifindex == netbuf.dev.ifindex
    │   │
    │   ├── 3. Fanout dispatch: if sock.fanout.is_some(),
    │   │      select target socket from FanoutGroup.members_rcu
    │   │      using the configured FanoutMode algorithm
    │   │
    │   ├── 4. BPF filter: if sock.bpf_filter.is_some(),
    │   │      run filter; drop if result == 0
    │   │
    │   ├── 5a. MMAP ring delivery (if rx_ring configured):
    │   │       Find current block in ring
    │   │       If block full → retire block (set TP_STATUS_USER)
    │   │                     → advance to next block
    │   │       If next block still USER-owned → drop (tp_drops++)
    │   │       Copy frame + Tpacket3Hdr into block
    │   │       Update block.num_pkts, timestamps
    │   │       If retire timeout armed → reset timer
    │   │
    │   └── 5b. Queue delivery (if no MMAP ring):
    │           Clone NetBuf → push to sock.recv_queue
    │           If queue full → drop (tp_drops++)
    │           Wake sock.waitq (poll/epoll/select notification)
    └── Continue to normal L3 protocol processing (IPv4/IPv6)

Busy-poll integration: When SO_BUSY_POLL is set on the socket and the underlying NIC's NAPI instance supports busy polling (busy_poll_enabled = true), poll(POLLIN) on the packet socket drives the NAPI poll function directly from process context before falling back to sleep-wait. This reduces capture latency from the ~10-50 microsecond softirq scheduling delay to sub-microsecond for latency-sensitive monitoring (Section 16.14).

Packet cloning: When multiple packet sockets match the same frame (e.g., tcpdump on ETH_P_ALL + dhcpd on ETH_P_IP), the NetBuf is cloned for each socket. The clone shares the underlying data pages (reference count incremented) with a new metadata header. The original NetBuf continues through the normal protocol stack unaffected.

16.25.9 TX Path

Mode Behavior
SOCK_RAW Userspace provides the complete L2 frame (Ethernet header + payload). The kernel validates the frame length (minimum 14 bytes for Ethernet header) and transmits directly.
SOCK_DGRAM Userspace provides only the payload. The kernel prepends the Ethernet header using sll_addr (destination MAC), the interface's own MAC (source), and sll_protocol (ethertype).
PACKET_QDISC_BYPASS Skip the traffic control layer entirely. The frame is submitted directly to the NIC driver's TX ring via NetDevice.ops.ndo_start_xmit(). Used by packet injection tools that need deterministic TX timing (e.g., hostapd beacons, DHCP replies).
MMAP TX ring Userspace writes frames to the TX ring with TP_STATUS_SEND_REQUEST. A sendto(fd, NULL, 0, 0, NULL, 0) call triggers the kernel to drain the TX ring and submit all pending frames to the NIC driver. After transmission, the kernel sets TP_STATUS_AVAILABLE on each frame slot.

TX path flow:

sendto(fd, data, len, flags, &sll_addr, addrlen)
    v
PacketSocket.sendmsg():
    ├── Build or validate L2 header (SOCK_DGRAM vs SOCK_RAW)
    ├── Allocate NetBuf from per-CPU pool
    ├── Copy userspace data → NetBuf data pages
    ├── Set netbuf.dev = resolve_ifindex(sll_ifindex)
    ├── If qdisc_bypass:
    │       netbuf.dev.ops.ndo_start_xmit(netbuf)
    │       (direct to NIC driver, no TC processing)
    └── Else:
            dev_queue_xmit(netbuf)
            (enters TC layer → qdisc enqueue → NIC driver)

16.25.10 Namespace Isolation

PacketSocket inherits the network namespace from SockCommon.net_ns, captured at socket() time from the calling task's NamespaceSet (Section 16.3). Namespace isolation is enforced at three points:

  1. bind() validation: sll_ifindex is checked against sock.net_ns.interfaces. Binding to an interface that does not exist in the socket's namespace returns -ENODEV.

  2. RX delivery: The packet demux path only considers packet sockets whose net_ns matches the receiving interface's net_ns. A packet socket in container namespace N never receives frames from interfaces in the host namespace or other containers.

  3. TX validation: sendto() resolves sll_ifindex against sock.net_ns.interfaces. Transmitting to an interface outside the socket's namespace returns -ENXIO.

16.25.11 Tier Assignment

AF_PACKET runs in Tier 0 (in-kernel, statically linked). It does not cross any isolation domain boundary for packet delivery.

Rationale: Packet capture requires access to raw NIC DMA buffers at the earliest point in the RX path — inside the NAPI poll loop, before any protocol processing. The PACKET_MMAP ring writes frame data directly from the NIC's DMA buffer (or a copy thereof) into the mmap'd userspace-visible pages. Introducing a domain switch between the NAPI context and the packet socket delivery path would add ~23 cycles per packet on every captured frame. At 14.88 Mpps (10 GbE line rate, 64-byte frames), this adds ~1.5 nanoseconds per packet — tolerable in isolation, but the mmap write path is already latency-sensitive (the kernel must fill ring blocks before the retire timeout expires), and the cache working set of the packet delivery path (ring block headers + frame headers + packet data copy) benefits from staying in the same address space as the NAPI poll function.

AF_PACKET's attack surface is limited: it requires CAP_NET_RAW, operates on pre-validated NetBuf structures produced by the NIC driver, and the mmap ring is a simple producer-consumer protocol with status words. The BPF filter engine is already isolated in its own domain (Section 16.18). No driver-provided code executes in the AF_PACKET path — it is entirely kernel (umka-net) code.


16.26 AF_XDP -- eXpress Data Path Socket

AF_XDP (address family 44, protocol PF_XDP) is the kernel interface for high-performance userspace packet processing. Unlike AF_PACKET's copy-based or PACKET_MMAP model, AF_XDP provides true zero-copy: the NIC DMAs directly into userspace-registered memory (UMEM), and an XDP BPF program redirects packets to the AF_XDP socket. This is UmkaOS's recommended path for DPDK-class workloads without requiring userspace drivers or full kernel bypass.

Performance target: <100 ns per packet in zero-copy mode (vs. ~1-5 us for AF_PACKET MMAP, ~500 ns for DPDK PMD with VFIO overhead). Achieved by eliminating all kernel-to-userspace copies on the data path: the NIC writes packet data directly into UMEM chunks, and the kernel communicates packet metadata (offset + length) through lock-free shared-memory ring buffers.

Linux parallel: Linux implements AF_XDP in net/xdp/xsk.c and net/xdp/xsk_buff_pool.c. UmkaOS implements the same userspace ABI (struct xdp_umem_reg, struct xdp_desc, struct sockaddr_xdp, ring mmap offsets) so that unmodified libxdp and libbpf AF_XDP applications work without recompilation.

XDP core context: The XDP program model (XdpContext, XdpAction, bpf_redirect_map()) is specified in Section 16.5. This section specifies the AF_XDP socket interface that sits behind XdpAction::Redirect when the target is an XSKMAP entry.

16.26.1 Socket Creation and Binding

The userspace workflow follows the Linux AF_XDP API exactly:

1. fd = socket(AF_XDP, SOCK_RAW, 0)
2. setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &umem_reg)       // register UMEM
3. setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &ring_size) // create FILL ring
4. setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &ring_size) // create COMPLETION ring
5. setsockopt(fd, SOL_XDP, XDP_RX_RING, &ring_size)        // create RX ring
6. setsockopt(fd, SOL_XDP, XDP_TX_RING, &ring_size)        // create TX ring
7. mmap(fd, XDP_PGOFF_RX_RING, ...)                        // map RX ring
8. mmap(fd, XDP_PGOFF_TX_RING, ...)                        // map TX ring
9. mmap(fd, XDP_UMEM_PGOFF_FILL_RING, ...)                 // map FILL ring
10. mmap(fd, XDP_UMEM_PGOFF_COMPLETION_RING, ...)          // map COMPLETION ring
11. bind(fd, &SockAddrXdp { ifindex, queue_id, ... })      // bind to NIC queue
12. Attach XDP BPF program with XSKMAP redirect to the NIC

Capability requirement: Creating an AF_XDP socket requires CAP_NET_RAW or CAP_NET_ADMIN in the socket's network namespace (Section 16.3). UMEM registration additionally requires that the memory region is owned by the calling process (validated via VMA lookup during XDP_UMEM_REG).

16.26.2 UMEM (User Memory Region)

UMEM is a contiguous region of userspace memory divided into fixed-size chunks. Each chunk holds one packet frame. The kernel pins the UMEM pages and creates DMA mappings so the NIC can read/write directly.

/// UMEM registration parameters.
/// Passed to setsockopt(fd, SOL_XDP, XDP_UMEM_REG, ...).
/// Layout matches Linux struct xdp_umem_reg for binary compatibility.
#[repr(C)]
pub struct XdpUmemReg {
    /// Base virtual address of the UMEM region (userspace mmap'd).
    /// Must be page-aligned. The kernel calls get_user_pages_fast() to
    /// pin the backing pages and create DMA mappings.
    pub addr: u64,

    /// Total UMEM size in bytes. Must be a multiple of chunk_size.
    /// Maximum: 4 GiB (limited by u32 descriptor offsets within UMEM).
    /// The kernel validates addr + len does not overflow and falls
    /// entirely within a single VMA owned by the calling process.
    pub len: u64,

    /// Frame/chunk size in bytes. Must be a power of two.
    /// Typical values: 2048 (for MTU <= ~1500 with headroom) or 4096.
    /// Minimum: 2048. Maximum: PAGE_SIZE for aligned chunk mode (arch-dependent:
    /// 4096 on x86-64, up to 65536 on AArch64/PPC64LE with large page configs).
    /// In unaligned mode (XDP_UMEM_UNALIGNED_CHUNK_FLAG set): max = UMEM size / 2.
    /// When XDP_UMEM_UNALIGNED_CHUNK_FLAG is set, chunk_size is the
    /// maximum frame size rather than a strict alignment boundary.
    pub chunk_size: u32,

    /// Headroom reserved before packet data in each chunk (bytes).
    /// The NIC writes packet data starting at chunk_base + headroom.
    /// Provides space for userspace to prepend headers (e.g., tunnel
    /// encapsulation) without reallocating. Must satisfy:
    /// headroom + MTU <= chunk_size.
    pub headroom: u32,

    /// UMEM flags (bitmask).
    ///
    /// - `XDP_UMEM_UNALIGNED_CHUNK_FLAG` (1 << 0): Chunks need not be
    ///   aligned to chunk_size boundaries. Descriptor addr fields are
    ///   arbitrary offsets into UMEM (within [0, len - chunk_size]).
    ///   Enables variable-size frame support at the cost of more
    ///   complex buffer management in userspace.
    /// - `XDP_UMEM_TX_SW_CSUM` (1 << 1): Enable software TX checksum
    ///   offload via UMEM TX metadata. When set, the `tx_metadata_len`
    ///   area may contain checksum offload hints (start offset, offset
    ///   of checksum field). Linux 6.8+ feature. Rejected if
    ///   `tx_metadata_len == 0`.
    /// - `XDP_UMEM_TX_METADATA_LEN` (1 << 2): Acknowledge that
    ///   `tx_metadata_len > 0` is intentional. Linux rejects
    ///   `tx_metadata_len > 0` unless this flag is set. This prevents
    ///   legacy applications from accidentally interpreting the metadata
    ///   area as packet data.
    ///
    /// Unknown flag bits are rejected with `EINVAL` (matches Linux
    /// `xdp_umem_reg()` validation in `net/xdp/xdp_umem.c`).
    pub flags: u32,

    /// Per-chunk TX metadata area length (bytes). When non-zero,
    /// the first tx_metadata_len bytes of each TX chunk contain
    /// driver-interpreted metadata (e.g., launch time for time-based
    /// scheduling, checksum offload hints). Linux 6.6+ feature.
    pub tx_metadata_len: u32,
}
const_assert!(size_of::<XdpUmemReg>() == 32);

/// UMEM flags (for `XdpUmemReg.flags`). Matches Linux `include/uapi/linux/if_xdp.h`.
pub const XDP_UMEM_UNALIGNED_CHUNK_FLAG: u32 = 1 << 0;
pub const XDP_UMEM_TX_SW_CSUM: u32           = 1 << 1;
pub const XDP_UMEM_TX_METADATA_LEN: u32      = 1 << 2;

UMEM lifecycle: 1. Userspace allocates a contiguous region via mmap(MAP_ANONYMOUS | MAP_PRIVATE) or mmap(MAP_HUGETLB) for large UMEM regions. 2. setsockopt(SOL_XDP, XDP_UMEM_REG) pins the pages via get_user_pages_fast() and creates a struct XskUmem in the kernel. 3. The kernel creates a DMA mapping for each pinned page via dma_map_page() (Section 4.14). For IOMMU-enabled systems, this programs the IOMMU page table so the NIC can DMA to/from the UMEM pages. 4. Multiple AF_XDP sockets can share a single UMEM via the shared_umem_fd field in SockAddrXdp. The second socket inherits the first socket's UMEM registration (refcounted). This allows multiple queues on the same NIC to share a single memory pool. 5. When the last socket referencing a UMEM closes, the kernel unpins the pages and tears down DMA mappings.

/// Kernel-internal UMEM descriptor. Created during XDP_UMEM_REG
/// and shared (refcounted) across all sockets bound to this UMEM.
pub struct XskUmem {
    /// Base kernel virtual address (pinned userspace pages).
    pub base_addr: *mut u8,

    /// Total UMEM size.
    pub size: u64,

    /// Chunk/frame size.
    pub chunk_size: u32,

    /// Headroom per chunk.
    pub headroom: u32,

    /// UMEM flags (copied from XdpUmemReg.flags).
    pub flags: u32,

    /// TX metadata length per chunk.
    pub tx_metadata_len: u32,

    /// Number of pinned pages (size / PAGE_SIZE).
    pub nr_pages: u64,

    /// DMA addresses for each pinned page. XArray keyed by page index
    /// (u64 page frame number -> dma_addr_t). O(1) lookup during
    /// NIC descriptor programming.
    pub dma_map: XArray<u64>,

    /// Reference count. Incremented when an XskSocket binds to this
    /// UMEM (including shared_umem_fd binds). Decremented on socket close.
    pub refcount: AtomicU64,
}

16.26.3 Ring Buffer Protocol

AF_XDP uses four shared-memory ring buffers per socket. All rings are single-producer / single-consumer (SPSC), mapped into both kernel and userspace address space via mmap().

/// Ring buffer layout as seen by both kernel and userspace after mmap().
///
/// The ring is a power-of-two array of descriptors. Producer and consumer
/// indices are monotonically increasing u32 values; the actual array index
/// is obtained via `idx & (size - 1)` (mask, since size is power-of-two).
///
/// This struct describes the mmap'd layout. The kernel initialises the
/// ring during setsockopt() and returns the offsets via getsockopt(XDP_MMAP_OFFSETS).
#[repr(C)]
pub struct XdpRingOffset {
    /// Byte offset from mmap base to the producer index (AtomicU32).
    pub producer: u64,
    /// Byte offset from mmap base to the consumer index (AtomicU32).
    pub consumer: u64,
    /// Byte offset from mmap base to the descriptor array.
    pub desc: u64,
    /// Byte offset from mmap base to the flags word (u32).
    pub flags: u64,
}
// UAPI ABI: matches Linux struct xdp_ring_offset from include/uapi/linux/if_xdp.h.
// 4 × u64 = 32 bytes. Ring size is communicated via the ring itself (producer/consumer
// indices and power-of-two mask), not through this offset struct.
const_assert!(size_of::<XdpRingOffset>() == 32);

/// Packet descriptor in RX and TX rings.
/// Layout matches Linux struct xdp_desc for binary compatibility.
#[repr(C)]
pub struct XdpDesc {
    /// Byte offset into UMEM where this packet's data starts.
    /// For aligned mode: must be a multiple of chunk_size + headroom.
    /// For unaligned mode: arbitrary offset within [0, umem.size - chunk_size].
    pub addr: u64,

    /// Packet data length in bytes (not including headroom).
    pub len: u32,

    /// Per-descriptor options.
    /// XDP_PKT_CONTD (1 << 0): This descriptor is part of a multi-buffer
    ///   packet. The next descriptor in the ring continues the same packet.
    ///   The last descriptor in a multi-buffer chain has options = 0.
    /// Used for jumbo frames exceeding chunk_size.
    pub options: u32,
}
const_assert!(size_of::<XdpDesc>() == 16);

/// FILL ring descriptor: just a UMEM chunk address (u64).
/// Userspace submits empty chunk addresses; kernel fills them with data.
pub type XdpFillDesc = u64;

/// COMPLETION ring descriptor: just a UMEM chunk address (u64).
/// Kernel returns completed TX chunk addresses for userspace to reclaim.
pub type XdpCompletionDesc = u64;

Ring roles and ownership:

Ring Producer Consumer Descriptor type Purpose
FILL Userspace Kernel u64 (UMEM addr) Supply empty chunks for RX
RX Kernel Userspace XdpDesc Deliver received packets
TX Userspace Kernel XdpDesc Submit packets for transmission
COMPLETION Kernel Userspace u64 (UMEM addr) Return TX-completed chunks

Memory ordering protocol (critical for correctness on weakly-ordered architectures):

The producer writes descriptor data into the ring slot, then advances the producer index with store_release semantics. The consumer reads the producer index with load_acquire semantics, then reads the descriptor data. This ensures that the consumer always sees fully-written descriptors:

Producer side:
    ring.desc[prod_idx & mask] = descriptor;   // write data
    atomic_store_release(&ring.producer, prod_idx + 1);  // publish

Consumer side:
    let prod = atomic_load_acquire(&ring.producer);  // observe
    if prod != cons_idx:
        let desc = ring.desc[cons_idx & mask];  // read data (safe)
        atomic_store_release(&ring.consumer, cons_idx + 1);  // acknowledge

On x86-64 (TSO), store_release is a plain store and load_acquire is a plain load -- zero overhead. On AArch64, RISC-V, and PPC, the compiler emits the appropriate barrier instructions (stlr/ldar, fence rw,rw, lwsync).

mmap page offsets (matching Linux constants):

/// mmap offset for the RX ring region.
pub const XDP_PGOFF_RX_RING: u64 = 0;
/// mmap offset for the TX ring region.
pub const XDP_PGOFF_TX_RING: u64 = 0x8000_0000;
/// mmap offset for the FILL ring region.
pub const XDP_UMEM_PGOFF_FILL_RING: u64 = 0x1_0000_0000;
/// mmap offset for the COMPLETION ring region.
pub const XDP_UMEM_PGOFF_COMPLETION_RING: u64 = 0x1_8000_0000;

16.26.4 SockAddrXdp

/// AF_XDP socket address. Passed to bind() to attach the socket
/// to a specific NIC queue.
/// Layout matches Linux struct sockaddr_xdp for binary compatibility.
#[repr(C)]
pub struct SockAddrXdp {
    /// Address family: AF_XDP (44).
    pub sxdp_family: u16,

    /// Bind flags:
    /// - XDP_SHARED_UMEM (1 << 0): Share UMEM with the socket identified
    ///   by sxdp_shared_umem_fd. The sharing socket must already be bound.
    /// - XDP_COPY (1 << 1): Force copy mode even if zero-copy is available.
    /// - XDP_ZEROCOPY (1 << 2): Require zero-copy mode; fail bind() with
    ///   ENOTSUP if the driver does not support it.
    /// - XDP_USE_NEED_WAKEUP (1 << 3): Enable NEED_WAKEUP signaling
    ///   (reduces CPU usage in poll-based applications).
    /// - XDP_USE_SG (1 << 4): Enable multi-buffer (scatter-gather) mode.
    ///   Packets may span multiple UMEM frames via `XDP_PKT_CONTD`.
    pub sxdp_flags: u16,

    /// Network interface index (from if_nametoindex()).
    /// Must reference an interface in the socket's network namespace.
    pub sxdp_ifindex: u32,

    /// NIC hardware queue index. The AF_XDP socket receives packets
    /// from (and transmits to) this specific RX/TX queue.
    /// Must be < device's num_rx_queues / num_tx_queues.
    pub sxdp_queue_id: u32,

    /// File descriptor of another AF_XDP socket whose UMEM this socket
    /// shares. Only meaningful when XDP_SHARED_UMEM flag is set.
    /// Allows multiple sockets (one per queue) to share a single
    /// UMEM allocation, reducing memory overhead for multi-queue NICs.
    pub sxdp_shared_umem_fd: u32,
}
const_assert!(size_of::<SockAddrXdp>() == 16);

/// AF_XDP address family constant (matches Linux).
pub const AF_XDP: u16 = 44;

/// SockAddrXdp bind flags. Matches Linux `include/uapi/linux/if_xdp.h`.
pub const XDP_SHARED_UMEM: u16     = 1 << 0;
pub const XDP_COPY: u16            = 1 << 1;
pub const XDP_ZEROCOPY: u16        = 1 << 2;
pub const XDP_USE_NEED_WAKEUP: u16 = 1 << 3;
/// Enable multi-buffer (scatter-gather) XDP support. When set, the
/// `XDP_PKT_CONTD` descriptor flag is valid: a packet may span multiple
/// UMEM frames. Required for jumbo frames and TSO/LRO-fed receive paths.
/// Matches Linux `XDP_USE_SG` (added in Linux 6.3).
pub const XDP_USE_SG: u16         = 1 << 4;

/// Ring flag: kernel sets this when it needs userspace to call
/// poll()/sendto() to kick processing. Userspace polls this flag
/// before deciding whether a syscall is needed.
pub const XDP_RING_NEED_WAKEUP: u32 = 1 << 0;

16.26.5 XDP Program Integration

An XDP BPF program attached to the NIC acts as the steering layer for AF_XDP. The program inspects packet headers and decides whether to redirect the packet to an AF_XDP socket:

XDP program (BPF):
    struct bpf_map_def xsks_map SEC(".maps") = {
        .type = BPF_MAP_TYPE_XSKMAP,
        .key_size = sizeof(__u32),
        .value_size = sizeof(__u32),
        .max_entries = 64,        // one entry per RX queue
    };

    SEC("xdp")
    int xdp_redirect_to_xsk(struct xdp_md *ctx) {
        return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS);
    }

Kernel-side XSKMAP:

/// BPF map type for AF_XDP socket lookup.
/// Keyed by NIC queue index (u32) -> XskSocket reference.
///
/// XArray-backed (integer-keyed lookup). XArray provides RCU-safe reads
/// natively: readers call `xa_load()` under `rcu_read_lock()` with no
/// contention. Writers call `xa_store()` / `xa_erase()` under the map's
/// update lock; XArray publishes the new entry via internal RCU, giving
/// O(log₆₄ N) per-entry updates instead of O(N) clone-and-swap.
pub struct XskMap {
    /// XArray mapping queue_id (u32) -> Arc<XskSocket>.
    /// RCU-protected natively by XArray (no external RcuCell needed).
    /// Writers acquire `update_lock`, then call `xa_store()` / `xa_erase()`.
    /// Readers (XDP fast path) call `xa_load()` under `rcu_read_lock()`.
    pub map: XArray<Arc<XskSocket>>,

    /// Maximum number of entries (set at map creation).
    pub max_entries: u32,

    /// Write-side serialization lock for map updates (socket bind/close).
    pub update_lock: SpinLock<()>,
}

When the XDP program returns XdpAction::Redirect with an XSKMAP target (Section 16.5), the kernel: 1. Looks up the AF_XDP socket in XskMap by queue index (O(1) XArray lookup). 2. Consumes one entry from the socket's FILL ring (acquires an empty UMEM chunk). 3. In zero-copy mode: the packet is already in a UMEM chunk (NIC DMAed directly). In copy mode: copies packet data from the NIC's NetBuf into the UMEM chunk. 4. Writes an XdpDesc (UMEM offset + length) to the socket's RX ring. 5. If the socket is in the poll() wait queue, wakes it via wake_up().

If the FILL ring is empty (no available UMEM chunks), the redirect fails and the fallback action (typically XDP_PASS or XDP_DROP) is taken. The xsk_rx_dropped counter is incremented.

16.26.6 Zero-Copy Mode

Zero-copy mode eliminates all data copies between the NIC and userspace. The NIC's DMA engine reads from and writes to UMEM chunks directly.

Requirements: The NIC driver must implement the XSK pool interface:

/// Extension to NetDeviceOps for AF_XDP zero-copy support.
/// Drivers that support zero-copy implement these methods.
/// Drivers that do not implement them fall back to copy mode.
pub trait NetDeviceXskOps: NetDeviceOps {
    /// Configure an XSK buffer pool on a specific RX/TX queue.
    ///
    /// Called during AF_XDP socket bind() when zero-copy is requested
    /// (or when the driver advertises zero-copy capability and copy mode
    /// was not explicitly forced).
    ///
    /// The driver must:
    /// 1. Reconfigure the hardware RX queue to use UMEM DMA addresses
    ///    as RX buffer descriptors (instead of kernel-allocated NetBuf pages).
    /// 2. Set up the TX queue to accept UMEM-based TX descriptors.
    /// 3. Store the XskPool reference for use in the NAPI poll function.
    ///
    /// Returns Ok(()) on success.
    /// Returns Err(IoError::ENOTSUP) if the hardware queue does not support
    /// zero-copy (e.g., queue is in use by another feature, firmware limitation).
    fn xsk_pool_setup(
        &self,
        dev: &NetDevice,
        pool: &Arc<XskPool>,
        queue_id: u16,
    ) -> Result<(), IoError>;

    /// Tear down the XSK buffer pool on a queue. Restore normal NetBuf-based
    /// RX/TX operation. Called when the last AF_XDP socket on this queue closes.
    fn xsk_pool_teardown(
        &self,
        dev: &NetDevice,
        queue_id: u16,
    );

    /// Wake the NIC's TX path for XSK transmission. Called when userspace
    /// submits new TX descriptors and the kernel needs the driver to
    /// process them. In NAPI mode, this typically schedules a NAPI poll.
    fn xsk_wakeup(
        &self,
        dev: &NetDevice,
        queue_id: u16,
        flags: u32,
    ) -> Result<(), IoError>;
}

RX zero-copy data flow: 1. Driver calls xsk_pool_alloc_frame() to consume a FILL ring entry (UMEM chunk address) and programs the NIC RX descriptor with its DMA address. 2. NIC receives a packet and DMAs it into the UMEM chunk. 3. NIC raises an RX completion interrupt; NAPI poll fires (Section 16.14). 4. Driver reads the RX completion descriptor, constructs an XdpDesc with the UMEM offset and packet length, and calls xsk_rx_deliver(). 5. xsk_rx_deliver() writes the XdpDesc to the socket's RX ring and wakes any waiting poll() call.

TX zero-copy data flow: 1. Userspace writes XdpDesc entries to the TX ring (UMEM offset + length). 2. Userspace calls sendto(fd, NULL, 0, MSG_DONTWAIT) or the kernel detects new TX entries via XDP_RING_NEED_WAKEUP mechanism. 3. Driver's xsk_wakeup() is called; the driver reads TX ring descriptors. 4. For each descriptor, the driver programs a NIC TX descriptor with the DMA address of the UMEM chunk (from XskUmem.dma_map). 5. NIC DMAs packet data from the UMEM chunk and transmits. 6. On TX completion, the driver writes the UMEM chunk address to the COMPLETION ring so userspace can reclaim the chunk.

Supported NIC drivers (UmkaOS Phase 2+): Intel E810 (ice), Mellanox ConnectX-5/6/7 (mlx5), Broadcom NetXtreme (bnxt_en), Amazon ENA. All other NIC drivers use copy mode transparently.

16.26.7 Copy Mode Fallback

When the NIC driver does not implement NetDeviceXskOps, or when the XDP_COPY flag is set in SockAddrXdp.sxdp_flags, AF_XDP operates in copy mode:

  • RX: The XDP redirect path copies packet data from the driver's NetBuf (Section 16.5) into a UMEM chunk consumed from the FILL ring. The copy uses memcpy with the length from the NetBuf's linear data region (plus fragments if scatter-gather).
  • TX: Userspace writes packets into UMEM chunks and submits TX descriptors. The kernel allocates a NetBuf, copies data from the UMEM chunk into the NetBuf's DMA buffer, and passes it to NetDeviceOps::start_xmit() (Section 16.13).

Copy mode is transparent to userspace -- the same ring buffer protocol and XdpDesc format apply. The performance penalty is ~2-3x compared to zero-copy (one memcpy per packet), but it works with any NIC driver that supports XDP (XdpAction::Redirect).

Automatic mode selection: If neither XDP_COPY nor XDP_ZEROCOPY is set in SockAddrXdp.sxdp_flags, bind() attempts zero-copy first by calling xsk_pool_setup(). If that returns ENOTSUP, the kernel silently falls back to copy mode. If XDP_ZEROCOPY is explicitly set and the driver does not support it, bind() fails with ENOTSUP.

16.26.8 NEED_WAKEUP Mechanism

When XDP_USE_NEED_WAKEUP is set in SockAddrXdp.sxdp_flags, the kernel and userspace cooperate to reduce CPU usage by avoiding unnecessary busy-polling:

FILL ring starvation: When the kernel exhausts all FILL ring entries (no empty UMEM chunks available for RX), it sets the XDP_RING_NEED_WAKEUP flag on the FILL ring's flags word. Userspace checks this flag after replenishing the FILL ring and, if set, calls poll(fd, POLLOUT) or sendto(fd, NULL, 0, MSG_DONTWAIT) to notify the kernel that new FILL entries are available.

TX ring processing: Similarly, when the kernel has finished processing all TX ring entries and enters idle, it sets XDP_RING_NEED_WAKEUP on the TX ring. Userspace checks the flag after submitting new TX entries and calls sendto() to kick the driver's TX path.

Without NEED_WAKEUP, userspace must either busy-poll continuously or use a fixed-interval poll() call. NEED_WAKEUP enables event-driven wakeup with minimal latency overhead: the flag check is a single load_acquire on a cache-line shared with the kernel.

16.26.9 Multi-Buffer Support

For jumbo frames or packets exceeding chunk_size, AF_XDP uses multi-buffer descriptors. Multiple XdpDesc entries in the RX or TX ring form a chain via the XDP_PKT_CONTD flag:

Descriptor N:     { addr: 0x1000, len: 4096, options: XDP_PKT_CONTD }
Descriptor N+1:   { addr: 0x2000, len: 4096, options: XDP_PKT_CONTD }
Descriptor N+2:   { addr: 0x3000, len: 1500, options: 0 }  // last fragment

The last descriptor in a chain has options = 0, signaling end of packet. Each descriptor references a separate UMEM chunk. The total packet length is the sum of all len fields in the chain.

FILL ring interaction: For a multi-buffer RX packet consuming N chunks, N entries are consumed from the FILL ring. Userspace must ensure the FILL ring contains enough entries for the maximum expected chain length.

16.26.10 XskSocket Kernel State

/// Kernel-internal AF_XDP socket state.
pub struct XskSocket {
    /// Common socket state (network namespace, credentials, flags).
    pub common: SockCommon,

    /// UMEM backing this socket (shared across sockets via refcount).
    pub umem: Arc<XskUmem>,

    /// RX ring: kernel produces, userspace consumes.
    /// Allocated during setsockopt(XDP_RX_RING).
    pub rx_ring: Option<XskRing<XdpDesc>>,

    /// TX ring: userspace produces, kernel consumes.
    /// Allocated during setsockopt(XDP_TX_RING).
    pub tx_ring: Option<XskRing<XdpDesc>>,

    /// FILL ring: userspace produces (empty chunks), kernel consumes.
    /// Allocated during setsockopt(XDP_UMEM_FILL_RING).
    pub fill_ring: Option<XskRing<u64>>,

    /// COMPLETION ring: kernel produces (TX-done chunks), userspace consumes.
    /// Allocated during setsockopt(XDP_UMEM_COMPLETION_RING).
    pub completion_ring: Option<XskRing<u64>>,

    /// Bound NIC interface index. 0 = not yet bound.
    pub ifindex: u32,

    /// Bound NIC queue index.
    pub queue_id: u32,

    /// Operating mode after bind().
    pub mode: XskMode,

    /// Bind flags from SockAddrXdp.
    pub bind_flags: u16,

    /// Wait queue for poll() / epoll() readiness notification.
    pub waitq: WaitQueue,

    /// Per-socket statistics (u64 counters for 50-year uptime safety).
    pub stats: XskStats,
}

/// Operating mode for an AF_XDP socket.
#[repr(u8)]
pub enum XskMode {
    /// Not yet bound.
    Unbound = 0,
    /// Zero-copy: NIC DMAs directly to/from UMEM.
    ZeroCopy = 1,
    /// Copy mode: kernel copies between NetBuf and UMEM.
    Copy = 2,
}

/// AF_XDP socket statistics. All counters use AtomicU64 for tear-free reads
/// on 32-bit architectures (ARMv7, PPC32) where NAPI RX path updates and
/// userspace `getsockopt(XDP_STATISTICS)` reads may race.
///
/// At 100 Gbps line rate with minimum-size (64-byte) packets:
/// ~148.8M packets/sec -> u64 wraps in ~3.9 billion seconds (~124 years).
/// Safe for 50+ year continuous operation.
///
/// Updates use Relaxed ordering (monotonic counters, no cross-field invariant).
/// `getsockopt` reads use Relaxed ordering (approximate snapshot is sufficient).
pub struct XskStats {
    /// Packets successfully delivered to userspace via RX ring.
    pub rx_packets: AtomicU64,
    /// Packets dropped because the RX ring was full.
    pub rx_ring_full: AtomicU64,
    /// Packets dropped because the FILL ring was empty (no available chunks).
    pub rx_fill_empty: AtomicU64,
    /// Packets successfully transmitted from TX ring.
    pub tx_packets: AtomicU64,
    /// TX submissions that failed (TX ring empty or NIC error).
    pub tx_errors: AtomicU64,
    /// Total bytes received (sum of all RX packet lengths).
    pub rx_bytes: AtomicU64,
    /// Total bytes transmitted (sum of all TX packet lengths).
    pub tx_bytes: AtomicU64,
}

/// Typed ring buffer wrapper for AF_XDP shared-memory rings.
///
/// Generic over the descriptor type: XdpDesc for RX/TX rings,
/// u64 for FILL/COMPLETION rings.
pub struct XskRing<T: Copy> {
    /// Pointer to the mmap'd producer index (AtomicU32).
    pub producer: *mut AtomicU32,
    /// Pointer to the mmap'd consumer index (AtomicU32).
    pub consumer: *mut AtomicU32,
    /// Pointer to the mmap'd flags word.
    pub flags: *mut u32,
    /// Pointer to the descriptor array.
    pub descs: *mut T,
    /// Number of entries (power of two).
    pub size: u32,
    /// Mask for index-to-slot conversion: size - 1.
    pub mask: u32,
}

16.26.11 Namespace Isolation

AF_XDP sockets inherit the network namespace of the creating process (Section 17.1). The sxdp_ifindex in SockAddrXdp must reference an interface in the socket's namespace. bind() validates this by looking up the interface in sock.common.net_ns.interfaces.

A process inside a network namespace can create AF_XDP sockets only for interfaces visible in that namespace. Container runtimes that move a NIC (or a VF via SR-IOV) into a container's network namespace enable the container to use AF_XDP on that interface without host namespace access.

16.26.12 setsockopt / getsockopt Interface

Level Option Direction Type Description
SOL_XDP XDP_UMEM_REG set XdpUmemReg Register UMEM region
SOL_XDP XDP_RX_RING set u32 Create RX ring with N entries
SOL_XDP XDP_TX_RING set u32 Create TX ring with N entries
SOL_XDP XDP_UMEM_FILL_RING set u32 Create FILL ring with N entries
SOL_XDP XDP_UMEM_COMPLETION_RING set u32 Create COMPLETION ring with N entries
SOL_XDP XDP_MMAP_OFFSETS get XdpMmapOffsets (128 bytes) Ring layout offsets for mmap
SOL_XDP XDP_STATISTICS get XskStats Per-socket statistics
SOL_XDP XDP_OPTIONS get XdpOptions Current socket options

Ring sizes must be powers of two. Maximum ring size: 32768 entries (sufficient for 100 Gbps line rate with batched processing).

/// Offsets returned by getsockopt(SOL_XDP, XDP_MMAP_OFFSETS).
/// Userspace uses these to locate ring fields after mmap().
/// Layout matches Linux struct xdp_mmap_offsets.
#[repr(C)]
pub struct XdpMmapOffsets {
    pub rx: XdpRingOffset,
    pub tx: XdpRingOffset,
    pub fill: XdpRingOffset,
    pub completion: XdpRingOffset,
}
// UAPI ABI: 4 × XdpRingOffset(32) = 128 bytes.
// Matches Linux struct xdp_mmap_offsets from include/uapi/linux/if_xdp.h.
const_assert!(size_of::<XdpMmapOffsets>() == 128);

16.26.13 Comparison with Alternatives

Feature AF_XDP AF_PACKET MMAP DPDK PMD
Per-packet latency <100 ns ~1-5 us ~50-80 ns
Data copy Zero-copy (zc mode) One copy (kernel to mmap ring) Zero-copy (DMA to hugepage)
Kernel bypass Partial (XDP in kernel) No Full (UIO/VFIO)
Container-safe Yes (net namespace) Yes (net namespace) No (requires VFIO passthrough)
BPF filtering Yes (XDP program) Yes (cBPF/eBPF SO_ATTACH) No (application-level only)
Multi-tenant Yes (per-queue XSKMAP) Yes (PACKET_FANOUT) Difficult (SR-IOV VFs)
Kernel visibility Full (XDP sees all packets) Full (copies all packets) None (kernel bypassed)
Driver requirement XDP support (any mode) None UIO or VFIO binding
Fallback Copy mode (any NIC) Always works None (driver-specific)

AF_XDP is the recommended choice for high-performance packet processing workloads that must coexist with the kernel network stack (routing, firewalling, monitoring). For workloads that require absolute minimum latency and can dedicate entire NICs, DPDK PMD via VFIO passthrough (Section 18.5) remains an option.

16.26.14 Shared UMEM

The XDP_SHARED_UMEM flag allows multiple AF_XDP sockets to share a single UMEM allocation. This is the standard deployment pattern for multi-queue NICs: one AF_XDP socket per RX queue, all backed by the same UMEM.

Sharing protocol: 1. The first socket (the "owner") registers the UMEM via setsockopt(SOL_XDP, XDP_UMEM_REG) and creates the FILL and COMPLETION rings. This socket owns the UMEM lifecycle. 2. Subsequent sockets set XDP_SHARED_UMEM in SockAddrXdp.sxdp_flags and provide the owner socket's file descriptor in sxdp_shared_umem_fd. 3. At bind() time, the kernel resolves sxdp_shared_umem_fd to the owner XskSocket, validates that the owner's UMEM is registered, and increments XskUmem.refcount. 4. Sharing sockets create their own RX and TX rings (per-socket) but do not create separate FILL/COMPLETION rings -- they share the owner's rings. Each socket receives packets only from its bound queue. 5. When a sharing socket closes, XskUmem.refcount is decremented. The UMEM pages are unpinned and DMA mappings torn down only when the last socket (including the owner) closes.

FILL/COMPLETION ring sharing: All sockets that share a UMEM share the same FILL and COMPLETION rings. This means userspace must coordinate chunk allocation across queues. In practice, the application thread pool assigns chunks from a shared free list and submits them to the FILL ring. Completed TX chunks appear on the shared COMPLETION ring regardless of which queue transmitted them.

Per-queue isolation: Despite shared UMEM, each socket has independent RX and TX rings. Packets arriving on queue N are delivered only to the socket bound to queue N. There is no cross-queue interference in the data path.

16.26.15 Per-Architecture Notes

AF_XDP ring buffer correctness depends on cache coherency between the kernel and userspace (for the shared ring indices and descriptors) and between the CPU and the NIC (for UMEM DMA). Each architecture requires specific handling:

Architecture Cache line Ring alignment DMA coherence ZC support
x86-64 64 B Producer/consumer indices on separate 64 B cache lines to prevent false sharing Fully cache-coherent DMA (PCIe snooping). No explicit flush needed for UMEM pages. CLFLUSHOPT available but unnecessary for DMA. Full zero-copy support (primary target).
AArch64 64 B or 128 B (micro-arch dependent) Indices aligned to 64 B minimum; 128 B alignment on cores with 128 B cache lines (A710, V-series). Non-coherent DMA on many SoCs. Kernel issues dc civac (clean and invalidate) on UMEM pages before programming RX descriptors and after TX completion. SMMU (IOMMU) required for zero-copy to provide DMA address translation. Full zero-copy where SMMU is present. Copy mode on SoCs without SMMU.
ARMv7 32 B Indices aligned to 32 B cache lines. Non-coherent DMA. Kernel issues dma_sync_single_for_device()/dma_sync_single_for_cpu() on UMEM pages. Copy mode only. Most ARMv7 SoCs lack IOMMU support, and 32-bit address space limits practical UMEM size. Zero-copy is architecturally possible but not a priority target.
RISC-V 64 64 B (platform dependent) Indices aligned to 64 B. Cache coherence varies by platform. On non-coherent platforms (most current SoCs), explicit fence and cache management ops are required. fence iorw, iorw for ordering. Copy mode by default. Zero-copy available on platforms with IOMMU (e.g., future RISC-V server SoCs with RISC-V IOMMU spec).
PPC32 32 B Indices aligned to 32 B. dcbst + sync for DMA write-back to UMEM pages. Limited by 32-bit physical addressing. Copy mode only. 32-bit address space and limited IOMMU support preclude practical zero-copy.
PPC64LE 128 B Indices aligned to 128 B cache lines (POWER9/10). dcbst + sync for explicit cache writeback. eieio for load/store ordering where lwsync is insufficient. POWER9+ has coherent I/O but explicit sync is still required for non-cacheable MMIO regions. Full zero-copy support where PCIe IOMMU (PHB) is configured.

Cache line alignment for ring indices: The producer and consumer indices of each ring (XskRing.producer and XskRing.consumer) are placed on separate cache lines to prevent false sharing between the kernel-side and userspace-side accessors. The mmap'd ring layout allocates each index at a cache-line-aligned offset. The XdpRingOffset offsets returned by getsockopt(XDP_MMAP_OFFSETS) reflect the target architecture's cache line size.

Memory ordering on weakly-ordered architectures: The ring protocol uses store_release / load_acquire pairs (see Ring Buffer Protocol above). On AArch64, these compile to stlr / ldar instructions. On RISC-V, the compiler emits fence rw,rw barriers. On PPC64, lwsync provides the required ordering. On PPC32 (e500), sync is used instead (lwsync causes an Illegal Instruction trap on e500v1/v2). On x86-64, the total store ordering (TSO) memory model makes these barriers free -- they compile to plain loads and stores.

16.26.16 Performance Budget

AF_XDP performance targets are specified per-packet for 64-byte minimum-size Ethernet frames, representing the worst-case packet rate (highest packets per second for a given link speed). Larger frames amortize per-packet overhead across more bytes and achieve higher throughput.

Metric Zero-copy mode Copy mode Budget source
Per-packet latency (64 B frames) ≤100 ns ≤300 ns Total path: XDP redirect + ring enqueue + userspace wakeup
Ring enqueue (kernel side) ≤15 ns ≤15 ns One store_release + descriptor write (1 cache line)
UMEM chunk allocation (FILL ring consume) ≤10 ns ≤10 ns One load_acquire + index increment
Data copy (copy mode only) -- ≤200 ns memcpy of ~64 B payload + metadata (dominated by cache miss on UMEM chunk if cold)
XDP BPF program execution ≤30 ns ≤30 ns Typical 5-20 instruction XDP redirect program (JIT'd)
Userspace wakeup (poll() notification) ≤50 ns ≤50 ns Wake-up via epoll (amortised across batch)
Batch amortization (64 packets) ≤2 ns/pkt overhead ≤5 ns/pkt overhead NAPI poll processes up to 64 packets per cycle

Throughput at line rate:

Link speed Wire rate (64 B frames) AF_XDP ZC target AF_XDP copy target
10 GbE 14.88 Mpps 14.88 Mpps (single core) 10+ Mpps (single core)
25 GbE 37.2 Mpps 25+ Mpps (single core, multi-queue) 15+ Mpps (multi-core)
100 GbE 148.8 Mpps 100+ Mpps (multi-queue, multi-core) 50+ Mpps (multi-core)

These targets assume NAPI batch sizes of 64, XDP_USE_NEED_WAKEUP enabled to avoid spurious syscalls, and userspace processing in batch mode (consuming multiple ring entries per poll() wakeup). Single-core figures assume dedicated CPU affinity for both the NAPI softirq and the userspace thread.

16.26.17 DPDK Migration Path

AF_XDP provides a kernel-integrated alternative to DPDK poll-mode drivers (PMDs) that eliminates several operational pain points of full kernel bypass:

DPDK AF_XDP PMD: DPDK includes a built-in net_af_xdp driver that exposes an AF_XDP socket as a standard DPDK port. Applications using the DPDK rte_eth_* API can switch from a hardware PMD (e.g., net_ice, net_mlx5) to net_af_xdp with a single command-line change:

# DPDK with hardware PMD (full kernel bypass):
dpdk-app -a 0000:03:00.0 --driver net_ice ...

# DPDK with AF_XDP PMD (kernel-integrated):
dpdk-app --vdev=net_af_xdp,iface=eth0,queue_count=4 ...

Migration benefits:

Concern DPDK PMD (kernel bypass) AF_XDP Improvement
Hugepage reservation Requires pre-allocated hugepages (typically 1-8 GB) Standard anonymous mmap for UMEM (hugepage optional) Simpler deployment, no boot-time config
Kernel network stack Bypassed entirely; no routing, firewall, monitoring Coexists with full kernel stack (XDP steers specific flows) Selective offload, not all-or-nothing
Container support Requires VFIO passthrough or SR-IOV VF assignment Works with network namespaces natively Per-container AF_XDP without hardware partitioning
NIC sharing One DPDK app owns entire NIC (or VF) Multiple AF_XDP sockets + kernel stack share same NIC Multi-tenant on shared infrastructure
Monitoring No kernel visibility; custom PMD instrumentation only Full kernel tracing (Section 20.2), XDP stats, standard ethtool counters Unified observability
Security Requires CAP_SYS_RAWIO or VFIO group access Requires CAP_NET_RAW (namespace-scoped) Namespace-aware privilege model

When to stay with DPDK PMD: For workloads requiring absolute minimum latency (sub-50 ns per packet) where the NIC is fully dedicated to one application, DPDK hardware PMDs via VFIO passthrough (Section 18.5) still offer ~20-30% lower latency than AF_XDP zero-copy due to eliminating the XDP BPF evaluation and the kernel-side ring management.

16.26.18 Tier Assignment

AF_XDP runs in Tier 0 (in-kernel, statically linked), same as AF_PACKET (Section 16.25).

Rationale: The AF_XDP fast path sits inside the XDP redirect handler within the NAPI poll context (Section 16.14). On each packet redirect, the kernel must: 1. Look up the target socket in the XSKMAP (RCU read, 1 cache line). 2. Consume a FILL ring entry (1 atomic load + 1 store). 3. In copy mode: memcpy packet data from NetBuf to UMEM chunk. 4. Write an XdpDesc to the RX ring (1 cache line write). 5. Advance the RX ring producer index (1 atomic store).

These operations touch shared memory (ring indices, UMEM pages) that are mapped into both kernel and userspace. Introducing a domain switch between the NAPI context and the AF_XDP delivery path would add ~23 cycles per redirect on x86-64. At 14.88 Mpps (10 GbE line rate, 64-byte frames), this would consume ~341 million cycles/sec -- approximately 10% of a 3.5 GHz core, which would negate AF_XDP's latency advantage over AF_PACKET MMAP.

AF_XDP's attack surface is contained: it requires CAP_NET_RAW, operates on pre-pinned userspace pages validated at bind() time, and the ring protocol is a simple SPSC pattern with bounded indices. The XDP BPF program that steers packets to AF_XDP is already isolated in its own domain (Section 16.18). No driver-provided code executes in the AF_XDP delivery path -- it is entirely umka-net code.

Cross-references: - XDP program model (XdpContext, XdpAction): Section 16.5 - NetBuf (copy mode source/destination): Section 16.5 - NAPI poll integration: Section 16.14 - NetDeviceOps (TX path for copy mode): Section 16.13 - DMA subsystem (UMEM page pinning): Section 4.14 - BPF/eBPF subsystem (XDP program attachment): Section 16.18 - VFIO passthrough (alternative for full bypass): Section 18.5 - Network namespace isolation: Section 17.1 - Socket common state: Section 16.3 - Capability checks: Section 16.3 - AF_PACKET comparison: Section 16.25


16.27 802.1Q VLAN Subsystem

IEEE 802.1Q Virtual LANs allow a single Ethernet link to carry traffic for multiple logical networks. Each frame carries a 4-byte tag inserted between the source MAC address and the EtherType field; the 12-bit VLAN ID (VID) partitions broadcast domains on shared infrastructure without rewiring. UmkaOS implements a full 802.1Q VLAN subsystem inside umka-net (Section 16.2), at the link layer, below the IPv4/IPv6 network layer.

Linux parallel: Linux implements 802.1Q in net/8021q/ with the 8021q kernel module. UmkaOS's VLAN subsystem provides full API and behavioral compatibility so that vconfig, ip link, and bridge vlan commands work on UmkaOS without modification.

16.27.1 Overview

An 802.1Q tag is inserted into an Ethernet frame immediately after the 6-byte source MAC address. The tag occupies exactly 4 bytes:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         TPID (0x8100)         |PCP|D|        VID (12 bits)   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • TPID (Tag Protocol Identifier, 16 bits): 0x8100 for 802.1Q; 0x88A8 for 802.1ad (QinQ outer tag). Ethertype values are the same across IPv4, IPv6, and ARP frames — the presence of a value ≥ 0x0600 in the 2-byte field after the source MAC identifies either an EtherType (untagged) or a TPID (tagged).
  • PCP (Priority Code Point, 3 bits): 802.1p QoS priority, 0–7. Mapped to internal skb_priority by the ingress priority map.
  • DEI (Drop Eligible Indicator, 1 bit, formerly CFI): set by upstream equipment to indicate this frame may be dropped under congestion.
  • VID (VLAN Identifier, 12 bits): 1–4094 usable values. VID 0 is reserved (priority-only tag, no VLAN membership). VID 4095 is reserved by the standard.

QinQ (802.1ad): Service provider networks use double tagging to tunnel customer VLANs (C-VLAN) across provider networks. The outer tag uses TPID 0x88A8 (S-VLAN, "service tag"); the inner tag uses 0x8100 (C-VLAN, "customer tag"). UmkaOS models the outer tag as a VlanProto::Dot1AD VLAN device stacked on top of a VlanProto::Dot1Q device. The link-layer transmit path inserts both tags, outer first.

16.27.2 VLAN Device Model

UmkaOS models each VLAN as a virtual NetDevice — a VlanDev — that sits on top of a real ("lower") NetDevice. Multiple VlanDev instances can share the same lower device, each with a distinct VID. The VLAN device has its own MAC address (inherited from the lower device by default, but overridable), its own ARP/NDP state, and its own routing table entries. Its MTU is lower.mtu - 4 to account for the tag bytes.

/// A virtual 802.1Q or 802.1ad VLAN network device.
///
/// Sits on top of a lower `NetDevice` and presents a logical interface
/// restricted to one VLAN ID. Transmit inserts (or requests hardware to
/// insert) the 802.1Q/802.1ad tag; receive strips the tag and demultiplexes.
pub struct VlanDev {
    /// The real ("lower") network device this VLAN rides on.
    pub lower: Arc<NetDevice>,
    /// VLAN ID in the range 1..=4094 (VID 0 and 4095 are reserved).
    pub vlan_id: u16,
    /// Tag protocol: 802.1Q (0x8100) or 802.1ad / QinQ outer tag (0x88A8).
    pub vlan_proto: VlanProto,
    /// Feature flags controlling VLAN device behaviour.
    pub flags: VlanDevFlags,
    /// Ingress priority map: PCP value (0–7) → internal skb_priority.
    /// Populated via IFLA_VLAN_INGRESS_QOS or ioctl SIOCSIFVLAN.
    pub ingress_priority_map: [u8; 8],
    /// Egress priority map: internal skb_priority → PCP value (0–7).
    /// XArray<u8> keyed by skb_priority (u32, zero-extended to usize).
    /// Value is PCP code (0-7).
    pub egress_priority_map: XArray<u8>,
    /// The VLAN's own NetDevice handle (MAC, stats, queue disciplines).
    /// **MTU**: The VLAN device's MTU automatically tracks the lower device's
    /// MTU minus 4 bytes (the 802.1Q tag overhead). When the lower device's
    /// MTU changes (via SIOCSIFMTU or netlink), the VLAN device's MTU is
    /// adjusted accordingly (NETDEV_CHANGEMTU event). The VLAN MTU can also
    /// be set independently to a value ≤ (lower_mtu - 4), but setting it
    /// higher than (lower_mtu - 4) returns `-ERANGE`.
    pub netdev: NetDevice,
}

/// Tag protocol identifier distinguishing 802.1Q from 802.1ad QinQ.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u16)]
pub enum VlanProto {
    /// IEEE 802.1Q customer VLAN tag (TPID = 0x8100).
    Dot1Q  = 0x8100,
    /// IEEE 802.1ad service provider outer tag (TPID = 0x88A8).
    Dot1AD = 0x88A8,
}

bitflags::bitflags! {
    /// Behavioural flags for a VLAN device.
    pub struct VlanDevFlags: u32 {
        /// Reorder Ethernet header on ingress for efficient higher-layer processing.
        const REORDER_HDR    = 1 << 0;
        /// Enable GVRP (GARP VLAN Registration Protocol) on this VLAN interface.
        const GVRP           = 1 << 1;
        /// Loose binding: VLAN device stays UP even if lower device is DOWN.
        const LOOSE_BINDING  = 1 << 2;
        /// Enable MVRP (Multiple VLAN Registration Protocol) on this interface.
        const MVRP           = 1 << 3;
        /// Bridge binding: VLAN follows bridge master state instead of lower device.
        const BRIDGE_BINDING = 1 << 4;
    }
}

VLAN devices are created and destroyed through the netlink RTM_NEWLINK / RTM_DELLINK path (Section 16.17). The VLAN subsystem registers a LinkOps implementation named "vlan" with the netlink link-type registry; the kernel dispatches RTM_NEWLINK with IFLA_INFO_KIND = "vlan" to this handler.

16.27.3 Transmit Path

When userspace writes to a socket bound through a VLAN device, the packet reaches the VLAN device's ndo_start_xmit entry point. The transmit path proceeds as follows:

  1. PCP selection: Look up the packet's skb_priority in egress_priority_map. If no entry exists, use PCP 0. Encode as the upper 3 bits of the TCI (Tag Control Information) field: tci = (pcp << 13) | (dei << 12) | vlan_id.

  2. Hardware offload check: Inspect the lower device's feature flags for NETIF_F_HW_VLAN_CTAG_TX (hardware 802.1Q TX offload) or NETIF_F_HW_VLAN_STAG_TX (hardware 802.1ad TX offload).

  3. Offload available: Store tci in NetBuf.vlan_tci and set NetBuf.vlan_present = true. Pass the unmodified frame to the lower device. The NIC inserts the 4-byte tag at the correct position in hardware, saving a memmove of the MAC header.

  4. No offload: Prepend the tag in software. Call netbuf_push_vlan_tag(buf, vlan_proto, tci) which expands the headroom by 4 bytes, memmoves the 12-byte MAC header (6 bytes DA + 6 bytes SA) 4 bytes toward the start of the buffer, and writes the 4-byte tag at byte offset 12. The frame is then handed to the lower device's transmit function.

  5. Lower device enqueue: The (possibly tag-inserted) NetBuf is passed to the lower NetDevice's transmit path, which applies traffic control (qdisc) and enqueues to the NIC ring.

QinQ transmit (outer tag 0x88A8, inner tag 0x8100) follows the same path twice: the inner VlanDev inserts or requests the C-VLAN tag, then the outer VlanDev inserts or requests the S-VLAN tag. Hardware NIC offload for double-tagging requires NETIF_F_HW_VLAN_STAG_TX; if absent, both tags are inserted in software in two sequential passes.

16.27.4 Receive Path

On receive, the lower NIC driver delivers frames to umka-net's generic receive entry point. The VLAN receive path runs before the frame is dispatched to the network layer:

  1. Hardware tag strip check: If the NIC reported NETIF_F_HW_VLAN_CTAG_RX (or NETIF_F_HW_VLAN_STAG_RX for QinQ), the tag has already been stripped by the NIC and its TCI value is stored in NetBuf.vlan_tci with vlan_present = true. Proceed to step 3.

  2. Software tag detection: Inspect the EtherType field at byte offset 12 of the frame. If it equals VlanProto::Dot1Q as u16 (0x8100) or VlanProto::Dot1AD as u16 (0x88A8), a tag is present. Extract the 4-byte TCI, store in NetBuf.vlan_tci, set vlan_present = true. Remove the 4 tag bytes: memmove the 12-byte MAC header 4 bytes toward the end of the buffer (restoring the original untagged layout), advance the frame data pointer.

  3. VID lookup: Extract vid = vlan_tci & 0x0FFF (lower 12 bits). Walk the lower device's VLAN device table (an XArray<Arc<VlanDev>> keyed on VID) to find the registered VlanDev for this VID.

  4. Ingress priority mapping: Extract pcp = (vlan_tci >> 13) & 0x7. Map via VlanDev.ingress_priority_map[pcp] to set NetBuf.priority for upstream QoS.

  5. Dispatch: If a matching VlanDev is found, deliver the frame to that device's receive queue. If no VlanDev is registered for this VID:

  6. If a trunk port is configured on the lower device (bridge mode, Section 16.27), deliver to the bridge for forwarding.
  7. Otherwise, drop the frame and increment the lower device's rx_dropped counter.

16.27.5 GARP and MRP

GARP (Generic Attribute Registration Protocol) is defined in IEEE 802.1D-2004 Annex 12. It provides a distributed mechanism for network nodes to register attributes (such as VLAN membership) with 802.1D-capable switches. Each GARP participant runs an applicant state machine and a registrar state machine per attribute:

  • Applicant: drives Join/Leave declaration transmission. States: VO (Very Anxious Observer), VP (Very Anxious Passive), VN (Very Anxious New), AN (Anxious New), AA (Anxious Active), QA (Quiet Active), LA (Leaving Active), AO (Anxious Observer), QO (Quiet Observer), AP (Anxious Passive), QP (Quiet Passive), LO (Leaving Observer). Transitions are triggered by application events (Join, Leave, New) and protocol timers (Join timer, Leave timer, LeaveAll timer).

  • Registrar: tracks whether an attribute has been declared by a remote participant. States: IN (registered), LV (leaving — Leave message received, awaiting Leave timer expiry), MT (empty — not registered).

GVRP (GARP VLAN Registration Protocol) is the application of GARP to VLAN membership. A GVRP-enabled VLAN device periodically advertises its VID to adjacent 802.1D bridges, which propagate the registration through the spanning tree so that VLANs are dynamically provisioned on trunk links.

MRP (Multiple Registration Protocol) (IEEE 802.1ak, incorporated into 802.1Q-2018) is the successor to GARP. MRP improves scalability and adds support for multiple applications sharing a single PDU exchange:

  • MVRP (Multiple VLAN Registration Protocol): MRP application for VLAN membership. Replaces GVRP on 802.1Q-2011 and later switches.
  • MMRP (Multiple MAC Registration Protocol): MRP application for multicast group membership (Ethernet MAC addresses), enabling efficient multicast pruning in IEEE 802.1Q bridges.

UmkaOS implements MVRP as a Tier 1 kernel worker in umka-net. The MRP engine is structured around the MrpApplication trait:

/// A registered MRP application (e.g., MVRP or MMRP).
pub trait MrpApplication: Send + Sync {
    /// Application identifier (e.g., MVRP_APPLICATION_ID = 0x0021).
    fn application_id(&self) -> u16;

    /// Encode all locally declared attributes into a MRP PDU vector for transmission.
    fn encode_pdu(&self, buf: &mut NetBuf);

    /// Process a received MRP PDU and update local registrar state accordingly.
    fn process_pdu(&self, buf: &NetBuf, port: &MrpPort) -> Result<(), KernelError>;

    /// Called on LeaveAll timer expiry: re-declare all active attributes.
    fn on_leave_all(&self);
}

/// Per-port MRP state.
pub struct MrpPort {
    /// The network device this MRP port is attached to.
    pub netdev: Arc<NetDevice>,
    /// Periodic Join timer handle (default: 200 ms, IEEE 802.1Q-2018 Table 10-7).
    pub join_timer: TimerHandle,
    /// Leave timer handle (default: 600 ms).
    pub leave_timer: TimerHandle,
    /// LeaveAll timer handle (default: 10 s).
    pub leave_all_timer: TimerHandle,
    /// Registered applications sharing this port (MVRP, MMRP).
    /// Effectively constant-size: IEEE 802.1Q defines only 2 standard MRP
    /// applications (MVRP, MMRP). Max 4 including potential vendor extensions.
    pub applications: ArrayVec<Arc<dyn MrpApplication>, 4>,
}

/// Per-attribute applicant state machine state (IEEE 802.1Q-2018 Table 10-3).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MrpApplicantState {
    /// Very Anxious Observer: no declaration, observing.
    VO,
    /// Very Anxious Passive: declaration pending (no active registrar known).
    VP,
    /// Very Anxious New: new declaration, needs immediate transmission.
    VN,
    /// Anxious New: declaration queued, waiting for join period.
    AN,
    /// Anxious Active: declaration active, waiting for join period.
    AA,
    /// Quiet Active: declaration active and acknowledged.
    QA,
    /// Leaving Active: Leave sent, awaiting Leave timer.
    LA,
    /// Anxious Observer: observing, join pending.
    AO,
    /// Quiet Observer: passively observing a remote declaration.
    QO,
    /// Anxious Passive: passive, join pending.
    AP,
    /// Quiet Passive: passive, idle.
    QP,
    /// Leaving Observer: Leave in progress, was observing.
    LO,
}

/// Per-attribute registrar state (IEEE 802.1Q-2018 Table 10-4).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MrpRegistrarState {
    /// Attribute is registered (a remote Join has been received and is current).
    IN,
    /// Attribute is leaving (Leave received; leave_timer running).
    LV,
    /// Attribute is not registered.
    MT,
}

The MRP engine drives these state machines on two triggers:

  • Timer expiry (join_timer, leave_timer, leave_all_timer): The mrp_timer kernel worker fires and calls mrp_port_timer_work(), which iterates all registered applications on the port, advances state machines, and transmits pending PDUs via mrp_encode_and_send().

  • PDU receive: The link-layer receive path recognises MRP Ethernet frames by their destination multicast MAC (01:80:C2:00:00:21 for MVRP) and dispatches them to mrp_rcv(), which decodes the PDU and calls MrpApplication::process_pdu().

MVRP's process_pdu updates the VLAN membership database: a JoinIn or JoinMt vector attribute for VID v causes the VLAN subsystem to ensure VID v is admitted on the ingress port of the bridge; a LeaveIn or LeaveMt attribute triggers removal after the leave timer expires.

16.27.6 Userspace Interface

VLAN devices are managed through two interfaces:

Netlink (preferred, iproute2): ip link add link eth0 name eth0.100 type vlan id 100 protocol 802.1Q. Internally this sends RTM_NEWLINK with:

IFLA_LINK        → ifindex of lower device (eth0)
IFLA_IFNAME      → "eth0.100"
IFLA_LINKINFO:
  IFLA_INFO_KIND → "vlan"
  IFLA_INFO_DATA:
    IFLA_VLAN_ID         → 100
    IFLA_VLAN_PROTOCOL   → ETH_P_8021Q (0x8100) or ETH_P_8021AD (0x88A8)
    IFLA_VLAN_FLAGS      → vlan_flags (e.g., VLAN_FLAG_REORDER_HDR)
    IFLA_VLAN_INGRESS_QOS → list of { from: u32, to: u32 } mappings
    IFLA_VLAN_EGRESS_QOS  → list of { from: u32, to: u32 } mappings

Deletion: ip link del eth0.100RTM_DELLINK. Query: ip -d link show eth0.100RTM_GETLINK with IFLA_INFO_DATA in reply.

Legacy ioctl (vconfig compatibility): The ioctl(SIOCGIFVLAN, SIOCSIFVLAN) path is supported for the legacy vconfig tool. vconfig add eth0 100 translates to SIOCSIFVLAN with cmd = ADD_VLAN_CMD. vconfig rem eth0.100cmd = DEL_VLAN_CMD. vconfig set_flag eth0.100 1 1cmd = SET_VLAN_FLAG_CMD.

/proc/net/vlan/: Read-only informational interface:

/proc/net/vlan/
|-- config         (one line per VLAN device: name, VID, lower device)
`-- eth0.100       (per-device stats: rx/tx bytes, packets, dropped)

The /proc/net/vlan/ tree is implemented via umka's procfs (Section 20.5) dynamic file model. Reads are satisfied without locks by reading per-CPU stats counters and summing them.

16.27.7 Bridge Integration

A VLAN-aware bridge (Section 16.27) operates in one of two modes:

  • VLAN-unaware (default): The bridge forwards frames based on MAC address alone; 802.1Q tags are treated as opaque payload. This is the traditional Linux bridge behaviour.

  • VLAN-aware (vlan_filtering = 1): Each bridge port maintains a per-port VLAN filter table. Frames arriving on a port are subject to ingress VLAN filtering; frames leaving a port are subject to egress tagging rules.

The per-port VLAN filter table:

/// A single entry in a bridge port's VLAN filter table.
///
/// Parsed form; wire format is `IFLA_BRIDGE_VLAN_INFO` (u16 flags + u16 vid).
/// This struct is kernel-internal and does not cross KABI boundaries, so `bool`
/// fields are acceptable (no non-0/1 byte risk from external sources).
pub struct BridgeVlanEntry {
    /// VLAN ID this entry applies to (1..=4094).
    pub vid: u16,
    /// If true, this VID is the port's PVID: untagged ingress frames are
    /// assigned this VID for forwarding decisions.
    pub pvid: bool,
    /// If true, frames egressing on this port with this VID are sent untagged
    /// (the tag is stripped on egress).
    pub untagged: bool,
    /// Master flag: this entry was added by the bridge itself (not user-space).
    pub master: bool,
    /// BRFORWARD: this VID is forwarded (not filtered on ingress).
    pub brforward: bool,
}

Bridge port VLAN configuration is done via netlink RTM_SETLINK with IFLA_BRIDGE_VLAN_INFO nested attributes (one per VID, with BRIDGE_VLAN_INFO_PVID and BRIDGE_VLAN_INFO_UNTAGGED flags). The bridge vlan command (from iproute2) uses this interface:

bridge vlan add dev eth0 vid 100 pvid untagged
bridge vlan del dev eth0 vid 200
bridge vlan show

On ingress: if the frame arrives untagged, assign PVID. Perform ingress VLAN filter lookup; drop if the port has no entry for the VID. On egress: if the port's entry for the VID has untagged = true, strip the tag before transmitting.

The bridge's FDB (Forwarding Database) is keyed on (MAC, VID) in VLAN-aware mode, enabling separate MAC learning per VLAN domain.

16.27.8 Linux Compatibility

UmkaOS's 802.1Q VLAN subsystem is fully compatible with Linux's 8021q, garp, mrp, and bridge VLAN subsystems:

  • vconfig (legacy): ADD_VLAN_CMD, DEL_VLAN_CMD, SET_VLAN_FLAG_CMD, SET_VLAN_NAME_TYPE_CMD ioctls all work.
  • ip link (iproute2): all IFLA_VLAN_* netlink attributes parsed and applied.
  • bridge vlan (iproute2): IFLA_BRIDGE_VLAN_INFO and the bridge netlink API work.
  • MVRP/MMRP MRP PDU format matches Linux's mrp module (IEEE 802.1ak-2007 wire format).
  • /proc/net/vlan/config and per-interface stat files present with identical format.
  • Ethtool VLAN offload feature flags (NETIF_F_HW_VLAN_CTAG_TX/RX, NETIF_F_HW_VLAN_STAG_TX/RX) are honoured identically to Linux.

16.28 NIC Bonding and Link Aggregation

Bonding combines multiple physical network interfaces into a single logical interface, providing link-level redundancy, increased throughput, and seamless failover. UmkaOS implements a full bonding subsystem inside umka-net (Section 16.2), at the link layer, below the IPv4/IPv6 network layer. The bond device is a virtual NetDevice (Section 16.13) that distributes traffic across its slave interfaces according to a configurable mode.

Linux parallel: Linux implements bonding in drivers/net/bonding/ with the bonding kernel module (modes 0-6). UmkaOS's bonding subsystem provides full API and behavioral compatibility: the same sysfs interface (/sys/class/net/bond0/bonding/*), the same netlink attributes (IFLA_BOND_*), and the same mode semantics. Unmodified iproute2, ifenslave, and NetworkManager bond configurations work on UmkaOS without modification.

Use cases: Server high availability (active-backup for switch-independent failover), datacenter link aggregation (802.3ad LACP for switch-negotiated multi-link bandwidth), load balancing across paths (modes 2/5/6), and broadcast fault tolerance (mode 3).

16.28.1 BondDevice Structure

/// Bond master device — a virtual network device aggregating multiple slaves.
///
/// The bond device registers as a `NetDevice` via `register_netdev()` and
/// implements `NetDeviceOps`. It does not own hardware resources; its TX path
/// selects a slave and delegates to that slave's `start_xmit()`. Its RX path
/// receives frames from slaves via the slave's RX handler hook.
pub struct BondDevice {
    /// The bond's network device (name, ifindex, stats, qdisc, etc.).
    pub dev: Arc<NetDevice>,

    /// Bond mode (determines TX distribution and failover behavior).
    /// Set at creation time; changing mode requires the bond to be down
    /// with no slaves attached (matching Linux semantics).
    pub mode: BondMode,

    /// Slave list — ordered by priority. Used for admin operations (slave
    /// add/remove), link monitoring, and failover scans. Protected by a
    /// spinlock because slave add/remove is a warm-path admin operation.
    /// The array is bounded at `MAX_BOND_SLAVES` (32), which covers all
    /// production configurations (Linux's default maximum is also 32 via
    /// `max_bonds` parameter; typical deployments use 2-8 slaves).
    ///
    /// **NOT used on the TX hot path.** The TX path reads `usable_slaves`
    /// (below) under RCU protection to avoid spinlock contention.
    pub slaves: SpinLock<ArrayVec<BondSlave, MAX_BOND_SLAVES>>,

    /// RCU-protected snapshot of usable (link-up, not being removed)
    /// slaves for the TX hot path. Rebuilt by `bond_update_slave_arr()`
    /// whenever a slave's link state changes or slaves are added/removed.
    ///
    /// The TX path reads this via `usable_slaves.rcu_read()` — no
    /// spinlock, no contention. At 10 Gbps (~800K pps), this saves
    /// ~35-55 cycles per packet vs the SpinLock on `slaves`.
    /// Matches Linux's `bond->usable_slaves` design.
    pub usable_slaves: RcuCell<ArrayVec<Arc<BondSlave>, MAX_BOND_SLAVES>>,

    /// Currently active slave (for active-backup, balance-tlb, balance-alb).
    /// RCU-protected: TX hot path reads without locking; link monitor or
    /// admin failover writes under RCU update. `None` when no slave is UP.
    pub active_slave: RcuCell<Option<Arc<BondSlave>>>,

    /// Bond parameters (configurable via sysfs/netlink).
    pub params: SpinLock<BondParams>,

    /// LACP state machine (only for mode 4 / 802.3ad). `None` for all
    /// other modes. Protected by its own spinlock because LACP timer
    /// callbacks and RX PDU processing are independent of the slave list.
    pub lacp: Option<SpinLock<LacpState>>,

    /// Link monitoring configuration and timer state.
    pub link_monitor: BondLinkMonitor,

    /// TX hash function for load-balancing modes (0, 2, 4).
    /// Read on the TX hot path (no lock — `Relaxed` atomic load of the
    /// discriminant). Changed via sysfs/netlink on the warm path.
    /// Setter validation: sysfs/netlink write path validates value against
    /// XmitHashPolicy discriminant range [0, 5] before store. TX hot path
    /// loads with Relaxed ordering and converts via match (not transmute).
    /// The match default arm uses a safe fallback (`XmitHashPolicy::Layer2`)
    /// instead of `unreachable_unchecked()`. The branch predictor never
    /// takes the default arm under normal operation, so the safe fallback
    /// has identical performance. This guards against DRAM bitflips, Tier 1
    /// crash-induced memory corruption, and future code changes that might
    /// forget to validate the setter path.
    pub xmit_hash_policy: AtomicU8, // XmitHashPolicy discriminant

    /// Per-CPU statistics: packets/bytes per direction, failover count.
    /// Aggregated by `get_stats64()` across all CPUs.
    pub stats: PerCpu<BondStats>,

    /// Round-robin TX counter (mode 0 only). Incremented atomically on
    /// each `start_xmit()` call; slave index = counter % active_slave_count.
    pub rr_tx_counter: AtomicU64,

    /// Number of gratuitous ARP/ND announcements to send after failover.
    /// Default: 1. Configurable via `num_grat_arp` / `num_unsol_na`.
    pub num_grat_arp: AtomicU8,
}

/// Maximum number of slaves in a single bond. 32 covers all production
/// server and datacenter configurations. Environments needing more
/// aggregated links should use switch-side LAG with fewer bond members.
pub const MAX_BOND_SLAVES: usize = 32;

/// Per-CPU bond statistics. Aggregated across CPUs by `get_stats64()`.
pub struct BondStats {
    pub tx_packets: u64,
    pub tx_bytes: u64,
    pub rx_packets: u64,
    pub rx_bytes: u64,
    pub tx_errors: u64,
    pub rx_errors: u64,
    /// Number of failover events observed on this CPU (incremented by
    /// the link monitor when it switches the active slave).
    pub failover_count: u64,
}

16.28.2 Bond Modes

/// Bond operating mode — determines TX path selection and failover behavior.
///
/// These discriminant values match Linux's `BOND_MODE_*` constants exactly
/// (linux/if_bonding.h) to ensure netlink and sysfs compatibility.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum BondMode {
    /// Mode 0: Round-robin — packets distributed sequentially across slaves.
    /// Pro: maximum aggregate throughput for multi-flow workloads.
    /// Con: single TCP flows may see out-of-order delivery (mitigated by
    /// `packets_per_slave` parameter). Requires static LAG or no switch
    /// awareness.
    BalanceRr = 0,

    /// Mode 1: Active-backup — only one slave carries traffic; others are
    /// standby. On failure, the bond selects a new active slave.
    /// Pro: works with any switch (no LAG configuration required).
    /// Con: no TX load balancing; aggregate bandwidth = single link speed.
    ActiveBackup = 1,

    /// Mode 2: XOR — TX slave selected by hash of packet headers modulo
    /// slave count. Hash function is configurable via `xmit_hash_policy`.
    /// Pro: deterministic per-flow distribution across slaves.
    /// Con: requires static LAG on the switch side.
    BalanceXor = 2,

    /// Mode 3: Broadcast — every packet transmitted on all slaves.
    /// Pro: maximum fault tolerance (packet survives if any slave delivers).
    /// Con: wastes bandwidth (each frame sent N times); not for throughput.
    Broadcast = 3,

    /// Mode 4: IEEE 802.3ad dynamic link aggregation (LACP).
    /// Pro: standards-based, switch-negotiated; supports bandwidth and
    /// failover simultaneously.
    /// Con: requires LACP-capable switch partner.
    Lacp8023ad = 4,

    /// Mode 5: Adaptive transmit load balancing — TX distributed by
    /// current slave load (measured by byte count per interval). No switch
    /// support needed: each slave's MAC is used for TX, so the switch sees
    /// independent unicast sources.
    /// Con: RX arrives on a single slave only (the active slave's MAC).
    BalanceTlb = 5,

    /// Mode 6: Adaptive load balancing — TX balanced as in mode 5; RX
    /// balanced by ARP negotiation (bond intercepts ARP replies and
    /// rewrites the source MAC to distribute RX across slaves).
    /// Pro: TX+RX balancing without switch support.
    /// Con: ARP-based RX balancing is IPv4-only; IPv6 RX is not balanced.
    BalanceAlb = 6,
}

16.28.3 BondSlave

/// Per-slave state within a bond. One `BondSlave` exists for each physical
/// NIC enslaved to the bond master.
pub struct BondSlave {
    /// The slave's physical network device.
    pub dev: Arc<NetDevice>,

    /// Link state as detected by the link monitor.
    /// Encoded as `BondLinkState` discriminant.
    pub link_state: AtomicU8,

    /// Slave priority (lower value = higher priority for active-backup
    /// selection). Default: 0. Configurable via netlink `IFLA_BOND_SLAVE_PRIO`.
    pub priority: i32,

    /// Link speed in Mbps (queried from the slave's `EthtoolOps::get_link_ksettings()`).
    /// Updated on enslave and on link-up events. Used by LACP aggregator
    /// selection (mode 4) and TLB load distribution (modes 5/6).
    pub speed: u32,

    /// Duplex mode (queried from ethtool). Slaves must have matching speed
    /// and duplex to form an 802.3ad aggregator.
    pub duplex: Duplex,

    /// Original MAC address (saved before the bond overwrites the slave's
    /// MAC). Restored on release from the bond.
    pub original_mac: [u8; 6],

    /// LACP port state (mode 4 only). `None` for non-LACP modes.
    pub lacp_port: Option<LacpPortState>,

    /// Timestamp of last confirmed link-up event (monotonic nanoseconds).
    /// Used by ARP monitoring to detect silent link failures.
    pub last_link_up: AtomicU64,

    /// Queue ID for per-slave traffic steering. Userspace can direct
    /// specific flows to a specific slave via `tc` or socket options.
    /// 0 means no queue steering (default hash-based selection).
    pub queue_id: u16,

    /// Milliseconds remaining until DOWN transition completes. Set when
    /// entering `Fail` state; decremented each `miimon` interval.
    /// 0 when not in a delay state.
    pub downdelay_remaining: u32,

    /// Milliseconds remaining until UP transition completes. Set when
    /// entering `Back` state; decremented each `miimon` interval.
    /// 0 when not in a delay state.
    pub updelay_remaining: u32,
}

/// Slave link state as detected by the bond's link monitor.
/// Discriminant values match Linux `BOND_LINK_*` from
/// `include/uapi/linux/if_bonding.h`.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum BondLinkState {
    /// Link is up and carrying traffic (BOND_LINK_UP = 0).
    Up = 0,
    /// Link has just gone down — transitional state while `downdelay`
    /// timer runs (BOND_LINK_FAIL = 1). If carrier returns before
    /// downdelay expires, transitions back to `Up`.
    Fail = 1,
    /// Link has been down for longer than `downdelay` — confirmed dead
    /// (BOND_LINK_DOWN = 2). MII monitor promoted from `Fail` after
    /// downdelay expired.
    Down = 2,
    /// Link was down but carrier has been re-detected — transitional
    /// state while `updelay` timer runs (BOND_LINK_BACK = 3). If
    /// carrier is lost again before updelay expires, transitions back
    /// to `Down`.
    Back = 3,
}

16.28.4 LACP (802.3ad) Protocol

When the bond operates in mode 4 (Lacp8023ad), it runs the IEEE 802.3ad Link Aggregation Control Protocol to dynamically negotiate aggregation groups with the switch partner. LACP is mandatory for mode 4; no other mode uses it.

/// Bond-wide LACP state (mode 4 only).
pub struct LacpState {
    /// Actor (local) system information.
    pub actor: LacpSystemInfo,
    /// Partner (remote switch) system information, learned from received LACPDUs.
    pub partner: LacpSystemInfo,
    /// Selected aggregator: groups slaves by matching partner system ID + key.
    /// Only slaves in the selected aggregator carry traffic.
    pub aggregator: LacpAggregator,
    /// Aggregator selection policy.
    pub ad_select: AdSelect,
    /// LACPDU transmission rate: slow (30s) or fast (1s).
    pub lacp_rate: LacpRate,
}

/// System identification for LACP actor/partner.
pub struct LacpSystemInfo {
    /// System priority (lower = higher priority). Used to determine which
    /// system's port priorities govern aggregator selection. Default: 65535.
    pub system_priority: u16,
    /// System MAC address (6 bytes). Together with `system_priority`, forms
    /// the System Identifier (IEEE 802.3ad §5.2.4).
    pub system_mac: [u8; 6],
    /// Operational key. Ports with the same key can be aggregated. Derived
    /// from link speed and duplex (ports at different speeds get different keys).
    pub key: u16,
}

/// Per-slave LACP port state machine.
pub struct LacpPortState {
    /// Actor port state flags (transmitted in LACPDUs).
    pub actor_state: LacpStateFlags,
    /// Partner port state flags (received from partner LACPDUs).
    pub partner_state: LacpStateFlags,
    /// Actor port number (unique within the system, typically ifindex & 0xFFFF).
    pub actor_port_number: u16,
    /// Actor port priority (lower = higher priority). Default: 255.
    pub actor_port_priority: u16,
    /// Partner port number (learned from received LACPDUs).
    pub partner_port_number: u16,
    /// Partner port priority (learned from received LACPDUs).
    pub partner_port_priority: u16,
    /// Number of LACPDUs received since last reset.
    pub rx_lacpdu_count: u64,
    /// Number of LACPDUs transmitted since last reset.
    pub tx_lacpdu_count: u64,
}

bitflags::bitflags! {
    /// LACP state flags (IEEE 802.3ad §5.4.2). Carried in the Actor_State
    /// and Partner_State fields of each LACPDU.
    #[repr(transparent)]
    pub struct LacpStateFlags: u8 {
        /// LACP_Activity: port actively sends LACPDUs (vs. passive, only responds).
        const ACTIVITY       = 1 << 0;
        /// LACP_Timeout: short timeout (fast rate, 3s expiry) vs. long (90s).
        const TIMEOUT        = 1 << 1;
        /// Aggregation: port is aggregatable (vs. individual link).
        const AGGREGATION    = 1 << 2;
        /// Synchronization: port is in sync with the aggregator.
        const SYNCHRONIZATION = 1 << 3;
        /// Collecting: port is accepting inbound frames for the aggregator.
        const COLLECTING     = 1 << 4;
        /// Distributing: port is transmitting outbound frames for the aggregator.
        const DISTRIBUTING   = 1 << 5;
        /// Defaulted: partner information is using default values (no LACPDU
        /// received from partner yet, or partner timed out).
        const DEFAULTED      = 1 << 6;
        /// Expired: partner information has expired (awaiting next LACPDU).
        const EXPIRED        = 1 << 7;
    }
}

LACPDU frame format: LACPDUs are transmitted as Slow Protocol frames (EtherType 0x8809, subtype 0x01). The frame layout is:

Offset Length Field
0 1 Subtype (0x01 = LACP)
1 1 Version (0x01)
2 1 Actor TLV type (0x01)
3 1 Actor TLV length (0x14 = 20 bytes)
4 2 Actor System Priority
6 6 Actor System MAC
12 2 Actor Key
14 2 Actor Port Priority
16 2 Actor Port Number
18 1 Actor State (LacpStateFlags)
19 3 Reserved
22 1 Partner TLV type (0x02)
23 1 Partner TLV length (0x14 = 20 bytes)
24-41 18 Partner fields (same layout as Actor)
42 1 Collector TLV type (0x03)
43 1 Collector TLV length (0x10 = 16 bytes)
44 2 Collector Max Delay
46 12 Reserved
58 1 Terminator TLV type (0x00)
59 1 Terminator TLV length (0x00)
60 50 Reserved (pad to 110 bytes total)
/// On-wire LACPDU frame payload (IEEE 802.3ad / 802.1AX).
/// All multi-byte integer fields use network byte order (big-endian)
/// per IEEE specification. Kernel-internal structs (`LacpSystemInfo`,
/// `LacpPortState`) use host byte order; convert via `to_be()` when
/// constructing on-wire LACPDUs and via `from_be()` when parsing.
///
/// **Offset table** (each TLV includes its 2-byte type+length header):
///
/// | Offset | Size | Field |
/// |--------|------|-------|
/// | 0-1    | 2    | subtype + version |
/// | 2-3    | 2    | Actor TLV header (type=0x01, length=20) |
/// | 4-21   | 18   | Actor TLV payload (priority+MAC+key+port_prio+port+state+reserved) |
/// | 22-23  | 2    | Partner TLV header (type=0x02, length=20) |
/// | 24-41  | 18   | Partner TLV payload |
/// | 42-43  | 2    | Collector TLV header (type=0x03, length=16) |
/// | 44-57  | 14   | Collector TLV payload (max_delay + reserved) |
/// | 58-59  | 2    | Terminator TLV (type=0x00, length=0x00) |
/// | 60-109 | 50   | Reserved padding |
/// | **Total** | **110** | |
#[repr(C, packed)]
pub struct LacpduFrame {
    pub subtype:                 u8,        // offset 0: 0x01 = LACP
    pub version:                 u8,        // offset 1: 0x01

    // Actor TLV (offset 2-21: 2-byte header + 18-byte payload = 20 bytes)
    pub actor_type:              u8,        // offset 2: 0x01
    pub actor_length:            u8,        // offset 3: 20 (includes type+length)
    pub actor_system_priority:   Be16,      // offset 4
    pub actor_system:            [u8; 6],   // offset 6: MAC address
    pub actor_key:               Be16,      // offset 12
    pub actor_port_priority:     Be16,      // offset 14
    pub actor_port:              Be16,      // offset 16
    pub actor_state:             u8,        // offset 18: LacpStateFlags
    pub _actor_reserved:         [u8; 3],   // offset 19

    // Partner TLV (offset 22-41: 2-byte header + 18-byte payload = 20 bytes)
    pub partner_type:            u8,        // offset 22: 0x02
    pub partner_length:          u8,        // offset 23: 20
    pub partner_system_priority: Be16,      // offset 24
    pub partner_system:          [u8; 6],   // offset 26: MAC address
    pub partner_key:             Be16,      // offset 32
    pub partner_port_priority:   Be16,      // offset 34
    pub partner_port:            Be16,      // offset 36
    pub partner_state:           u8,        // offset 38: LacpStateFlags
    pub _partner_reserved:       [u8; 3],   // offset 39

    // Collector TLV (offset 42-57: 2-byte header + 14-byte payload = 16 bytes)
    pub collector_type:          u8,        // offset 42: 0x03
    pub collector_length:        u8,        // offset 43: 16
    pub collector_max_delay:     Be16,      // offset 44
    pub _collector_reserved:     [u8; 12],  // offset 46

    // Terminator TLV (offset 58-59)
    pub terminator_type:         u8,        // offset 58: 0x00
    pub terminator_length:       u8,        // offset 59: 0x00

    // Padding to 110-byte LACPDU payload (offset 60-109)
    pub _reserved:               [u8; 50],
}
const_assert!(core::mem::size_of::<LacpduFrame>() == 110);

Periodic TX: The bond transmits LACPDUs at the rate determined by the partner's TIMEOUT flag: - Partner TIMEOUT = 0 (long): transmit every 30 seconds; partner timeout after 90 seconds. - Partner TIMEOUT = 1 (short): transmit every 1 second; partner timeout after 3 seconds.

The local lacp_rate parameter (slow or fast) controls the actor's own TIMEOUT flag, which tells the partner how frequently to send LACPDUs to us.

Aggregator selection: Ports are grouped into aggregators by matching (partner_system_priority, partner_system_mac, partner_key). Only one aggregator is active at a time. The ad_select parameter controls which aggregator is chosen when multiple candidates exist:

/// 802.3ad aggregator selection policy.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AdSelect {
    /// Stable: select the aggregator with the most ports. Once selected,
    /// do not reselect unless the active aggregator has no ports left.
    /// This is the default and minimizes disruption.
    Stable = 0,
    /// Bandwidth: select the aggregator with the highest aggregate bandwidth
    /// (sum of slave speeds). Reselect whenever a higher-bandwidth aggregator
    /// becomes available.
    Bandwidth = 1,
    /// Count: select the aggregator with the most ports. Reselect whenever
    /// a numerically larger aggregator becomes available.
    Count = 2,
}

The bond monitors slave link health through two independent mechanisms. Both can be enabled simultaneously (MII for fast carrier detection, ARP for end-to-end reachability verification).

16.28.5.1 MII Monitoring

MII monitoring polls each slave's carrier state via the slave's EthtoolOps::get_link() (Section 16.13) at a configurable interval.

/// MII link monitoring configuration.
pub struct MiiMonitor {
    /// Polling interval in milliseconds. 0 = disabled. Default: 0.
    /// Typical production value: 100 ms.
    pub miimon: u32,
    /// Time (ms) a slave must remain link-up before being declared UP.
    /// Must be a multiple of `miimon`. Default: 0.
    /// Prevents rapid flapping on unstable links.
    pub updelay: u32,
    /// Time (ms) a slave must remain link-down before being declared DOWN.
    /// Must be a multiple of `miimon`. Default: 0.
    /// Avoids premature failover on transient carrier drops.
    pub downdelay: u32,
    /// Use carrier state (`netif_carrier_ok()`) instead of MII/ethtool
    /// for link detection. Default: true (carrier is more reliable on
    /// modern NICs; MII register polling is a legacy fallback).
    pub use_carrier: bool,
}

MII monitor algorithm (runs every miimon ms on a dedicated kernel timer):

mii_monitor_tick(bond):
    for slave in bond.slaves.lock().iter_mut():
        link_ok = slave.dev.ethtool_ops.get_link(&slave.dev)
                  OR slave.dev.carrier.load(Relaxed)  // if use_carrier

        match (slave.link_state, link_ok):
            (Up, false):
                slave.link_state = Fail  // start downdelay
                slave.downdelay_remaining = bond.params.downdelay
            (Fail, false):
                slave.downdelay_remaining -= miimon
                if slave.downdelay_remaining <= 0:
                    slave.link_state = Down  // confirm link dead
                    bond_slave_link_down(bond, slave)
            (Fail, true):
                // Link recovered during downdelay — cancel transition.
                slave.link_state = Up
            (Down, true):
                slave.link_state = Back  // start updelay
                slave.updelay_remaining = bond.params.updelay
            (Back, true):
                slave.updelay_remaining -= miimon
                if slave.updelay_remaining <= 0:
                    slave.link_state = Up  // confirm link recovered
                    bond_slave_link_up(bond, slave)
            (Back, false):
                // Link dropped again during updelay — cancel transition.
                slave.link_state = Down
            (Up, true) | (Down, false):
                // Steady state — no action.
                pass

16.28.5.2 ARP Monitoring

ARP monitoring verifies end-to-end reachability by sending ARP requests to configured target IP addresses and checking for replies. This detects failures invisible to MII (e.g., switch misconfiguration, upstream router failure, VLAN mismatch).

/// ARP link monitoring configuration.
pub struct ArpMonitor {
    /// ARP probe interval in milliseconds. 0 = disabled. Default: 0.
    pub arp_interval: u32,
    /// Target IPv4 addresses to probe. Up to 16 targets.
    /// At least one target must be configured when ARP monitoring is enabled.
    pub arp_ip_target: ArrayVec<Ipv4Addr, 16>,
    /// Which slaves must receive ARP replies to be considered UP.
    pub arp_validate: ArpValidate,
    /// Whether any single target reply suffices, or all targets must reply.
    pub arp_all_targets: ArpAllTargets,
}

/// ARP validation mode — controls which slaves must see ARP replies.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ArpValidate {
    /// No validation: any ARP activity on the slave counts as proof of life.
    None = 0,
    /// Validate only on the active slave.
    Active = 1,
    /// Validate only on backup (inactive) slaves.
    Backup = 2,
    /// Validate on all slaves (active + backup).
    All = 3,
}

/// ARP target response requirement.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ArpAllTargets {
    /// Any single target responding is sufficient.
    Any = 0,
    /// All configured targets must respond.
    All = 1,
}

A slave is declared DOWN if no validated ARP reply is received within arp_interval * 3 (three missed intervals). This matches Linux's BOND_LINK_MON_INTERV multiplier.

16.28.6 TX Hash Policies

For load-balancing modes (0: round-robin, 2: XOR, 4: 802.3ad), the bond hashes packet headers to select a slave. The hash function is configurable:

/// TX hash policy — determines how load-balancing modes distribute packets.
///
/// Discriminant values match Linux's `BOND_XMIT_POLICY_*` constants
/// (linux/if_bonding.h) for netlink/sysfs compatibility.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum XmitHashPolicy {
    /// Layer 2: hash(src_mac, dst_mac). Works for all traffic including
    /// non-IP. Poor distribution if most traffic shares the same MAC pair.
    Layer2 = 0,
    /// Layer 3+4: hash(src_ip, dst_ip, src_port, dst_port). Best per-flow
    /// distribution for TCP/UDP. Falls back to Layer2 for non-IP traffic.
    Layer34 = 1,
    /// Layer 2+3: hash(src_mac, dst_mac, src_ip, dst_ip). Compromise
    /// between Layer2 and Layer34; useful for non-TCP/UDP IP traffic.
    Layer23 = 2,
    /// Encapsulated 2+3: hash inner L2+L3 headers for tunneled traffic
    /// (VXLAN, GRE, Geneve). Falls back to outer headers if inner headers
    /// are not parseable. Ensures tunnel flows distribute across slaves.
    Encap23 = 3,
    /// Encapsulated 3+4: hash inner L3+L4 headers for tunneled traffic.
    /// Best per-flow distribution for tunneled TCP/UDP workloads.
    Encap34 = 4,
    /// VLAN + source MAC: hash(vlan_id, src_mac). Designed for multi-VLAN
    /// trunk topologies where each VLAN should use a different slave.
    VlanSrcMac = 5,
}

TX path (bond_start_xmit):

All multi-slave modes read usable_slaves under RCU protection (no spinlock). bond_update_slave_arr() is called on every slave state change (link up/down, slave add/remove) to rebuild the RCU snapshot from the slaves list under the slaves SpinLock.

bond_start_xmit(bond, buf):
    let _rcu = rcu_read_lock();
    let usable = bond.usable_slaves.rcu_read();
    let active_count = usable.len();
    if active_count == 0: return Err(ENETDOWN)

    match bond.mode:
        BalanceRr:
            // Note: rr_tx_counter races with usable_slaves snapshot updates.
            // This matches Linux bonding mode 0 behavior -- best-effort
            // round-robin distribution with momentary unevenness on slave
            // count changes.
            idx = bond.rr_tx_counter.fetch_add(1, Relaxed) % active_count
            slave = &usable[idx]
        ActiveBackup:
            slave = bond.active_slave.rcu_read()  // RCU-protected read
        BalanceXor | Lacp8023ad:
            // Relaxed load of xmit_hash_policy: the policy is an AtomicU8
            // discriminant set by admin via sysfs/netlink (rare, cold-path).
            // During a policy change, some packets may hash with the old
            // policy while others use the new one. This is benign: traffic
            // distribution temporarily becomes uneven for in-flight packets.
            // Linux uses READ_ONCE (equivalent to Relaxed) and accepts the
            // same race window. No ordering guarantee is needed relative to
            // other fields — the policy value is self-contained.
            let policy_raw = bond.xmit_hash_policy.load(Relaxed);
            let policy = match XmitHashPolicy::try_from(policy_raw) {
                Ok(p) => p,
                Err(_) => XmitHashPolicy::Layer2, // safe fallback
            };
            hash = xmit_hash(buf, policy)
            slave = &usable[hash % active_count]
        Broadcast:
            // NetBufHandle::clone() increments the underlying NetBuf reference
            // count (shared data, COW semantics). All slaves' DMA engines read
            // from the same physical buffer. The DMA mapping is shared; it is
            // unmapped only when the last NetBufHandle referencing the buffer is
            // dropped. This matches Linux skb_clone() semantics.
            for slave in usable.iter():
                slave.dev.ops.start_xmit(&slave.dev, buf.clone())
            return Ok(())
        BalanceTlb | BalanceAlb:
            slave = tlb_select_slave(bond, buf)  // least-loaded slave

    slave.dev.ops.start_xmit(&slave.dev, buf)

TX hash function:

fn xmit_hash(buf: &NetBuf, policy: XmitHashPolicy) -> u32:
    match policy:
        XmitHashPolicy::Layer2:
            // hash(src_mac, dst_mac)
            jhash_2words(buf.src_mac_u32(), buf.dst_mac_u32(), 0)
        XmitHashPolicy::Layer34:
            // hash(src_ip, dst_ip, src_port, dst_port)
            // Falls back to Layer2 for non-IP traffic.
            if let Some((sip, dip, sp, dp)) = buf.l34_tuple():
                jhash_3words(sip, dip, (sp as u32) << 16 | dp as u32, 0)
            else:
                jhash_2words(buf.src_mac_u32(), buf.dst_mac_u32(), 0)
        XmitHashPolicy::Layer23:
            // hash(src_mac, dst_mac, src_ip, dst_ip)
            if let Some((sip, dip)) = buf.l3_tuple():
                jhash_3words(buf.src_mac_u32() ^ buf.dst_mac_u32(), sip, dip, 0)
            else:
                jhash_2words(buf.src_mac_u32(), buf.dst_mac_u32(), 0)
        XmitHashPolicy::Encap23:
            // Same as Layer23 but uses inner headers for tunnel traffic.
            let (sip, dip) = buf.inner_l3_tuple().unwrap_or(buf.l3_tuple().unwrap_or_default());
            jhash_3words(buf.src_mac_u32() ^ buf.dst_mac_u32(), sip, dip, 0)
        XmitHashPolicy::Encap34:
            // Same as Layer34 but uses inner headers for tunnel traffic.
            if let Some((sip, dip, sp, dp)) = buf.inner_l34_tuple().or(buf.l34_tuple()):
                jhash_3words(sip, dip, (sp as u32) << 16 | dp as u32, 0)
            else:
                jhash_2words(buf.src_mac_u32(), buf.dst_mac_u32(), 0)
        XmitHashPolicy::VlanSrcMac:
            // hash(vlan_id, src_mac) — for bridge+LACP setups.
            let vlan = buf.vlan_tci().unwrap_or(0) as u32;
            jhash_2words(vlan, buf.src_mac_u32(), 0)

16.28.7 Bond Parameters

All parameters are configurable via sysfs (/sys/class/net/bond0/bonding/*) and netlink (IFLA_BOND_* attributes). Values match Linux's defaults and semantics exactly for compatibility.

/// Runtime-configurable bond parameters.
pub struct BondParams {
    /// MII link monitoring configuration.
    pub mii_monitor: MiiMonitor,
    /// ARP link monitoring configuration.
    pub arp_monitor: ArpMonitor,
    /// Preferred active slave name (for active-backup). When this slave is
    /// UP, it becomes the active slave regardless of other slaves' state.
    /// Empty string means no preferred slave.
    pub primary: [u8; 16], // IFNAMSIZ
    /// When to reselect the primary slave after failover.
    pub primary_reselect: PrimaryReselect,
    /// MAC address handling on failover.
    pub fail_over_mac: FailOverMac,
    /// Number of gratuitous ARPs to send after failover. Default: 1.
    pub num_grat_arp: u8,
    /// Number of unsolicited IPv6 Neighbor Advertisements after failover.
    pub num_unsol_na: u8,
    /// Deliver duplicate frames on inactive slaves (for NIDS, packet capture).
    /// 0 = drop on inactive slaves (default). 1 = deliver on all slaves.
    pub all_slaves_active: bool,
    /// Minimum number of active slaves for the bond device to report carrier UP.
    /// Default: 0 (bond is UP even with no active slaves, for compatibility).
    pub min_links: u32,
    /// Round-robin packets per slave before switching to the next slave.
    /// Default: 1. Higher values reduce reordering but worsen distribution.
    pub packets_per_slave: u32,
    /// Resend interval (ms) for unacked frames in ALB mode. Default: 1.
    pub lp_interval: u32,
}

/// Primary reselection policy.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum PrimaryReselect {
    /// Always: whenever the primary slave comes back up, switch to it.
    Always = 0,
    /// Better: switch to primary only if it is better than current active
    /// (higher speed or lower priority value).
    Better = 1,
    /// Failure: switch to primary only if the current active slave fails.
    Failure = 2,
}

/// Failover MAC address handling policy.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FailOverMac {
    /// None: bond sets all slaves to the bond's MAC; no MAC change on failover.
    /// This is the default and works with all switch configurations.
    None = 0,
    /// Active: the bond's MAC follows the active slave's MAC on failover.
    /// Avoids MAC reprogramming on slaves that don't support it.
    Active = 1,
    /// Follow: each slave keeps its original MAC; the bond's MAC changes
    /// to the new active slave's MAC on failover.
    Follow = 2,
}

Parameter summary table (sysfs/netlink equivalents):

Parameter sysfs name Netlink attribute Default Description
mode mode IFLA_BOND_MODE balance-rr (0) Bond operating mode
miimon miimon IFLA_BOND_MIIMON 0 (disabled) MII monitoring interval (ms)
updelay updelay IFLA_BOND_UPDELAY 0 Link up delay (ms, multiple of miimon)
downdelay downdelay IFLA_BOND_DOWNDELAY 0 Link down delay (ms, multiple of miimon)
use_carrier use_carrier IFLA_BOND_USE_CARRIER 1 Use carrier state (1) or MII register (0)
arp_interval arp_interval IFLA_BOND_ARP_INTERVAL 0 (disabled) ARP monitoring interval (ms)
arp_ip_target arp_ip_target IFLA_BOND_ARP_IP_TARGET (none) Up to 16 ARP probe targets
arp_validate arp_validate IFLA_BOND_ARP_VALIDATE none (0) ARP validation scope
arp_all_targets arp_all_targets IFLA_BOND_ARP_ALL_TARGETS any (0) Require any or all ARP targets
xmit_hash_policy xmit_hash_policy IFLA_BOND_XMIT_HASH_POLICY layer2 (0) TX hash function
lacp_rate lacp_rate IFLA_BOND_AD_LACP_RATE slow (0) LACP PDU rate
ad_select ad_select IFLA_BOND_AD_SELECT stable (0) 802.3ad aggregator selection
primary primary IFLA_BOND_PRIMARY (none) Preferred active slave
primary_reselect primary_reselect IFLA_BOND_PRIMARY_RESELECT always (0) Primary reselection policy
fail_over_mac fail_over_mac IFLA_BOND_FAIL_OVER_MAC none (0) MAC handling on failover
num_grat_arp num_grat_arp IFLA_BOND_NUM_PEER_NOTIF 1 Gratuitous ARPs after failover
all_slaves_active all_slaves_active IFLA_BOND_ALL_SLAVES_ACTIVE 0 Deliver frames on all slaves
min_links min_links IFLA_BOND_MIN_LINKS 0 Minimum active links for bond UP
packets_per_slave packets_per_slave IFLA_BOND_PACKETS_PER_SLAVE 1 RR packets before switching slave

Bond devices are managed through the standard Linux netlink and sysfs interfaces. UmkaOS implements both for full compatibility with iproute2, ifenslave, NetworkManager, and systemd-networkd.

Bond creation (netlink RTM_NEWLINK):

ip link add bond0 type bond mode 802.3ad miimon 100 xmit_hash_policy layer3+4

This sends RTM_NEWLINK with IFLA_INFO_KIND = "bond" and nested IFLA_INFO_DATA containing IFLA_BOND_MODE, IFLA_BOND_MIIMON, and IFLA_BOND_XMIT_HASH_POLICY attributes.

Slave enslave (netlink RTM_NEWLINK):

ip link set eth0 master bond0

Enslaving a device: the bond saves the slave's original MAC, sets the slave's MAC to the bond's MAC (unless fail_over_mac != None), registers an RX handler on the slave to intercept incoming frames, disables the slave's own IP addresses (the bond device carries all IP configuration), and adds the slave to the slaves array.

Slave release (netlink RTM_NEWLINK):

ip link set eth0 nomaster

Releasing a slave: restore original MAC, remove RX handler, re-enable the slave's independent operation.

Sysfs: All bond parameters are exposed at /sys/class/net/bond0/bonding/<param> as readable and writable files. Per-slave information is at /sys/class/net/bond0/bonding/slaves (list of slave names) and per-slave attributes at /sys/class/net/bond0/slave_<ethN>/<attr>.

16.28.9 Failover Behavior

Active-backup failover (mode 1) is the primary HA mechanism. The sequence when the link monitor detects a slave failure:

  1. Detection: MII monitor observes carrier loss (or ARP monitor times out) on the currently active slave. After downdelay expires, the slave transitions to BondLinkState::Down.

  2. Selection: bond_select_active_slave() iterates the slave list and selects a new active slave using the following priority:

  3. If a primary slave is configured and UP, select it (subject to primary_reselect policy).
  4. Otherwise, select the first UP slave with the lowest priority value.
  5. Among slaves with equal priority, select the first in the slave list (insertion order).

  6. MAC update: Depending on fail_over_mac:

  7. None: new active slave already has the bond's MAC (all slaves share it).
  8. Active: bond's MAC is updated to the new active slave's MAC.
  9. Follow: new active slave's MAC becomes the bond's MAC; old active's MAC is restored to its original.

  10. Gratuitous ARP/ND: The bond sends num_grat_arp gratuitous ARP packets (and num_unsol_na unsolicited IPv6 Neighbor Advertisements) on the new active slave. These update the switch's MAC forwarding table and ARP caches of connected hosts to point to the bond's MAC on the new physical port.

  11. In-flight packet loss: Packets queued in the old slave's TX ring at the time of failure are lost. The bond does not retransmit at the link layer; higher-layer protocols (TCP, application-level) handle retransmission.

  12. Event reporting: The failover is logged via Section 20.1 as a structured event with fields: bond_name, old_slave, new_slave, mode, timestamp. Userspace can subscribe to bond events via netlink IFLA_EVENT notifications.

  13. Traffic resumes: The new active slave begins carrying all TX/RX traffic. Typical failover time is bounded by downdelay + miimon (for MII monitoring) or arp_interval * 3 (for ARP monitoring).

16.28.10 Network Namespace Integration

  • Bond and slaves must reside in the same network namespace. Attempting to enslave a device from a different namespace returns -EINVAL.
  • A bond device can be moved between network namespaces via ip link set bond0 netns <ns>. All enslaved devices follow the bond into the new namespace (matching Linux behavior).
  • LACP PDUs (EtherType 0x8809) are processed within the bond's network namespace (Section 17.1).
  • When a bond is destroyed (or moved), all slaves are released first and their original namespace membership is restored.

16.28.11 Feature Negotiation

The bond device's feature flags (NetDevFeatures) are the intersection of all slaves' features. When a slave is added or removed, the bond recalculates its feature set:

bond_compute_features(bond):
    features = NetDevFeatures::all()
    for slave in bond.slaves:
        features &= slave.dev.features
    // Disable TSO/GSO if any slave lacks it (a single slave without
    // offload would force software segmentation on that path, causing
    // unpredictable latency).
    bond.dev.features = features
    bond.dev.hw_features = features

The bond's MTU is the minimum MTU across all slaves. Changing the bond's MTU propagates to all slaves via NetDeviceOps::change_mtu().

16.28.12 Linux Compatibility

UmkaOS's bonding subsystem is fully compatible with Linux's bonding module:

  • All 7 modes (0-6) implemented with identical semantics.
  • ifenslave (legacy): BOND_ENSLAVE_OLD / BOND_RELEASE_OLD ioctls supported.
  • ip link (iproute2): all IFLA_BOND_* netlink attributes parsed and applied.
  • /sys/class/net/bondN/bonding/*: all sysfs parameter files present with identical read/write format (matching Linux's bonding_show_*/bonding_store_*).
  • /proc/net/bonding/bond0: per-bond status file with identical format (mode, slaves, link state, LACP info).
  • LACP interoperability: LACPDUs conform to IEEE 802.3ad wire format; tested against Cisco, Arista, and Mellanox switch LACP implementations.
  • Bonding over VLAN (Section 16.27) and VLAN over bond topologies both work (bond device as lower device for VlanDev, or VlanDev as slave of bond).

Cross-references: - Network device model: Section 16.13 - Network stack integration: Section 16.2 - VLAN over bond: Section 16.27 - Network namespace scoping: Section 17.1 - Failover event reporting: Section 20.1 - NAPI on slave devices: Section 16.14 - Netlink interface: Section 16.17 - Traffic control on bond device: Section 16.21

16.29 Multicast Routing

Multicast routing forwards IP multicast packets between networks, enabling efficient one-to-many delivery without unicast duplication at the source. The subsystem has two components: group membership (IGMP for IPv4, MLD for IPv6) tracks which hosts on each interface want to receive a given group, and the multicast forwarding cache (MFC) routes multicast packets from their ingress interface to the correct set of egress interfaces.

Use cases: IPTV and live video streaming, cluster heartbeat (Corosync, Pacemaker), PIM routing daemons (pimd, pim6sd, FRR), multicast DNS (mDNS/Avahi), financial market data distribution, and any application using IP_ADD_MEMBERSHIP / IPV6_JOIN_GROUP.

Linux parallel: Linux implements multicast routing via the MRT (multicast routing table) raw socket API (net/ipv4/ipmr.c, net/ipv6/ip6mr.c). The kernel provides the forwarding plane; userspace daemons (mrouted, pimd, pim6sd) implement the routing protocol (PIM-SM, PIM-SSM, DVMRP) and install forwarding entries via setsockopt(). UmkaOS provides full MRT socket API compatibility so that unmodified Linux multicast routing daemons work without recompilation.

Isolation: Multicast routing runs inside umka-net (Section 16.2) as part of the network-layer forwarding path. VIF-to-NetDevice mappings reference devices in the same network namespace (Section 17.1).

16.29.1 IGMP (Internet Group Management Protocol)

IGMP operates between hosts and their immediately adjacent multicast routers on an IPv4 LAN segment. Hosts report group membership; routers query periodically to discover which groups have active listeners.

16.29.1.1 IGMPv2 (RFC 2236)

/// IGMP message types.
///
/// Values match the IANA "IGMP Type Numbers" registry. UmkaOS processes all
/// five types on receive; it generates only Query (as querier), ReportV2,
/// ReportV3, and LeaveGroup.
pub enum IgmpType {
    /// General or group-specific membership query.
    MembershipQuery = 0x11,
    /// IGMPv1 membership report (backward compatibility with RFC 1112 hosts).
    MembershipReportV1 = 0x12,
    /// IGMPv2 membership report.
    MembershipReportV2 = 0x16,
    /// Leave group notification (host departing a group).
    LeaveGroup = 0x17,
    /// IGMPv3 membership report (source filtering, RFC 3376).
    MembershipReportV3 = 0x22,
}

/// IGMP message header (8 bytes, RFC 2236 Section 2).
///
/// All IGMP messages share this fixed 8-byte header. For IGMPv3 reports the
/// header is followed by variable-length group records (see `Igmpv3Report`).
#[repr(C, packed)]
pub struct IgmpHeader {
    /// Message type (one of `IgmpType` values).
    pub msg_type: u8,
    /// Maximum response time in units of 1/10 second. For Membership Query
    /// messages, this is the maximum time before a host must respond. For
    /// other message types, this field is set to zero on transmit and ignored
    /// on receipt.
    pub max_resp_time: u8,
    /// Internet checksum (ones' complement sum of the IGMP message).
    /// Big-endian on-wire per RFC 2236.
    pub checksum: Be16,
    /// Multicast group address. For General Queries this is 0.0.0.0;
    /// for Group-Specific Queries and Reports it is the group address.
    pub group_addr: Ipv4Addr,
}
const_assert!(size_of::<IgmpHeader>() == 8);

Querier election: On each LAN segment, the router with the lowest IPv4 address becomes the designated querier. All other routers suppress their query timers for other_querier_present_interval (= robustness_variable * query_interval + query_response_interval / 2). If the current querier is silent for this interval, the next-lowest-IP router assumes the querier role.

Timers (all configurable via sysctl, defaults match RFC 2236 Section 8.4):

Parameter Default Description
query_interval 125 s Interval between General Queries
query_response_interval 10 s Max Response Time in General Queries
robustness_variable 2 Accounts for packet loss on the segment
group_membership_interval 260 s robustness * query_interval + query_response
last_member_query_interval 1 s Interval between Group-Specific Queries after Leave
last_member_query_count 2 Number of Group-Specific Queries after Leave (= robustness)

16.29.1.2 IGMPv3 (RFC 3376)

IGMPv3 adds source filtering — a host can request traffic from a specific set of sources (INCLUDE mode) or from all sources except a specific set (EXCLUDE mode). This is the foundation for PIM-SSM (Source-Specific Multicast).

/// IGMPv3 membership report message.
///
/// Sent by hosts to report group membership with source filtering. Contains
/// one or more group records, each specifying a group address, filter mode,
/// and source list. The report is sent to the IGMPv3 all-routers group
/// 224.0.0.22 (not to the multicast group itself).
#[repr(C, packed)]
pub struct Igmpv3Report {
    /// Message type (always `IgmpType::MembershipReportV3`, 0x22).
    pub msg_type: u8,
    /// Reserved, set to zero.
    pub reserved: u8,
    /// Internet checksum over the entire report. Big-endian on-wire.
    pub checksum: Be16,
    /// Reserved, set to zero. Big-endian for wire format consistency.
    pub reserved2: Be16,
    /// Number of group records following this header. Big-endian on-wire.
    pub num_group_records: Be16,
    // Followed by `num_group_records` Igmpv3GroupRecord entries (variable length).
}

/// IGMPv3 group record — one record per group in an IGMPv3 report.
///
/// Variable-length: the source list follows the fixed fields.
#[repr(C, packed)]
pub struct Igmpv3GroupRecord {
    /// Record type (see table below).
    pub record_type: u8,
    /// Length of auxiliary data (in 32-bit words) after the source list.
    /// Currently always 0 — no auxiliary data is defined by any RFC.
    pub aux_data_len: u8,
    /// Number of source addresses in the source list. Big-endian on-wire.
    pub num_sources: Be16,
    /// Multicast group address this record applies to.
    pub group_addr: Ipv4Addr,
    // Followed by `num_sources` Ipv4Addr source addresses (4 bytes each).
}
const_assert!(size_of::<Igmpv3Report>() == 8);
const_assert!(size_of::<Igmpv3GroupRecord>() == 8);

IGMPv3 record types (RFC 3376 Section 4.2.12):

Type Value Description
MODE_IS_INCLUDE 1 Current-state: forward only from listed sources
MODE_IS_EXCLUDE 2 Current-state: forward from all sources except listed
CHANGE_TO_INCLUDE 3 Filter-mode-change: switching to include mode
CHANGE_TO_EXCLUDE 4 Filter-mode-change: switching to exclude mode
ALLOW_NEW_SOURCES 5 Source-list-change: adding sources to include list
BLOCK_OLD_SOURCES 6 Source-list-change: removing sources from include list

16.29.1.3 Per-Interface IGMP State

/// Per-interface multicast group membership state.
///
/// One instance per (interface, group) pair. Tracks the merged membership
/// state from all sockets that have joined this group on this interface.
/// Used by the IGMP querier/reporter state machine and by the multicast
/// forwarding path to determine local interest.
pub struct McGroupState {
    /// Multicast group address.
    pub group: Ipv4Addr,
    /// Filter mode: INCLUDE (forward only from listed sources) or EXCLUDE
    /// (forward from all except listed). Derived from IGMPv3 reports; IGMPv2
    /// reports are treated as EXCLUDE {} (all sources accepted).
    pub filter_mode: IgmpFilterMode,
    /// Source list for IGMPv3 source filtering. In INCLUDE mode, these are
    /// the accepted sources. In EXCLUDE mode, these are the blocked sources.
    ///
    /// Bounded to `MAX_IGMP_SOURCES` per group — matches Linux's
    /// `/proc/sys/net/ipv4/igmp_max_msf` default. Excess sources from
    /// `IP_ADD_SOURCE_MEMBERSHIP` are rejected with `ENOBUFS`.
    pub sources: ArrayVec<Ipv4Addr, MAX_IGMP_SOURCES>,
    /// Group membership timer. When this expires without a membership report
    /// being received, the group is considered to have no listeners on this
    /// interface and is removed from the forwarding state.
    pub group_timer: Timer,
    /// Number of local socket joins on this interface (reference count).
    /// When `ref_count` drops to zero, a Leave message is sent (IGMPv2) or
    /// a TO_IN {} record is sent (IGMPv3), and the entry is scheduled for
    /// removal after `last_member_query_count` unresponded queries.
    pub ref_count: u32,
}

/// Maximum source filter entries per multicast group per interface.
///
/// Linux default for `igmp_max_msf` is 10. This constant is the compile-time
/// upper bound; the runtime limit is governed by the sysctl value.
/// Configurable at runtime via `/proc/sys/net/ipv4/igmp_max_msf`.
pub const MAX_IGMP_SOURCES: usize = 64;

16.29.2 MLD (Multicast Listener Discovery) -- IPv6

MLD is the IPv6 equivalent of IGMP. It uses ICMPv6 message types instead of a separate protocol number:

MLD Version RFC ICMPv6 Types IGMP Equivalent
MLDv1 RFC 2710 130 (Query), 131 (Report), 132 (Done) IGMPv2
MLDv2 RFC 3810 130 (Query), 143 (Report) IGMPv3

MLD shares the same state machine as IGMP (querier election, timer parameters, filter modes) but operates on IPv6 multicast addresses (ff00::/8). Key differences:

  • Link-local scope: MLD messages use hop limit 1 and link-local source addresses. The kernel validates both on receipt and silently drops non-conforming messages.
  • Router Alert option: MLD messages carry the IPv6 Hop-by-Hop Router Alert option (RFC 2711) so that routers intercept them without inspecting every multicast packet.
  • Scope-aware: IPv6 multicast addresses encode scope (link-local ff02::, site-local ff05::, organization ff08::, global ff0e::). The multicast forwarding path respects scope boundaries — link-local groups are never forwarded beyond the originating link.

Internally, McGroupState is parameterized over address family. The IPv6 variant uses Ipv6Addr for group and sources fields with identical timer logic.

16.29.3 Multicast Forwarding Cache (MFC)

The MFC is the kernel's multicast routing table — analogous to the unicast FIB (Section 16.6) but indexed by (source, group) pairs instead of destination prefix. Each entry specifies the expected incoming VIF and the set of outgoing VIFs with per-VIF TTL thresholds.

/// Multicast forwarding cache entry — (S,G) or (*,G) routing entry.
///
/// Installed by userspace routing daemons via `MRT_ADD_MFC` setsockopt.
/// Looked up on the multicast forwarding hot path (per-packet for
/// non-cached flows). Entries are stored in an `XArray` keyed by a hash
/// of (source, group) for O(1) lookup.
///
/// **Path temperature**: MFC lookup is hot-path (per multicast packet).
/// All fields are read-only after installation except atomic counters and
/// `last_use`. The `pending` queue is accessed only for unresolved entries
/// (rare — only until the routing daemon responds to the upcall).
pub struct MfcEntry {
    /// Source address. For (*,G) entries (any-source multicast / ASM),
    /// this is `Ipv4Addr::UNSPECIFIED` (0.0.0.0). For (S,G) entries
    /// (source-specific multicast / SSM), this is the specific source.
    pub source: Ipv4Addr,
    /// Multicast group address (class D: 224.0.0.0/4).
    pub group: Ipv4Addr,
    /// Incoming VIF index — the VIF on which packets for this (S,G) must
    /// arrive. Packets arriving on a different VIF trigger an
    /// `IGMPMSG_WRONGVIF` upcall (RPF check failure).
    pub parent_vif: u16,
    /// Per-VIF TTL thresholds. Index = VIF index. If `ttl_thresholds[i] > 0`
    /// and the packet's TTL >= `ttl_thresholds[i]`, the packet is forwarded
    /// out VIF `i`. A threshold of 0 means "do not forward on this VIF".
    /// This array doubles as the outgoing interface set (OIF list).
    pub ttl_thresholds: [u8; MAXVIFS],
    /// Packet counter — total packets forwarded via this MFC entry.
    /// u64 for 50-year uptime at sustained multicast rates.
    pub packets: AtomicU64,
    /// Byte counter — total bytes forwarded via this MFC entry.
    pub bytes: AtomicU64,
    /// Timestamp of last packet match (nanoseconds, `CLOCK_MONOTONIC_RAW`).
    /// Used for MFC cache expiry — entries unused for `mfc_expiry_timeout`
    /// are candidates for garbage collection by the routing daemon.
    pub last_use: AtomicU64,
    /// Pending packet queue for unresolved entries. When a multicast packet
    /// arrives with no matching MFC entry, the kernel creates a pending
    /// entry, queues up to 8 packets, and sends an upcall to userspace.
    /// Once the daemon installs the MFC entry via `MRT_ADD_MFC`, pending
    /// packets are drained and forwarded.
    ///
    /// Bounded to 8 entries (matching Linux `MFC_LINES`). Excess packets
    /// for an unresolved entry are silently dropped.
    pub pending: SpinLock<ArrayVec<NetBufHandle, 8>>,
}

/// Maximum number of virtual interfaces per multicast routing table.
///
/// Matches Linux `MAXVIFS` (32). Each VIF maps to either a physical
/// `NetDevice` or a tunnel endpoint. 32 VIFs is sufficient for all
/// practical multicast routing topologies; the PIM register VIF
/// consumes one slot.
pub const MAXVIFS: usize = 32;

16.29.4 Virtual Interface (VIF)

VIFs abstract physical and tunnel interfaces for multicast routing. Each VIF has an index (0 to MAXVIFS - 1) and maps to a NetDevice or a point-to-point tunnel.

/// Virtual interface for multicast routing.
///
/// Wraps a real `NetDevice` or an IPIP tunnel endpoint. Installed via
/// `MRT_ADD_VIF` setsockopt, removed via `MRT_DEL_VIF`.
pub struct VifDevice {
    /// VIF index (0-based, max `MAXVIFS - 1`).
    pub vif_index: u16,
    /// VIF flags.
    pub flags: VifFlags,
    /// TTL threshold — packets with TTL <= this value are not forwarded
    /// through this VIF. Allows scoping multicast by hop count.
    pub threshold: u8,
    /// Rate limit in kilobits per second (0 = unlimited). Applied per-VIF
    /// to prevent a single multicast stream from saturating a slow link.
    pub rate_limit: u32,
    /// Associated network device. `None` only for tunnel VIFs where the
    /// device is resolved dynamically from `local_addr`/`remote_addr`.
    pub dev: Option<Arc<NetDevice>>,
    /// Local address (source IP for tunnel VIFs). For physical interfaces,
    /// this is the interface's primary IPv4 address.
    pub local_addr: Ipv4Addr,
    /// Remote address (destination IP for IPIP tunnel VIFs). Zero for
    /// physical (non-tunnel) interfaces.
    pub remote_addr: Ipv4Addr,
    /// Inbound packet counter.
    pub in_packets: AtomicU64,
    /// Outbound packet counter.
    pub out_packets: AtomicU64,
}

bitflags::bitflags! {
    /// VIF configuration flags (matching Linux `VIFF_*` constants).
    pub struct VifFlags: u32 {
        /// VIF is an IPIP tunnel (encapsulate in IP-in-IP to `remote_addr`).
        const TUNNEL       = 0x1;
        /// VIF is a PIM Register tunnel (see PIM Register Tunnel section).
        const REGISTER     = 0x4;
        /// Use interface index instead of local/remote address to identify
        /// the VIF's device. Preferred on systems with unnumbered interfaces.
        const USE_IFINDEX  = 0x8;
    }
}

16.29.5 MRT Socket API

Multicast routing is controlled via raw socket options on an AF_INET / SOCK_RAW / IPPROTO_IGMP socket (IPv4) or an AF_INET6 / SOCK_RAW / IPPROTO_ICMPV6 socket (IPv6). Only one MRT control socket may be open per network namespace — opening a second returns EADDRINUSE.

/// Multicast routing initialization sequence:
///
/// let mrt_fd = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
/// setsockopt(mrt_fd, IPPROTO_IP, MRT_INIT, &1u32, size_of::<u32>());
/// // Now add VIFs and MFC entries via setsockopt.

IPv4 socket options (level IPPROTO_IP, values match Linux <linux/mroute.h>):

Socket Option Value Direction Description
MRT_INIT 200 Set Initialize multicast routing for this namespace
MRT_DONE 201 Set Shut down multicast routing, remove all VIFs and MFC entries
MRT_ADD_VIF 202 Set Add a virtual interface (struct vifctl)
MRT_DEL_VIF 203 Set Delete a virtual interface by VIF index
MRT_ADD_MFC 204 Set Add/update a multicast forwarding cache entry (struct mfcctl)
MRT_DEL_MFC 205 Set Delete a multicast forwarding cache entry
MRT_VERSION 206 Get Return MRT API version (0x0305 for full IPv4 MRT)
MRT_ASSERT 207 Set Enable/disable IGMPMSG_WRONGVIF upcalls (RPF violation alerts)
MRT_PIM 208 Set Enable PIM processing (required for PIM Register VIF)
MRT_TABLE 209 Set Select MRT table ID (multi-table support, Linux 3.5+)
MRT_ADD_MFC_PROXY 210 Set Add MFC entry on behalf of another table (inter-table proxy)
MRT_DEL_MFC_PROXY 211 Set Delete proxied MFC entry
MRT_FLUSH 212 Set Flush MFC/VIF entries (flags: MRT_FLUSH_MFC=1, MRT_FLUSH_MFC_STATIC=2, MRT_FLUSH_VIFS=4, MRT_FLUSH_VIFS_STATIC=8)

IPv6 socket options (level IPPROTO_IPV6, on AF_INET6 / SOCK_RAW / IPPROTO_ICMPV6):

Socket Option Value Direction Description
MRT6_INIT 200 Set Initialize IPv6 multicast routing
MRT6_DONE 201 Set Shut down IPv6 multicast routing
MRT6_ADD_MIF 202 Set Add a multicast interface (MIF, IPv6 equivalent of VIF)
MRT6_DEL_MIF 203 Set Delete a multicast interface
MRT6_ADD_MFC 204 Set Add IPv6 multicast forwarding entry
MRT6_DEL_MFC 205 Set Delete IPv6 multicast forwarding entry
MRT6_TABLE 209 Set Select MRT6 table ID

struct mfcctl wire format (passed to MRT_ADD_MFC / MRT_DEL_MFC):

/// Multicast Forwarding Cache control structure.
/// Matches Linux `struct mfcctl` (include/uapi/linux/mroute.h).
/// Field order matches Linux: origin first, then mcastgrp.
#[repr(C)]
pub struct MfcCtl {
    /// Source address (origin of the multicast stream).
    pub mfcc_origin: InAddr,       // struct in_addr (4 bytes, offset 0)
    /// Multicast group address (destination).
    pub mfcc_mcastgrp: InAddr,     // struct in_addr (4 bytes, offset 4)
    /// Incoming VIF index (the VIF on which packets should arrive).
    pub mfcc_parent: u16,          // offset 8
    /// Per-VIF TTL thresholds. mfcc_ttls[i] > 0 means forward to VIF i
    /// with TTL threshold mfcc_ttls[i]. 0 means do not forward to VIF i.
    pub mfcc_ttls: [u8; MAXVIFS],  // MAXVIFS = 32, offset 10
    /// Explicit padding: repr(C) inserts 2 bytes here to align mfcc_pkt_cnt
    /// (u32) to a 4-byte boundary. Offset 42 + 2 = 44.
    pub _pad: [u8; 2],
    /// Packet counter for this MFC entry (kernel-maintained, read by userspace).
    pub mfcc_pkt_cnt: u32,         // offset 44
    /// Byte counter for this MFC entry (kernel-maintained, read by userspace).
    pub mfcc_byte_cnt: u32,        // offset 48
    /// Counter of packets arriving on the wrong VIF (kernel-maintained).
    pub mfcc_wrong_if: u32,        // offset 52
    /// Expiry timer in seconds (0 = no expiry). Set by userspace.
    pub mfcc_expire: i32,          // offset 56
}
// Size: 4 + 4 + 2 + 32 + 2(pad) + 4 + 4 + 4 + 4 = 60 bytes.
const_assert!(size_of::<MfcCtl>() == 60);

16.29.6 Upcall Mechanism

When the kernel receives a multicast packet with no matching MFC entry, it notifies the userspace routing daemon via the MRT control socket. The daemon computes the correct forwarding path (using PIM, DVMRP, or static configuration) and installs the MFC entry.

Upcall message types (delivered as IGMPMSG_* on the raw socket):

Message Value Trigger
IGMPMSG_NOCACHE 1 Multicast packet received with no MFC entry
IGMPMSG_WRONGVIF 2 Packet arrived on unexpected VIF (RPF check failure)
IGMPMSG_WHOLEPKT 3 Entire packet delivered for PIM Register encapsulation

Upcall flow (NOCACHE):

  1. Multicast packet arrives; MFC lookup returns no match.
  2. Kernel creates a pending (unresolved) MFC entry and queues the packet (up to 8).
  3. Kernel constructs an igmpmsg header (source, group, incoming VIF) and delivers it to the MRT control socket via the raw socket receive queue.
  4. Userspace daemon reads the upcall, computes the route, and calls setsockopt(MRT_ADD_MFC) to install the forwarding entry.
  5. Kernel drains the pending queue: all queued packets are forwarded per the new entry.
  6. If no MRT_ADD_MFC arrives within 10 seconds, the pending entry expires and queued packets are dropped. The next packet for this (S,G) triggers a new upcall.

RPF check (Reverse Path Forwarding): For every MFC entry, parent_vif specifies the expected incoming interface. If a packet arrives on a different VIF, it fails the RPF check. When MRT_ASSERT is enabled, the kernel sends an IGMPMSG_WRONGVIF upcall so the routing daemon can log the event or adjust routing state. Packets failing RPF are always dropped regardless of the MRT_ASSERT setting.

16.29.7 PIM Register Tunnel

PIM-SM (Protocol Independent Multicast - Sparse Mode, RFC 7761) uses a special register VIF to deliver multicast traffic from a first-hop router to the Rendezvous Point (RP) before the shortest-path tree (SPT) is established.

Register flow:

  1. Source sends multicast to group G. First-hop router (DR) has no downstream tree.
  2. DR encapsulates the multicast packet in a PIM Register message (unicast to the RP).
  3. RP decapsulates the inner packet and forwards it down the shared tree (*,G).
  4. RP sends a PIM Register-Stop to the DR once the (S,G) SPT is established.
  5. DR stops encapsulating; traffic flows natively on the (S,G) tree.

Kernel role: UmkaOS provides the register VIF (VifFlags::REGISTER) and the IGMPMSG_WHOLEPKT upcall mechanism. The PIM daemon (pimd, pim6sd, FRR pimd) handles PIM protocol messages; the kernel handles only forwarding plane operations. When a packet matches an MFC entry whose outgoing VIF set includes the register VIF, the kernel delivers the entire packet to the MRT socket via IGMPMSG_WHOLEPKT instead of forwarding it on a physical interface. The daemon encapsulates the packet in a PIM Register message and transmits it as a unicast IP packet to the RP.

16.29.8 IPv6 Multicast Routing

IPv6 multicast routing mirrors the IPv4 architecture with the following substitutions:

IPv4 IPv6
IGMP (protocol 2) MLD (ICMPv6 types 130, 131, 132, 143)
MfcEntry (Ipv4Addr) Mf6cEntry (Ipv6Addr source/group)
VifDevice MifDevice (Multicast Interface, same structure with Ipv6Addr)
MRT_* setsockopt on IPPROTO_IGMP socket MRT6_* setsockopt on IPPROTO_ICMPV6 socket
PIM Register (protocol 103) PIM6 Register (same mechanism, IPv6 addresses)
/proc/net/ip_mr_vif, /proc/net/ip_mr_cache /proc/net/ip6_mr_vif, /proc/net/ip6_mr_cache

The Mf6cEntry struct is identical to MfcEntry except source and group are Ipv6Addr. IPv6 scope rules apply: packets with link-local group addresses (ff02::/16) are never forwarded beyond the originating link, regardless of MFC entries.

16.29.9 Per-Namespace Multicast Routing State

Each network namespace (Section 17.1) maintains independent multicast routing state. The MRT control socket, VIF table, and MFC table are all per-namespace. MRT_TABLE / MRT6_TABLE enables multiple routing tables within a single namespace (analogous to policy routing for unicast, Section 16.6).

16.29.10 Sysctls

Sysctl Path Default Description
net.ipv4.igmp_max_memberships 20 Maximum multicast groups per socket
net.ipv4.igmp_max_msf 10 Maximum source filter entries per group per socket
net.ipv4.igmp_qrv 2 IGMP robustness variable (query retransmit count)
net.ipv4.conf.<iface>.mc_forwarding 0 Multicast forwarding enabled (read-only; set by MRT_INIT)
net.ipv4.conf.<iface>.force_igmp_version 0 Force IGMP version (0 = auto-detect, 2 or 3)
net.ipv4.conf.all.mc_forwarding 0 Global multicast forwarding status (read-only)
net.ipv6.conf.<iface>.mc_forwarding 0 IPv6 multicast forwarding (read-only; set by MRT6_INIT)

All sysctl paths are exposed under /proc/sys/ for Linux compatibility (Section 20.5) and are per-namespace — each NetNamespace holds its own sysctl values.

16.29.11 Cross-References

  • Section 16.6 -- unicast FIB, policy routing, route lookup
  • Section 16.2 -- packet receive and forwarding path, L3 dispatch
  • Section 16.3 -- raw socket creation, setsockopt() dispatch
  • Section 16.13 -- VIF-to-NetDevice mapping, interface counters
  • Section 17.1 -- per-namespace MRT table, network namespace isolation
  • Section 16.16 -- PIM register tunnel, IPIP encapsulation
  • Section 16.17 -- NETLINK_ROUTE multicast routing netlink messages
  • Section 16.18 -- BPF programs can inspect multicast forwarding decisions
  • Section 16.27 -- VLAN-aware multicast: VIF can reference a VLAN device

16.30 IPVS — IP Virtual Server

IPVS (IP Virtual Server) is a Layer 4 load balancer implemented inside the kernel. It receives connection requests addressed to a Virtual IP (VIP) and a configured port, selects a Real Server (RS) from a pool using a pluggable scheduling algorithm, and forwards packets to the chosen backend. The original Linux IPVS implementation (ip_vs module, part of the Linux Virtual Server project) is widely deployed as the data-plane engine for Kubernetes kube-proxy --mode=ipvs.

UmkaOS implements IPVS inside umka-net (Section 16.2) as a netfilter hook set (Section 16.18). It is transparent to both clients and real servers: clients connect to the VIP as if it were a single endpoint; real servers see connections from either the load balancer (NAT mode) or directly from clients (DR/TUN modes).

Linux parallel: Linux's ip_vs module is located in net/netfilter/ipvs/. UmkaOS's IPVS provides binary-compatible ipvsadm support (both ioctl and Generic Netlink interfaces) and identical /proc/net/ip_vs* output.

16.30.1 Overview

IPVS supports three packet forwarding methods:

  • NAT (Masquerade): The load balancer rewrites the destination IP and port of each incoming SYN to the chosen real server's address (DNAT). Return traffic passes back through the load balancer, which rewrites the source IP/port back to the VIP (SNAT). Both directions traverse the load balancer. Real servers need no special configuration; they see connections from the load balancer's IP.

  • DR (Direct Routing): The load balancer rewrites only the destination MAC address of the frame to the chosen real server's MAC; the IP header is unchanged. The real server must have the VIP configured on a loopback interface (with ARP disabled for that VIP) so that it accepts the packet. Return traffic goes directly from the real server to the client without passing through the load balancer. DR is the most performant mode because the load balancer handles only inbound packets. Requires all servers on the same L2 segment.

  • TUN (IP-in-IP Tunneling): The load balancer encapsulates the original IP packet in a new IP header addressed to the real server (IP-in-IP, protocol 4, or GRE optionally). The real server decapsulates and processes the inner packet. Like DR, return traffic bypasses the load balancer. Allows real servers on different L3 networks. Real servers must have the VIP configured on a tunnel interface with ARP disabled.

  • LocalNode: The VIP resolves to the load balancer host itself. Packets are delivered to a local socket. Used when the load balancer is also a real server.

16.30.2 Data Structures

/// A virtual service: the VIP:port tuple that clients connect to.
///
/// Each virtual service has its own scheduler and set of real servers.
/// Connections are tracked in the global `IPVS_CONN_TABLE` (not per-service),
/// matching Linux's architecture. Multiple virtual services may share the
/// same VIP with different ports or protocols.
pub struct IpvsService {
    /// Virtual IP address (IPv4 or IPv6).
    pub addr: IpAddr,
    /// Virtual port (network byte order).
    pub port: u16,
    /// Transport protocol (TCP, UDP, or SCTP).
    pub protocol: IpvsProtocol,
    /// Service-level flags (e.g., persistence, hashed scheduler config).
    pub flags: IpvsServiceFlags,
    /// Persistent session timeout (seconds). 0 = persistence disabled.
    /// When non-zero, the source address is used for session affinity:
    /// all connections from the same client IP are sent to the same RS
    /// for `timeout` seconds after the last connection.
    pub timeout: u32,
    /// Netmask applied to the client IP before persistence lookup.
    /// For IPv4 persistence: typically 255.255.255.255 (per-host) or
    /// 255.255.255.0 (per-subnet). Ignored when `timeout` is 0.
    pub netmask: u32,
    /// Scheduling algorithm used to select a real server for new connections.
    pub scheduler: Arc<dyn IpvsScheduler>,
    /// Pool of real servers with pre-computed weight CDF.
    ///
    /// RCU-protected for lock-free reads on the per-connection scheduling hot path.
    /// Writers (server add/remove/weight change) clone the table, rebuild the CDF,
    /// and RCU-publish the new snapshot. Write serialisation is provided by the
    /// admin netlink/ioctl path (single-threaded command processing).
    pub real_servers: RcuCell<IpvsServerTable>,
    /// Aggregate statistics for this virtual service.
    pub stats: IpvsStats,
}

/// Pre-computed server table for lock-free scheduler access.
///
/// Published via `RcuCell`: readers take an `RcuReadGuard` (zero contention),
/// writers clone-modify-publish on admin operations (server add/remove/weight change).
/// No hardcoded server limit — `Box<[...]>` is resized on server add/remove.
/// Kubernetes Services with thousands of endpoints (e.g., headless services for
/// large StatefulSets) are supported without arbitrary caps.
pub struct IpvsServerTable {
    /// Real servers for this virtual service. Heap-allocated at service creation
    /// time (initial capacity) and resized on server add/remove. Admin operations
    /// are cold-path; heap allocation is acceptable. Clone cost during RcuCell
    /// publish is O(N) regardless of inline vs heap.
    pub servers: Box<[Arc<IpvsRealServer>]>,
    /// Pre-computed cumulative weight distribution for weighted selection.
    ///
    /// `cdf[i]` = sum of `servers[0..=i].weight`. The WLC scheduler performs
    /// a binary search on this array for O(log N) weighted selection instead
    /// of O(N) linear scan. Rebuilt on every write (server add/remove/weight
    /// change). Empty entries (weight = 0, drain mode) contribute 0 to the CDF
    /// and are skipped by weighted schedulers. Same length as `servers`.
    pub cdf: Box<[u32]>,
}

/// A real server: a backend host that handles connections for a virtual service.
pub struct IpvsRealServer {
    /// Real server IP address.
    pub addr: IpAddr,
    /// Real server port. May differ from the virtual port (port mapping).
    pub port: u16,
    /// Scheduling weight.
    ///
    /// - `0` — **drain mode**: no new connections are scheduled to this server;
    ///   existing connections in the `IpvsConnTable` continue until they close
    ///   naturally or time out. Used for graceful removal before maintenance.
    ///   Weight 0 is NOT the same as removing the server — the entry remains in
    ///   the IPVS table with its existing connections tracked until the weight
    ///   is raised back above 0 or the server is explicitly deleted via
    ///   `IPVS_CMD_DEL_DEST` / `IP_VS_SO_SET_DELDEST`.
    /// - `1–65535` — normal operation; higher weight means proportionally more
    ///   new connections scheduled (weighted round-robin / least-connections).
    pub weight: u16,
    /// Packet forwarding method for this real server.
    pub fwd_method: IpvsFwdMethod,
    /// Number of currently active connections (TCP ESTABLISHED or UDP active).
    /// AtomicU32 with saturating arithmetic (connection count cannot go negative).
    pub activeconns: AtomicU32,
    /// Number of inactive connections (TCP TIME_WAIT, FIN_WAIT, etc.).
    pub inactconns: AtomicU32,
    /// Per-real-server statistics (bytes, packets, connections).
    pub stats: IpvsStats,
}

/// An established IPVS connection entry.
///
/// Created on the first SYN (TCP) or first packet (UDP) from a new client;
/// destroyed when the connection enters TIME_WAIT and the timeout expires.
/// Keyed in `IpvsConnTable` by `(protocol, caddr, cport, vaddr, vport)`.
pub struct IpvsConn {
    /// Transport protocol.
    pub protocol: IpvsProtocol,
    /// Client (source) IP address.
    pub caddr: IpAddr,
    /// Client (source) port.
    pub cport: u16,
    /// Virtual IP address (destination as seen by client).
    pub vaddr: IpAddr,
    /// Virtual port.
    pub vport: u16,
    /// Destination IP (real server address after forwarding decision).
    pub daddr: IpAddr,
    /// Destination port (real server port, may differ from vport).
    pub dport: u16,
    /// Forwarding method in use for this connection.
    pub fwd_method: IpvsFwdMethod,
    /// TCP/UDP state as tracked by IPVS (independent of nf_conntrack).
    pub state: IpvsConnState,
    /// Remaining time before this connection entry is garbage-collected (seconds).
    /// `AtomicU32` because the timer callback (softirq) decrements this field
    /// concurrently with the packet forwarding path (NAPI softirq) resetting it
    /// on TCP state transitions, and the `IP_VS_SO_GET_DESTS` userspace query
    /// (syscall) reading it. Timer decrements use `fetch_sub(1, Relaxed)`.
    /// Packet path resets use `.store(new_val, Relaxed)`. Userspace reads use
    /// `.load(Relaxed)`.
    pub timeout: AtomicU32,
    /// Timer handle that fires on timeout expiry, triggering GC.
    /// Uses the kernel's `TimerHandle` type ([Section 7.8](07-scheduling.md#timekeeping-and-clock-management)).
    pub timer: TimerHandle,
    /// Back-pointer to the real server (for `activeconns`/`inactconns` accounting).
    pub real_server: Weak<IpvsRealServer>,
}

/// Packet forwarding method.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsFwdMethod {
    /// Full NAT: DNAT inbound, SNAT outbound. Both directions via load balancer.
    Masquerade,
    /// Direct Routing: rewrite destination MAC only. Real server replies directly.
    DirectRouting,
    /// IP-in-IP tunnel: encapsulate packet, real server decapsulates and replies directly.
    Tunnel,
    /// Local node: deliver to local socket on the same host.
    LocalNode,
}

/// TCP/UDP connection state as tracked by IPVS.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsConnState {
    None,
    /// TCP connection fully established (ESTABLISHED).
    EstablishedTcp,
    /// SYN sent by client, SYN-ACK not yet seen.
    SynSent,
    /// SYN-ACK seen, waiting for client ACK.
    SynRecv,
    /// FIN sent by one side; connection closing.
    FinWait,
    /// TCP TIME_WAIT: waiting for duplicate segments to expire.
    TimeWait,
    /// TCP CLOSE: both FINs exchanged.
    Close,
    /// TCP CLOSE_WAIT: FIN received, application not yet closed.
    CloseWait,
    /// TCP LAST_ACK: FIN sent after CLOSE_WAIT, awaiting ACK.
    LastAck,
    /// TCP LISTEN: server socket listening (LocalNode mode).
    Listen,
    /// SYN-ACK sent by server, completing three-way handshake.
    Synack,
    /// UDP active flow (packet seen within timeout window).
    Udp,
    /// ICMP error response in flight.
    Icmp,
}

/// Transport protocol selector for IPVS services and connections.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsProtocol {
    Tcp  = 6,
    Udp  = 17,
    Sctp = 132,
}

bitflags::bitflags! {
    /// Virtual service flags.
    pub struct IpvsServiceFlags: u32 {
        /// Persistent service: use source-IP session affinity.
        const PERSISTENT   = 1 << 0;
        /// Hashed scheduler: use a hash of the destination IP (dh scheduler).
        const HASHED       = 1 << 1;
        /// One-packet scheduling: schedule every UDP packet independently.
        const ONE_PACKET   = 1 << 2;
        /// SYN proxy: validate TCP SYN cookies before creating a connection entry.
        const SYN_PROXY    = 1 << 3;
    }
}

/// Aggregate statistics for a virtual service or real server.
pub struct IpvsStats {
    /// Total connections handled (atomic, updated on connection creation).
    pub conns:   AtomicU64,
    /// Total inbound packets processed.
    pub inpkts:  AtomicU64,
    /// Total outbound packets processed (NAT mode only; DR/TUN: outbound bypasses LB).
    pub outpkts: AtomicU64,
    /// Total inbound bytes.
    pub inbytes:  AtomicU64,
    /// Total outbound bytes.
    pub outbytes: AtomicU64,
}

16.30.3 Scheduling Algorithms

Scheduling algorithms implement the IpvsScheduler trait:

/// Pluggable scheduling algorithm for selecting a real server.
///
/// Implementations must be `Send + Sync` (called from multiple CPUs concurrently).
/// They must not block; all internal state uses lock-free or fine-grained locking.
pub trait IpvsScheduler: Send + Sync {
    /// Short name used in `ipvsadm -s` and `/proc/net/ip_vs`.
    fn name(&self) -> &'static str;

    /// Select a real server for a new connection.
    ///
    /// `service` provides access to the real server list and connection table.
    /// Returns `None` if no server with non-zero weight is available (all
    /// servers are administratively disabled or at zero weight).
    fn schedule(
        &self,
        service: &IpvsService,
        conn: &IpvsConn,
    ) -> Option<Arc<IpvsRealServer>>;
}

UmkaOS provides the following built-in schedulers:

All schedulers access service.real_servers via an RcuReadGuard — no lock, no contention on the per-connection scheduling hot path. The IpvsServerTable is immutable for the duration of the RCU read-side critical section. N is bounded only by available memory (no hardcoded limit).

Round Robin (rr): Cycles sequentially through the servers array, skipping servers with weight = 0. Uses an AtomicUsize index into the array, incremented with fetch_add(..., Relaxed) on each scheduling decision. Lock-free: RCU read guard + atomic index.

Weighted Round Robin (wrr): Servers are selected proportionally to their weight using the pre-computed cdf array in IpvsServerTable. Selection: generate a random value in [0, total_weight) and binary-search the CDF array to find the corresponding server. The CDF is rebuilt (O(N)) on server weight change (admin action, cold path). Per-connection scheduling is O(log N) binary search on the CDF — lock-free under RCU read guard.

Least Connection (lc): Selects the server with the fewest active connections. Scans servers array, comparing active_conns.load(Relaxed). O(N) per decision; N ≤ 256 so this is bounded. Lock-free: RCU read guard + per-server atomic counter.

Weighted Least Connection (wlc): Selects the server minimizing active_conns / weight. Equivalent to lc but normalizes by weight so higher-weight servers absorb proportionally more connections. Same O(N) scan as lc.

Source Hashing (sh): Hashes the source IP address to select a server. server_index = jhash(saddr) % N. Provides session affinity — the same client IP always reaches the same server (assuming stable server membership). Lock-free: pure computation + RCU.

Destination Hashing (dh): Hashes the destination IP address to select a server. server_index = jhash(daddr) % N. Used for cache-affinity load balancing (e.g., reverse proxies where each server caches a different content shard).

Shortest Expected Delay (sed): Selects the server with the lowest expected response time, estimated as (active_conns + 1) / weight. Biases toward servers that have both low load and high weight. O(N) scan per decision.

Never Queue (nq): Selects any server with zero active connections. If all servers have at least one connection, falls back to sed. Optimizes for latency-sensitive workloads where idle servers should be used first.

Maglev Hashing (mh): Consistent hashing using Google's Maglev algorithm. Builds a lookup table of configurable size (prime from 251 to 131071, default 65521) where each entry maps to a real server. Connection scheduling is O(1): server = table[hash(tuple) % table_size]. Provides minimal disruption on server membership changes (only connections hashed to the changed server are remapped). Used by kube-proxy (--ipvs-scheduler=mh) for Kubernetes active-active load balancing. No state synchronization needed between LB nodes.

Locality-Based Least Connection (lblc): Assigns connections for the same destination IP to the same server, providing cache locality. If the assigned server is overloaded (active connections exceed weight), falls back to least-connection selection across all servers. Used in CDN and caching proxy deployments where each server caches different content.

Locality-Based Least Connection with Replication (lblcr): Like lblc but maintains a per-destination server set instead of a single server assignment. When the current set is overloaded, a new server is replicated into the set via least-connection selection. Servers are pruned from the set when idle. Provides both locality and load distribution for popular destinations.

Weighted Failover (fo): Always selects the available server with the highest weight. All traffic goes to a single primary server; traffic moves to the next-highest-weight server only when the primary becomes unavailable. Used for active-passive failover configurations.

Overflow (ovf): Selects the highest-weight server and routes all traffic to it until its active connection count exceeds its weight. Once overflowed, connections spill to the next-highest-weight server. Provides weighted active-passive behavior with overflow capacity.

16.30.3.1 Netfilter Hook Points

IPVS hooks into the netfilter framework (Section 16.18):

Hook point Priority Purpose
NF_INET_LOCAL_IN NF_IP_PRI_NAT_DST + 1 Match incoming packets to VIPs; create or look up connection entries; forward to real server.
NF_INET_FORWARD NF_IP_PRI_CONNTRACK Handle forwarded traffic in DR and TUN modes; update connection state.
NF_INET_POST_ROUTING NF_IP_PRI_NAT_SRC SNAT return traffic (NAT/Masquerade mode): rewrite source address to VIP.
NF_INET_LOCAL_OUT NF_IP_PRI_NAT_DST LocalNode forwarding: intercept locally generated packets destined for VIPs on loopback.

The NF_INET_LOCAL_IN hook is the primary entry point. For each incoming packet:

  1. Connection table lookup: Look up (protocol, saddr, sport, daddr, dport) in the global IPVS_CONN_TABLE. If a matching IpvsConn is found, update its timeout and proceed to forwarding. This is the fast path for established connections.

  2. Virtual service lookup: If no connection entry exists, look up (protocol, daddr, dport) in the global virtual service table (RcuHashMap<(IpvsProtocol, IpAddr, u16), Arc<IpvsService>>). If no match, let the packet pass through (NF_ACCEPT). The service table uses RcuHashMap because the key is a composite tuple (not a simple integer, so XArray is inapplicable), lookups occur on the per-packet hot path (lock-free RCU reads required), and no ordered iteration or range queries are needed — only exact-match by 3-tuple.

  3. Scheduling: Call service.scheduler.schedule(&service, &new_conn) to select a real server. If no server is available (weight = 0 for all), drop the packet (NF_DROP) and log to the IPVS stats (no_route).

  4. Connection entry creation: Allocate an IpvsConn, insert into the global IPVS_CONN_TABLE, increment real_server.activeconns.

  5. Forwarding: Rewrite the packet according to fwd_method and re-inject via ip_route_output or dev_queue_xmit (for DR).

nf_conntrack integration: IPVS creates nf_conntrack entries for NAT-mode connections to enable TCP state tracking and ensure symmetrical NAT rewrite on both directions. The conntrack entry is linked to the IpvsConn via a private extension (nf_ct_ext_add). DR and TUN modes do not create conntrack entries for the return path (which bypasses the load balancer).

16.30.4 Connection Table

The IPVS connection table is a single global XArray of IpvsConn entries with RCU-protected reads for lock-free lookups on the packet forwarding hot path. This matches Linux's single global connection hash table architecture. RSS provides CPU affinity for connection locality; per-service tables would add complexity without measurable benefit.

/// Global IPVS connection table. XArray keyed by connection hash (u64)
/// provides O(log₆₄ N) lookup with native RCU-safe reads — no explicit
/// bucket chains, no per-bucket caps, no DoS surface from hash collisions.
///
/// XArray's tree structure eliminates the bucket chain length problem:
/// entries with the same hash prefix share XArray tree nodes, but each
/// entry occupies its own slot. There is no linear scan within a bucket.
///
/// Tunable via `/proc/sys/net/ipv4/vs/conn_tab_bits` for compatibility
/// with Linux tooling (the value is accepted but only affects the hash
/// function seed, not XArray's internal tree structure).
pub struct IpvsConnTable {
    /// XArray mapping connection_hash (u64) -> ArrayVec<Arc<IpvsConn>, 16>.
    /// Each XArray slot holds a small chain (up to 16 entries) to handle
    /// hash collisions. `xa_store()` overwrites a single entry; without
    /// chaining, a collision would silently drop an active connection.
    ///
    /// **Chain length of 16**: With SipHash-1-3 and millions of connections,
    /// natural collision probability per bucket is <0.001%. The chain length
    /// of 16 (up from 4) provides additional headroom against hash collision
    /// attacks: an attacker would need to find 17 inputs that collide under
    /// SipHash-1-3 with a secret key, which is computationally infeasible
    /// without side-channel information. Memory overhead is minimal: each
    /// `Arc` is 8 bytes, so increasing from 4 to 16 adds 96 bytes per
    /// XArray slot that has collisions (most slots have 0-1 entries).
    ///
    /// **Overflow monitoring**: When a chain reaches length 16, `insert()`
    /// returns `ResourceExhausted` and increments a per-table
    /// `chain_overflow_count: AtomicU64` counter, exposed via
    /// `/proc/sys/net/ipv4/vs/conn_chain_overflows` for monitoring.
    ///
    /// Readers (packet forwarding hot path) call `xa_load()` under
    /// `rcu_read_lock()`, then scan the ArrayVec for exact 5-tuple match.
    /// Writers (insert/remove) modify the ArrayVec under XArray's internal
    /// lock.
    /// Integer-keyed mapping → XArray per collection policy.
    table: XArray<ArrayVec<Arc<IpvsConn>, 16>>,
    /// Overflow counter: incremented when `insert()` fails due to chain
    /// length exceeding 16. Exposed via procfs for DoS monitoring.
    chain_overflow_count: AtomicU64,
    /// Current number of active connection entries.
    count: AtomicUsize,
}

impl IpvsConnTable {
    /// Look up a connection by 5-tuple. Called from the hot path (NF_INET_LOCAL_IN).
    /// Computes `conn_hash(proto, caddr, cport, vaddr, vport)` and performs
    /// `xa_load()` under `rcu_read_lock()`. O(log₆₄ N) with zero contention
    /// for the XArray lookup.
    ///
    /// After `xa_load()`, iterates all entries in the `ArrayVec` chain and
    /// returns the first entry whose 5-tuple fields exactly match the query
    /// parameters. Returns `None` if no entry in the chain matches (normal
    /// "connection not found" result, not a warning).
    pub fn lookup(
        &self,
        proto: IpvsProtocol,
        caddr: IpAddr, cport: u16,
        vaddr: IpAddr, vport: u16,
    ) -> Option<Arc<IpvsConn>>;

    /// Insert a newly created connection entry.
    /// Computes the connection hash. Loads the existing `ArrayVec` at that
    /// XArray slot (if any), pushes the new `IpvsConn` into it, and stores
    /// the updated chain via `xa_store()`. If the slot is empty, creates a
    /// new `ArrayVec` with the single entry.
    ///
    /// # Errors
    /// Returns `KernelError::ResourceExhausted` if the chain at this hash
    /// slot already contains 16 entries (maximum chain length). This indicates
    /// extreme hash collision pressure; the caller logs a warning, increments
    /// `chain_overflow_count`, and drops the connection.
    pub fn insert(&self, conn: Arc<IpvsConn>) -> Result<(), KernelError>;

    /// Remove an expired or closed connection entry. Called from the timer
    /// expiry path or explicit connection teardown.
    ///
    /// Loads the `ArrayVec` chain at the connection's hash slot, finds the
    /// entry matching the connection's 5-tuple via `retain()` or
    /// `swap_remove()`, and stores the updated chain via `xa_store()`.
    /// If the chain becomes empty after removal, calls `xa_erase()` to
    /// reclaim the XArray slot. RCU publishing is handled by XArray
    /// internally; the old chain is freed after the next grace period.
    pub fn remove(&self, conn: &IpvsConn);
}

/// Global IPVS connection table. Shared across all virtual services.
/// Matches Linux's architecture: a single hash table (here, XArray) indexed
/// by connection 5-tuple hash. Initialized at IPVS module load time.
static IPVS_CONN_TABLE: OnceCell<IpvsConnTable> = OnceCell::new();

The hash key is conn_hash(protocol, caddr, cport, vaddr, vport) producing a u64. XArray provides RCU-safe reads natively: the forwarding hot path calls xa_load() under rcu_read_lock() with O(log₆₄ N) lookup and zero lock contention, then scans the ArrayVec chain (up to 16 entries) for an exact 5-tuple match. Writers load the existing chain, modify it (push/remove), and call xa_store() with XArray's internal RCU publishing. The chain length of 16 provides ample headroom against both natural hash collisions (<0.001% with SipHash-1-3) and deliberate collision attacks. Chain overflow (>16 entries per slot) returns ResourceExhausted and increments chain_overflow_count.

Connection expiry uses per-connection TimerHandle instances. When a timer fires, the connection's timeout field is atomically decremented via timeout.fetch_sub(1, Relaxed); if the previous value was 1 (now zero), the entry is removed from the table and the real_server.activeconns (or inactconns) counter is decremented. TCP state transitions (SYN_SENT → ESTABLISHED → TIME_WAIT → CLOSED) reset the timer atomically via timeout.store(new_val, Relaxed) with the appropriate timeout value for the new state (same timeouts as Linux: ESTABLISHED 15 min, TIME_WAIT 120 s, SYN_SENT 1 min, SYN_RECV 1 min).

16.30.5 Health Checking Integration

IPVS itself does not perform health checks. Health checking is the responsibility of a user-space daemon (keepalived, HAProxy, or a cloud-native controller). The kernel provides the following mechanisms for user-space to signal server health:

  • IP_VS_SO_SET_EDITDEST (ioctl) or IPVS_CMD_SET_DEST (Generic Netlink): Update IpvsRealServer.weight. Setting weight = 0 drains the server: no new connections are assigned, but existing connections in the IpvsConnTable continue until they time out or are explicitly removed. Setting weight > 0 re-enables the server.

  • IP_VS_SO_SET_DELDEST / IPVS_CMD_DEL_DEST: Remove the server from the pool immediately. Existing IpvsConn entries retain their Weak<IpvsRealServer> reference; the strong Arc is released from the real_servers list. When the last IpvsConn drops its Weak reference, the IpvsRealServer is freed.

Graceful drain sequence (as used by keepalived before a rolling upgrade):

  1. IPVS_CMD_SET_DEST weight=0 — stops new connection scheduling.
  2. Poll IP_VS_SO_GET_DESTS: wait until activeconns = 0 and inactconns = 0.
  3. IPVS_CMD_DEL_DEST — remove the server entry.

This sequence is identical to Linux's ip_vs behaviour and is relied upon by Kubernetes kube-proxy during endpoint removal.

16.30.6 Userspace Interface

ipvsadm communicates with the IPVS kernel subsystem via two equivalent interfaces:

Legacy socket ioctl (compatibility with ipvsadm < 1.28 and older scripts): A raw IP socket is created with socket(AF_INET, SOCK_RAW, IPPROTO_RAW). setsockopt and getsockopt calls on this socket with level IPPROTO_IP and option names in the IP_VS_SO_* range carry IPVS commands:

Option Direction Purpose
IP_VS_SO_SET_ADD set Add a virtual service
IP_VS_SO_SET_EDIT set Edit a virtual service (scheduler, flags)
IP_VS_SO_SET_DEL set Delete a virtual service
IP_VS_SO_SET_FLUSH set Delete all virtual services
IP_VS_SO_SET_ADDDEST set Add a real server to a virtual service
IP_VS_SO_SET_EDITDEST set Edit a real server (weight, fwd method)
IP_VS_SO_SET_DELDEST set Remove a real server
IP_VS_SO_SET_TIMEOUT set Set TCP/UDP/SCTP connection timeouts
IP_VS_SO_GET_VERSION get Kernel IPVS version string
IP_VS_SO_GET_INFO get Global stats: num services, conn_tab_size
IP_VS_SO_GET_SERVICES get List of all virtual services
IP_VS_SO_GET_SERVICE get Single virtual service by VIP:port
IP_VS_SO_GET_DESTS get Real servers for a virtual service

Generic Netlink (modern interface, ipvsadm ≥ 1.28, kube-proxy): The IPVS subsystem registers a Generic Netlink family named "IPVS" with the kernel's Generic Netlink layer. Commands mirror the ioctl set: IPVS_CMD_NEW_SERVICE, IPVS_CMD_SET_SERVICE, IPVS_CMD_DEL_SERVICE, IPVS_CMD_GET_SERVICE, IPVS_CMD_NEW_DEST, IPVS_CMD_SET_DEST, IPVS_CMD_DEL_DEST, IPVS_CMD_GET_DEST. Attributes carry the same fields as the ioctl structures (ip_vs_service_user, ip_vs_dest_user) but in netlink TLV form, enabling extensibility without ABI breaks.

procfs read-only status:

/proc/net/ip_vs          — virtual service list with scheduler and stats
/proc/net/ip_vs_conn     — active connection table dump
/proc/sys/net/ipv4/vs/   — sysctl namespace:
    conn_tab_bits        (rw) hash table size as log2 (default: 12)
    expire_nodest_conn   (rw) expire connections on dest removal (default: 0)
    expire_quiescent_template (rw) expire persistence templates (default: 0)
    nat_icmp_send        (rw) send ICMP errors from NAT (default: 0)
    sync_threshold       (rw) connection sync threshold (default: 3 50)
    timeout_tcp          (rw) TCP ESTABLISHED timeout seconds (default: 900)
    timeout_tcp_fin      (rw) TCP FIN_WAIT timeout (default: 120)
    timeout_udp          (rw) UDP flow timeout (default: 300)

16.30.7 IPVS and Kubernetes kube-proxy

Kubernetes kube-proxy --mode=ipvs uses IPVS for all Service type ClusterIP, NodePort, and LoadBalancer load balancing. For UmkaOS to support kube-proxy in IPVS mode, the following must be satisfied — all of which the above design meets:

  • IPv4 and IPv6 (IpAddr is an enum over Ipv4Addr and Ipv6Addr; IPVS services are created independently for each address family).
  • FWD_MASQ (NAT mode): Required for ClusterIP services where the real server is on a different node. UmkaOS supports full NAT with nf_conntrack integration.
  • sh, rr, lc schedulers: kube-proxy defaults to rr; users may select lc or sh. All three are implemented.
  • Session persistence (IpvsServiceFlags::PERSISTENT, non-zero timeout): Used for sessionAffinity: ClientIP Services. UmkaOS honours the timeout field.
  • Graceful server drain (weight = 0): kube-proxy sets weight to 0 when a pod is being terminated and the endpoint is removed from the Endpoints object.
  • /proc/sys/net/ipv4/vs/ sysctl namespace: kube-proxy writes conn_tab_bits and expire_nodest_conn. Both are emulated by UmkaOS's procfs.
  • Generic Netlink IPVS family: kube-proxy uses the IPVS Netlink family for all service and destination management. UmkaOS's implementation exposes identical attributes and command semantics.

kube-proxy additionally uses ipset (netfilter IP sets) for efficient NodePort and externalIPs matching. UmkaOS's netfilter layer (Section 16.18) supports ipset-style match modules; kube-proxy's ipset rules operate identically.

16.30.8 Linux Compatibility

UmkaOS's IPVS subsystem is binary-compatible with Linux's ip_vs module behaviour:

  • ipvsadm (all versions): both the ioctl socket API and the Generic Netlink API work without recompilation. The ioctl option numbers, structure layouts (ip_vs_service_user, ip_vs_dest_user, ip_vs_get_info), and the Generic Netlink family name and attribute types are identical to Linux 5.19+.
  • /proc/net/ip_vs and /proc/net/ip_vs_conn: output format identical to Linux (column widths, field ordering). Scripts that awk/grep these files work unchanged.
  • /proc/sys/net/ipv4/vs/ sysctl tree: all keys present, same defaults, same semantics.
  • Connection state machine: TCP state timeouts and transition logic match Linux's ip_vs_proto_tcp.c behaviour exactly, ensuring that keepalived's connection drain logic (which polls activeconns/inactconns via IP_VS_SO_GET_DESTS) operates correctly.
  • Scheduling algorithm names: "rr", "wrr", "lc", "wlc", "lblc", "lblcr", "sh", "dh", "sed", "nq", "fo", "ovf", "mh" — identical strings to Linux, used by ipvsadm -s and kube-proxy's scheduler selection.

16.31 Network Service Provider

Provider model: A NIC with Tier M firmware IS the network service provider directly — the device advertises EXTERNAL_NETWORK via CapAdvertise, and the host creates NetworkServiceClient via PeerServiceProxy (Section 5.11). The NIC can also serve as cluster transport (Section 5.10). Without Tier M firmware, the host runs a KABI NIC driver and provides network service as a host-proxy. Sharing model: shared — multiple peers get independent virtual interfaces (per-client MAC in L2 mode, per-client NAT in L3 mode).

When a node has an external NIC (a physical network interface reaching outside the cluster fabric), the network subsystem can provide that interface as a cluster service via the peer protocol. Remote peers that lack direct external connectivity use this service to reach external networks.

This is the network-subsystem instantiation of the capability service provider model described in Section 5.7.

16.31.1 Motivation

Not every node in a cluster has external network access:

  • CXL-attached compute sleds may have only a CXL link to a host — no Ethernet port.
  • Embedded accelerator nodes (FPGAs, inference ASICs) connect to the cluster via PCIe or internal RDMA but have no external NIC.
  • Air-gapped partitions may designate specific gateway nodes for external traffic (security policy).
  • DPU-mediated access: a BlueField DPU owns the physical port; the host reaches external networks via the DPU's ServiceId("nic_offload") (Section 5.11). The network service provider extends this pattern cluster-wide: other nodes can route through the DPU-equipped host.

16.31.2 Service Provider and Wire Protocol

// umka-net/src/service_provider.rs

/// Provides external network access to cluster peers.
/// The service provider creates a virtual network interface on the serving
/// host, bridged or routed to the physical external NIC. Remote peers
/// submit and receive packets via the peer protocol.
pub struct NetServiceProvider {
    /// Physical NIC backing this service.
    device: NetDeviceHandle,
    /// Unique service instance identifier.
    service_id: ServiceInstanceId,
    /// Service endpoint on the peer protocol.
    endpoint: PeerServiceEndpoint,
    /// Maximum transmission unit (bytes) for the external link.
    mtu: u32,
    /// Whether RDMA proxy is available (requires RDMA-capable external NIC).
    rdma_capable: bool,
    /// Connected client peers, each with a virtual interface.
    /// XArray provides O(1) lookup by PeerId (u64) with native RCU read-side access.
    clients: XArray<NetServiceClient>,
}

/// Per-client state on the provider side.
pub struct NetServiceClient {
    /// Peer consuming this network service.
    peer_id: PeerId,
    /// MAC address assigned to this client's virtual interface.
    mac: [u8; 6],
    /// IPv4/IPv6 addresses assigned (from provider's subnet or DHCP relay).
    /// 8 entries: accommodates dual-stack (IPv4 + IPv6 link-local + IPv6 global
    /// + IPv6 ULA) plus additional addresses from multiple subnets or aliases.
    addresses: ArrayVec<IpAddr, 8>,
    /// Traffic shaping: bandwidth limit (bytes/sec), 0 = unlimited.
    bandwidth_limit: u64,
    /// Packet counters for diagnostics and QoS enforcement.
    tx_packets: AtomicU64,
    rx_packets: AtomicU64,
}

Bandwidth enforcement: Token bucket rate limiter per client, applied to the TX path (client -> external):

  • bucket_size = bandwidth_limit (bytes).
  • refill_rate = bandwidth_limit (bytes/sec).
  • Packets exceeding the rate are queued (bounded: 256 packets). Queue overflow causes packets to be dropped with -ENOBUFS reported to client via LinkStatus with RATE_LIMITED flag.
  • RX path (external -> client): provider's MACVLAN/NAT handles rate limiting via tc qdisc if configured. No per-client kernel enforcement on RX by default (external traffic rate is bounded by the physical link).

PeerCapFlags: EXTERNAL_NETWORK (bit 8) — advertised by peers that provide external network access.

ServiceId: ServiceId("external_nic", 1).

PeerServiceDescriptor.properties (32 bytes):

/// Wire format: all integer fields are little-endian (`Le32`/`Le16`)
/// because this struct is embedded in `PeerServiceDescriptor.properties`
/// and transmitted cross-node via the peer protocol.
#[repr(C)]
pub struct ExternalNicProperties {
    /// External link speed in Mbps. Reflects the CURRENT negotiated link
    /// speed (updated on link renegotiation events). Not the maximum
    /// capability. Clients can monitor changes via LinkStatus messages.
    ///
    /// Unit: Mbps (not Gbps) to preserve sub-Gbps precision for common
    /// speeds like 2.5 Gbps (2500 Mbps, IEEE 802.3bz 2.5GBASE-T). At
    /// Mbps granularity, Le32 covers up to ~4.3 Pbps. This matches
    /// Linux's `ethtool_link_ksettings.speed` unit (Mbps, u32).
    pub link_speed_mbps: Le32,
    /// MTU of the external link.
    pub mtu: Le32,
    /// Capabilities bitmask.
    /// bit 0: L2 bridging (client gets own MAC on external network)
    /// bit 1: L3 routing (NAT/masquerade, client uses provider's IP)
    /// bit 2: RDMA proxy (provider's NIC supports RDMA, client can
    ///         establish RDMA connections to external hosts via proxy)
    /// bit 3: VLAN tagging (client can request a specific VLAN ID)
    pub capabilities: Le32,
    /// Number of client slots available (0 = unlimited).
    pub max_clients: Le16,
    /// Padding to fill the 32-byte `PeerServiceDescriptor.properties` slot.
    /// 4+4+4+2+18 = 32 bytes. Must be zeroed on transmit.
    pub _pad: [u8; 18],
}
const_assert!(size_of::<ExternalNicProperties>() == 32);

16.31.3 Access Modes

The network service provider supports three access modes, selected per-client at ServiceBind time via the properties blob in ServiceBindPayload:

L3 routing (NAT/masquerade) — simplest mode. The provider performs source NAT for outbound client traffic, using the provider's own external IP. Inbound connections to the client require explicit port-forwarding rules. No external MAC or IP allocated to the client. Analogous to a home router.

L2 bridging — the provider creates a virtual interface (MACVLAN or equivalent) on its physical NIC, assigns it to the client. The client gets its own MAC address and obtains an IP via DHCP or static configuration on the external network. The client is a full L2 participant on the external segment. Requires the external switch to accept multiple MACs on the provider's port.

MAC address assignment (L2 bridging mode): Provider generates a locally-administered MAC address per client:

mac[0] = 0x02             (locally administered, unicast)
mac[1] = provider_peer_id[0]  (scoped to provider)
mac[2..5] = hash(client_peer_id, service_instance_id)[0..3]

This guarantees uniqueness within one provider (different client_peer_id values hash differently) and across providers (different provider_peer_id bytes). No collision detection needed -- the hash space (32 bits) vastly exceeds typical client counts per provider.

RDMA proxy — extends L2 bridging with RDMA capability. The provider allocates RDMA resources (protection domains, queue pairs) on its external NIC and proxies RDMA verbs from the client. The client can establish RDMA connections to external hosts (e.g., remote storage arrays, cross-datacenter GPU clusters) via the provider's NIC. Higher complexity; requires the provider's NIC to support multiple protection domains (standard for ConnectX, EFA, SFC).

16.31.4 Wire Protocol

Network service wire messages use ServiceMessage/ServiceResponse on the bound ring pair, with the following opcodes:

#[repr(u16)]
pub enum NetServiceOpcode {
    /// Client → provider: transmit packet to external network.
    /// Payload: Ethernet frame (L2 mode) or IP packet (L3 mode).
    TxPacket         = 0x0001,
    /// Provider → client: received packet from external network.
    /// Payload: Ethernet frame (L2 mode) or IP packet (L3 mode).
    RxPacket         = 0x0002,
    /// Client → provider: request address configuration.
    /// Payload: NetAddrRequest (requested IP, or DHCP flag).
    AddrRequest      = 0x0010,
    /// Provider → client: address configuration response.
    /// Payload: NetAddrResponse (assigned IP, gateway, DNS).
    AddrResponse     = 0x0011,
    /// Client → provider: configure packet filter (optional).
    /// Payload: BPF program (cBPF, max 256 instructions).
    SetFilter        = 0x0020,
    /// Provider → client: link status change notification.
    /// Payload: NetLinkStatus (up/down, speed change).
    LinkStatus       = 0x0030,
    /// RDMA proxy: client → provider: create remote QP.
    RdmaCreateQp     = 0x0040,
    /// RDMA proxy: provider → client: QP created, connection info.
    RdmaCreateQpAck  = 0x0041,
    /// RDMA proxy: client → provider: destroy remote QP.
    RdmaDestroyQp    = 0x0042,
}

Capability gating: remote network access requires CAP_NET_REMOTE (Section 9.1). Checked at ServiceBind time. L2 bridging and RDMA proxy modes additionally require CAP_NET_RAW (same as Linux CAP_NET_RAW -- grants raw packet access on the external network segment).

NetAddrRequest / NetAddrResponse wire structs:

/// Address configuration request from client to provider.
/// Wire format: cross-node. Multi-byte integers use `Le16`.
#[repr(C)]
pub struct NetAddrRequest {
    /// Requested address mode.                    // offset 0, 1 byte
    /// 0 = DHCP (provider relays DHCP on behalf of client)
    /// 1 = static IPv4 (use requested_ipv4 field)
    /// 2 = static IPv6 (use requested_ipv6 field)
    /// 3 = SLAAC (IPv6 stateless autoconfiguration via provider's RA)
    pub mode: u8,
    pub _pad: [u8; 3],                            // offset 1, 3 bytes
    /// Requested VLAN ID (0 = no VLAN). Only used if provider
    /// advertises VLAN capability (bit 3 in ExternalNicProperties).
    pub vlan_id: Le16,                             // offset 4, 2 bytes
    pub _pad2: [u8; 2],                            // offset 6, 2 bytes
    /// Requested IPv4 address (for mode=1). Network byte order.
    pub requested_ipv4: [u8; 4],                   // offset 8, 4 bytes
    /// Requested IPv4 prefix length (e.g., 24 for /24).
    pub ipv4_prefix_len: u8,                       // offset 12, 1 byte
    pub _pad3: [u8; 3],                            // offset 13, 3 bytes
    /// Requested IPv6 address (for mode=2). Network byte order.
    pub requested_ipv6: [u8; 16],                  // offset 16, 16 bytes
    /// Requested IPv6 prefix length (e.g., 64).
    pub ipv6_prefix_len: u8,                       // offset 32, 1 byte
    pub _pad4: [u8; 3],                            // offset 33, 3 bytes
    // Total: 1+3+2+2+4+1+3+16+1+3 = 36 bytes
}
const_assert!(size_of::<NetAddrRequest>() == 36);

/// Address configuration response from provider to client.
/// Wire format: cross-node. Status field uses `Lei32` (little-endian i32).
#[repr(C)]
pub struct NetAddrResponse {
    /// 0 = success, negative errno on failure.
    /// -EADDRINUSE: requested IP already assigned to another client.
    /// -EINVAL: invalid address/prefix.
    /// -ENOTSUP: mode not supported by provider.
    pub status: Lei32,
    /// Assigned IPv4 address (may differ from requested if DHCP).
    pub assigned_ipv4: [u8; 4],
    pub ipv4_prefix_len: u8,
    pub _pad: [u8; 3],
    /// Default gateway IPv4.
    pub gateway_ipv4: [u8; 4],
    /// Primary DNS server IPv4.
    pub dns_ipv4: [u8; 4],
    /// Assigned IPv6 address.
    pub assigned_ipv6: [u8; 16],
    pub ipv6_prefix_len: u8,
    pub _pad2: [u8; 3],
    /// Default gateway IPv6.
    pub gateway_ipv6: [u8; 16],
    /// Assigned MAC address (for L2 bridging mode).
    pub assigned_mac: [u8; 6],
    pub _pad3: [u8; 2],
}
const_assert!(size_of::<NetAddrResponse>() == 64);

Packet transport: TxPacket and RxPacket carry raw frames/packets inline in the ServiceMessage payload. For frames exceeding the ring entry size (default 224 bytes payload), the continuation protocol (Section 5.1) is used. For bulk traffic, the provider and client negotiate remote write for the data path (same split-transfer pattern as DSM data messages, Section 6.6): provider pushes received packets into client's pre-registered data region, then sends RxPacket with FLAG_BULK_DATA + region offset.

SetFilter BPF encoding: Payload is a serialized classic BPF (cBPF) program:

/// BPF filter program for SetFilter opcode.
/// Wire format: cross-node. `len` uses `Le16`.
///
/// **Size rationale**: 2056 bytes (8 + 256×8). Fixed-size inline array is
/// intentional for wire-format simplicity — avoids variable-length parsing
/// in the RDMA receive path. 256 instructions covers all standard cBPF
/// socket filters (Linux SO_ATTACH_FILTER max is 4096, but network service
/// filters are simpler). This struct is allocated in SetFilter request
/// slots (cold path, not per-packet).
#[repr(C)]
pub struct BpfFilterProgram {
    /// Number of instructions (max 256).
    pub len: Le16,
    pub _pad: [u8; 6],
    /// Array of BPF instructions (8 bytes each, standard Linux
    /// sock_filter format: code[2] + jt[1] + jf[1] + k[4]).
    /// Only first `len` entries are valid.
    pub insns: [SockFilter; 256],
}
const_assert!(size_of::<BpfFilterProgram>() == 2056);

Maximum program size: 256 * 8 + 8 = 2056 bytes. The provider verifies the BPF program before loading:

  • Standard BPF verifier checks (no backward jumps, no division by zero, bounded execution).
  • Only packet filtering opcodes allowed: BPF_LD, BPF_LDX, BPF_ALU, BPF_JMP, BPF_RET, BPF_ST/BPF_STX (scratch memory M[] only, not kernel memory). No BPF_MISC.
  • If verification fails: return -EINVAL.

Applied to: RX path only (filters incoming packets before delivering to client). TX path is unfiltered (client controls what it sends).

RdmaCreateQp / RdmaCreateQpResponse wire structs:

// Note: these structs are intentionally RDMA-specific. They manage QPs
// on the provider's external RDMA NIC for reaching non-UmkaOS hosts.
// Client↔provider communication uses the transport-neutral peer protocol.

/// RDMA proxy: client requests creation of a remote QP on the
/// provider's external RDMA NIC.
/// Wire format: cross-node. Multi-byte integers use `Le32`.
#[repr(C)]
pub struct RdmaCreateQpRequest {
    /// QP type: 0 = RC (Reliable Connected), 1 = UD (Unreliable Datagram).
    pub qp_type: u8,
    pub _pad: [u8; 3],
    /// Maximum send queue depth.
    pub max_send_wr: Le32,
    /// Maximum receive queue depth.
    pub max_recv_wr: Le32,
    /// Maximum scatter/gather entries per send WR.
    pub max_send_sge: Le32,
    /// Maximum scatter/gather entries per recv WR.
    pub max_recv_sge: Le32,
    /// Remote peer's GID (for RC connection setup). 16 bytes.
    pub remote_gid: [u8; 16],
    /// Remote peer's QP number (for RC connection).
    pub remote_qpn: Le32,
    pub _pad2: [u8; 4],
}

/// RDMA proxy: provider confirms QP creation.
/// Wire format: cross-node. Multi-byte integers use `Lei32`/`Le32`.
#[repr(C)]
pub struct RdmaCreateQpResponse {
    /// 0 = success, negative errno on failure.
    pub status: Lei32,
    /// Allocated QP number on the provider's external NIC.
    pub local_qpn: Le32,
    /// Provider's GID on the external NIC.
    pub local_gid: [u8; 16],
    /// Protection domain ID (for MR registration).
    pub pd_id: Le32,
    pub _pad: [u8; 4],
}
const_assert!(size_of::<RdmaCreateQpRequest>() == 44);
const_assert!(size_of::<RdmaCreateQpResponse>() == 32);

RDMA proxy connection flow:

  1. Client sends RdmaCreateQp with remote GID/QPN of the external host it wants to connect to.
  2. Provider creates a local QP on its external RDMA NIC, transitions it through INIT -> RTR -> RTS using the remote GID/QPN.
  3. Provider sends RdmaCreateQpResponse with the local QPN and GID.
  4. Client informs the external host of (provider_gid, local_qpn) so the external host can complete its side of the RC handshake.
  5. Data path: the provider operates the external QP on behalf of the client. The client submits high-level transfer requests via ServiceMessage (buffer region offset, length, operation type). The provider translates these into RDMA verbs on the external QP. This adds one hop of latency (client→provider via peer transport, then provider→external via RDMA). For latency-critical external RDMA access, a local RDMA NIC is required.

Full RDMA proxy data path protocol: Phase 3+ (requires verb batching and completion coalescing to amortize the extra hop). 6. On RdmaDestroyQp: provider destroys the QP and frees NIC resources.

16.31.5 Integration with IP Stack

On the client side, the network service provider appears as a standard network interface (netdev) in the client's IP stack. The client's routing table, firewall rules, and socket layer work unchanged — they see a network interface, not a remote service. The NetServiceClient kernel module creates a virtual netdev whose xmit function sends TxPacket via the service endpoint, and whose receive path delivers RxPacket payloads to the local network stack.

On the provider side, the virtual interface is a MACVLAN (L2 mode) or an iptables MASQUERADE rule (L3 mode) on the physical NIC. Standard Linux- compatible networking primitives — no custom packet processing in the provider's data path.

16.31.6 Drain Protocol

On graceful shutdown of the provider peer (Section 5.8):

  1. Send ServiceDrainNotify to all connected clients. alternative_peer points to another peer with EXTERNAL_NETWORK capability (if any).
  2. Clients switch to the alternative provider or fall back to no external connectivity.
  3. Provider tears down virtual interfaces and NAT rules.

16.31.7 Relationship to DPU NIC Offload

ServiceId("nic_offload") (Section 5.11) is a local service: the host consumes its own DPU's NIC to reach the network. The DPU provides NIC hardware offload (checksum, TSO, RSS) to its local host over PCIe.

ServiceId("external_nic") (this section) is a cluster-wide service: any peer can consume another peer's external network access. The provider may itself be using a DPU via ServiceId("nic_offload") internally — the client doesn't know or care.

These are complementary, not overlapping: - DPU on Host A provides nic_offload → Host A uses it locally. - Host A provides external_nic → Host B (no external NIC) uses it remotely. - Both use the same CapAdvertiseServiceBind mechanism.