Skip to content

Chapter 15: Networking

Socket layer, NetBuf, routing, TCP stack, congestion control, kTLS, overlays/tunnels, netlink, packet filtering, interface naming


15.1 TCP Stack Extensibility

Linux problem: MPTCP took many years to get into mainline because it required deep changes to the TCP stack. The monolithic TCP implementation made it hard to add new transport protocols. Congestion control algorithms are pluggable, but the socket layer itself is tightly coupled to TCP internals — adding a fundamentally new transport like QUIC kernel offload requires invasive surgery across net/ipv4/, net/ipv6/, and the socket layer.

UmkaOS design:

15.1.1 Network Stack Architecture

umka-net is the Tier 1 network stack. It runs in its own isolation domain, isolated from umka-core. The kernel never executes protocol processing directly — all network I/O crosses the domain boundary via ring buffers (~23 cycles per crossing, Section 10.2).

The stack is layered, with each layer communicating through well-defined internal interfaces:

Application (userspace)
    |  syscall (socket, bind, listen, accept, read, write, sendmsg, recvmsg)
    v
umka-core: socket dispatch (translates fd ops to umka-net ring commands)
    |  domain ring buffer (~23 cycles)
    v
umka-net (Tier 1):
    Socket layer (protocol-agnostic)
    |
    Transport layer (TCP, UDP, SCTP, MPTCP)
    |
    Network layer (IPv4, IPv6, routing, netfilter)
    |
    Link layer (ARP/NDP, bridge, VLAN)
    |  domain ring buffer (~23 cycles)
    v
NIC driver (Tier 1): device-specific TX/RX

The four domain switches (two domain entries — NIC driver and umka-net — each requiring an enter and exit switch) add ~92 cycles total to the data path (4 × ~23 cycles; see Section 15.1.7 for detailed analysis). For comparison, a single sendmsg() syscall in Linux costs ~700-1800 cycles in syscall transition overhead on modern hardware with Spectre/Meltdown mitigations enabled (pre-mitigation: ~200-400 cycles) (the SYSCALL/SYSRET ring crossing, not a full Linux process context switch which costs 5,000-20,000 cycles due to TLB flushes, cache pollution, and scheduler overhead). The domain boundary is cheaper than even a bare syscall transition.

Linux comparison: Linux's network stack is monolithic — TCP, IP, Netfilter, and the socket layer all execute in the same address space with no isolation. A buffer overflow in a Netfilter module can corrupt TCP connection state. In UmkaOS, umka-net's isolation means a bug in the VXLAN tunnel parser cannot corrupt the TCP congestion window of an unrelated connection because Rust's type system and memory safety enforce separation within the umka-net module. Hardware domain isolation enforces the boundary between umka-net and the NIC driver (and between umka-net and umka-core); within umka-net, Rust's ownership model provides the intra-module isolation.

15.1.2 Socket Abstraction

The socket layer is protocol-agnostic. Transport protocols register implementations of a common trait:

/// Protocol-agnostic socket operations.
/// Each transport protocol (TCP, UDP, SCTP, MPTCP) implements this trait.
pub trait SocketOps: Send + Sync {
    /// Bind the socket to a local address.
    fn bind(&self, addr: &SockAddr) -> Result<(), KernelError>;

    /// Mark the socket as a passive listener.
    fn listen(&self, backlog: u32) -> Result<(), KernelError>;

    /// Accept an incoming connection (blocking or non-blocking).
    fn accept(&self) -> Result<(SlabRef<dyn SocketOps>, SockAddr), KernelError>;

    /// Initiate an outgoing connection.
    fn connect(&self, addr: &SockAddr) -> Result<(), KernelError>;

    /// Send a message (scatter-gather, ancillary data, destination address).
    fn sendmsg(&self, msg: &MsgHdr, flags: u32) -> Result<usize, KernelError>;

    /// Receive a message (scatter-gather, ancillary data, source address).
    fn recvmsg(&self, msg: &mut MsgHdr, flags: u32) -> Result<usize, KernelError>;

    /// Set a socket option (protocol-specific behavior).
    fn setsockopt(&self, level: i32, name: i32, val: &[u8]) -> Result<(), KernelError>;

    /// Get a socket option value.
    fn getsockopt(&self, level: i32, name: i32, buf: &mut [u8]) -> Result<usize, KernelError>;

    /// Retrieves the local address of a bound socket.
    /// Returns the address in `addr` and its length.
    fn getsockname(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;

    /// Retrieves the remote address of a connected socket.
    /// Returns the address in `addr` and its length.
    fn getpeername(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;

    /// Poll for readiness events (POLLIN, POLLOUT, POLLERR, POLLHUP).
    fn poll(&self, events: PollEvents) -> PollEvents;

    /// Shut down part of a full-duplex connection.
    fn shutdown(&self, how: ShutdownHow) -> Result<(), KernelError>;

    /// Close the socket and release all resources.
    /// For TCP: initiates FIN handshake (or RST if SO_LINGER with timeout 0).
    /// For UDP: releases port binding and queued buffers.
    /// Called when the last file descriptor reference is dropped (VFS layer
    /// guarantees exactly one call per socket lifetime). Error recovery paths
    /// do NOT call close() directly — they mark the socket as errored and let
    /// the VFS drop path handle cleanup.
    ///
    /// Close is **best-effort**: if close() returns Err (e.g., TCP FIN
    /// handshake timeout), the error is logged and the socket is released
    /// regardless. The VFS layer always frees the socket resources after
    /// this call, matching Linux semantics where close(2) errors on
    /// sockets are not retryable (POSIX: "If close() is interrupted by a
    /// signal [...] the state of fildes is unspecified"; Linux: close always
    /// releases the fd regardless of error). Applications requiring durable
    /// delivery must use `shutdown(SHUT_WR)` + `read()` for EOF confirmation
    /// before calling close(), same as on Linux.
    fn close(&self) -> Result<(), KernelError>;
}

15.1.3 NetBuf: Packet Buffer

The NetBuf is UmkaOS's native packet data structure — the equivalent of Linux's sk_buff. It carries packet data and metadata through the entire network stack: from NIC driver RX through protocol processing, firewall evaluation, socket delivery, and back out through TX. Unlike Linux's sk_buff (~240 bytes, accumulated over 30 years of organic growth), NetBuf is designed from scratch for zero-copy domain crossings, scatter-gather I/O, and reference-counted sharing.

Design principles: 1. Separation of metadata and data: The NetBuf struct is a metadata header (~296 bytes, fits in 5 cache lines). Packet data lives in separately allocated DMA-eligible pages (via DmaBufferHandle, Section 11.1.5). When a NetBuf crosses an isolation domain boundary (umka-net to NIC driver or vice versa), only the metadata header is copied (~296 bytes); data pages are shared via the DMA buffer pool (shared isolation domain: PKEY 14 on x86-64 / domain 2 on AArch64; see Section 10.2).

Size note: The 296-byte figure is the full NetBuf struct size as allocated from the slab cache. The "256 handles per page" figure (4096 / 16 = 256) found in NetBufPool documentation refers to the compact NetBufHandle token (16 bytes: pool-id + slot-index + generation), not the full NetBuf. Handles are stored in ring buffers and transmission queues; full NetBuf objects are stored separately in the slab pool.

  1. Per-CPU allocation, no global lock: NetBufs are allocated from per-CPU NetBufPool slabs (Section 4.1). The fast path (alloc/free) never touches a global lock or cross-CPU data structure.
  2. Reference-counted for zero-copy: Multiple NetBufs can reference the same underlying data pages (e.g., XDP_REDIRECT to multiple interfaces, TCP zero-copy receive delivering the same page fragment to multiple sockets). Cloning a NetBuf increments the data page refcount without copying data.
  3. Scatter-gather native: Large packets (GSO/GRO aggregates, jumbo frames) use a fragment list rather than requiring contiguous allocation. The fragment list is inline for small fragment counts (up to 6) and spills to a heap-allocated extension for larger counts.
  4. RDMA-eligible: Data pages allocated from the RDMA pool (Section 5.1.4.3) can be used directly for RDMA operations without re-registration. The flags field tracks RDMA eligibility.
/// A variable-length array backed by the kernel slab allocator.
/// Semantics similar to `SmallVec<[T; N]>`: up to N elements stored
/// inline without allocation; overflow spills to a slab-allocated
/// heap block.
///
/// Used for small, bounded collections on hot paths (e.g., routing next-hop
/// arrays, scatter-gather lists) where heap allocation must be avoided.
pub struct SlabVec<T, const N: usize> {
    /// Inline storage for the common case (N elements).
    inline: [MaybeUninit<T>; N],
    /// Pointer to slab-allocated overflow storage. NULL if len <= N.
    overflow: *mut T,
    /// Current element count.
    len: usize,
    /// Capacity of the overflow allocation (in elements). 0 if inline.
    overflow_cap: usize,
}
// umka-net/src/netbuf.rs

/// Packet buffer — carries packet data and metadata through the network stack.
///
/// `NetBuf` is the UmkaOS equivalent of Linux's `sk_buff`. It is a metadata header
/// (~296 bytes) that references separately allocated data pages. The struct itself
/// is allocated from a per-CPU `NetBufPool` (slab-backed, Section 4.1).
///
/// **Lifetime**: Allocated via `NetBufPool::alloc()`, freed via `NetBufPool::free()`
/// or when the last reference is dropped. Reference counting applies to the data
/// pages (`data_handle`), not the `NetBuf` struct itself — the struct is owned by
/// exactly one consumer at a time. Zero-copy sharing is achieved by cloning: the
/// clone gets a new `NetBuf` struct (from the local CPU's pool) pointing to the
/// same data pages with an incremented refcount.
///
/// **Domain crossing protocol** (see Section 15.1.7):
/// When a NetBuf crosses the umka-net / NIC driver isolation domain boundary:
/// 1. The sending domain allocates a new `NetBuf` struct in the receiving domain's
///    per-CPU pool (via a cross-domain slab allocation helper).
/// 2. The metadata fields are copied to the new struct (~296 bytes memcpy).
/// 3. The `data_handle` (DMA buffer reference) is shared — both domains can access
///    data pages through the shared DMA buffer pool (PKEY 14 / domain 2).
/// 4. The original `NetBuf` struct is freed in the sending domain's pool.
/// This ensures each domain operates on its own metadata (preventing TOCTOU attacks
/// on header offsets) while sharing the bulk data zero-copy.
///
/// **Cross-reference**: `DmaBufferHandle` (Section 11.1.5), `NetBufPool` (below),
/// `NetDeviceVTable` TX/RX paths (Section 12.3.3), XDP bounce buffer (Section 18.1.4),
/// NAPI batching (Section 15.1.7), TCP zero-copy receive (Section 15.1.4.9).
#[repr(C)]
pub struct NetBuf {
    // ---- Data region pointers (linear buffer) ----

    /// DMA buffer handle for the underlying data pages.
    ///
    /// Points to a DMA-mapped memory region allocated via `KernelServicesVTable::
    /// alloc_dma_buffer()` (Section 11.1.5). The handle is valid for the lifetime of
    /// this NetBuf (or until the data pages are explicitly released). The same
    /// handle may be shared across multiple cloned NetBufs (refcounted in the
    /// DMA buffer pool).
    ///
    /// For scatter-gather packets, this handle refers to the linear (header) portion
    /// only. Fragment data is in separate DMA handles within `frags`.
    pub data_handle: DmaBufferHandle,

    /// Offset from `data_handle` base to the start of the allocated buffer region.
    ///
    /// The region `[head_offset .. end_offset)` is the total allocated linear buffer.
    /// `head_offset` is typically 0 but may be non-zero if the buffer was carved from
    /// a larger DMA allocation (e.g., page fragment sub-allocation for small packets).
    pub head_offset: u32,

    /// Offset from `data_handle` base to the start of packet data.
    ///
    /// The region `[head_offset .. data_offset)` is headroom — available for
    /// prepending headers (e.g., tunnel encapsulation adds an outer IP/UDP header).
    /// `push()` decrements `data_offset` to claim headroom; if insufficient headroom
    /// remains, the caller must reallocate (or use `NetBuf::prepend_realloc()`).
    ///
    /// **Invariant**: `head_offset <= data_offset <= tail_offset <= end_offset`.
    pub data_offset: u32,

    /// Offset from `data_handle` base to the end of packet data.
    ///
    /// `tail_offset - data_offset` is the linear data length. `put()` increments
    /// `tail_offset` to append data; `pull()` increments `data_offset` to consume
    /// a header (advancing past it after parsing).
    pub tail_offset: u32,

    /// Offset from `data_handle` base to the end of the allocated buffer region.
    ///
    /// `end_offset - tail_offset` is tailroom — available for appending data
    /// (e.g., padding, FCS). The total linear buffer size is `end_offset - head_offset`.
    pub end_offset: u32,

    // ---- Protocol metadata (parsed by the stack) ----

    /// Byte offset from `data_offset` to the start of the L2 (link-layer) header (signed).
    ///
    /// For Ethernet frames, this is 0 (L2 header is at the start of data). For
    /// packets received after L2 processing (e.g., after bridge forwarding or XDP),
    /// this may be negative (L2 header was in the headroom and has been consumed).
    /// Set by the NIC driver or the L2 processing layer.
    /// `i16::MIN` (-32768) = sentinel meaning "L2 layer not present or not parsed".
    /// Valid range: -32767 to 32767. Typical values: 0 (L2 starts at data_offset).
    pub l2_offset: i16,

    /// Byte offset from `data_offset` to the start of the L3 (network-layer) header.
    ///
    /// For IPv4/IPv6. Set during L3 header parsing. Used by checksum offload,
    /// GSO segmentation, and BPF helpers (`bpf_skb_load_bytes()`). Value 0xFFFF
    /// means "not set" (packet has not been parsed to L3 yet).
    pub l3_offset: u16,

    /// Byte offset from `data_offset` to the start of the L4 (transport-layer) header.
    ///
    /// For TCP/UDP/SCTP. Set during L4 header parsing. Used by checksum offload
    /// (provides the checksum start offset to the NIC) and GRO coalescing.
    /// Value 0xFFFF means "not set".
    pub l4_offset: u16,

    /// Byte offset from `data_offset` to the start of the inner L3 header.
    ///
    /// Non-zero only for encapsulated packets (VXLAN, Geneve, GRE, IPIP).
    /// Used by GSO for tunnel segmentation offload and by XDP decap helpers.
    /// Value 0xFFFF means "not encapsulated".
    pub inner_l3_offset: u16,

    /// Byte offset from `data_offset` to the start of the inner L4 header.
    ///
    /// Non-zero only for encapsulated packets. Value 0xFFFF means "not encapsulated".
    pub inner_l4_offset: u16,

    // ---- Checksum state ----

    /// Checksum offload status. Determines whether software checksum verification
    /// or computation is needed.
    ///
    /// **RX path** (NIC to stack):
    /// - `None`: NIC did not verify checksum; software must verify.
    /// - `Unnecessary`: NIC verified the full L4 checksum; software can skip.
    /// - `Complete`: NIC computed a raw checksum over `[csum_start .. end]` and
    ///   stored it in `csum_value`. Software must fold and verify.
    ///
    /// **TX path** (stack to NIC):
    /// - `None`: Software computed the full checksum; NIC should not touch it.
    /// - `Partial`: Software filled the pseudo-header checksum; NIC must compute
    ///   the L4 checksum from `csum_start` for `csum_offset` bytes and write the
    ///   result at `csum_start + csum_offset`. This matches Linux's
    ///   `CHECKSUM_PARTIAL` semantics.
    pub checksum_status: ChecksumStatus,

    /// Byte offset from `data_offset` where checksum computation starts.
    ///
    /// Used with `ChecksumStatus::Partial` (TX) and `ChecksumStatus::Complete` (RX).
    /// For TX partial offload, this is the start of the L4 header.
    pub csum_start: u16,

    /// Byte offset from `csum_start` to the checksum field within the L4 header.
    ///
    /// Used with `ChecksumStatus::Partial` (TX). For TCP, this is 16 (offset of
    /// the checksum field in the TCP header). For UDP, this is 6.
    pub csum_offset: u16,

    /// Raw checksum value from hardware (RX `Complete` mode) or computed by
    /// software. Interpretation depends on `checksum_status`.
    pub csum_value: u32,

    // ---- VLAN ----

    /// 802.1Q VLAN tag. `vlan_present` indicates whether this field is valid.
    ///
    /// Format: bits [15:13] = PCP (priority), bit [12] = DEI, bits [11:0] = VID.
    /// This matches the on-wire 802.1Q TCI format.
    pub vlan_tci: u16,

    /// Whether `vlan_tci` contains a valid VLAN tag.
    ///
    /// True if: (a) the NIC extracted the VLAN tag via hardware offload (the tag
    /// was stripped from the frame and placed here), or (b) software VLAN processing
    /// parsed and extracted the tag. False for untagged frames.
    pub vlan_present: bool,

    // ---- Packet classification ----

    /// IP protocol number from the (outer) L3 header. Set during L3 parsing.
    /// Values: 6 (TCP), 17 (UDP), 1 (ICMP), 58 (ICMPv6), 132 (SCTP), etc.
    /// 0 means "not yet parsed".
    pub protocol: u8,

    /// Address family of the (outer) L3 header.
    ///
    /// `AddressFamily::Inet` for IPv4, `AddressFamily::Inet6` for IPv6.
    /// Set during L3 parsing. Used for routing table selection and BPF
    /// program dispatch.
    pub addr_family: AddressFamily,

    /// Packet direction and processing state flags.
    pub flags: NetBufFlags,

    // ---- Routing decision cache ----

    /// Cached routing lookup result. Populated by the first routing table lookup
    /// for this packet (L3 input or output path). Subsequent consumers (e.g.,
    /// conntrack, firewall, forwarding) reuse the cached result without repeating
    /// the FIB lookup. `None` if routing has not been performed yet.
    ///
    /// **Cross-reference**: `RouteLookupResult` (Section 15.1.4, routing table spec).
    /// The cached result includes the resolved next-hop, output interface, and MTU.
    /// The cache is valid for the lifetime of this NetBuf — routing table changes
    /// (RCU-swapped) do not invalidate in-flight packets' cached routes, which is
    /// safe because the old routing table remains valid until the RCU grace period
    /// completes, and no NetBuf outlives an RCU grace period (packets are processed
    /// within a single softirq / NAPI poll cycle).
    pub route_cache: Option<RouteLookupResult>,

    // ---- GSO (Generic Segmentation Offload) ----

    /// GSO type. Non-zero if this NetBuf represents an aggregated super-packet
    /// that must be segmented before transmission (if the NIC does not support
    /// hardware TSO/USO) or was coalesced by GRO on the receive path.
    pub gso_type: GsoType,

    /// MSS (Maximum Segment Size) for GSO segmentation.
    ///
    /// When `gso_type != GsoType::None`, the packet must be split into segments
    /// of at most `gso_size` bytes of L4 payload each. The NIC (via TSO) or
    /// software GSO performs the segmentation. Value 0 when `gso_type == None`.
    pub gso_size: u16,

    /// Number of segments in this GSO packet.
    ///
    /// For GRO-coalesced packets, this is the count of original packets merged
    /// into this aggregate. Used for byte/packet accounting and for calculating
    /// the number of ACKs to expect. Value 0 when `gso_type == None`.
    pub gso_segs: u16,

    // ---- Scatter-gather fragment list ----

    /// Number of valid entries in `frags`. Range: 0 (linear-only packet) to
    /// `MAX_INLINE_FRAGS` for inline storage, or up to `frag_ext.len()` if
    /// the extension list is allocated.
    pub nr_frags: u8,

    /// Inline fragment storage for common cases (up to 6 fragments).
    ///
    /// Most packets have 0-3 fragments (linear header + 1-3 page fragments for
    /// payload). The inline array avoids a heap allocation for the common case.
    /// Fragments beyond `MAX_INLINE_FRAGS` (6) spill to `frag_ext`.
    pub frags: [NetBufFrag; MAX_INLINE_FRAGS],

    /// Extension fragment list for packets with more than `MAX_INLINE_FRAGS`
    /// fragments (e.g., large GSO aggregates with many page fragments).
    ///
    /// Heap-allocated via the slab allocator (Section 4.1) on demand. `None` for
    /// packets with 6 or fewer fragments. When present, `frags[0..MAX_INLINE_FRAGS]`
    /// holds the first 6 fragments and `frag_ext` holds the remainder.
    pub frag_ext: Option<SlabVec<NetBufFrag>>,

    // ---- Reference counting and ownership ----

    /// Atomic reference count for the data pages.
    ///
    /// Starts at 1 on allocation. Incremented by `NetBuf::clone_shared()` (zero-copy
    /// clone). When it reaches 0, the data pages are returned to the DMA buffer pool.
    /// The `NetBuf` struct itself is always singly-owned and freed to its CPU's pool
    /// independently of the data refcount.
    ///
    /// **Implementation**: This is a pointer to a shared atomic counter that lives in
    /// the `DmaBufferHandle`'s metadata region (not in the NetBuf struct). Multiple
    /// cloned NetBufs point to the same counter. Shown here for documentation; the
    /// actual refcount is accessed via `data_handle.refcount()`.
    // (refcount is part of DmaBufferHandle, not stored inline)

    /// Hash value computed over the packet's flow key (src/dst IP, src/dst port,
    /// protocol). Used for:
    /// - Receive flow steering (RFS): selecting the CPU queue
    /// - Conntrack bucket selection
    /// - ECMP next-hop selection (consistent hashing)
    /// - Socket demultiplexing
    ///
    /// Computed once (by the NIC via RSS hardware hash, or by software during L3/L4
    /// parsing) and reused by all consumers. Value 0 means "not computed".
    pub flow_hash: u32,

    /// Timestamp of packet arrival (RX) or queuing (TX), in nanoseconds since boot
    /// (CLOCK_MONOTONIC_RAW). Set by the NIC driver from hardware timestamping if
    /// available, otherwise set by umka-net from the kernel clock at first touch.
    /// Used for RTT estimation, packet scheduling (pacing), and SO_TIMESTAMPNS.
    pub timestamp_ns: u64,

    /// Network interface index on which this packet was received (RX) or will be
    /// transmitted (TX). Indexes into the per-namespace interface table
    /// (`NetNamespace::interfaces`, Section 16.1.1). Set by the NIC driver on RX;
    /// set by routing on TX.
    pub ifindex: u32,

    /// NUMA node of the CPU that allocated this NetBuf. Used for NUMA-aware
    /// freeing: when a NetBuf is freed on a different NUMA node than where it was
    /// allocated, it is returned to a cross-node return magazine (Section 4.1) rather
    /// than the local CPU's pool, to avoid remote memory access on the next alloc.
    pub alloc_numa_node: u16,

    /// Mark value (equivalent to Linux `skb->mark`). Set by iptables/nftables MARK
    /// target (translated to BPF), policy routing rules, or `SO_MARK` socket option.
    /// Used for routing table selection (policy routing, Section 15.1.4) and traffic
    /// classification (tc, QoS).
    pub mark: u32,

    /// Connection tracking reference. Index into the conntrack hash table (Section
    /// 12.2.2). `CONNTRACK_UNTRACKED` (u32::MAX) means this packet is not tracked.
    /// Populated by the prerouting conntrack BPF hook. Used by NAT and stateful
    /// firewall rules.
    pub conntrack_idx: u32,

    /// Priority / traffic class. Used by the QoS/tc layer for queue selection.
    /// Initialized from the IP TOS/DSCP field or from `SO_PRIORITY`.
    pub priority: u32,
}

/// Lightweight handle to a `NetBuf`. Does not own the buffer — the caller
/// must ensure the referenced `NetBuf` outlives this handle. Used to pass
/// buffer references across ring buffer boundaries without copying the
/// full `NetBuf` struct.
///
/// Encodes the DMA pool index and slot offset for O(1) pointer reconstruction
/// without storing a raw pointer (avoids KASLR leaks in ring buffers).
///
/// Explicit 16-byte layout (with `#[repr(C)]`):
///   bytes 0-1:  pool_id (u16)
///   bytes 2-3:  _pad0 (explicit, aligns slot_idx to 4 bytes)
///   bytes 4-7:  slot_idx (u32)
///   bytes 8-9:  generation (u16)
///   bytes 10-15: _pad1 (explicit, pads to 16 bytes for ring buffer alignment)
/// 16 bytes total → 256 handles per 4KB page.
#[derive(Copy, Clone, Debug)]
#[repr(C)]
pub struct NetBufHandle {
    /// DMA pool index (selects which pool this handle refers to).
    pub pool_id: u16,
    /// Explicit padding to align `slot_idx` to a 4-byte boundary.
    pub _pad0: [u8; 2],
    /// Slot index within the pool's backing slab.
    pub slot_idx: u32,
    /// Generation counter matching the pool slot's generation (prevents
    /// stale handle use after buffer recycle).
    pub generation: u16,
    /// Explicit padding to bring the struct to exactly 16 bytes for
    /// ring buffer alignment (256 handles per 4KB page).
    pub _pad1: [u8; 6],
}

/// Maximum number of scatter-gather fragments stored inline in `NetBuf::frags`.
///
/// 6 fragments covers the common case: 1 linear header region + up to 5 page
/// fragments for a 64KB GSO packet with 4KB pages (ceil(64KB / 4KB) - 1 = 15,
/// but most real packets have fewer fragments because MTU-sized packets use
/// 1-2 pages). 6 fragments covers >99% of packets while keeping `NetBuf` at
/// ~296 bytes (fits in 5 cache lines; competitive with Linux's `sk_buff` at ~240 bytes).
pub const MAX_INLINE_FRAGS: usize = 6;

/// Scatter-gather fragment: a reference to a contiguous region within a DMA buffer.
///
/// Each fragment represents a page (or page range) of packet data that is not
/// contiguous with the linear buffer. Fragments are used for:
/// - TCP zero-copy receive: userspace pages are directly referenced as fragments
/// - GRO coalescing: appended packet payloads become fragments
/// - sendfile()/splice(): file pages are attached as fragments without copying
#[repr(C)]
pub struct NetBufFrag {
    /// DMA buffer handle for this fragment's data pages.
    ///
    /// May be the same as `NetBuf::data_handle` (different region of the same
    /// DMA allocation) or a completely separate DMA buffer. The handle's refcount
    /// is incremented when the fragment is attached and decremented when removed.
    pub handle: DmaBufferHandle,

    /// Byte offset within the DMA buffer where this fragment's data begins.
    pub offset: u32,

    /// Length of this fragment's data in bytes.
    pub length: u32,
}

/// Checksum offload status (matches Linux CHECKSUM_* semantics).
#[repr(u8)]
pub enum ChecksumStatus {
    /// No checksum information. Software must compute/verify.
    None = 0,
    /// Hardware verified the checksum is correct (RX). Software may skip verification.
    /// Equivalent to Linux `CHECKSUM_UNNECESSARY`.
    Unnecessary = 1,
    /// Hardware computed a partial checksum over the packet data (RX). The raw value
    /// is in `NetBuf::csum_value`. Software must fold and verify against the
    /// pseudo-header. Equivalent to Linux `CHECKSUM_COMPLETE`.
    Complete = 2,
    /// Software filled the pseudo-header checksum; hardware must complete the L4
    /// checksum computation (TX). `csum_start` and `csum_offset` specify the
    /// computation range. Equivalent to Linux `CHECKSUM_PARTIAL`.
    Partial = 3,
}

/// GSO (Generic Segmentation Offload) type.
///
/// Identifies the segmentation algorithm needed when software GSO must split
/// a super-packet into MTU-sized frames. Hardware TSO/USO supersedes software
/// GSO when the NIC reports the corresponding offload capability.
#[repr(u8)]
pub enum GsoType {
    /// Not a GSO packet. No segmentation needed.
    None = 0,
    /// TCP segmentation (TSO). Split at MSS boundaries, rewrite TCP sequence
    /// numbers and checksums per segment.
    TcpV4 = 1,
    /// TCP segmentation for IPv6.
    TcpV6 = 2,
    /// UDP fragmentation offload (UFO). Split at MSS boundaries, generate
    /// IP fragments (IPv4) or fragment extension headers (IPv6).
    Udp = 3,
    /// TCP segmentation for tunnel-encapsulated packets. Outer and inner
    /// headers are rewritten per segment.
    TcpTunnel = 4,
    /// UDP segmentation for tunnel-encapsulated packets.
    UdpTunnel = 5,
    /// GRO partial: a GRO-coalesced packet that was only partially merged
    /// (different IP IDs or non-contiguous sequence numbers). Must be
    /// re-segmented before delivery if the receiver cannot handle it.
    GroPartial = 6,
}

bitflags! {
    /// Packet processing flags.
    pub struct NetBufFlags: u32 {
        /// Packet is locally generated (TX), not forwarded.
        const LOCAL_OUT       = 1 << 0;
        /// Packet is destined for local delivery (RX), not forwarded.
        const LOCAL_IN        = 1 << 1;
        /// Packet is being forwarded (neither locally generated nor locally destined).
        const FORWARDED       = 1 << 2;
        /// Packet data pages are in the RDMA-eligible pool (Section 5.1.4.3).
        /// Can be used directly for RDMA operations without re-registration.
        const RDMA_ELIGIBLE   = 1 << 3;
        /// Packet was decapsulated (tunnel outer headers stripped).
        const DECAPPED        = 1 << 4;
        /// Packet requires encryption before transmission (IPsec or WireGuard).
        const NEEDS_ENCRYPT   = 1 << 5;
        /// Packet has been decrypted (IPsec or WireGuard).
        const DECRYPTED       = 1 << 6;
        /// XDP metadata area is valid (contains XDP metadata prepended by the driver).
        const XDP_META_VALID  = 1 << 7;
        /// Data pages are shared (refcount > 1). Write operations must
        /// copy-on-write to avoid corrupting other consumers.
        const SHARED_DATA     = 1 << 8;
        /// Packet is a clone created by `clone_shared()`. The data pages are
        /// shared with the original; metadata is independently owned.
        const CLONED          = 1 << 9;
        /// Software GSO segmentation is needed before TX (NIC lacks TSO support).
        const NEEDS_GSO       = 1 << 10;
    }
}

15.1.3.1 NetBufPool: Per-CPU Slab Pool

/// Per-CPU pool for `NetBuf` metadata struct allocation.
///
/// Each CPU maintains its own slab of pre-allocated `NetBuf` structs. The fast path
/// (alloc/free) is a single pointer swap with no locks, no atomics, and no cross-CPU
/// traffic — the pool pointer lives in the `CpuLocalBlock::slab_magazines` array
/// (Section 3.1.2).
///
/// **Capacity**: Each CPU's magazine holds a configurable number of NetBufs (default:
/// 256). When a magazine is exhausted, the CPU requests a new slab page from the
/// global slab allocator (Section 4.1, one atomic increment). The slab page holds
/// `PAGE_SIZE / size_of::<NetBufHandle>()` handles (e.g., 4096 / 16 = 256 handles per
/// 4KB page; `NetBufHandle` is 16 bytes: pool-id + slot-index + generation).
///
/// **NAPI batch integration**: During NAPI poll (Section 15.1.7), the driver allocates
/// a batch of up to 64 NetBufs at once via `NetBufPool::alloc_batch()`. This amortizes
/// any fallback to the global allocator across the entire batch. The batch is processed
/// within a single NAPI poll cycle — no NetBuf from the batch outlives the poll call.
///
/// **NUMA awareness**: NetBufs freed on a different NUMA node than their allocation
/// node are placed on a cross-node return list (per-CPU, per-source-node). The return
/// list is drained back to the origin node's pool in batches of 32 during idle time
/// or when the local pool is full. This prevents NUMA-remote memory from accumulating
/// on a local CPU's free list.
///
/// **Cross-reference**: Section 4.1 (slab allocator), Section 3.1.2 (CpuLocalBlock),
/// NAPI batching (Section 15.1.7).
pub struct NetBufPool {
    // Implementation is the standard slab magazine pattern from Section 4.1.
    // No additional fields are specified here — NetBufPool is a type alias for
    // `SlabPool<NetBuf>` with the NetBuf-specific size class.
}

15.1.3.2 NetBuf Operations

impl NetBuf {
    /// Allocate a new NetBuf with a linear buffer of at least `size` bytes.
    ///
    /// The buffer is allocated from the current CPU's `NetBufPool` (metadata) and
    /// `DmaBufferHandle` pool (data). `headroom` bytes are reserved before the data
    /// region for header prepend operations. Total allocation is `headroom + size`
    /// bytes, rounded up to the DMA allocator's alignment (typically cache-line, 64B).
    ///
    /// Returns `Err(KernelError::NoMem)` if the DMA pool is exhausted.
    ///
    /// **Default headroom**: `NET_BUF_DEFAULT_HEADROOM` (128 bytes) — sufficient for
    /// an outer Ethernet (14) + IP (20) + UDP (8) + VXLAN (8) header plus alignment
    /// padding. Callers that know they need less (e.g., loopback) may specify 0.
    ///
    /// # Preconditions
    /// - Must be called with preemption disabled (NAPI context or explicit
    ///   `PreemptGuard`), because the per-CPU pool requires CPU pinning.
    pub fn alloc(size: u32, headroom: u32) -> Result<NetBuf, KernelError>;

    /// Allocate a batch of `count` NetBufs, each with `size` bytes and `headroom`.
    ///
    /// More efficient than `count` individual `alloc()` calls because the slab
    /// magazine is checked once and the DMA pool may satisfy the entire batch from
    /// a single large allocation (if the DMA allocator supports bulk alloc).
    /// Used by NAPI poll to pre-allocate RX buffers for a batch of up to 64 packets.
    ///
    /// Returns the number of successfully allocated NetBufs (may be less than `count`
    /// if memory is low). Partial success is not an error — the caller processes
    /// however many buffers were obtained.
    pub fn alloc_batch(
        out: &mut [MaybeUninit<NetBuf>],
        count: usize,
        size: u32,
        headroom: u32,
    ) -> usize;

    /// Free this NetBuf, returning the metadata struct to the local CPU's pool.
    ///
    /// If the data page refcount reaches 0, the DMA buffer is also freed.
    /// If the data pages are shared (`SHARED_DATA` flag), only the refcount is
    /// decremented. Safe to call from any CPU — if the NetBuf was allocated on a
    /// different NUMA node, it is placed on the cross-node return list.
    pub fn free(self);

    /// Prepend `len` bytes of headroom, advancing `data_offset` backward.
    ///
    /// Used to prepend protocol headers (e.g., IP header before TCP payload,
    /// Ethernet header before IP). The caller writes the header into the newly
    /// exposed region `[new_data_offset .. old_data_offset)`.
    ///
    /// # Panics
    /// Panics if `data_offset - len < head_offset` (insufficient headroom).
    /// Callers must check headroom or use `prepend_realloc()` for untrusted sizes.
    pub fn push(&mut self, len: u32) -> &mut [u8];

    /// Consume `len` bytes from the front of the data region.
    ///
    /// Used after parsing a protocol header: the header is consumed (data_offset
    /// advances past it) so the next layer sees its own header at `data_offset`.
    /// Returns a slice to the consumed header bytes (valid until the NetBuf is freed).
    ///
    /// # Panics
    /// Panics if `data_offset + len > tail_offset` (consuming more than available).
    pub fn pull(&mut self, len: u32) -> &[u8];

    /// Append `len` bytes at the tail of the data region.
    ///
    /// Used to append data to the linear buffer (e.g., padding, trailer).
    /// Returns a mutable slice to the newly appended region.
    ///
    /// # Panics
    /// Panics if `tail_offset + len > end_offset` (insufficient tailroom).
    pub fn put(&mut self, len: u32) -> &mut [u8];

    /// Create a zero-copy clone of this NetBuf.
    ///
    /// Allocates a new `NetBuf` struct from the local CPU's pool. The new struct
    /// gets a copy of all metadata fields. The data pages are shared: the DMA
    /// buffer's refcount is incremented, and both the original and clone set the
    /// `SHARED_DATA` flag. Subsequent writes to either NetBuf's data region trigger
    /// copy-on-write (the writer allocates new data pages and copies before modifying).
    ///
    /// **Use cases**: XDP_REDIRECT to multiple interfaces, TCP retransmission queue
    /// (keeping a reference to sent data for potential retransmit), multicast forwarding.
    pub fn clone_shared(&self) -> Result<NetBuf, KernelError>;

    /// Linearize the packet: copy all scatter-gather fragments into the linear buffer.
    ///
    /// After linearization, the entire packet is in the contiguous region
    /// `[data_offset .. tail_offset)` and `nr_frags == 0`. This is required before
    /// passing the packet to consumers that do not support scatter-gather (e.g., some
    /// BPF helpers, legacy protocol parsers).
    ///
    /// If the linear buffer is too small to hold all fragment data, a new larger
    /// DMA buffer is allocated and all data (linear + fragments) is copied into it.
    ///
    /// Returns `Err(KernelError::NoMem)` if reallocation fails.
    pub fn linearize(&mut self) -> Result<(), KernelError>;

    /// Total packet length (linear data + all fragments).
    ///
    /// This is the logical packet size visible to protocols. For GSO packets,
    /// this is the aggregate size before segmentation.
    pub fn len(&self) -> u32 {
        let linear = self.tail_offset - self.data_offset;
        let frag_total: u32 = self.frags[..self.nr_frags as usize]
            .iter()
            .map(|f| f.length)
            .sum();
        // frag_ext contribution omitted for brevity; follows same pattern
        linear + frag_total
    }

    /// Return a read-only slice to the linear data region.
    ///
    /// Does NOT include scatter-gather fragments. Use `linearize()` first if you
    /// need the entire packet as a contiguous slice.
    pub fn linear_data(&self) -> &[u8];

    /// Return a mutable slice to the linear data region.
    ///
    /// If the data pages are shared (`SHARED_DATA` flag), this triggers copy-on-write:
    /// new data pages are allocated, the linear data is copied, and the original
    /// pages' refcount is decremented.
    pub fn linear_data_mut(&mut self) -> Result<&mut [u8], KernelError>;

    /// Attach a page fragment to this NetBuf's scatter-gather list.
    ///
    /// The fragment's DMA buffer handle refcount is incremented. If `nr_frags`
    /// exceeds `MAX_INLINE_FRAGS` and `frag_ext` is `None`, a `SlabVec` is
    /// allocated for overflow storage.
    pub fn add_frag(&mut self, frag: NetBufFrag) -> Result<(), KernelError>;

    /// Adjust the data region by `delta` bytes (positive = grow, negative = shrink).
    ///
    /// This is the underlying operation for `bpf_skb_adjust_room()` BPF helper.
    /// Positive delta inserts space at the current `data_offset` (for encapsulation);
    /// negative delta removes space (for decapsulation). May trigger reallocation
    /// if the adjustment exceeds available headroom or tailroom.
    pub fn adjust_room(&mut self, delta: i32) -> Result<(), KernelError>;
}

/// Default headroom reserved in newly allocated NetBufs.
///
/// 128 bytes is sufficient for: Ethernet (14) + 802.1Q (4) + outer IPv6 (40) +
/// UDP (8) + VXLAN (8) + inner Ethernet (14) + alignment padding (40).
/// This covers the common tunnel encapsulation case without reallocation.
pub const NET_BUF_DEFAULT_HEADROOM: u32 = 128;

/// Sentinel value for conntrack index indicating the packet is not tracked.
pub const CONNTRACK_UNTRACKED: u32 = u32::MAX;

15.1.3.3 Domain Crossing Protocol

When a NetBuf crosses the isolation domain boundary between umka-net and a NIC driver (in either direction), the following protocol is followed. This protocol ensures that each domain operates on metadata it owns (preventing TOCTOU races on header offsets) while sharing bulk data zero-copy:

RX path (NIC driver to umka-net): 1. The NIC driver completes DMA into a data page from the shared DMA buffer pool. 2. The driver allocates a NetBuf from its own per-CPU pool and fills in the metadata (data offsets, checksum status, VLAN tag, RSS hash, timestamp). 3. The driver writes the NetBuf metadata to the submission ring buffer shared with umka-net (PKEY 1 / shared read-only descriptors, Section 10.2). The ring entry contains a serialized copy of the NetBuf metadata fields — not a pointer to the driver's NetBuf struct (which lives in the driver's private domain). 4. The driver frees its local NetBuf struct (the metadata has been serialized to the ring). 5. umka-net reads the ring entry, allocates a NetBuf from its own per-CPU pool, and deserializes the metadata. The data_handle field references the same DMA pages (shared domain). 6. umka-net processes the packet through the protocol stack.

TX path (umka-net to NIC driver): Symmetric — umka-net serializes metadata to the TX submission ring, the driver deserializes into its own NetBuf and programs the NIC's TX descriptor with the data_handle's physical address.

NAPI batching: Steps 3-5 are batched. The driver writes up to 64 ring entries before signaling umka-net (doorbell). umka-net processes the entire batch in a single NAPI poll iteration, amortizing the 4 domain switches across all 64 packets (Section 15.1.7).

XDP interaction: XDP programs execute in the BPF isolation domain (Section 18.1.4) before the RX path reaches umka-net. For XDP, the driver copies the packet descriptor into a BPF-accessible bounce buffer (or maps the data pages read-only into the BPF domain for zero-copy XDP, Section 18.1.4). The XDP program receives an XdpContext pointer and returns an XdpAction value that determines packet fate. XdpAction::Pass delivers the packet to umka-net via the normal RX path above. XdpAction::Drop, XdpAction::Tx, and XdpAction::Redirect are handled entirely within the driver/BPF domain, never crossing to umka-net.

/// Context passed to an XDP BPF program attached to a network interface.
///
/// Read-only view of the packet: the program sees packet byte offsets into the
/// NIC's DMA buffer. The program's return value determines packet fate.
///
/// # Linux Compatibility
/// Layout-compatible with Linux's `struct xdp_md` (the BPF program ABI context).
/// Existing Linux XDP programs compile and run without modification. UmkaOS-specific
/// fields (if any) are appended after the Linux-compatible fields and are optional.
#[repr(C)]
pub struct XdpContext {
    /// Byte offset from DMA buffer start to the first byte of packet data.
    pub data:            u32,
    /// Byte offset from DMA buffer start to the byte AFTER the last packet byte.
    pub data_end:        u32,
    /// Byte offset to the start of XDP metadata (between `data_meta` and `data`).
    /// Zero if no metadata has been set. Set via `bpf_xdp_adjust_meta()`.
    pub data_meta:       u32,
    /// Ingress network interface index (1-based). Zero if not applicable.
    pub ingress_ifindex: u32,
    /// Receive queue index on the ingress interface (zero-based).
    pub rx_queue_index:  u32,
    /// Egress interface index for `XdpAction::Redirect`. Set by `bpf_redirect()`.
    /// Zero if not redirecting.
    pub egress_ifindex:  u32,
}

/// XDP program return codes.
///
/// Values MUST match Linux's `enum xdp_action` for BPF program binary portability.
/// Existing Linux XDP programs that return these values work without recompilation.
#[repr(u32)]
pub enum XdpAction {
    /// Unrecoverable error in the XDP program. Drop packet; bump `xdp_aborted` counter.
    Aborted  = 0,
    /// Discard the packet silently. Fastest drop path.
    Drop     = 1,
    /// Pass the packet up to the normal network stack for processing.
    Pass     = 2,
    /// Retransmit the packet out the same NIC queue it arrived on.
    Tx       = 3,
    /// Redirect the packet to another NIC or another CPU queue via `bpf_redirect()`.
    Redirect = 4,
}

15.1.4 Routing Table (FIB — Forwarding Information Base)

The routing table provides longest-prefix-match (LPM) lookup for IPv4 and IPv6 destination addresses, supporting policy routing (multiple tables with rule-based selection), VRF (Virtual Routing and Forwarding, Section 15.2), and ECMP (Equal-Cost Multi-Path) with weighted next-hops.

Design principles: 1. RCU-protected: Route lookup is on the per-packet forwarding path. Readers (packet processing) access the routing table under rcu_read_lock() with zero lock acquisition. Writers (netlink RTM_NEWROUTE/RTM_DELROUTE, Section 15.2.1) clone the table, apply mutations, and atomically swap via RcuCell::update() (same pattern as NetNamespace::interfaces, Section 16.1.1). 2. Per-namespace: Each NetNamespace holds its own RcuCell<RouteTable> (Section 13.1.1). VRFs within a namespace have separate tables, identified by table ID. 3. Unified data structure: IPv4 and IPv6 share the same trie implementation (operating on 128-bit addresses; IPv4 addresses are stored as IPv4-mapped-IPv6). This eliminates code duplication and simplifies policy routing rules that apply to both address families.

15.1.4.1 Data Structures

// umka-net/src/routing.rs

/// Forwarding Information Base — the routing table for a network namespace.
///
/// Contains one or more numbered routing tables (Linux supports 256 tables by
/// default; UmkaOS supports up to 4096). Table 253 (`RT_TABLE_DEFAULT`) and table
/// 254 (`RT_TABLE_MAIN`) are always present. Table 255 (`RT_TABLE_LOCAL`) holds
/// routes for local addresses (auto-populated when addresses are assigned).
///
/// **RCU integration**: The entire `RouteTable` is behind `RcuCell<RouteTable>` in
/// `NetNamespace` ([Section 16.1.1](16-containers.md#1611-capability-domain-mapping)). Lookups read under RCU; mutations clone-and-swap.
/// The clone is a logical clone — the trie nodes themselves are `Arc`-shared between
/// the old and new versions (persistent/path-copied trie), so cloning a routing table
/// with N routes costs O(W) allocations where W = key width (at most 128 for IPv6),
/// not O(N). This is the path-copy cost for a Patricia trie: bounded by key width,
/// independent of table size.
///
/// **Cross-reference**: `NetNamespace::routes` ([Section 16.1.1](16-containers.md#1611-capability-domain-mapping)), `FibRule` (below),
/// `NetBuf::route_cache` (Section 15.1.3), `bpf_fib_lookup()` (Section 15.2.2),
/// VRF (Section 15.2), netlink RTM_* messages (Section 15.2.1).
pub struct RouteTable {
    /// Named routing tables, indexed by table ID.
    ///
    /// Standard table IDs (matching Linux `RT_TABLE_*` constants):
    /// - 0: `RT_TABLE_UNSPEC` (used in rules to mean "any table")
    /// - 253: `RT_TABLE_DEFAULT` (default routes)
    /// - 254: `RT_TABLE_MAIN` (main routing table, where `ip route add` goes)
    /// - 255: `RT_TABLE_LOCAL` (local and broadcast addresses, auto-managed)
    /// - 1-252, 256-4095: user-defined tables for policy routing and VRF
    ///
    /// Stored as a `BTreeMap` for ordered iteration (netlink dump) and O(log K)
    /// lookup by table ID, where K is the number of tables (typically 3-10).
    /// The per-table trie provides O(W) prefix lookup where W is the address width.
    pub tables: BTreeMap<u32, FibTrie>,

    /// Policy routing rules, evaluated in priority order.
    ///
    /// Rules select which routing table to consult based on packet attributes
    /// (source address, destination address, mark, incoming interface, IP protocol,
    /// source/destination port, UID). If no rule matches, the default rule chain
    /// applies: local table (255) first, main table (254), then default table (253).
    ///
    /// Sorted by `FibRule::priority` (ascending). Lower numeric priority = higher
    /// precedence (matching Linux semantics where priority 0 is highest).
    pub rules: Vec<FibRule>,

    /// Default rule chain (compiled from `rules`).
    ///
    /// For the common case where no custom policy rules are configured, this is
    /// `[255, 254, 253]` — check local, main, default, in that order. When custom
    /// rules exist, lookups evaluate `rules` first, falling through to this default
    /// chain only if no rule matches. Caching the default chain avoids allocating
    /// and iterating the rules list for the common no-policy-routing case.
    pub default_chain: Vec<u32>,
}

/// Compressed radix trie for longest-prefix-match IP routing.
///
/// Implements a path-compressed (Patricia) trie over 128-bit keys (IPv4 addresses
/// are stored as IPv4-mapped-IPv6: `::ffff:a.b.c.d`). Path compression collapses
/// single-child internal nodes.
///
/// **Lookup complexity**: O(W) where W = 32 (IPv4) or W = 128 (IPv6). Lookup cost
/// is bounded by key width, independent of table size N — a 4M-entry trie takes the
/// same O(W) steps as a 1K-entry trie. For route lookup at 100 Mpps line rate,
/// W=128 means at most 128 bit comparisons per packet — negligible compared to DRAM
/// latency for the node reads. Path compression reduces actual node visits to 5-20
/// for typical routing tables (worst case bounded by W, not N).
///
/// **Why path-compressed Patricia trie (not LC-trie)**:
/// Linux uses an LC-trie (Level-Compressed trie) for IPv4 FIB, which provides
/// excellent lookup performance for dense, well-distributed prefix tables. However:
/// 1. LC-trie requires periodic rebalancing (level compression ratios change as
///    routes are added/removed), which conflicts with UmkaOS's RCU clone-and-swap model.
/// 2. A path-compressed Patricia trie supports persistent (functional) updates:
///    inserting or deleting a route copies only the O(W) nodes on the path from
///    root to the modified leaf (W = key width, bounded by 128 for IPv6), sharing
///    all other nodes with the previous version via `Arc`. This makes
///    `RcuCell` clone-and-swap cheap.
/// 3. For typical routing tables (10-1000 entries for host routing, up to ~1M entries
///    for full Internet BGP table), Patricia trie lookup is 5-20 memory accesses,
///    which is comparable to LC-trie (3-10 accesses) and well within the performance
///    budget (route lookup << TCP processing per packet).
///
/// **Full BGP table**: For routers carrying a full Internet routing table (~1M IPv4
/// prefixes, ~200K IPv6 prefixes), the Patricia trie uses approximately 2-4 MB of
/// memory (each node ~40 bytes, ~1.5-2x the number of prefixes due to internal
/// branching nodes). Lookup is ~20-25 memory accesses worst case, ~500-1000 ns on
/// modern CPUs with L2/L3 cache. This is acceptable because full-BGP hosts are
/// routers where routing lookup is a small fraction of per-packet processing.
///
/// **Cache optimization**: Trie nodes are allocated from a dedicated slab pool
/// (not the general-purpose allocator) to improve spatial locality. Nodes along
/// hot paths (default route, /8 aggregates) are likely to remain in L2 cache.
pub struct FibTrie {
    /// Root node of the trie. `None` for an empty table.
    ///
    /// The root is `Arc`-shared between RCU versions of the routing table.
    /// Cloning a `FibTrie` for RCU update creates a new `FibTrie` with `Arc::clone`
    /// of the root — zero allocation if the route being modified is not on the
    /// root's path.
    pub root: Option<Arc<FibTrieNode>>,

    /// Number of route entries (prefixes) in this trie.
    /// Used for netlink dump pagination and sysctl reporting.
    pub entry_count: u32,

    /// Table ID (for cross-referencing with `RouteTable::tables`).
    pub table_id: u32,
}

/// A node in the path-compressed Patricia trie.
///
/// Each node represents either:
/// - An **internal branching node**: has children but no route entry. The `prefix`
///   and `prefix_len` fields define the common prefix shared by all descendants.
/// - A **leaf node**: has a route entry (`route`) and possibly children (a prefix
///   that is both a route and a branching point, e.g., 10.0.0.0/8 with more
///   specific routes 10.1.0.0/16, 10.2.0.0/16 as children).
///
/// Path compression: internal nodes with a single child are collapsed. The
/// `prefix_len` may skip multiple bits between parent and child.
pub struct FibTrieNode {
    /// The prefix bits for this node (stored as a 128-bit value).
    ///
    /// Only the first `prefix_len` bits are significant. The remaining bits are zero.
    /// IPv4 routes use IPv4-mapped-IPv6 encoding: `::ffff:a.b.c.d` (prefix_len =
    /// 96 + IPv4 prefix length).
    pub prefix: u128,

    /// Number of significant bits in `prefix`. Range: 0 (default route) to 128.
    ///
    /// For IPv4 routes, this is 96 + the IPv4 prefix length (e.g., /24 becomes 120).
    /// For IPv6 routes, this is the native prefix length (e.g., /64 becomes 64).
    pub prefix_len: u8,

    /// Route entry at this prefix. `Some` if this prefix is a destination in the
    /// routing table. `None` if this is a pure branching node (exists only to
    /// connect more-specific child prefixes).
    pub route: Option<RouteEntry>,

    /// Left child (next bit after `prefix_len` is 0).
    ///
    /// `Arc`-shared for persistent data structure: cloning a trie for RCU update
    /// shares unchanged subtrees between old and new versions.
    pub left: Option<Arc<FibTrieNode>>,

    /// Right child (next bit after `prefix_len` is 1).
    pub right: Option<Arc<FibTrieNode>>,
}

/// A single route entry in the FIB.
///
/// Corresponds to a row in the output of `ip route show` or an `RTM_NEWROUTE`
/// netlink message.
#[repr(C)]
pub struct RouteEntry {
    /// Destination prefix (redundant with `FibTrieNode::prefix` but stored here
    /// for self-contained netlink serialization and `bpf_fib_lookup()` results
    /// without requiring a trie node reference).
    pub dst_prefix: u128,

    /// Destination prefix length (0-128). Same as `FibTrieNode::prefix_len`.
    pub dst_prefix_len: u8,

    /// Source prefix for source-specific routing (used with `ip route add ... src`).
    ///
    /// When non-zero, this route matches only if the packet's source address also
    /// falls within this prefix. `src_prefix_len == 0` means "any source" (the
    /// common case). Linux supports source-specific routing only for IPv6
    /// (`RT6_F_POLICY`); UmkaOS supports it for both address families.
    pub src_prefix: u128,

    /// Source prefix length (0 = match any source).
    pub src_prefix_len: u8,

    /// Next-hop(s) for this route.
    ///
    /// Single next-hop for simple routes. Multiple next-hops for ECMP (Equal-Cost
    /// Multi-Path). The `NextHopGroup` handles weighted distribution.
    pub next_hops: NextHopGroup,

    /// Route scope — defines the "reach" of this route.
    ///
    /// Values match Linux `RT_SCOPE_*`:
    /// - `Universe` (0): global routes (reachable via gateways)
    /// - `Site` (200): interior routes within a site
    /// - `Link` (253): directly attached (on-link, no gateway needed)
    /// - `Host` (254): local host route (loopback, local address)
    /// - `Nowhere` (255): destination is unreachable
    pub scope: RouteScope,

    /// Route type — determines packet handling action.
    ///
    /// Values match Linux `RTN_*`:
    /// - `Unicast` (1): normal forwarding
    /// - `Local` (2): local delivery (address on this host)
    /// - `Broadcast` (3): broadcast address
    /// - `Unreachable` (7): drop and return ICMP host unreachable
    /// - `Prohibit` (8): drop and return ICMP administratively prohibited
    /// - `Blackhole` (6): silently drop
    /// - `Throw` (9): policy routing: skip this table and try the next rule
    pub route_type: RouteType,

    /// Route protocol — identifies who installed this route.
    ///
    /// Values match Linux `RTPROT_*`:
    /// - `Kernel` (2): installed by the kernel (e.g., directly connected networks)
    /// - `Boot` (3): installed during boot (static routes from config)
    /// - `Static` (4): installed by administrator (`ip route add`)
    /// - `Zebra` (11) / `Bird` (12): installed by routing daemons
    /// - `Dhcp` (16): installed by DHCP client
    ///
    /// Used for route management (e.g., `ip route flush proto dhcp`).
    pub protocol: RouteProtocol,

    /// Route metric (preference). Lower metric = preferred route.
    ///
    /// When multiple routes match the same destination prefix, the route with
    /// the lowest metric is selected. Matches Linux `ip route add ... metric N`.
    /// Default: 0 for kernel routes, 1024 for DHCP, configurable for static routes.
    pub metric: u32,

    /// Preferred source address for packets originating from this host via this route.
    ///
    /// Set via `ip route add ... src <addr>`. When a locally-generated packet uses
    /// this route and has not yet chosen a source address, this address is used.
    /// `[0u8; 16]` means "no preference, use default address selection (RFC 6724)".
    pub prefsrc: [u8; 16],

    /// MTU override for this route.
    ///
    /// If non-zero, packets using this route are limited to this MTU instead of
    /// the output interface's MTU. Used for path MTU discovery caching and for
    /// tunnel routes with reduced MTU. Zero means "use interface MTU".
    pub mtu: u32,

    /// Route flags.
    pub flags: RouteFlags,

    /// Route expiry time (nanoseconds since boot, CLOCK_MONOTONIC_RAW).
    ///
    /// Zero means "no expiry" (permanent route). Non-zero means the route expires
    /// and is garbage-collected after this time. Used for: DHCP routes (expire when
    /// lease expires), redirected routes (expire after `net.ipv4.route.gc_timeout`),
    /// and path MTU entries (expire after `net.ipv4.route.mtu_expires`).
    pub expires_ns: u64,
}

/// Group of next-hops for a route, supporting ECMP.
///
/// Single-next-hop routes are the common case and are stored inline (no heap
/// allocation). Multi-path routes store their next-hops in a slab-allocated
/// vector.
pub enum NextHopGroup {
    /// Single next-hop (the common case for host routing tables).
    Single(NextHop),

    /// Multiple weighted next-hops for Equal-Cost Multi-Path routing.
    ///
    /// Traffic is distributed across next-hops proportionally to their weights.
    /// The selection is deterministic per flow: the `NetBuf::flow_hash` (Section
    /// 12.1.3) is used to pick a next-hop, ensuring all packets of the same flow
    /// follow the same path (avoiding TCP reordering).
    ///
    /// **Selection algorithm**: `flow_hash % total_weight` determines which
    /// next-hop handles the packet. Each next-hop occupies a range of the weight
    /// space proportional to its weight. For example, with weights [3, 1, 1]
    /// (total 5): hash % 5 in [0,2] -> hop 0, [3,3] -> hop 1, [4,4] -> hop 2.
    ///
    /// **Resilient hashing**: When a next-hop goes down (link failure, neighbor
    /// unreachable), traffic is redistributed only among the remaining next-hops.
    /// Flows that were already using a surviving next-hop are not disrupted.
    /// This matches Linux's `nexthop` group resilient hashing (kernel 5.13+).
    Multipath {
        /// The next-hops and their weights.
        hops: SlabVec<NextHop>,
        /// Sum of all weights. Cached to avoid recomputing on every packet.
        total_weight: u32,
    },
}

/// A single next-hop in the routing table.
#[repr(C)]
pub struct NextHop {
    /// Gateway IP address.
    ///
    /// The IP address of the next router to forward the packet to. For directly
    /// connected networks (on-link routes), this is `[0u8; 16]` and the packet is
    /// sent directly to the destination's link-layer address (resolved via ARP/NDP).
    ///
    /// Stored as 128-bit value: IPv4 gateways use IPv4-mapped-IPv6 encoding.
    pub gateway: [u8; 16],

    /// Output interface index. Indexes into `NetNamespace::interfaces` ([Section 16.1.1](16-containers.md#1611-capability-domain-mapping)).
    ///
    /// The physical or virtual interface through which the packet is transmitted
    /// to reach the gateway (or the destination, for on-link routes).
    pub ifindex: u32,

    /// Weight for ECMP distribution. Range: 1-256 (matching Linux `ip route add ...
    /// nexthop ... weight N`). Higher weight = proportionally more traffic.
    /// Default: 1 (equal distribution). Ignored for `NextHopGroup::Single`.
    pub weight: u16,

    /// Next-hop flags.
    pub flags: NextHopFlags,

    /// MPLS label stack (for MPLS forwarding, if configured).
    ///
    /// `label_count == 0` for non-MPLS routes (the common case). When non-zero,
    /// `labels[0..label_count]` are pushed as an MPLS label stack before
    /// forwarding. Used by MPLS-based VPNs and segment routing.
    pub label_count: u8,

    /// MPLS labels to push (up to 4 labels deep, matching Linux `RTA_ENCAP`).
    pub labels: [u32; 4],

    /// Encapsulation type for tunnel routes.
    ///
    /// When `encap_type != EncapType::None`, this next-hop requires tunnel
    /// encapsulation before forwarding. The encapsulation parameters are in
    /// `encap_data`. Used for VXLAN, Geneve, MPLS, and BPF lightweight tunnels.
    pub encap_type: EncapType,
}

/// Route scope — defines how far a destination is reachable.
#[repr(u8)]
pub enum RouteScope {
    /// Global scope: reachable via gateways (Internet routes).
    Universe = 0,
    /// Site-internal scope: reachable within a site but not globally.
    Site = 200,
    /// Link-local scope: directly attached to this link (no gateway).
    Link = 253,
    /// Host scope: this host itself (loopback, local addresses).
    Host = 254,
    /// Nowhere: destination is unreachable.
    Nowhere = 255,
}

/// Route type — determines packet handling action.
#[repr(u8)]
pub enum RouteType {
    /// Unknown / unspecified.
    Unspec = 0,
    /// Unicast route: forward to next-hop.
    Unicast = 1,
    /// Local route: deliver locally (address on this host).
    Local = 2,
    /// Broadcast route: deliver as link-layer broadcast.
    Broadcast = 3,
    /// Anycast route: deliver to any of a set of local addresses.
    Anycast = 4,
    /// Multicast route: deliver via multicast.
    Multicast = 5,
    /// Blackhole: silently drop.
    Blackhole = 6,
    /// Unreachable: drop and send ICMP Destination Unreachable.
    Unreachable = 7,
    /// Prohibit: drop and send ICMP Administratively Prohibited.
    Prohibit = 8,
    /// Throw: skip this table, continue to next policy rule.
    Throw = 9,
}

/// Route protocol identifier — who installed this route.
#[repr(u8)]
pub enum RouteProtocol {
    /// Route origin is unknown.
    Unspec = 0,
    /// Installed by ICMP redirect.
    Redirect = 1,
    /// Installed by the kernel (directly connected networks, local addresses).
    Kernel = 2,
    /// Installed at boot from static configuration.
    Boot = 3,
    /// Installed by administrator (static route via `ip route add`).
    Static = 4,
    /// Installed by the OSPF routing daemon.
    Ospf = 8,
    /// Installed by the RIP routing daemon.
    Rip = 9,
    /// Installed by the BGP routing daemon (Zebra/Quagga/FRR).
    Zebra = 11,
    /// Installed by the BIRD routing daemon.
    Bird = 12,
    /// Installed by a DHCP client.
    Dhcp = 16,
    /// Installed by a BPF program (lightweight tunnel, custom routing).
    Bpf = 200,
}

bitflags! {
    /// Route flags.
    pub struct RouteFlags: u32 {
        /// Route was installed by an ICMP redirect and may expire.
        const REDIRECT    = 1 << 0;
        /// Notify userspace when this route is used (for route monitoring).
        const NOTIFY      = 1 << 1;
        /// Route uses a cached gateway (PMTU entry).
        const CACHE       = 1 << 2;
        /// Route is an on-link route (gateway is on the directly attached network).
        const ONLINK      = 1 << 3;
        /// Route is a link-prefixed route (prefix is assigned to a link, not a host).
        const PREFIX_RT   = 1 << 4;
    }
}

bitflags! {
    /// Next-hop flags.
    pub struct NextHopFlags: u16 {
        /// Next-hop is currently dead (link down or neighbor unreachable).
        /// Excluded from ECMP distribution until restored.
        const DEAD        = 1 << 0;
        /// Next-hop is on-link (no gateway resolution needed).
        const ONLINK      = 1 << 1;
        /// Next-hop is a tunnel (requires encapsulation via `encap_type`).
        const ENCAP       = 1 << 2;
    }
}

/// Encapsulation type for tunnel routes.
#[repr(u8)]
pub enum EncapType {
    /// No encapsulation.
    None = 0,
    /// MPLS encapsulation (push label stack).
    Mpls = 1,
    /// BPF lightweight tunnel (custom encapsulation via BPF program).
    Bpf = 2,
    /// SEG6 (Segment Routing over IPv6).
    Seg6 = 3,
}

/// Standard routing table IDs (matching Linux RT_TABLE_* constants).
pub const RT_TABLE_UNSPEC: u32 = 0;
pub const RT_TABLE_DEFAULT: u32 = 253;
pub const RT_TABLE_MAIN: u32 = 254;
pub const RT_TABLE_LOCAL: u32 = 255;

15.1.4.2 Policy Routing Rules

/// Policy routing rule (evaluated for every packet to select the routing table).
///
/// Rules are evaluated in priority order (ascending). The first matching rule
/// determines which routing table to consult. If no rule matches, the default
/// chain (local -> main -> default) is used.
///
/// Corresponds to `ip rule add` commands and `RTM_NEWRULE`/`RTM_DELRULE` netlink
/// messages (Section 15.2.1).
///
/// **Performance**: For hosts without custom policy rules (the common case), rule
/// evaluation is skipped entirely — `RouteTable::default_chain` is used directly
/// (a static array lookup, ~1 cache miss). For hosts with policy rules, rules are
/// stored in a sorted `Vec` and evaluated linearly. The typical rule count is 3-20
/// (Linux default: 3 rules), so linear scan is optimal (fits in a single cache line
/// for the common case).
///
/// **50+ rule scalability**: For deployments with >50 policy rules (uncommon;
/// typical in VPN concentrators and multi-tenant routers with per-tenant tables),
/// linear scan adds ~50 comparison branches per packet — still O(50) < O(64)
/// cache lines, acceptable overhead. For >64 rules, an interval tree or radix
/// trie on (src_prefix, dst_prefix, tos) tuples is used instead (switched
/// automatically when `rules.len() > POLICY_TRIE_THRESHOLD = 64`). At 64+
/// rules, the trie lookup is O(prefix_bits) = O(128) bit comparisons, faster
/// than O(64) linear scan and fitting in ~3-4 cache lines.
/// 64-rule threshold: deployments with 50-200 rules (common in Kubernetes pod
/// firewall policies) would experience per-packet linear scan under a 256
/// threshold. At 64, most realistic rule sets get trie-based O(W) lookup
/// (W = 128 bits for IPv6 — a fixed constant independent of rule count N).
pub struct FibRule {
    /// Rule priority. Lower = higher precedence.
    ///
    /// Linux default rules: 0 (local table), 32766 (main table), 32767 (default
    /// table). User rules are typically in the range 1-32765.
    pub priority: u32,

    /// Source address prefix to match. `src_len == 0` means "any source".
    pub src: u128,
    /// Source prefix length (0-128). 0 = match all sources.
    pub src_len: u8,

    /// Destination address prefix to match. `dst_len == 0` means "any destination".
    pub dst: u128,
    /// Destination prefix length (0-128). 0 = match all destinations.
    pub dst_len: u8,

    /// Incoming interface name to match. Empty string means "any interface".
    /// Matches Linux `ip rule add iif <name>`.
    pub iif: InterfaceName,

    /// Outgoing interface name to match. Empty string means "any interface".
    /// Matches Linux `ip rule add oif <name>`.
    pub oif: InterfaceName,

    /// Packet mark to match (`NetBuf::mark`). `mark_mask == 0` means "any mark".
    /// Matches Linux `ip rule add fwmark <value>/<mask>`.
    pub mark: u32,
    /// Mask applied to `NetBuf::mark` before comparing with `mark`.
    pub mark_mask: u32,

    /// IP protocol to match (e.g., 6 for TCP, 17 for UDP). 0 = any protocol.
    /// Matches Linux `ip rule add ipproto <proto>`.
    pub ip_proto: u8,

    /// Source port range to match. Both 0 = any port.
    /// Matches Linux `ip rule add sport <start>-<end>`.
    pub sport_start: u16,
    pub sport_end: u16,

    /// Destination port range to match. Both 0 = any port.
    pub dport_start: u16,
    pub dport_end: u16,

    /// UID range to match (originating process UID). Both 0 = any UID.
    /// Matches Linux `ip rule add uidrange <start>-<end>`.
    pub uid_start: u32,
    pub uid_end: u32,

    /// Action to take when this rule matches.
    pub action: FibRuleAction,

    /// Target routing table ID (for `FibRuleAction::Lookup`).
    ///
    /// When `action == Lookup`, the packet is looked up in this table.
    /// Ignored for other actions.
    pub table: u32,

    /// Whether this rule suppresses prefix lengths below a threshold.
    ///
    /// `suppress_prefixlen`: if the lookup in `table` returns a match with a
    /// prefix length <= this value, the match is suppressed and the next rule
    /// is tried. Used to implement "don't use the default route from this table"
    /// (set `suppress_prefixlen = 0` to suppress /0 default routes).
    /// Value 0xFFFF disables suppression.
    pub suppress_prefixlen: u16,
}

/// Policy routing rule action.
#[repr(u8)]
pub enum FibRuleAction {
    /// Look up the packet in the specified routing table (`FibRule::table`).
    Lookup = 1,
    /// Drop the packet (equivalent to blackhole route at rule level).
    Blackhole = 2,
    /// Drop and send ICMP Destination Unreachable.
    Unreachable = 3,
    /// Drop and send ICMP Administratively Prohibited.
    Prohibit = 4,
}

15.1.4.3 Route Lookup Algorithm

/// Result of a FIB lookup, cached in `NetBuf::route_cache`.
///
/// Contains all information needed for packet forwarding without re-consulting
/// the routing table. The result is valid for the lifetime of the NetBuf (see
/// RCU safety note in `NetBuf::route_cache`).
#[repr(C)]
pub struct RouteLookupResult {
    /// The selected next-hop for this packet.
    ///
    /// For ECMP routes, this is the specific next-hop selected by
    /// `flow_hash % total_weight`. For single-hop routes, this is the sole next-hop.
    pub next_hop: NextHop,

    /// Effective MTU for this path.
    ///
    /// `min(route.mtu, output_interface.mtu)`. If `route.mtu == 0`, this is just
    /// the output interface's MTU. Used for IP fragmentation decisions and TCP MSS
    /// clamping.
    pub mtu: u32,

    /// Preferred source address for locally-originated packets.
    ///
    /// Copied from `RouteEntry::prefsrc` if set, otherwise determined by the
    /// source address selection algorithm (RFC 6724 for IPv6, longest-match for IPv4).
    pub prefsrc: [u8; 16],

    /// Route type (Unicast, Local, Broadcast, Blackhole, etc.).
    /// Determines the forwarding action.
    pub route_type: RouteType,

    /// Table ID that provided this result. Used for debugging and netlink reporting.
    pub table_id: u32,
}

impl RouteTable {
    /// Perform a FIB lookup for the given destination address.
    ///
    /// This is the primary packet-path entry point. The algorithm:
    ///
    /// 1. **Rule evaluation**: If policy rules are configured, iterate `rules` in
    ///    priority order. For each matching rule:
    ///    a. If `action == Lookup`: look up the destination in the specified table.
    ///       If a route is found and not suppressed by `suppress_prefixlen`, return it.
    ///    b. If `action == Blackhole/Unreachable/Prohibit`: return immediately with
    ///       the corresponding `RouteType`.
    /// 2. **Default chain**: If no rule matched (or no custom rules exist), look up
    ///    the destination in each table in `default_chain` order (local, main, default).
    ///    Return the first match.
    /// 3. **No route**: If no table contains a matching route, return
    ///    `Err(KernelError::NetUnreachable)`.
    ///
    /// **ECMP selection**: When the matching `RouteEntry` has a `NextHopGroup::Multipath`,
    /// the specific next-hop is selected using `flow_hash % total_weight` (see
    /// `NextHopGroup::Multipath` documentation). Dead next-hops (with `DEAD` flag)
    /// are excluded and their weight is subtracted from `total_weight` for the
    /// selection computation.
    ///
    /// **Performance**: For the common case (no policy rules, single default route),
    /// lookup is: 1 array access (default_chain[0] = table 255) + 1 trie walk
    /// (local table miss, typically 1-2 nodes) + 1 array access (default_chain[1]
    /// = table 254) + 1 trie walk (main table hit). Total: ~4-6 memory accesses,
    /// ~200-400 ns with warm cache.
    ///
    /// # Preconditions
    /// - Caller holds `rcu_read_lock()` (packet processing context).
    /// - `dst` is a 128-bit address (IPv4 uses IPv4-mapped-IPv6 encoding).
    ///
    /// # Cross-reference
    /// - `bpf_fib_lookup()` BPF helper (Section 15.2.2): wraps this function,
    ///   requires `CAP_NET_ROUTE_READ` capability in the BPF domain.
    /// - `NetBuf::route_cache` (Section 15.1.3): the result is stored here to
    ///   avoid repeated lookups for the same packet.
    pub fn lookup(
        &self,
        dst: u128,
        src: u128,
        mark: u32,
        ifindex: u32,
        protocol: u8,
        sport: u16,
        dport: u16,
        uid: u32,
        flow_hash: u32,
    ) -> Result<RouteLookupResult, KernelError>;
}

impl FibTrie {
    /// Longest-prefix-match lookup in this trie.
    ///
    /// Walks the trie from root to leaf, following the path determined by the
    /// destination address bits. At each node, if `node.route` is `Some`, it is
    /// recorded as the current best match. The walk continues into the child
    /// determined by the next bit of `dst` (left for 0, right for 1). When no
    /// more children exist (or the next bit's child is `None`), the most recent
    /// recorded match is returned.
    ///
    /// **Complexity**: O(W) where W is the prefix length of the matching route.
    /// For IPv4, W <= 32; for IPv6, W <= 128. In practice, path compression reduces
    /// the number of actual node visits to 5-20 for typical routing tables.
    ///
    /// Returns `None` if no prefix in this trie matches `dst`.
    pub fn longest_prefix_match(&self, dst: u128) -> Option<&RouteEntry>;

    /// Insert a route into this trie.
    ///
    /// Creates a new trie version by path-copying: only the nodes on the path from
    /// root to the inserted prefix are newly allocated; all other nodes are shared
    /// with the previous version via `Arc`. Returns the new root.
    ///
    /// If a route with the same `(dst_prefix, dst_prefix_len)` already exists, it
    /// is replaced. The old route's memory is freed after the RCU grace period.
    ///
    /// # Usage pattern (RCU update)
    /// ```
    /// let mut new_table = route_table.clone();  // Arc-shared nodes, O(1)
    /// let new_trie = old_trie.insert(entry);    // path-copy, O(W) where W = key width
    /// new_table.tables.insert(table_id, new_trie);
    /// namespace.routes.update(new_table);        // RCU swap
    /// ```
    pub fn insert(&self, entry: RouteEntry) -> FibTrie;

    /// Remove a route from this trie.
    ///
    /// Same path-copying strategy as `insert()`. Returns the new root, or the
    /// same trie unchanged if no matching route was found.
    pub fn remove(&self, dst_prefix: u128, dst_prefix_len: u8) -> FibTrie;
}

15.1.4.3a FIB Trie Construction: Level-Compressed Trie (LC-Trie) Reference Algorithm

Reference: Nilsson & Karlsson, "IP-Address Lookup Using LC-Tries" (IEEE J-SAC 1999); see also RFC 3765.

UmkaOS uses a path-compressed Patricia trie (described above) rather than an LC-trie, because the LC-trie's level-compression step requires periodic rebalancing incompatible with RCU clone-and-swap persistence. This section specifies the LC-trie algorithm for reference — it is used in the /proc/net/fib_trie compatibility dump (Section 15.5) and serves as the authoritative definition for the level-compression rationale cited in the FibTrie documentation.

LC-trie data structures:

/// A node in the level-compressed FIB trie.
pub enum FibNode {
    /// Internal branching node: branch on `stride` bits starting at `skip` bits from MSB.
    Branch {
        /// Number of bits to skip (path compression: skip nodes with one child).
        skip: u8,
        /// Number of bits in this level's branch key (level compression).
        stride: u8,
        /// 2^stride children. Index by extracting `stride` bits at position `skip`.
        children: Box<[FibNode]>,
    },
    /// Leaf node: matched prefix, points to nexthop table.
    Leaf {
        nexthop: NextHopId,
    },
}

pub struct FibTrie {
    pub root: FibNode,
    /// Total number of prefixes stored.
    pub prefix_count: u32,
}

Path compression: if an internal node has exactly one non-empty child, skip it (increment skip counter). This eliminates chains of single-child nodes common in sparse prefix distributions.

Level compression: if all children of an internal node have identical subtrees, they can be represented as a single leaf with a wider stride. This merges multiple single-bit branch levels into one multi-bit lookup.

Construction algorithm: 1. Insert all prefixes as leaf nodes at their natural bit depth (no compression). 2. Bottom-up pass: for each internal node with a single child → apply path compression (absorb into child's skip). 3. Bottom-up pass: for each subtree where all leaves carry the same nexthop → collapse to a single Leaf node. 4. Repeat until no further compression is possible.

Lookup (O(W) steps, W = 32 or 128):

fn lookup(trie: &FibTrie, dest: IpAddr) -> NextHopId:
  node = &trie.root
  bit_pos = 0
  loop:
    match node:
      Leaf { nexthop } → return nexthop
      Branch { skip, stride, children }:
        bit_pos += skip as usize
        index = extract_bits(dest, bit_pos, stride as usize)
        node = &children[index]
        bit_pos += stride as usize

Why O(W) not O(log N): the trie walk terminates when a Leaf is reached or all W address bits are consumed. Each loop iteration advances bit_pos by at least skip + stride >= 1 bits (stride >= 1 for any branching node). Therefore the loop executes at most W iterations regardless of prefix count N. A 4M-entry trie takes the same O(W) steps as a 1K-entry trie.

15.1.4.4 VRF Integration

Each VRF (Virtual Routing and Forwarding) instance is a separate routing table. When a VRF is created (via ip link add vrf0 type vrf table 100), a new entry is added to RouteTable::tables with the specified table ID. A policy rule is also added that directs packets received on VRF-enslaved interfaces to the VRF's table:

ip rule add iif <vrf-interface> table <vrf-table-id>
ip rule add oif <vrf-interface> table <vrf-table-id>

This integrates naturally with the policy routing rule evaluation described above: when a packet arrives on a VRF-enslaved interface, the matching rule directs the lookup to the VRF's private routing table, providing L3 domain isolation.

Cross-reference: VRF (Section 15.2, line 742-744), NetNamespace (Section 16.1.1), policy routing rules (Section 15.2.1, RTM_NEWRULE).

Route management is performed via NETLINK_ROUTE (Section 15.2.1):

  • RTM_NEWROUTE: Insert or replace a route. Translates to FibTrie::insert().
  • RTM_DELROUTE: Remove a route. Translates to FibTrie::remove().
  • RTM_GETROUTE: Perform a FIB lookup and return the result (used by ip route get). Translates to RouteTable::lookup().
  • RTM_NEWRULE / RTM_DELRULE: Add or remove a policy routing rule. Translates to insertion/removal in RouteTable::rules (followed by re-sort).

All write operations are serialized by NetNamespace::config_lock (Section 16.1.1). After mutation, the entire RouteTable is published via RcuCell::update(). The old RouteTable's FibTrieNodes are freed after the RCU grace period, but most nodes are shared with the new version (via Arc) and are only freed when their reference count reaches zero.

15.1.4.6 bpf_fib_lookup() Integration

The bpf_fib_lookup() BPF helper (Section 15.2.2, capability: CAP_NET_ROUTE_READ) wraps RouteTable::lookup() for BPF programs. The helper:

  1. Reads the destination address and other lookup keys from the BPF program's packet context (NetBuf metadata accessible via the BPF domain's read-only mapping, or from function parameters for TC/XDP programs).
  2. Calls RouteTable::lookup() under rcu_read_lock() in the umka-net domain (cross-domain helper invocation, ~23 cycles for domain switch on x86-64).
  3. Writes the result (RouteLookupResult) to BPF-accessible memory.
  4. Returns BPF_FIB_LKUP_RET_SUCCESS (0) on match, or an error code:
  5. BPF_FIB_LKUP_RET_BLACKHOLE (1)
  6. BPF_FIB_LKUP_RET_UNREACHABLE (2)
  7. BPF_FIB_LKUP_RET_PROHIBIT (3)
  8. BPF_FIB_LKUP_RET_NOT_FWDED (4): lookup succeeded but packet is local
  9. BPF_FIB_LKUP_RET_FWD_DISABLED (5): forwarding disabled on interface
  10. BPF_FIB_LKUP_RET_NO_NEIGH (7): next-hop neighbor not resolved

This matches the Linux bpf_fib_lookup() return value semantics, ensuring compatibility with existing XDP programs (e.g., Cilium, Katran) that use this helper for fast-path routing decisions.


#### 15.1.4.7 Neighbor Subsystem (ARP/NDP)

Every IP packet ultimately requires L2 (link-layer) address resolution — mapping
an IP next-hop address to a hardware (MAC) address. The neighbor subsystem manages
this mapping for both IPv4 (ARP — RFC 826) and IPv6 (NDP — RFC 4861).

```rust
/// Neighbor cache entry — maps an L3 (IP) address to an L2 (MAC) address.
///
/// Each entry tracks the state of neighbor reachability and the resolved
/// hardware address. Entries transition through a state machine matching
/// RFC 4861 Section 7.3 (NDP) and RFC 826 (ARP).
pub struct NeighborEntry {
    /// L3 (network) address. IPv4 (4 bytes) or IPv6 (16 bytes).
    pub ip_addr: IpAddr,

    /// L2 (hardware) address. Ethernet MAC (6 bytes). Valid only in
    /// REACHABLE, STALE, DELAY, PROBE states.
    pub hw_addr: [u8; 6],

    /// Current state in the neighbor reachability state machine.
    pub state: AtomicU8,  // NeighborState discriminant

    /// Output network interface index.
    pub ifindex: u32,

    /// Timestamp of last confirmed reachability (nanoseconds, monotonic).
    /// Used for REACHABLE→STALE timeout (default 30 seconds for IPv6,
    /// configurable via `base_reachable_time` sysctl).
    pub confirmed_ns: AtomicU64,

    /// Number of unanswered solicitations sent in PROBE state.
    pub probes_sent: AtomicU8,

    /// Queue of packets waiting for address resolution (INCOMPLETE state).
    /// Maximum 3 packets queued; excess are dropped (matching Linux behavior).
    pub pending_queue: SpinLock<ArrayVec<NetBufHandle, 3>>,

    /// Hash table linkage.
    pub hash_node: HashListNode,

    /// RCU head for deferred freeing.
    pub rcu_head: RcuHead,

    /// Reference count.
    pub refcount: AtomicU32,
}

/// Neighbor reachability states (RFC 4861 Section 7.3.2).
#[repr(u8)]
pub enum NeighborState {
    /// Address resolution in progress. Solicitations are being sent.
    /// Packets to this neighbor are queued (up to 3).
    Incomplete = 0,
    /// L2 address is known and recently confirmed reachable.
    /// Timeout: base_reachable_time (default 30s, randomized +/-50%).
    Reachable = 1,
    /// Reachable timeout expired. L2 address is probably still valid
    /// but has not been confirmed recently.
    Stale = 2,
    /// Traffic was sent to this neighbor from STALE state.
    /// Wait delay_first_probe_time (default 5s) before probing.
    Delay = 3,
    /// Actively probing. Unicast solicitations sent every retrans_timer
    /// (default 1s). Max ucast_solicit (default 3) probes before FAILED.
    Probe = 4,
    /// Address resolution failed. All queued packets dropped with
    /// EHOSTUNREACH. Entry may be garbage-collected.
    Failed = 5,
    /// Permanently configured (static ARP entry / `ip neigh add`).
    /// Never times out or transitions.
    Permanent = 6,
}

/// Per-namespace neighbor table.
pub struct NeighborTable {
    /// Hash table mapping IP addresses to neighbor entries.
    /// RCU-protected for lockless lookup on the packet forwarding path.
    /// Keyed by (ifindex, ip_addr) for interface-scoped lookups.
    pub entries: RcuHashTable<NeighborEntry>,

    /// Garbage collection timer. Runs periodically to remove FAILED and
    /// expired STALE entries.
    pub gc_timer: Timer,

    /// Configuration parameters (sysctl-configurable per-interface).
    pub config: NeighborConfig,
}

pub struct NeighborConfig {
    /// Time after which a REACHABLE entry transitions to STALE.
    /// Default: 30_000_000_000 ns (30 seconds). Randomized +/-50%.
    pub base_reachable_time_ns: u64,
    /// Delay before first probe in DELAY state.
    /// Default: 5_000_000_000 ns (5 seconds).
    pub delay_first_probe_ns: u64,
    /// Interval between unicast probes in PROBE state.
    /// Default: 1_000_000_000 ns (1 second).
    pub retrans_timer_ns: u64,
    /// Maximum unicast probes before transition to FAILED.
    pub ucast_solicit: u8,  // Default: 3
    /// Maximum multicast probes (INCOMPLETE state).
    pub mcast_solicit: u8,  // Default: 3
    /// Maximum entries in the neighbor table. Default: 1024 (gc_thresh3).
    pub gc_thresh3: u32,
}

State machine transitions:

                   ┌─────────────┐
    ┌──────────────│  INCOMPLETE  │──── resolution timeout ────► FAILED
    │              └──────┬──────┘
    │                     │ reply received
    │                     ▼
    │              ┌─────────────┐
    │              │  REACHABLE  │
    │              └──────┬──────┘
    │                     │ reachable_time expires
    │                     ▼
    │              ┌─────────────┐
    │              │    STALE    │
    │              └──────┬──────┘
    │                     │ traffic sent to neighbor
    │                     ▼
    │              ┌─────────────┐
    │              │    DELAY    │
    │              └──────┬──────┘
    │                     │ delay_first_probe expires
    │                     ▼
    │              ┌─────────────┐
    │              │    PROBE    │──── max probes exceeded ────► FAILED
    │              └──────┬──────┘
    │                     │ reply received
    │                     ▼
    └─────────────── REACHABLE (confirmed)

ARP operation (IPv4): On cache miss, the VFS/routing layer calls neighbor_resolve() which sends an ARP request (broadcast on the local network) and queues the packet. When the ARP reply arrives, the neighbor entry transitions to REACHABLE, the queued packets are transmitted, and subsequent packets use the cached MAC address directly.

NDP operation (IPv6): Similar to ARP but uses ICMPv6 Neighbor Solicitation (multicast to solicited-node address) and Neighbor Advertisement. NDP also handles Router Solicitation/Advertisement for default gateway discovery and Duplicate Address Detection (DAD).

Integration with routing (Section 15.1.4): The routing table lookup returns a next-hop IP address. NetBuf::route_cache stores the RouteLookupResult which includes the next-hop. Before transmission, the output path calls neighbor_lookup(ifindex, next_hop) to resolve the L2 address.


/// Socket address structure. Matches Linux's `struct sockaddr_storage` (128 bytes)
/// to accommodate all address families (AF_INET, AF_INET6, AF_UNIX, etc.).
/// The `family` field discriminates the actual address type.
#[repr(C)]
pub struct SockAddr {
    /// Address family (AF_INET, AF_INET6, AF_UNIX, etc.).
    pub family: u16,
    /// Address data. Interpretation depends on family:
    /// - AF_INET: bytes [2..8] = struct sockaddr_in (port, addr, padding)
    /// - AF_INET6: bytes [2..28] = struct sockaddr_in6 (port, flowinfo, addr, scope_id)
    /// - AF_UNIX: bytes [2..108] = struct sockaddr_un (path)
    pub data: [u8; 126],
}

/// Socket factory trait. Each protocol registers a factory during initialization.
/// The factory creates socket instances when `socket()` syscall is invoked.
pub trait SocketFactory: Send + Sync {
    /// Create a new socket instance.
    /// `family` is AF_INET or AF_INET6. `sock_type` is SOCK_STREAM, SOCK_DGRAM, etc.
    /// `protocol` is IPPROTO_TCP, IPPROTO_UDP, etc. (or 0 for default).
    fn create_socket(
        &self,
        family: AddressFamily,
        sock_type: SocketType,
        protocol: u16,
    ) -> Result<SlabRef<dyn SocketOps>, KernelError>;
}

/// Address family constants (matches Linux AF_* values).
#[repr(u16)]
pub enum AddressFamily {
    Unspec = 0,   // AF_UNSPEC
    Unix = 1,     // AF_UNIX / AF_LOCAL
    Inet = 2,     // AF_INET (IPv4)
    Inet6 = 10,   // AF_INET6 (IPv6)
    Netlink = 16, // AF_NETLINK
    // ... other families as needed
}

/// Socket type constants (matches Linux SOCK_* values).
#[repr(u32)]
pub enum SocketType {
    Stream = 1,    // SOCK_STREAM (TCP)
    Dgram = 2,     // SOCK_DGRAM (UDP)
    Raw = 3,       // SOCK_RAW
    Seqpacket = 5, // SOCK_SEQPACKET (SCTP)
    // ... other types as needed
}

// SCTP is supported as a transport protocol (SOCK_SEQPACKET, multihoming). The
// `SocketOps` trait and congestion control framework are designed to accommodate
// SCTP's multi-stream and multihoming semantics: `connect()` supports multiple
// addresses (SCTP associations), `send()`/`recv()` carry stream identifiers via
// ancillary data (cmsg), and the congestion controller interface
// ([Section 15.1.5](#1515-congestion-control-framework)) supports per-path CWND
// (SCTP requires independent congestion state per destination address).
// Full SCTP specification: see [Section 15.7](#157-sctp-stream-control-transmission-protocol).

### 15.1.4.8 TCP Control Block and State Machine

**TcpState enum** (matches RFC 793 + TIME_WAIT 2MSL + LISTEN):

```rust
pub enum TcpState {
    Closed,
    Listen,
    SynSent,
    SynReceived,
    Established,
    FinWait1,
    FinWait2,
    CloseWait,
    Closing,
    LastAck,
    TimeWait { expiry: Instant },  // 2 × MSL = 120s (RFC 793); configurable
}

TcpCb — TCP control block (per-socket, ~256 bytes):

pub struct TcpCb {
    pub state: TcpState,

    // === Send-side sequence variables (RFC 793 §3.2) ===
    pub snd_una: u32,       // oldest unacknowledged sequence number
    pub snd_nxt: u32,       // next sequence number to send
    pub snd_wnd: u32,       // current send window (from remote receiver)
    pub snd_up: u32,        // urgent pointer (send side)
    pub snd_wl1: u32,       // sequence number of last window update
    pub snd_wl2: u32,       // ack number of last window update
    pub iss: u32,           // initial send sequence number

    // === Receive-side sequence variables ===
    pub rcv_nxt: u32,       // next expected receive sequence number
    pub rcv_wnd: u32,       // current receive window advertised to peer
    pub rcv_up: u32,        // urgent pointer (receive side)
    pub irs: u32,           // initial receive sequence number

    // === RTT estimation (RFC 6298 Jacobson/Karels) ===
    pub srtt_us: u32,       // smoothed RTT estimate (microseconds × 8)
    pub rttvar_us: u32,     // RTT variance (microseconds × 4)
    pub rto_us: u32,        // current RTO (microseconds), clamped [200ms, 120s]
    pub rtt_seq: u32,       // sequence number being timed

    // === Congestion control (via CongestionOps trait, Section 15.1.5) ===
    pub cwnd: u64,          // congestion window (bytes); u64 to support high-BDP paths (>4.3 GB at 400 Gbps/100ms RTT)
    pub ssthresh: u64,      // slow-start threshold; u64 matches cwnd width
    /// Stateless algorithm descriptor — a &'static reference to one of the registered
    /// CongestionOps implementations. No per-connection heap allocation; the ops
    /// pointer is 8 bytes. Per-connection state lives in `cong_priv` below.
    pub cong_ops: &'static dyn CongestionOps,
    /// 64-byte inline per-connection state for the congestion algorithm.
    /// Algorithms with ≤64 bytes of state (Reno, CUBIC, Vegas) store directly here.
    /// Larger algorithms (BBR v2) store a heap box pointer in the first 8 bytes and
    /// free it in CongestionOps::release(). The engine zeroes cong_priv before init().
    pub cong_priv: CongPriv,

    // === SACK state (RFC 2018) ===
    pub sack_ok: bool,
    pub sack_scoreboard: SackScoreboard,
    pub reorder_head: Option<NetBufHandle>,   // out-of-order receive queue head

    // === Retransmission queue ===
    pub retrans_head: Option<NetBufHandle>,   // oldest unacknowledged segment
    pub retrans_stamp: Instant,              // timestamp of last retransmission

    // === Timer handles (Section 6.5.4 timer wheel) ===
    pub retransmit_timer: TimerHandle,
    pub delack_timer: TimerHandle,
    pub keepalive_timer: TimerHandle,
    pub timewait_timer: TimerHandle,
    pub zwp_timer: TimerHandle,             // zero-window probe

    // === Options negotiated at connect time ===
    pub ts_ok: bool,        // TCP timestamps (RFC 7323)
    pub wscale_ok: bool,
    pub rcv_wscale: u8,     // our receive window scale
    pub snd_wscale: u8,     // peer's send window scale
    pub mss_clamp: u16,     // effective MSS (min of ours and peer's)
}

SACK scoreboard (RFC 2018 + RFC 6675):

pub struct SackScoreboard {
    /// Up to 4 SACK blocks per ACK (RFC 2018 §3 limit).
    /// Each block marks received bytes above snd_una.
    pub blocks: ArrayVec<SackBlock, 4>,
    pub pipe: u32,          // RFC 6675 "pipe" variable (bytes in flight estimate)
    pub recovery_point: u32, // sequence number where recovery ends
    pub in_recovery: bool,
}
pub struct SackBlock { pub start: u32, pub end: u32 }

TCP State Machine

The TCP state machine follows RFC 793 with RFC 1122 corrections and TCP Extensions (RFC 7323). UmkaOS implements all 11 states. The state is stored in TcpCb.state: TcpState.

State Transition Table

From State Event Guard Actions To State
CLOSED passive open (listen) allocate TCB, set backlog queue LISTEN
CLOSED active open (connect) send SYN, start connect timer (75s) SYN_SENT
LISTEN recv SYN backlog not full send SYN+ACK, start SYN-ACK timer (1s×3) SYN_RECEIVED
LISTEN recv SYN backlog full drop or send SYN cookie LISTEN
LISTEN send (active data) send SYN, become active SYN_SENT
SYN_SENT recv SYN+ACK ack matches our SYN seq send ACK, cancel connect timer ESTABLISHED
SYN_SENT recv SYN (simultaneous open) send SYN+ACK SYN_RECEIVED
SYN_SENT connect timer expires retries exhausted delete TCB, notify app: ECONNREFUSED CLOSED
SYN_RECEIVED recv ACK ack matches SYN+ACK seq move to accept queue, notify app ESTABLISHED
SYN_RECEIVED SYN-ACK timer expires retries < 3 retransmit SYN+ACK SYN_RECEIVED
SYN_RECEIVED SYN-ACK timer expires retries >= 3 delete TCB CLOSED
SYN_RECEIVED recv RST delete TCB CLOSED
ESTABLISHED app close / shutdown(WR) send FIN, start FIN timer FIN_WAIT_1
ESTABLISHED recv FIN send ACK, notify app: EOF CLOSE_WAIT
ESTABLISHED recv RST notify app: ECONNRESET, delete TCB CLOSED
FIN_WAIT_1 recv ACK (of our FIN) cancel FIN timer FIN_WAIT_2
FIN_WAIT_1 recv FIN send ACK CLOSING
FIN_WAIT_1 recv FIN+ACK ack covers our FIN send ACK, start TIME_WAIT timer (2×MSL) TIME_WAIT
FIN_WAIT_2 recv FIN send ACK, start TIME_WAIT timer (2×MSL) TIME_WAIT
FIN_WAIT_2 FIN_WAIT_2 timer expires (60s idle guard, RFC 1122 §4.2.2.20) delete TCB CLOSED
CLOSE_WAIT app close send FIN, start FIN timer LAST_ACK
CLOSING recv ACK (of our FIN) start TIME_WAIT timer (2×MSL) TIME_WAIT
LAST_ACK recv ACK (of our FIN) delete TCB CLOSED
TIME_WAIT TIME_WAIT timer expires 2×MSL elapsed delete TCB CLOSED
TIME_WAIT recv SYN seq > last seq seen recycle TCB, send SYN+ACK (RFC 1122 §4.2.2.13) SYN_RECEIVED

Timer Specifications

Each timer is stored as a TimerHandle in TcpCb (see struct definition above):

Timer Field in TcpCb Duration Action on expiry
retransmit retransmit_timer RTO (1s initial, exponential backoff, max 64s per RFC 6298) Retransmit oldest unacked segment; double RTO
persist zwp_timer RTO-based (5s–60s per RFC 1122 §4.2.2.17) Send zero-window probe segment
keepalive keepalive_timer tcp_keepalive_time sysctl (default 7200s); probe interval tcp_keepalive_intvl (default 75s) Send keepalive probe; after tcp_keepalive_probes (default 9) failures → RST + ECONNRESET
time_wait timewait_timer 2×MSL (MSL default 60s → 2×MSL = 120s; configurable to min 1s via tcp_fin_timeout sysctl) Delete TCB
syn_ack (in SYN queue entry) 1s, max 3 retries (RFC 1122 §4.2.2.13) Retransmit SYN+ACK; on 3rd expiry → discard
fin_wait2 (FIN_WAIT_2 state) tcp_fin_timeout sysctl (default 60s) Force close the connection
connect (SYN_SENT state) 75s total (RFC 1122 §4.2.3.5) Abort with ETIMEDOUT
delack delack_timer min(40ms, RTT/2) (RFC 1122 §4.2.3.2) Send delayed ACK; cancelled when ACK is piggybacked on outgoing data

Retransmit timer algorithm (RFC 6298 Jacobson/Karels):

Initial RTO = 1s.
On new RTT sample M:
  if first measurement: SRTT = M; RTTVAR = M/2
  else: RTTVAR = (3/4) × RTTVAR + (1/4) × |SRTT - M|
        SRTT   = (7/8) × SRTT   + (1/8) × M
RTO = SRTT + max(G, 4 × RTTVAR)  where G = clock granularity (1ms)
Clamped to [200ms, 120s].
Exponential backoff on expiry: RTO ← RTO × 2, max retries = 15 (RFC 1122).

TIME_WAIT Optimization (TW Recycling)

UmkaOS uses a hash-bucketed TIME_WAIT table (separate from the main TCB hash) to avoid holding a full TcpCb for each TIME_WAIT connection. A TwEntry stores:

pub struct TwEntry {
    pub local:   SocketAddr,  // local IP:port
    pub remote:  SocketAddr,  // remote IP:port
    pub ts_val:  u32,         // last timestamp value seen (for RFC 7323 PAWS)
    pub ts_ecr:  u32,         // last echoed timestamp (for PAWS)
    pub rcv_nxt: u32,         // expected sequence number (to detect stale SYNs)
    pub expiry:  Instant,     // when to delete this entry
}

TIME_WAIT entries are stored in a per-CPU TwBucket ring (no hash collision; O(1) insert/expire). Expired entries are reaped lazily on the next connection to the same 4-tuple, or by the per-CPU timer interrupt that fires once per hz/8 (125ms) to reclaim slots.

Simultaneous Open (RFC 793 §3.4)

Both sides send SYN without receiving one first → both enter SYN_SENT → both receive SYN → both transition to SYN_RECEIVED → both send SYN+ACK → receive the other's SYN+ACK → ESTABLISHED. This is rare (requires NAT traversal or carefully crafted sockets) but must be handled correctly.

RST Generation Rules (RFC 793 §3.4, RFC 5961)

  • Segment arrives on LISTEN socket with no SYN: send RST.
  • Segment arrives with ACK for a nonexistent SYN (blind RST injection guard per RFC 5961): validate that RST.seq falls within the receive window before accepting.
  • RST on TIME_WAIT: silently ignore (prevents RST-based TIME_WAIT assassination).

ACK processing (fast retransmit / fast recovery):

On receiving ACK:
  if ACK advances snd_una:
    update snd_una, reset dupack counter
    call congestion.on_ack(bytes_acked)
    update RTT estimate if timestamp or RTT timer matches
    if SACK: update scoreboard, recompute pipe
  else if ACK == snd_una (duplicate ACK):
    dupacks++
    if dupacks == 3: fast retransmit (RFC 5681)
      enter fast recovery:
        ssthresh = max(cwnd/2, 2×mss)
        cwnd = ssthresh + 3×mss
        retransmit snd_una segment immediately
    else if SACK: run RFC 6675 loss recovery (SACK-based retransmit)
  on leaving fast recovery (new ACK): cwnd = ssthresh (deflate)

Tail Loss Probe (TLP, RFC 8985 RACK-TLP): - After last-segment-sent, if no ACK within max(2×SRTT, 10ms), send one new or retransmit segment as probe - Allows faster loss recovery without triggering full RTO

Integration with existing UmkaOS components: - Uses CongestionOps trait (Section 15.1.5) — pluggable CUBIC/BBR/RENO/custom - Uses timer wheel (Section 6.5.4) for all five timers - Uses NetBufHandle (Section 15.1.3) for retransmit/reorder queues - Socket state lives in TcpSock which embeds TcpCb plus the Socket from Section 15.1.2

15.1.4.9 TCP Zero-Copy Receive (SO_ZEROCOPY)

Zero-copy TCP receive delivers incoming data directly into user-space pages without an intermediate kernel copy, reducing CPU usage for high-throughput bulk transfers (file serving, video streaming, bulk data pipelines).

Enable:

int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));

Receive:

ssize_t n = recvmsg(fd, &msg, MSG_ZEROCOPY);

On the zero-copy path: kernel maps incoming NetBuf pages (Section 15.1.3) directly into the process's address space as read-only anonymous mappings. The user sees data in msg.msg_iov as usual, but the pages are shared with the kernel receive buffer (no memcpy).

UmkaOS advantage: NetBuf pages are pre-registered as zero-copy eligible at buffer pool creation time (no per-receive decision needed). All TCP receives are zero-copy capable; SO_ZEROCOPY only enables the user-mapping step.

Completion notification (mandatory):

After processing the received data, the application must notify the kernel that it is done with the pages, so they can be returned to the buffer pool:

// Read the completion notification (blocks if not yet ready)
ssize_t ret = recvmsg(fd, &notification_msg, MSG_ERRQUEUE);
// notification_msg contains a struct sock_extended_err:
//   ee_errno = 0, ee_origin = SO_EE_ORIGIN_ZEROCOPY
//   ee_data = highest sequence number of acknowledged zero-copy data

After this call returns, the user-mapped pages are unmapped from the process's address space and returned to the NetBuf pool.

Constraints: - Minimum useful payload size: 4 KB. Below 4 KB, the mapping overhead exceeds the copy cost; the kernel falls back to a regular copy silently. - User buffer pointers in msg_iov must be page-aligned (enforced; EFAULT returned if not aligned). - Not compatible with in-kernel kTLS decryption (data must be decrypted before zero-copy mapping). Compatible with kTLS hardware offload (data arrives pre-decrypted from NIC). - On copy fallback (e.g., data < 4 KB): the MSG_ZEROCOPY flag has no effect; standard copy is used. No error is returned. The application does NOT need to drain MSG_ERRQUEUE for the copy path.

Zero-copy vs copy-fallback notification semantics:

The two code paths for a sendmsg() with MSG_ZEROCOPY are mutually exclusive per call and have different notification behavior:

  • Zero-copy path (kernel maps pages into NIC DMA directly): Notification is mandatory. The application MUST drain MSG_ERRQUEUE after the send to release the page pin and allow the buffer to be reused. Failure to drain causes page leaks. The notification carries SO_EE_ORIGIN_ZEROCOPY in the error ancillary data with the send range.

  • Copy-fallback path (kernel copies data, frees the original buffer immediately): No notification is sent. The kernel freed the buffer internally after copying. The application MUST NOT drain MSG_ERRQUEUE for this send — there is nothing to drain, and waiting would block. The send flags return value indicates MSG_ZEROCOPY_SKIPPED to signal the fallback occurred.

Applications can unconditionally drain MSG_ERRQUEUE with MSG_DONTWAIT and handle EAGAIN as "no notification pending" — this is safe for both paths.

Page reclaim policy for leaked zero-copy pages:

If an application crashes (SIGKILL) or hangs before draining MSG_ERRQUEUE, zero-copy mapped pages would leak indefinitely without a reclaim mechanism. UmkaOS tracks every outstanding zero-copy page and enforces bounded lifetimes:

/// Tracks a single zero-copy page lent to userspace via MSG_ERRQUEUE.
///
/// Inserted into the per-netns timeout list when the page is mapped into the
/// process address space. Removed when the application drains MSG_ERRQUEUE or
/// when the reclaim worker expires it.
pub struct ZcopyPageRef {
    /// Socket that owns this zero-copy mapping.
    pub socket_id: u64,
    /// Deadline after which the page is forcibly reclaimed.
    /// Set to `Instant::now() + Duration::from_secs(60)` at delivery time.
    pub deadline: Instant,
    /// Physical page frame backing this mapping.
    pub page: PageRef,
}

Each network namespace maintains a zcopy_timeout_list: Mutex<Vec<ZcopyPageRef>> ordered by deadline. The reclaim rules are:

  • Deadline: 60 seconds from when the packet was delivered to MSG_ERRQUEUE. This is generous enough for any well-behaved application but prevents indefinite page leaks.
  • Reclaim worker: A kernel thread (zcopy_reclaim_worker) runs every 10 seconds, scanning the timeout list and reclaiming (unmapping from userspace + returning to the NetBuf pool) all pages past their deadline.
  • Socket close (normal): When a socket is closed, all pending zero-copy pages for that socket are immediately reclaimed — the deadline is set to Instant::now() and the reclaim worker is woken.
  • Application crash (SIGKILL path): The SIGKILL handler runs the file descriptor cleanup path, which closes all open sockets. Socket close triggers the immediate reclaim described above, so pages are recovered promptly even on abnormal exit.
  • Accounting: The per-netns count of outstanding zero-copy pages is exposed via umkafs at /System/Network/<netns>/zcopy_pages_outstanding for monitoring.

This design ensures that zero-copy pages are never permanently leaked, regardless of application behavior.

Performance: Zero-copy receive eliminates a memcpy of typically 16-64 KB per call, saving ~10-50 μs per receive on a modern CPU. Break-even vs. copy: ~4 KB on x86-64 (mmap overhead amortized over one or more pages).

/// Poll event bitflags (matches Linux POLL* values).

[repr(transparent)]

pub struct PollEvents(u16); impl PollEvents { pub const IN: Self = Self(0x0001); // POLLIN pub const PRI: Self = Self(0x0002); // POLLPRI pub const OUT: Self = Self(0x0004); // POLLOUT pub const ERR: Self = Self(0x0008); // POLLERR pub const HUP: Self = Self(0x0010); // POLLHUP pub const NVAL: Self = Self(0x0020); // POLLNVAL }

/// Shutdown direction (matches Linux SHUT_* values).

[repr(i32)]

pub enum ShutdownHow { Rd = 0, // SHUT_RD Wr = 1, // SHUT_WR RdWr = 2, // SHUT_RDWR }

/// Message header for sendmsg/recvmsg (matches Linux struct msghdr).

[repr(C)]

pub struct MsgHdr { /// Optional destination address (for connectionless sockets). pub msg_name: mut SockAddr, pub msg_namelen: u32, /// Scatter-gather array (iovec). pub msg_iov: mut IoVec, pub msg_iovlen: usize, /// Ancillary data (control messages). pub msg_control: *mut u8, pub msg_controllen: usize, /// Flags. pub msg_flags: i32, }

/// I/O vector for scatter-gather I/O (matches Linux struct iovec).

[repr(C)]

pub struct IoVec { pub iov_base: *mut u8, pub iov_len: usize, }

/// Connection tracking state (matches Linux conntrack states).

[repr(u8)]

pub enum ConntrackState { New = 0, Established = 1, Related = 2, Invalid = 3, Untracked = 4, }

/// NAT type applied to a connection.

[repr(u8)]

pub enum NatType { None = 0, Snat = 1, // Source NAT Dnat = 2, // Destination NAT Masquerade = 3, // Source NAT with auto-IP }


**Socket concurrency model**: The socket dispatch layer wraps each `dyn SocketOps`
in a per-socket `RwLock<SlabRef<dyn SocketOps>>`. The `RwLock` serializes socket
lifecycle operations (`close`, which drops the `SlabRef`) against concurrent data
operations (`sendmsg`, `recvmsg`). Socket-internal state uses fine-grained interior
mutability (per-field atomics or internal locks), so the outer `RwLock` is never
contended on the data path -- readers acquire the read lock for all data operations,
and only `close` acquires the write lock. Slab allocation avoids per-socket
heap allocation; `SlabRef` provides stable references suitable for the `RwLock` wrapper,
and matches the return type of `SocketOps::accept()`. The dispatch layer — not the trait
implementation — acquires the appropriate lock before calling trait methods:
- Data-path operations (`sendmsg`, `recvmsg`, `poll`) acquire a **shared (read) lock**,
  allowing concurrent sends/receives from multiple threads (matching Linux's behavior
  where multiple threads can read/write the same socket simultaneously).
- Lifecycle operations (`close`, `shutdown`, `setsockopt`, `bind`, `listen`)
  acquire an **exclusive (write) lock**, ensuring they are serialized with respect to
  all other operations. Note that `connect()` acquires the write lock only to transition
  the socket state to `SYN_SENT`, then releases the lock and blocks on a wait queue,
  allowing concurrent `close()` or `shutdown()` to abort the connection attempt.

The `SocketOps` trait methods all take `&self` because the dispatch layer guarantees
the correct lock is held before invocation. Implementations use interior mutability
(per-field atomics or fine-grained locks) for their mutable state, as is standard for
Rust traits shared across threads.

This means:
- `close()` waits for any in-flight `recvmsg()` or `sendmsg()` to complete before
  proceeding. No use-after-free is possible. To prevent unbounded waits when
  `recvmsg()` is blocked in a long-polling receive, `close()` sets a `SOCK_DEAD`
  flag (visible to the socket's wait queue) before acquiring the write lock. This
  wakes any blocked readers, which check the flag and return `-EBADF`, releasing
  their read locks. This mirrors Linux's `sock_flag(sk, SOCK_DEAD)` mechanism.
- `shutdown()` + `recvmsg()`: `shutdown(SHUT_RD)` acquires the exclusive lock, sets a
  "read-shutdown" flag, and releases the lock. Subsequent `recvmsg()` calls see the
  flag and return 0 (EOF) without waiting for the exclusive lock.
- The `RwLock` is a per-CPU reader-optimized lock (the shared-lock fast path is a
  single atomic increment on the local CPU's counter, adding < 10 ns to data-path
  operations).

Socket objects returned by `accept()` are allocated from a per-CPU slab allocator (the
kernel slab allocator described in Section 4.1), not the general-purpose heap. `SlabRef<T>`
is a typed reference into a slab pool, providing O(1) allocation and deallocation without
contending on a global heap lock. This is critical for servers handling millions of
concurrent connections — `Box<dyn SocketOps>` would introduce a heap allocation on every
`accept()`, creating allocator contention under load. The slab is pre-sized per socket
type during protocol registration and grows in page-granularity chunks on demand.

**Static dispatch for the common path**: The `SocketOps` trait uses dynamic dispatch
(`dyn SocketOps`) to support runtime protocol registration and heterogeneous socket
collections. However, the common case — TCP sockets using the built-in CUBIC congestion
control — is monomorphized at compile time via generic specialization. The TCP
implementation calls its own concrete methods directly on the hot path (connect, send,
recv); `dyn` dispatch is only exercised when the socket layer must operate on a
protocol-agnostic socket handle (e.g., `epoll` readiness checks across mixed socket
types, or the `close()` path that iterates the fd table). This ensures the TCP fast path
has zero vtable overhead.

Protocol registration happens at umka-net initialization:

```rust
/// Register a transport protocol with the socket layer.
/// Called during umka-net init for built-in protocols (TCP, UDP, SCTP, MPTCP).
/// Can also be called at runtime to register dynamically loaded protocols.
pub fn register_protocol(
    family: AddressFamily,       // AF_INET, AF_INET6
    sock_type: SocketType,       // SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET
    protocol: u16,               // IPPROTO_TCP, IPPROTO_UDP, IPPROTO_SCTP, IPPROTO_MPTCP
    factory: Box<dyn SocketFactory>,
) -> Result<(), KernelError>;

Adding a new transport (e.g., QUIC kernel offload) requires only: (1) implement SocketOps, (2) implement SocketFactory, (3) call register_protocol with the appropriate family/type/protocol tuple. No changes to the socket layer, syscall dispatch, or any other transport's code.

Linux comparison: Linux's struct proto_ops serves a similar role, but the implementation is entangled with struct sock internals. Adding MPTCP to Linux required modifying tcp_input.c, tcp_output.c, the socket layer, and the connection tracking subsystem. UmkaOS's trait boundary enforces that transports are self-contained.

TCP timer implementation (RTO computation, delayed ACK, zero-window probe, TIME_WAIT, keepalive) follows RFC 6298 (RTO), RFC 1122 (delayed ACK ≤ 500ms, keepalive ≥ 2h), RFC 7323 (timestamps/window scaling), and RFC 7413 (TFO). Timer integration uses the kernel's hierarchical timer wheel (Section 6.5.4, 06-scheduling.md) with O(1) insertion and cancellation. Key constants (matching Linux defaults for compatibility): TCP_RTO_MIN = 200ms, TCP_RTO_MAX = 120s, TCP_TIMEWAIT_LEN = 60s, TCP_DELACK_MAX = 200ms. Specific timer algorithms (SRTT smoothing, Karn's algorithm) are implementation details within these RFC-defined bounds and are not specified further in the architecture.

15.1.5 Congestion Control Framework

Congestion control is pluggable via a trait, selectable per-socket at runtime:

The congestion control interface is the CongestionOps trait (fully specified in Section 15.4.1). Each algorithm is a stateless descriptor (&'static dyn CongestionOps); per-connection state lives in TcpCb.cong_priv (64 bytes inline). All byte counters in the interface use u64 to support high-BDP networks: at 400 Gbps with 100ms RTT the bandwidth-delay product is ~5 GB, which would overflow u32 (max ~4.3 GB). Using u64 avoids silent overflow on datacenter and WAN paths.

Built-in algorithms:

Algorithm Description Default
BBR Google's bottleneck bandwidth and RTT-based CC. Yes
CUBIC Linux default since 2.6.19. Cubic function for cwnd growth. No
BBRv3 Revised BBR merging bandwidth and loss models into a single state machine; available from Google's BBR repository (not yet in Linux mainline as of 2026). No
Reno Classic AIMD (additive increase, multiplicative decrease). No

Per-socket selection via setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr") — same API as Linux. Applications that set congestion control on Linux work identically on UmkaOS.

eBPF struct_ops: Custom congestion control algorithms can be loaded at runtime via eBPF, using the same struct_ops mechanism as Linux 5.6+. An eBPF program implements the CongestionOps trait methods as BPF functions, which are JIT-compiled and attached to the per-socket congestion control slot. This enables production A/B testing of new algorithms without kernel rebuilds — the same workflow used at Meta and Google on Linux.

15.1.6 MPTCP as First-Class Transport

MPTCP (RFC 8684) is designed into umka-net from the start, not retrofitted onto an existing TCP implementation. This avoids the years of integration pain that Linux experienced.

Architecture:

                    MPTCP Connection
                   /       |        \
              Subflow 0  Subflow 1  Subflow 2
              (WiFi)     (LTE)      (Ethernet)
                 |          |          |
              TCP stack  TCP stack  TCP stack
              (per-subflow congestion control)

Key design decisions:

  • Subflow management: A path manager component handles subflow creation and teardown. It monitors available network interfaces and creates subflows when new paths appear (e.g., WiFi connects). Subflow teardown is graceful (DATA_FIN) or abrupt (RST on path failure).

  • Packet scheduler: Distributes data segments across subflows. Built-in policies: round-robin, lowest-RTT (send on the subflow with the shortest current RTT estimate), and redundant (duplicate on all subflows for ultra-low-latency). Scheduler is pluggable via a trait, same pattern as congestion control.

  • Sequence number separation: Connection-level Data Sequence Numbers (DSN) are independent of per-subflow TCP sequence numbers. This is architecturally baked in — the MPTCP layer maintains a DSN-to-subflow-sequence mapping, and the per-subflow TCP machines operate with their own sequence spaces. In Linux, this separation was retrofitted and required careful locking; in UmkaOS, the type system enforces the distinction (DataSeqNum vs SubflowSeqNum are distinct newtypes).

  • Middlebox fallback: If a middlebox strips MPTCP options from the SYN/ACK, the connection falls back to single-path TCP transparently. The application sees a working connection regardless.

  • Use cases requiring MPTCP: iOS and macOS use MPTCP for seamless WiFi/cellular handoff. Multipath TCP proxies improve connection reliability for mobile clients. WireGuard multipath tunnels bond multiple network paths for increased throughput.

15.1.7 Domain Switch Overhead Analysis

Clarification: umka-core mediates domain transitions (switching PKRU/POR_EL0 state) but does NOT copy packet data. The NIC driver and umka-net share a zero-copy ring buffer in shared memory (accessible to both domains via PKEY 1). Domain switches occur when the CPU transitions between executing umka-core code (for dispatch/scheduling), NIC driver code (for DMA completion processing), and umka-net code (for TCP/IP processing). The 4 switches represent: (1) umka-core→NIC driver for interrupt dispatch, (2) NIC driver→umka-core on return, (3) umka-core→umka-net for protocol processing, (4) umka-net→umka-core on return. Data flows through shared-memory ring buffers without additional copies.

The network stack (umka-net, Tier 1) runs in its own isolation domain. Every packet traverses two domain boundaries, each requiring an entry and exit switch (4 switches total):

  1. umka-core to NIC driver and back: domain switch to enter NIC driver domain for interrupt handling (~23 cycles), then domain switch to return to umka-core (~23 cycles)
  2. umka-core to umka-net and back: domain switch to enter umka-net domain for TCP processing (~23 cycles), then domain switch to return to umka-core for socket delivery (~23 cycles)

For high-throughput networking (100 Gbps), the overhead matters.

Per-packet cost analysis (1500-byte frames at 100 Gbps = ~8.3M packets/sec):

Domain switches per packet:      4 (2 domain entries x 2 switches each)
Cycles per switch:               ~23 (WRPKRU, per [Section 10.2](10-drivers.md#102-isolation-mechanisms-and-performance-modes))
Total domain switch overhead/packet:  ~92 cycles (~20ns at 4.5 GHz)
Time budget per packet:          ~120ns (at 8.3M pps)
Domain switch overhead fraction: ~17% (at 4.5 GHz) to ~26% (at 3 GHz)

This 17-26% overhead (depending on clock speed) is unacceptable for production networking. Four mitigations reduce it to a negligible fraction:

  • Batching: Process packets in batches of up to 64. Each batch requires only 4 domain switches total (not 4 per packet), because the receiving domain processes the entire batch before returning. This reduces per-packet domain switch overhead to <1 cycle at batch size 64.
  • NAPI-style polling: After the first interrupt, switch to polling mode. The NIC driver translates its hardware-specific completion events into standardized KABI completion descriptors. These KABI descriptors are written to a shared isolation domain (the shared read-only PKEY per Section 10.2's domain allocation table), accessible by both umka-net and the NIC driver. umka-net reads the KABI descriptors directly from the shared domain without a per-packet domain switch. Write access to ring doorbell registers remains in the NIC driver's private domain — umka-net can observe completions but cannot manipulate the hardware directly. This deliberately places the ring descriptors outside the NIC driver's private domain, following the standard UmkaOS pattern for zero-copy data exchange between domains. No per-packet interrupt or domain switch while in polling mode. The polling-to-interrupt transition uses an adaptive threshold based on packet rate.
  • XDP fast path: XDP programs run in the NIC driver's isolation domain, processing packets before they reach umka-net. Packets that are dropped, redirected, or TX-bounced by XDP never incur the driver-to-umka-net domain switch. For workloads like DDoS mitigation where >90% of packets are dropped, this eliminates nearly all domain switches.
  • GRO (Generic Receive Offload): Coalesce multiple small packets into larger aggregates before delivery across domain boundaries, amortizing the per-byte domain switch cost across multiple original packets.

NIC Hardware Offloads

Modern NICs perform significant protocol processing in hardware, offloading work from the CPU. UmkaOS exposes these through the NIC driver's KABI and umka-net configuration:

Offload Direction Description Benefit
TSO (TCP Segmentation Offload) TX Application sends large (up to 64KB) TCP segments; NIC splits into MTU-sized packets with correct TCP sequence numbers and checksums Eliminates per-packet CPU segmentation; up to 5x throughput improvement for bulk transfers
GSO (Generic Segmentation Offload) TX Software fallback for TSO — umka-net segments just before the NIC driver if hardware TSO is unavailable. Also handles UDP (UFO) and tunnel encapsulated packets (GSO_ENCAP) Same API for applications regardless of NIC capability
GRO (Generic Receive Offload) RX Coalesce multiple received packets into larger aggregates before protocol processing Reduces per-packet overhead; amortizes domain switch cost
TX Checksum Offload TX NIC computes TCP/UDP/IP checksums in hardware; umka-net marks the SKB with CHECKSUM_PARTIAL and provides the checksum start/offset Saves ~50ns CPU per packet
RX Checksum Offload RX NIC verifies checksums and reports status; umka-net skips software verification for CHECKSUM_COMPLETE or CHECKSUM_UNNECESSARY packets Saves ~50ns CPU per packet
Scatter-Gather I/O TX NIC can DMA from non-contiguous memory (multiple physical pages); umka-net passes a scatter-gather list instead of copying to a contiguous buffer Eliminates linearization copy for large packets

Offload capabilities are queried at driver bind time via NicDriver::query_offloads() and are individually toggleable at runtime via sysfs (/sys/class/net/<dev>/offload/{tso,gso,tx_csum,rx_csum,sg}), matching Linux's ethtool -K semantics. Offloads are enabled by default when the NIC reports support. GSO is always available as a software fallback for NICs without TSO.

Receive Flow Steering (RFS / aRFS)

On multi-queue NICs, interrupt affinity determines which CPU processes each received packet. Without flow steering, a packet may arrive on CPU 0 (interrupt handler) while the consuming application runs on CPU 5 — the packet traverses the socket buffer cache-cold, adding ~2-5μs cross-CPU latency at high packet rates.

UmkaOS implements both software and hardware flow steering:

  • RFS (Receive Flow Steering) — software-based. When a socket performs recvmsg(), umka-net records the {flow_hash → cpu} mapping in a per-NIC flow table. On the next packet for that flow, the softirq handler checks the table and, if the target CPU differs from the current CPU, enqueues the packet to the target CPU's backlog via inter-processor interrupt (IPI). This steers subsequent packets to the CPU where the application is running, improving cache locality.

sysfs control: /sys/class/net/<dev>/queues/rx-<N>/rps_flow_cnt — entries per RX queue (default: 0 = disabled) /proc/sys/net/core/rps_sock_flow_entries — global flow table size (default: 0 = disabled)

  • aRFS (Accelerated RFS) — hardware-based. For NICs that support hardware flow steering (Intel i40e/ice, Mellanox mlx5, Broadcom bnxt), umka-net programs the NIC's flow director or n-tuple filter table to steer packets to the correct RX queue at the hardware level. This eliminates the software IPI redirect — the NIC delivers the packet directly to the correct CPU's RX queue via MSI-X.

The NIC driver implements the ndo_rx_flow_steer() KABI method. umka-net calls it when the flow table is updated. aRFS is preferred over RFS when the NIC supports it; umka-net falls back to software RFS automatically.

Socket-Level Busy Polling (SO_BUSY_POLL)

For latency-critical applications (HFT, DPDK-adjacent workloads, <10μs requirement), interrupt-driven packet delivery has an inherent latency floor: the time from NIC DMA completion → MSI-X → interrupt handler → softirq → socket wakeup is typically 5-20μs even with well-tuned interrupt coalescing.

Busy polling eliminates this floor by having the application poll the NIC's completion queue directly from the recvmsg() / poll() / epoll_wait() syscall path:

Per-socket:
  setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &timeout_us, sizeof(timeout_us));

Global default:
  /proc/sys/net/core/busy_poll = <microseconds>    — busy-poll timeout for poll()/select()
  /proc/sys/net/core/busy_read = <microseconds>    — busy-poll timeout for read()/recvmsg()

When busy polling is active, recvmsg() and epoll_wait() spin in a tight loop calling the NAPI poll function (napi_poll()) directly from process context, which checks the RX completion queue without waiting for an interrupt. (Note: Linux removed the per-driver ndo_busy_poll() callback in kernel 4.11; busy polling now goes through the NAPI subsystem uniformly.) The thread burns CPU cycles during the poll window but reduces receive latency to ~1-3μs (NIC DMA completion → next poll iteration).

Trade-offs: Busy polling trades CPU efficiency for latency. A thread busy-polling at 50μs timeout wastes those cycles if no packet arrives. This is appropriate for dedicated-CPU, latency-critical workloads (trading NICs, real-time control) but inappropriate for shared servers. The per-socket granularity ensures only opted-in sockets pay the CPU cost.

Measured overhead target: With batching + NAPI polling active, the domain switch overhead for sustained 100 Gbps throughput is <2% of CPU time. This is comparable to Linux's combined interrupt + softirq overhead for the same workload, making domain isolation effectively free at high packet rates.

15.1.8 Kernel TLS (kTLS)

Kernel TLS (kTLS): UmkaOS supports TCP_ULP with tls to offload TLS record-layer encryption/decryption to the kernel (TLS_TX, TLS_RX socket options). This enables sendfile() for HTTPS without userspace encryption (used by nginx, Envoy, HAProxy). The TLS record layer runs in umka-net (Tier 1); key material is confined to the connection's socket structure and wiped on close. Hardware TLS offload to capable NICs is supported via the standard NETIF_F_HW_TLS_TX / NETIF_F_HW_TLS_RX feature flags.

Offload Negotiation and Fallback

After the TLS handshake completes in userspace, the application calls setsockopt(SOL_TLS, TLS_TX, tls_crypto_info, ...) (and optionally TLS_RX) to hand off the record layer to the kernel. At this point the kernel decides whether to use NIC hardware offload or software kTLS:

  1. Capability discovery: NIC drivers expose TLS offload support via a TlsOffloadCaps bitfield advertised to umka-net during device registration. Capabilities are reported per direction (TX, RX) and per cipher suite so the stack can make per-connection decisions.

  2. Supported cipher suites for offload (NICs may support a subset):

  3. TLS_CIPHER_AES_GCM_128 — most widely supported
  4. TLS_CIPHER_AES_GCM_256
  5. TLS_CIPHER_CHACHA20_POLY1305

Software kTLS mandatory cipher support (all must be implemented):

Cipher Suite TLS Identifier Mandated By Key/IV Size Tag Size
AES-128-GCM TLS_AES_128_GCM_SHA256 RFC 8446 §B.4 (MUST) 16 B / 12 B 16 B
AES-256-GCM TLS_AES_256_GCM_SHA384 RFC 8446 §B.4 (SHOULD) 32 B / 12 B 16 B
ChaCha20-Poly1305 TLS_CHACHA20_POLY1305_SHA256 RFC 8446 §B.4 (SHOULD) 32 B / 12 B 16 B

TLS 1.2 backward compatibility also supports:

Cipher Suite TLS ID Standard Key/IV Size
AES-128-GCM (TLS 1.2) TLS_RSA_WITH_AES_128_GCM_SHA256 RFC 5246 16 B / 4 B + 8 B
AES-256-GCM (TLS 1.2) TLS_RSA_WITH_AES_256_GCM_SHA384 RFC 5246 32 B / 4 B + 8 B
  1. Negotiation sequence: userspace: setsockopt(SOL_TLS, TLS_TX, tls_crypto_info, ...) kernel: check if NIC supports the negotiated cipher suite and direction kernel: if yes → call driver's .tls_dev_add(), pass key material to NIC if no → fall back silently to software kTLS (same API, app unchanged) The setsockopt() API is identical to Linux (SOL_TLS socket option) for full compatibility with existing TLS-aware applications.

  2. Transparent fallback: if the NIC rejects offload (key table full, unsupported cipher suite, or device error), the kernel falls back to software kTLS without surfacing the failure to the application — the setsockopt() call succeeds either way. Only the data path changes (NIC encrypt/decrypt vs. kernel encrypt/decrypt); the socket API and application behaviour are identical in both cases.

  3. Asymmetric offload: TX and RX offload are independently negotiated. A NIC may support TX offload but not RX (or vice versa). Each direction is offloaded if and only if the NIC supports it for the negotiated cipher suite; the other direction falls back to software kTLS. Both directions may coexist on the same connection.

15.1.8.1 kTLS Mandatory Cipher Support

All three cipher suites below must be supported by the software kTLS implementation. NIC hardware offload for any subset of them is optional (driver declares support via the TlsOffloadCaps bitfield advertised during device registration).

Mandatory cipher suites:

Cipher suite TLS version Linux kTLS since RFC mandate Notes
TLS_AES_128_GCM_SHA256 TLS 1.3 Linux 4.13 RFC 8446 §10.1 MUST Default TLS 1.3 cipher; most widely NIC-offloaded
TLS_AES_256_GCM_SHA384 TLS 1.3 Linux 5.1 RFC 8446 §10.1 SHOULD Required for high-security deployments
TLS_CHACHA20_POLY1305_SHA256 TLS 1.3 Linux 5.11 RFC 8446 SHOULD Required on platforms lacking AES hardware acceleration

RFC 8446 §10.1 requirement: Implementations MUST implement TLS_AES_128_GCM_SHA256. TLS_AES_256_GCM_SHA384 and TLS_CHACHA20_POLY1305_SHA256 are recommended. UmkaOS implements all three.

Crypto info structs (passed via setsockopt(SOL_TLS, TLS_TX/TLS_RX, ...)). These are layout-compatible with Linux's tls_crypto_info family in include/uapi/linux/tls.h, ensuring unmodified applications work without recompilation:

/// Base TLS crypto info header. Passed as the first field of each cipher-specific
/// struct. Layout matches Linux `struct tls_crypto_info` for setsockopt compat.
#[repr(C)]
pub struct KtlsCryptoInfo {
    /// TLS protocol version: `0x0303` = TLS 1.2, `0x0304` = TLS 1.3.
    pub version: u16,
    /// Cipher type constant. Must match one of the `CIPHER_*` values below.
    pub cipher_type: u16,
}

/// Cipher type constants. Values match Linux `TLS_CIPHER_*` in `linux/tls.h`
/// for setsockopt compatibility with existing TLS-aware userspace applications.
pub mod cipher_type {
    /// AES-128-GCM. Value = 51 (`TLS_CIPHER_AES_GCM_128`). Linux 4.13+.
    pub const AES_GCM_128:       u16 = 51;
    /// AES-256-GCM. Value = 52 (`TLS_CIPHER_AES_GCM_256`). Linux 5.1+.
    pub const AES_GCM_256:       u16 = 52;
    /// ChaCha20-Poly1305. Value = 54 (`TLS_CIPHER_CHACHA20_POLY1305`). Linux 5.11+.
    pub const CHACHA20_POLY1305: u16 = 54;
}

/// AES-128-GCM crypto parameters for kTLS (TLS 1.2 or TLS 1.3).
/// Mandatory cipher (RFC 8446 §10.1 MUST).
/// Layout matches Linux `struct tls12_crypto_info_aes_gcm_128`.
#[repr(C)]
pub struct KtlsAes128GcmInfo {
    /// Base header (`version` + `cipher_type = cipher_type::AES_GCM_128`).
    pub info: KtlsCryptoInfo,
    /// Implicit nonce (IV) — 8 bytes. XOR'd with the sequence number to form
    /// the full 12-byte GCM nonce together with `salt`.
    pub iv:      [u8; 8],
    /// AES-128 symmetric key — 16 bytes.
    pub key:     [u8; 16],
    /// Fixed salt — 4 bytes. Prepended to `iv` to form the 12-byte GCM nonce.
    pub salt:    [u8; 4],
    /// TLS record sequence number — 8 bytes. Used for nonce reconstruction
    /// and AAD construction on the receive path.
    pub rec_seq: [u8; 8],
}

/// AES-256-GCM crypto parameters for kTLS (TLS 1.2 or TLS 1.3).
/// Layout matches Linux `struct tls12_crypto_info_aes_gcm_256`.
#[repr(C)]
pub struct KtlsAes256GcmInfo {
    /// Base header (`version` + `cipher_type = cipher_type::AES_GCM_256`).
    pub info: KtlsCryptoInfo,
    /// Implicit nonce (IV) — 8 bytes.
    pub iv:      [u8; 8],
    /// AES-256 symmetric key — 32 bytes.
    pub key:     [u8; 32],
    /// Fixed salt — 4 bytes.
    pub salt:    [u8; 4],
    /// TLS record sequence number — 8 bytes.
    pub rec_seq: [u8; 8],
}

/// ChaCha20-Poly1305 crypto parameters for kTLS (TLS 1.3 only).
/// ChaCha20 uses a 96-bit (12-byte) nonce directly; there is no separate salt.
/// Layout matches Linux `struct tls12_crypto_info_chacha20_poly1305`.
#[repr(C)]
pub struct KtlsChaCha20Poly1305Info {
    /// Base header (`version` + `cipher_type = cipher_type::CHACHA20_POLY1305`).
    pub info: KtlsCryptoInfo,
    /// Full 12-byte nonce (no salt split; nonce XOR'd with sequence number).
    pub iv:      [u8; 12],
    /// ChaCha20 symmetric key — 32 bytes.
    pub key:     [u8; 32],
    /// TLS record sequence number — 8 bytes.
    pub rec_seq: [u8; 8],
}

NIC hardware offload: When a capable NIC driver is bound (e.g., a NIC advertising TlsOffloadCaps::TX_AES_GCM_128), the kernel calls the driver's .tls_dev_add() callback, passing the crypto parameters. The NIC encrypts outbound records and/or decrypts inbound records in hardware. Software kTLS is always available as a fallback — the setsockopt() call succeeds even when NIC offload is unavailable or when the key table is full.

TlsOffloadCaps bitfield (reported by driver via NetDeviceInfo during device registration):

bitflags! {
    /// NIC TLS offload capability flags. Reported by driver via NetDeviceInfo.
    /// Each flag indicates per-direction, per-cipher hardware offload capability.
    pub struct TlsOffloadCaps: u32 {
        /// TX offload: AES-128-GCM (TLS 1.3)
        const TX_AES_128_GCM          = 0x0001;
        /// RX offload: AES-128-GCM (TLS 1.3)
        const RX_AES_128_GCM          = 0x0002;
        /// TX offload: AES-256-GCM (TLS 1.3)
        const TX_AES_256_GCM          = 0x0004;
        /// RX offload: AES-256-GCM (TLS 1.3)
        const RX_AES_256_GCM          = 0x0008;
        /// TX offload: ChaCha20-Poly1305 (TLS 1.3)
        const TX_CHACHA20_POLY1305    = 0x0010;
        /// RX offload: ChaCha20-Poly1305 (TLS 1.3)
        const RX_CHACHA20_POLY1305    = 0x0020;
        /// Device supports software-assisted crypto (e.g., Intel QuickAssist)
        const CRYPTO_ENGINE_ASSIST    = 0x0040;
        /// TX/RX offload: both directions for all three TLS 1.3 ciphers
        const FULL_TLS13 = Self::TX_AES_128_GCM.bits()
                         | Self::RX_AES_128_GCM.bits()
                         | Self::TX_AES_256_GCM.bits()
                         | Self::RX_AES_256_GCM.bits()
                         | Self::TX_CHACHA20_POLY1305.bits()
                         | Self::RX_CHACHA20_POLY1305.bits();
    }
}

Key rotation: When the TLS library rotates keys (TLS 1.3 key update message), it calls setsockopt(SOL_TLS, TLS_TX, new_crypto_info, ...) again with the new key. If the connection is offloaded to a NIC, UmkaOS calls the driver's .tls_dev_add() with a TLS_OFFLOAD_OP_UPDATE operation. The NIC must update its key atomically (no packet may be encrypted with an old key after the new key is installed). If the NIC cannot perform atomic key rotation, UmkaOS removes the offload and falls back to software kTLS for the remainder of the connection.


15.2 Network Overlay and Tunneling

Linux problem: Overlay networking (VXLAN, Geneve) was bolted onto the stack over many years. Bridge/veth code is complex and poorly isolated — a bug in the bridge module can crash the kernel.

UmkaOS design:

Tunnel protocols as umka-net modules — Each tunnel type runs as a Tier 1 module and implements a TunnelDevice trait:

/// Tunnel device interface for encapsulation protocols.
pub trait TunnelDevice: NetDevice {
    /// Encapsulate an inner packet for transmission through the tunnel.
    fn encap(&self, inner: &Packet, metadata: &TunnelMetadata) -> Result<Packet>;

    /// Decapsulate a received packet, returning the inner packet.
    fn decap(&self, outer: &Packet) -> Result<(Packet, TunnelMetadata)>;

    /// Maximum overhead added by encapsulation (for MTU calculation).
    fn encap_overhead(&self) -> usize;

    /// Tunnel-specific metadata (VNI, flow label, key, etc.).
    type Metadata: TunnelMetadata;
}

Supported tunnel protocols:

Protocol Description Use case
VXLAN Virtual Extensible LAN (UDP port 4789) Cloud overlay, OpenStack
Geneve Generic Network Virtualization Encap OVN, next-gen cloud overlay
GRE/GRE6 Generic Routing Encapsulation Site-to-site tunnels
IPIP/SIT IP-in-IP and IPv6-in-IPv4 IPv6 transition
WireGuard Modern VPN (ChaCha20-Poly1305) Secure tunnels

WireGuard tunnel specification (Noise IK handshake, ChaCha20-Poly1305 AEAD, key rotation, roaming) follows the WireGuard protocol specification (wireguard.com/protocol). Kernel integration details are deferred to Phase 3 implementation (Phase 3 is when the full TCP/IP stack and network drivers land; see Section 23.2.3): - Key storage: Private keys stored in kernel memory (not accessible from userspace after configuration); keys are zeroized on interface teardown. - Netlink interface: GENL_FAMILY "wireguard" with WG_CMD_SET_DEVICE, WG_CMD_GET_DEVICE — compatible with wg(8) and wg-quick. - Namespace interaction: WireGuard interfaces are namespace-aware (can be moved between namespaces like any netdev). - Rekeying timers: REKEY_AFTER_MESSAGES (2^60), REKEY_AFTER_TIME (120s), REJECT_AFTER_TIME (180s) — per upstream protocol spec. - Deferral rationale: WireGuard is a self-contained protocol module that plugs into the register_tunnel() framework above. No architectural changes are needed — only implementation of the Noise IK state machine and ChaCha20-Poly1305 AEAD (both provided by the crypto subsystem, Section 8.2).

Isolation tier: WireGuard runs as a Tier 1 driver (ring 0, hardware memory-domain isolated via MPK/POE/equivalent). The rationale:

  • WireGuard requires direct access to the network stack's packet path (NetBuf interface) for performance. The ring-crossing overhead of Tier 2 (~5–15 μs per batch) is unacceptable for a cryptographic tunnel that is otherwise a ~1–5 μs per-packet operation; routing every packet through a full user-ring boundary would eliminate the performance advantage of in-kernel tunneling.
  • Tier 1 isolation (hardware memory domain) still confines a WireGuard crash to a driver reload without causing a kernel panic, providing meaningful fault containment at lower cost than a full Tier 2 process boundary.
  • The WireGuard cryptographic state (device private key, peer public keys, session symmetric keys, handshake state machine) lives entirely within the WireGuard Tier 1 isolation domain and is never readable by other Tier 1 drivers or by Tier 0 code. Key zeroization on interface teardown is enforced before the domain is released.

Configuration: WireGuard interfaces are configured via the standard wg(8) / wg-quick(8) userspace tools using Linux-compatible Generic Netlink (GENL_FAMILY "wireguard", WG_CMD_* commands). No API changes from Linux — existing WireGuard tooling works without modification.

Software L2 switch — A Linux bridge equivalent in umka-net, supporting: - STP (Spanning Tree Protocol) for loop prevention - VLAN filtering (802.1Q tag-aware forwarding) - FDB (Forwarding Database) learning with configurable aging - Per-port traffic shaping

Virtual device pairs: - veth: Virtual ethernet pairs for namespace connectivity. Required for Docker and Kubernetes pod networking. Each end of the pair lives in a different network namespace. - macvlan/ipvlan: Lightweight container networking without bridges. macvlan assigns unique MACs per container; ipvlan shares the parent MAC and routes by IP.

VRF (Virtual Routing and Forwarding) — L3 domain isolation for multi-tenant routing. Each VRF has its own routing table and forwarding decisions, enabling multiple tenants to use overlapping IP ranges on the same host.

Hardware offload — Tunnel encap/decap can be offloaded to NIC hardware via KABI. This is the equivalent of Linux TC flower offload: - NIC firmware handles VXLAN/Geneve encap/decap in hardware - umka-net falls back to software path transparently if NIC lacks offload support - Offload rules are programmed via the same TunnelDevice trait (the NIC driver implements the trait with hardware acceleration)

XDP integration — XDP programs can inspect inner headers of tunneled packets via a "decap-before-XDP" mode: - Because XDP runs in the NIC driver before reaching umka-net, XDP programs that need to see inner headers must explicitly call a BPF helper (e.g., bpf_xdp_decap()). - This helper invokes the NIC's hardware offload or a fast-path software decapsulator to strip the tunnel headers. - The XDP program then sees the inner (original) packet headers. - Allows filtering/load-balancing decisions based on inner flow information - This avoids the Linux problem where XDP programs must manually parse tunnel headers

Container networking compatibility — Docker bridge network mode and Kubernetes CNI plugins (Calico, Cilium, Flannel) must work without modification. This requires: - veth pair creation via netlink - Bridge port management via netlink - VXLAN device creation via netlink - iptables/nftables rules for masquerade and port mapping - All of these are covered by the netlink subsystem (Section 15.2.1) and BPF-based packet filtering (Section 15.2.2)

Netlink is the primary kernel-userspace IPC mechanism for network configuration. Docker, Kubernetes CNI plugins (Calico, Cilium, Flannel), iproute2 (ip route, ip addr, ip link), and NetworkManager all depend on netlink. UmkaOS implements netlink as a socket family within umka-net.

Socket family: AF_NETLINK sockets are created via socket(AF_NETLINK, SOCK_DGRAM, protocol). Each protocol family controls a different subsystem:

Protocol Purpose Capability required
NETLINK_ROUTE Routes, addresses, links, neighbors, rules CAP_NET_ADMIN for writes; reads are unprivileged
NETLINK_AUDIT Audit event delivery (see Section 19.2.9) CAP_AUDIT_READ
NETLINK_KOBJECT_UEVENT Device hotplug events (udev) Unprivileged (receive only)
NETLINK_GENERIC Generic extensible netlink (genetlink) Per-family capability check
NETLINK_NETFILTER Conntrack entry dump/delete/event, nftables rule management. Required by Docker, Kubernetes kube-proxy, and conntrack-tools. CAP_NET_ADMIN

Message format: Every netlink message starts with an nlmsghdr (16 bytes):

/// Netlink message header (matches Linux struct nlmsghdr exactly).
#[repr(C)]
pub struct NlMsgHdr {
    /// Total message length including header.
    pub nlmsg_len: u32,
    /// Message type (RTM_NEWROUTE, RTM_DELADDR, etc.).
    pub nlmsg_type: u16,
    /// Flags (NLM_F_REQUEST, NLM_F_DUMP, NLM_F_ACK, etc.).
    pub nlmsg_flags: u16,
    /// Sequence number (for request/response matching).
    pub nlmsg_seq: u32,
    /// Sending process port ID (0 = kernel).
    pub nlmsg_pid: u32,
}

Messages are followed by type-specific payload structs (ifinfomsg for links, ifaddrmsg for addresses, rtmsg for routes) and nested TLV attributes (rtattr). umka-net implements the full NETLINK_ROUTE message set required for container networking:

  • Link management: RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK — create/destroy/query veth pairs, bridges, VXLAN devices, macvlan/ipvlan
  • Address management: RTM_NEWADDR, RTM_DELADDR, RTM_GETADDR — assign/remove IPv4/IPv6 addresses
  • Route management: RTM_NEWROUTE, RTM_DELROUTE, RTM_GETROUTE — manipulate routing tables (including per-VRF tables)
  • Neighbor management: RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH — ARP/NDP neighbor table entries
  • Rule management: RTM_NEWRULE, RTM_DELRULE — policy routing rules

Capability gating: Netlink write operations require the appropriate capability in the caller's network namespace. Read operations and multicast group subscriptions are unprivileged, matching Linux semantics. This ensures that unprivileged containers can observe network state but cannot modify it without explicit capability grants.

Multicast groups: Processes subscribe to multicast groups (e.g., RTNLGRP_LINK, RTNLGRP_IPV4_ROUTE) to receive asynchronous notifications of network state changes. This is how ip monitor, container runtimes, and NetworkManager track link state.

15.2.2 Packet Filtering (BPF-Based)

UmkaOS does not implement a separate nftables or iptables subsystem. Packet filtering uses the BPF-based filtering infrastructure described in Section 18.1.4 (eBPF).

Architecture: All packet filtering hooks (prerouting, input, forward, output, postrouting) are BPF attachment points. BPF programs attached to these hooks perform the equivalent of iptables/nftables rules: matching on headers, NATing, dropping, marking, and logging.

nftables/iptables compatibility: The syscall interface (Section 18.1) translates legacy iptables and nftables rule manipulations (setsockopt for iptables, netlink NFT_MSG_* for nftables) into equivalent BPF programs that are compiled and attached to the appropriate hooks. This translation happens transparently:

  • iptables -t nat -A POSTROUTING -s 10.0.0.0/8 -j MASQUERADE is translated to a BPF program attached to the postrouting hook that performs source NAT
  • nft add rule ip filter input tcp dport 80 accept is translated to a BPF program attached to the input hook

This approach provides Docker/Kubernetes compatibility (which depend on iptables/nftables for port mapping, masquerade, and network policy) without maintaining a separate packet filtering subsystem. The BPF JIT ensures that translated rules execute at native speed.

Connection tracking (conntrack): Stateful NAT (MASQUERADE, DNAT, SNAT) requires tracking connection state to map return packets back to the original source. UmkaOS implements connection tracking as a BPF-accessible hash map maintained by umka-net:

/// 5-tuple identifying one direction of a connection.
/// For ICMP, `src_port` and `dst_port` are repurposed as type and code.
#[repr(C)]
pub struct ConntrackTuple {
    /// Source IP address (IPv4-mapped-IPv6 for v4, native for v6).
    pub src_addr: [u8; 16], // 16 bytes raw array to guarantee layout
    /// Destination IP address.
    pub dst_addr: [u8; 16], // 16 bytes
    /// Source port (or ICMP type).
    pub src_port: u16,
    /// Destination port (or ICMP code).
    pub dst_port: u16,
    /// IP protocol number (TCP=6, UDP=17, ICMP=1, ICMPv6=58, etc.).
    pub protocol: u8,
    /// Explicit padding byte to avoid implicit alignment padding between
    /// `protocol` (u8 at offset 36) and `zone` (u16 at offset 38).
    /// Without this field, `#[repr(C)]` inserts 1 byte of uninitialized
    /// padding that would corrupt Jenkins hash if hashed as raw bytes.
    /// This field is always zero and included in the hash.
    pub _pad: u8,
    /// Conntrack zone (u16, matches Linux `nf_conntrack` zones). Part of the
    /// hash key to allow overlapping IP ranges in different network namespaces
    /// to coexist without collisions. Two entries with identical 5-tuples but
    /// different zones are distinct connections.
    pub zone: u16,
}

/// Connection tracking entry.
/// Keyed by (protocol, src_ip, src_port, dst_ip, dst_port) 5-tuple.
#[repr(C)]
pub struct ConntrackEntry {
    /// Original direction 5-tuple.
    pub original: ConntrackTuple,
    /// Reply direction 5-tuple (after NAT translation).
    pub reply: ConntrackTuple,
    /// Connection state (NEW, ESTABLISHED, RELATED, INVALID, UNTRACKED).
    pub state: ConntrackState,
    /// NAT type applied (SNAT, DNAT, MASQUERADE, or None).
    pub nat_type: NatType,
    /// Conntrack zone — duplicated from ConntrackTuple for fast access during
    /// NAT and accounting without dereferencing the tuple. Must always equal
    /// `original.zone`.
    pub zone: u16,
    /// Network namespace inode number of the process that created this connection.
    /// Set when the connection is first tracked; never changes afterwards.
    ///
    /// Used by `bpf_ct_lookup()` to enforce namespace isolation: a BPF program
    /// running in network namespace A must not observe conntrack entries from
    /// namespace B. The filter `entry.net_ns_inum == caller_ns.inum` enforces
    /// this boundary. Cross-namespace access requires `CAP_NET_ADMIN` in the
    /// initial network namespace AND the `BPF_F_CONNTRACK_GLOBAL` flag.
    pub net_ns_inum: u64,
    /// Connection mark (set by iptables CONNMARK target). Used by Kubernetes
    /// kube-proxy for service routing and by iptables CONNMARK save/restore.
    pub mark: u32,
    /// Timeout (nanoseconds since boot). Entry is garbage-collected after expiry.
    pub timeout_ns: u64,
    /// Packet/byte counters (for accounting). AtomicU64 because counters are
    /// updated on the forwarding path under RCU (no per-entry lock held) and
    /// read by BPF programs and conntrack dumps concurrently.
    pub packets_original: AtomicU64,
    pub packets_reply: AtomicU64,
    pub bytes_original: AtomicU64,
    pub bytes_reply: AtomicU64,
}

The conntrack table is a concurrent hash map with per-bucket spinlocks and RCU-protected lookup (matching Linux's nf_conntrack design: a global hash table with per-bucket locking, not per-CPU sharding, because connection state must be visible across all CPUs for NAT reply-direction lookups).

Conntrack hash table scalability design:

The hash table uses Jenkins hash (same as Linux nf_conntrack) over the 5-tuple + zone, distributing entries uniformly across buckets. The design separates the read path (hot, lockless) from the write path (rare, per-bucket locked):

Read path (packet lookup — hot, every packet):
  1. Compute hash(5-tuple, zone) → bucket index
  2. rcu_read_lock()
  3. Walk bucket chain (RCU-protected linked list), compare 5-tuples
  4. Return ConntrackEntry pointer (valid under RCU read-side)
  5. rcu_read_unlock()
  Cost: ~40-80 ns (hash + 1-2 pointer chases, no atomics, no locks)

Write path (new connection — rare, ~1 per 1000 packets for typical HTTP):
  1. Compute hash → bucket index
  2. spin_lock(&bucket[idx].lock)
  3. Allocate ConntrackEntry from per-CPU slab cache (no global lock)
  4. Insert at head of bucket chain (RCU publish: rcu_assign_pointer)
  5. spin_unlock(&bucket[idx].lock)
  Cost: ~200-400 ns (lock + slab alloc + RCU publish)

Conntrack table sizing: UmkaOS's conntrack hash table size is determined at boot from available physical memory and is runtime-resizable:

initial_buckets = clamp(
    next_power_of_two(system_ram_bytes / 65536),
    65_536,       // minimum: 64K buckets
    16_777_216    // maximum: 16M buckets
)

Example: 4 GB RAM → 65536 buckets; 64 GB → 1048576; 256 GB → 4194304.

Runtime resize: The table doubles when the running average chain length exceeds 8 (monitored per-bucket via exponentially-weighted moving average), and halves when the average falls below 2 for more than 60 continuous seconds. Resize is RCU-safe: a new table is allocated, all entries rehashed with RCU-protected pointer update, and the old table freed after a grace period. No packet drops during resize.

The kernel tunable conntrack.max_buckets overrides the boot-time calculation (requires CAP_NET_ADMIN). For Linux compatibility, the boot parameter nf_conntrack_buckets=N is also accepted. The maximum connection count is capped by nf_conntrack_max (default: conntrack_buckets × 4).

Memory per bucket: ~24 bytes (spinlock + RCU list head + counter). Memory per entry: ~200 bytes (ConntrackEntry + RCU callback + slab metadata). Total memory at maximum fill is dominated by entries, not buckets.

Contention analysis for 256+ CPUs: The per-bucket spinlock is the only serialization point. Under uniform hash distribution, the probability of two CPUs contending on the same bucket during insertion is:

P(contention) = (insert_rate × lock_hold_time) / num_buckets

For 256 CPUs each creating 10K connections/sec (2.56M total inserts/sec), with ~200 ns lock hold time and 1048576 buckets (typical on a 64 GB system per the memory-based formula):

P(contention) = (2.56M × 200ns) / 1048576 = 0.51 / 1048576 ≈ 0.0000005 per insert

This means contention occurs approximately once per 2,000,000 inserts — negligible. At 10M inserts/sec (extreme load), contention rises to ~once per 500,000 inserts, still well within acceptable limits. Larger memory gives more buckets, so contention decreases further on memory-rich systems.

Scaling beyond 10M connections/sec: For extreme-scale deployments (512+ CPUs, 10M+ new connections/sec), two additional strategies are available:

  1. Per-namespace sharding: Each network namespace has its own conntrack table. Traffic sharded across N namespaces yields N independent hash tables, each with 1/N the contention. Kubernetes pod networking naturally provides this sharding (each pod has its own network namespace). The kernel creates a new conntrack table per namespace via struct net→ct (same as Linux), so no additional design is required — the sharding is automatic.

  2. Percpu insertion batching: For workloads with extremely high short-lived connection rates (SYN floods, UDP scanning), insertions can be batched per-CPU and flushed to the global table periodically. This trades insertion latency (~1ms batch interval) for reduced lock contention. Enabled via umka.net.conntrack.batch_insert=1 (default: disabled, as most workloads don't need it).

Table saturation policy: When the conntrack table reaches its maximum entry count (umka.net.conntrack.max, default: conntrack_buckets × 4, tunable), new connection attempts receive -ENOMEM from bpf_ct_insert(). The BPF program decides the policy: drop the packet (default for SYN flood protection) or allow it untracked (stateless fallback). Under sustained SYN flood conditions (10-100x normal rate), the percpu batching mode and early drop heuristic (evict the oldest unassured connection in the target bucket) prevent table-full drops for legitimate traffic.

Hash distribution under NAT pools: When a small NAT pool (e.g., 4 public IPs) serves a large private network, all reverse-flow lookups (external → internal) hash to the same ~4 buckets, creating a hot-bucket problem under high reverse traffic. Mitigation: the hash includes the source port in addition to the 5-tuple, distributing NAT pool entries across source_ports buckets. For aggressive scanning workloads that fix the source IP and port, the administrator can enable net.conntrack.nat_pool_scatter=1 which additionally hashes on a per-session nonce, breaking the hot-bucket skew at the cost of one extra memory access per lookup.

Garbage collection: Expired entries are reclaimed by a per-CPU GC thread that scans its local slab and removes entries whose timeout_ns has passed. GC runs every umka.net.conntrack.gc_interval_ms (default: 1000ms). Removal holds the bucket lock briefly (~100 ns) and uses call_rcu() to defer freeing until all RCU readers have completed.

BPF programs at the prerouting and postrouting hooks query and update conntrack entries via BPF helper functions (bpf_ct_lookup(), bpf_ct_insert(), bpf_ct_set_nat()). This integrates with the BPF-based packet filtering: a MASQUERADE BPF program creates a conntrack entry with SNAT on the outgoing path; the prerouting hook automatically reverses the NAT for return packets by looking up the conntrack entry.

BPF conntrack access: helper-only, no direct mapping.

BPF programs access conntrack state exclusively via the bpf_ct_lookup() helper. The conntrack hash table is NOT mapped read-only into the BPF address space.

The helper enforces namespace isolation automatically: a BPF program attached to a network interface in namespace N sees only conntrack entries belonging to namespace N. The attachment point determines the filtering context — no explicit namespace argument needed, and no bypass possible.

The ~50–100 ns per-call overhead of the helper is negligible relative to per-packet processing cost. This is a deliberate UmkaOS design decision: the Linux optimization of mapping the full conntrack table read-only into BPF space creates a namespace isolation bypass — a BPF program can walk the raw hash table to enumerate all connections across all namespaces, violating container isolation in Kubernetes multi-tenant environments. UmkaOS eliminates this attack surface from day one.

BPF helper isolation model: The general isolation rules for BPF programs in the networking stack (and all other subsystems) are:

  1. Domain confinement: Each BPF program executes in a dedicated BPF isolation domain (Section 18.1.4), separate from both umka-core and the driver or subsystem that loaded it. An XDP program attached to a NIC driver does not run in the driver's domain — it runs in its own BPF domain and accesses driver or subsystem state only through verified BPF helpers, which perform cross-domain access on the program's behalf. This means a verifier bug in a BPF program cannot compromise the NIC driver's memory or umka-net's internal state. The map access control (rule 2) and capability-gated helpers (rule 3) are enforced by this domain boundary, not solely by the verifier's static analysis.

  2. Map access control: BPF maps are owned by the isolation domain that created them. A BPF program can only access maps owned by its own domain. Cross-domain map sharing is explicit: the owning domain grants a capability (with MAP_READ, MAP_WRITE, or both permission bits) to the target domain via the standard capability delegation mechanism (Section 8.1.1). The verifier rejects programs that reference map file descriptors for which the loading domain does not hold a valid capability.

  3. Capability-gated helpers: BPF helpers that access kernel state beyond the program's own domain require the BPF domain to hold the corresponding capability. For example: bpf_sk_lookup() (socket table lookup) requires CAP_NET_LOOKUP; bpf_fib_lookup() (route table lookup) requires CAP_NET_ROUTE_READ; bpf_ct_lookup() / bpf_ct_insert() require CAP_NET_CONNTRACK. Enforcement is dual: the verifier rejects programs at load time if the BPF domain does not hold the required capabilities (see rule 5), and the eBPF runtime re-checks the domain's capability set at helper invocation time. The runtime check is necessary because capabilities can be revoked after a program is loaded (Section 8.1.1) — without it, a revoked capability would remain effective until the program is explicitly unloaded.

  4. Cross-domain packet redirect: XDP redirect actions (XDP_REDIRECT, bpf_redirect_map()) that forward a packet to an interface in a different driver's isolation domain require the source domain to hold CAP_NET_REDIRECT for the target interface. Without this capability, the redirect returns -EACCES and the packet is dropped. This prevents a compromised NIC driver from injecting traffic into another driver's domain.

XDP Redirect Rate Limiting:

XDP programs can redirect frames to any network interface, including loopback and physical interfaces. Without rate limiting, a malicious or buggy XDP program can saturate links at line rate.

UmkaOS enforces: 1. Redirect to the same interface (hairpin): always allowed. 2. Redirect to another interface in the root network namespace: requires CAP_NET_ADMIN. Unrestricted redirect in root ns is intentional (root ns is trusted). 3. Redirect to another interface in a non-root network namespace: - The target interface must have a configured TX rate limit (ip link set dev X rate <limit>bps via netlink). - If no rate limit is configured: redirect is rejected with XDP_ABORTED and a log message: "XDP redirect blocked: no rate limit on [interface]". - The rate limit is enforced by the token-bucket scheduler already present in the UmkaOS Tier 1 network stack (Section 15.5).

Rationale: CAP_NET_ADMIN is required in the root ns because it controls physical hardware. In tenant namespaces, the rate limit is the safety valve — tenants can use XDP redirect but cannot monopolize shared physical links.

  1. Verifier enforcement: The verifier enforces constraints (2)–(4) at program load time by checking the BPF domain's capability set against the program's map references and helper calls. Programs that reference inaccessible maps or call helpers requiring capabilities the domain does not hold are rejected before JIT compilation. This is a static gate: it prevents unauthorized programs from being loaded in the first place. Runtime capability checks at helper invocation time (rule 3) serve a distinct purpose: they enforce capability revocation for already-loaded programs. Both mechanisms are primary for their respective concerns — load-time verification prevents unauthorized loading, and runtime checks ensure revocation takes immediate effect.

Linux compatibility: The conntrack subsystem exposes /proc/net/nf_conntrack and the NETLINK_NETFILTER netlink family for userspace tools (conntrack -L, conntrack -D). Docker and Kubernetes depend on conntrack for NAT state visibility.

Advantages over separate subsystems: A single filtering mechanism (BPF) eliminates the complexity of maintaining iptables, ip6tables, ebtables, arptables, and nftables as separate subsystems — a major source of bugs and inconsistencies in Linux networking. Connection tracking is the sole stateful component, shared by all BPF-translated NAT rules regardless of their original iptables/nftables syntax.


15.3 Network Interface Naming

Linux problem: Network interface naming was chaotic (eth0 could be different NICs each boot). systemd's "predictable names" (enp0s3, etc.) partially fixed this but introduced confusing names and edge cases.

UmkaOS design: - Deterministic, stable device naming in sysfs based on physical topology (bus/slot/function) from the first boot. - The device manager assigns stable names based on firmware (ACPI, Device Tree) hints first, then physical topology, then driver enumeration order as last resort. - User-defined naming rules via a declarative config (similar to udev rules but simpler). - Network namespaces get their own independent naming scope (see Section 16.1 for network namespace architecture).

15.3.1 AF_UNIX Socket Specification

Unix domain sockets provide local inter-process communication with semantics that differ from network sockets. UmkaOS implements the full Linux AF_UNIX interface for compatibility with systemd, D-Bus, X11/Wayland, and container runtimes.

Socket types:

Type Semantics Use case
SOCK_STREAM Byte stream, in-order, reliable D-Bus, systemd socket activation
SOCK_DGRAM Datagram, message-boundary-preserving, reliable (AF_UNIX: in-kernel, no packet loss; AF_INET/AF_INET6 UDP: unreliable) Logging, low-overhead IPC (AF_UNIX), DNS/NTP (UDP)
SOCK_SEQPACKET Message-preserving stream, in-order, reliable Protocol-framed IPC (e.g., varlink)

Address format:

/// Unix domain socket address (matches Linux struct sockaddr_un).
/// Path sockets start with a non-NUL byte; abstract sockets start with NUL.
#[repr(C)]
pub struct SockAddrUnix {
    /// Address family (AF_UNIX = 1).
    pub sun_family: u16,
    /// Path name or abstract name.
    /// - Path socket: null-terminated filesystem path (max 107 bytes including NUL)
    /// - Abstract socket: sun_path[0] = '\0', followed by abstract name (no filesystem entry)
    /// The Linux limit is 108 bytes total (sizeof(sockaddr_un) - 2 for sun_family).
    pub sun_path: [u8; 108],
}

Abstract namespace: Names starting with \0 (e.g., \0com.example.app) exist independently of the filesystem. Abstract sockets are destroyed when the last reference closes and are not affected by filesystem operations (unlink, rename). They are scoped to the network namespace (Section 16.1), providing isolation between containers.

Control messages (SCM_RIGHTS, SCM_CREDENTIALS):

/// Ancillary data types for AF_UNIX sockets.
pub enum UnixControlMsg {
    /// Pass file descriptors to the receiver.
    /// The sender's fd table entries are duplicated into the receiver's fd table.
    /// Fds are closed by the sender after sendmsg() returns (not transferred).
    /// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_RIGHTS,
    ///              cmsg_data = [i32; N] (array of fds)
    ScmRights {
        /// File descriptors to duplicate (max 253 per message, matching Linux SCM_MAX_FD).
        fds: [i32; 253],
        /// Number of valid entries in fds.
        count: usize,
    },

    /// Send sender's credentials to the receiver.
    /// Works on all Unix socket types (SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET).
    /// On SOCK_STREAM, at least one byte of non-ancillary data must accompany the message.
    /// The receiver can validate the sender's identity.
    /// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_CREDENTIALS,
    ///              cmsg_data = struct ucred
    ScmCredentials {
        /// Sender's PID in the receiver's PID namespace (translated if different).
        pid: i32,
        /// Sender's UID in the receiver's user namespace.
        uid: u32,
        /// Sender's GID in the receiver's user namespace.
        gid: u32,
    },
}

/// Credential structure for SCM_CREDENTIALS (matches Linux struct ucred).
#[repr(C)]
pub struct UCred {
    pub pid: i32,
    pub uid: u32,
    pub gid: u32,
}

SCM_CREDENTIALS outgoing validation: When a process sends SCM_CREDENTIALS via sendmsg(), the kernel validates the supplied fields before transmission (matching Linux's scm_check_creds()): - pid: Must equal the sender's real PID (in the sender's PID namespace). Spoofing a different PID requires CAP_SYS_ADMIN in the sender's user namespace. - uid: Must equal the sender's real, effective, or saved-set UID. Spoofing a different UID requires CAP_SETUID in the sender's user namespace. - gid: Must equal the sender's real, effective, or saved-set GID. Spoofing a different GID requires CAP_SETGID in the sender's user namespace. Without this validation, any process could forge arbitrary credentials — a critical security hole for D-Bus, systemd, and all services relying on SO_PEERCRED.

SO_PEERCRED: The getsockopt(SOL_SOCKET, SO_PEERCRED, ...) call retrieves the credentials of the peer process at connect() time. This is the standard authentication mechanism for D-Bus and systemd. The credentials are snapshotted when the connection is established and do not change if the peer later calls setuid() or exits.

Socketpair: socketpair(AF_UNIX, type, 0, sv) creates a connected pair of unnamed sockets. Both ends are interchangeable (no client/server distinction). Used for pthreads IPC, async I/O notification pipes, and subprocess communication.

Autobind: Binding to an empty address (sun_path[0] = '\0' with length 2) triggers autobind, which assigns a unique abstract name \0<inode>. This is used for unnamed socket peers that need a bindable address for sockname().


15.4 Pluggable TCP Congestion Control

Linux parallel: Linux exposes tcp_congestion_ops as a loadable module API. UmkaOS provides the same extensibility through the CongestionOps trait registered in umka-net's congestion control registry. Section 15.1.4 introduced BBR and the trait outline; this section specifies the full interface, registration lifecycle, per-socket selection, and the data structures that congestion algorithms receive.

15.4.1 CongestionOps Trait (Full Specification)

/// Congestion control algorithm interface.
///
/// Each algorithm is a stateless descriptor (`&'static dyn CongestionOps`).
/// Per-connection algorithm state lives in `TcpCb.cong_priv` (64 bytes inline,
/// heap-allocated if larger).
///
/// Methods marked `optional` have a default no-op implementation.
/// The TCP engine calls every method from within umka-net's isolation domain;
/// no domain crossing is required.
pub trait CongestionOps: Send + Sync {
    /// Algorithm name (ASCII, null-terminated, max 16 bytes including NUL).
    /// Used by TCP_CONGESTION sockopt and /proc/sys/net/ipv4/tcp_congestion_control.
    fn name(&self) -> &'static str;

    /// Capability flags declared by this algorithm.
    fn flags(&self) -> CaFlags;

    /// Called when a new TCP connection is allocated and this algorithm is selected.
    /// Initialise per-connection state in `cb.cong_priv`.
    fn init(&self, cb: &mut TcpCb);

    /// Called when the connection is destroyed or a different algorithm is selected.
    /// Release per-connection resources allocated in `init`.
    fn release(&self, cb: &mut TcpCb);

    /// Return the slow-start threshold for the current congestion event.
    /// Must return a value >= 2*MSS; never returns 0 (TCP engine enforces).
    fn ssthresh(&self, cb: &mut TcpCb) -> u64;

    /// Called on each ACK in the congestion-avoidance phase.
    ///
    /// `ack` is the acknowledged sequence number.
    /// `acked` is the number of bytes newly acknowledged by this ACK.
    ///
    /// Typical CUBIC/Reno implementation: advance cwnd by acked/cwnd per ACK.
    fn cong_avoid(&self, cb: &mut TcpCb, ack: u32, acked: u32);

    /// Optional full-ACK processing hook (used by BBR, not by Reno/CUBIC).
    ///
    /// Called instead of `cong_avoid` when `CaFlags::CA_FLAG_FULL_CONTROL` is set.
    /// Provides the full `TcpAck` structure for pacing-based algorithms.
    ///
    /// Default implementation: delegates to `cong_avoid(cb, ack.ack_seq, ack.bytes_acked)`.
    fn cong_control(&self, cb: &mut TcpCb, ack: &TcpAck) {
        self.cong_avoid(cb, ack.ack_seq, ack.bytes_acked);
    }

    /// Notify algorithm of a TCP state transition.
    ///
    /// Called when the connection enters a new state (e.g., CongState::Loss
    /// on RTO or triple-duplicate ACK). Algorithms may reset cwnd or adjust
    /// internal state here.
    fn set_state(&self, cb: &mut TcpCb, new_state: CongState);

    /// Notify algorithm of a discrete congestion event.
    ///
    /// Called for events that do not change TCP state but affect the congestion
    /// algorithm (e.g., the sender starts transmitting after an idle period).
    fn cwnd_event(&self, cb: &mut TcpCb, ev: CaEvent);

    /// Optional: called after SACK processing with a per-ACK rate sample.
    ///
    /// Used by RTT-based algorithms (BBR) that need delivery-rate estimation.
    /// Only called when `CaFlags::CA_FLAG_RTT_BASED` is set in `flags()`.
    ///
    /// Default implementation: no-op.
    fn pkts_acked(&self, cb: &mut TcpCb, sample: &RateSample) {
        let _ = (cb, sample);
    }

    /// Return the cwnd value to restore on an undo event (spurious RTO).
    ///
    /// Called when F-RTO or DSACK identifies a retransmit as spurious.
    /// Must return a value >= the current cwnd (undo cannot reduce cwnd).
    fn undo_cwnd(&self, cb: &TcpCb) -> u64;

    /// Optional: fill `buf` with TCP_INFO algorithm-specific bytes (up to 16 bytes).
    ///
    /// The TCP engine calls this when `getsockopt(TCP_INFO)` is issued.
    /// The returned bytes are appended to the standard `tcp_info` struct
    /// as the `tcpi_opt_vals` extension field.
    ///
    /// Returns: number of bytes written (0 if algorithm has no info to export).
    /// Default implementation: writes 0 bytes.
    fn get_info(&self, cb: &TcpCb, buf: &mut [u8]) -> usize {
        let _ = (cb, buf);
        0
    }
}

Inline private state: TcpCb.cong_priv is a CongPriv union — 64 bytes of inline storage. Algorithms with <= 64 bytes of per-connection state (Reno, CUBIC, Vegas) store directly there. Algorithms needing more (BBR v2 with bandwidth estimation tables) allocate a heap box and store its raw pointer in the first 8 bytes of cong_priv, freeing it in release(). The TCP engine zeroes cong_priv before calling init() so algorithms may rely on zero-initialisation.

15.4.2 Supporting Types

/// Discrete congestion events delivered via `cwnd_event()`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CaEvent {
    /// Sender began transmitting after an idle period (no in-flight data).
    TxStart,
    /// cwnd was reset to IW after an idle period (TCP RFC 5681 §4.1).
    CwndRestart,
    /// Quick-ack mode completed (returned to delayed-ACK).
    CompleteQuickAck,
    /// A loss event was detected (fast-retransmit or RTO).
    Loss,
}

/// TCP connection congestion states.
///
/// The TCP engine transitions between these states; algorithms receive
/// `set_state()` on every transition.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CongState {
    /// Normal operation (slow-start or congestion avoidance).
    Open,
    /// Disorder: duplicate ACKs or SACK holes detected, no loss confirmed.
    Disorder,
    /// ECN congestion signal received; cwnd reduced without loss.
    Cwr,
    /// Confirmed loss; performing fast recovery (RFC 6675).
    Recovery,
    /// RTO fired; entering slow-start from ssthresh.
    Loss,
}

/// Algorithm capability flags.
bitflags! {
    pub struct CaFlags: u32 {
        /// Algorithm uses RTT-based control; receive `pkts_acked()` calls.
        const CA_FLAG_RTT_BASED    = 0x1;
        /// Algorithm is per-connection (not flow-aggregate); used by BBR.
        const CA_FLAG_CONN         = 0x2;
        /// Algorithm requires ECN; negotiation fails if peer does not support ECN.
        const CA_FLAG_NEEDS_ECN    = 0x4;
        /// Algorithm implements full ACK processing via `cong_control()`.
        const CA_FLAG_FULL_CONTROL = 0x8;
    }
}

/// Per-ACK delivery rate sample (RFC 8148, Section 3.1).
///
/// Computed by the TCP engine's rate-sampling code after SACK processing.
/// Delivered to algorithms that set `CA_FLAG_RTT_BASED`.
#[derive(Debug, Clone, Copy)]
pub struct RateSample {
    /// Total bytes delivered since connection start at this ACK's arrival.
    pub delivered: u64,
    /// Bytes delivered that were CE-marked (ECN congestion experienced).
    pub delivered_ce: u64,
    /// Elapsed time (us) of the measurement interval.
    pub interval_us: u32,
    /// Sender-side measurement interval (us).
    pub snd_interval_us: u32,
    /// Receiver-side measurement interval (us).
    pub rcv_interval_us: u32,
    /// Latest RTT sample (us); 0 if unavailable.
    pub rtt_us: u32,
    /// Packets lost during this ACK's interval (SACK + RTO inference).
    pub losses: u32,
    /// Packets newly ACKed or SACKed.
    pub acked_sacked: u32,
    /// `delivered` counter value at the start of the measurement interval.
    pub prior_delivered: u64,
    /// True if the sender was app-limited during this interval.
    pub is_app_limited: bool,
}

/// Partial view of a received ACK, passed to `cong_control()`.
#[derive(Debug, Clone, Copy)]
pub struct TcpAck {
    /// Acknowledged sequence number (cumulative).
    pub ack_seq: u32,
    /// Bytes newly acknowledged by this ACK (excluding retransmits).
    pub bytes_acked: u32,
    /// SACK score: number of packets newly SACKed.
    pub sack_newly_sacked: u32,
    /// Receiver-advertised window (in bytes, after scaling).
    pub win: u32,
    /// ECN congestion window reduction signal received with this ACK.
    pub ecn_cwr: bool,
}

15.4.3 Registration API

The congestion control registry is a module-global list inside umka-net. It is initialised at umka-net startup with the builtin algorithms and is thereafter modified only by explicit register/unregister calls.

/// Entry in the congestion control registry.
pub struct CongCtlEntry {
    pub ops: &'static dyn CongestionOps,
    /// Active socket count using this algorithm. Unregister fails while > 0.
    pub refcnt: AtomicU32,
}

/// Global congestion control registry (umka-net internal).
///
/// Read path (TCP connection creation, `setsockopt(TCP_CONGESTION)`): uses RCU.
/// Write path (register/unregister, rare module load/unload): acquires write mutex,
/// builds a new `Arc<Vec<CongCtlEntry>>`, atomically replaces via `RcuCell::update`.
/// The old `Arc` is dropped after the RCU grace period, safely freeing memory.
///
/// This eliminates the `SpinLock` from the read path: 100K+ connections/sec pay
/// only an RCU read guard (~1-3 cycles) rather than a lock + cache-line bounce.
static CONG_CTL_LIST: RcuCell<Vec<CongCtlEntry>> = RcuCell::new_empty();
/// Serializes register/unregister calls. Never held during actual TCP processing.
static CONG_CTL_WRITE_LOCK: Mutex<()> = Mutex::new(());

/// Register a congestion control algorithm.
///
/// # Errors
/// - `KernelError::AlreadyExists` if an algorithm with the same name is registered.
/// - `KernelError::InvalidArgument` if `ops.name()` is empty or longer than 15 bytes.
pub fn tcp_register_congestion_control(
    ops: &'static dyn CongestionOps,
) -> Result<(), KernelError> {
    let name = ops.name();
    if name.is_empty() || name.len() > 15 {
        return Err(KernelError::InvalidArgument);
    }
    let _guard = CONG_CTL_WRITE_LOCK.lock();
    let current = CONG_CTL_LIST.read();
    if current.iter().any(|e| e.ops.name() == name) {
        return Err(KernelError::AlreadyExists);
    }
    let mut new_list: Vec<CongCtlEntry> = current.iter()
        .map(|e| CongCtlEntry { ops: e.ops, refcnt: AtomicU32::new(e.refcnt.load(Relaxed)) })
        .collect();
    new_list.push(CongCtlEntry { ops, refcnt: AtomicU32::new(0) });
    CONG_CTL_LIST.update(new_list);
    Ok(())
}

/// Unregister a congestion control algorithm.
///
/// # Errors
/// - `KernelError::NotFound` if the algorithm is not registered.
/// - `KernelError::Busy` if any socket is using this algorithm (refcnt > 0).
/// - `KernelError::PermissionDenied` if trying to unregister `"reno"`.
pub fn tcp_unregister_congestion_control(name: &str) -> Result<(), KernelError> {
    if name == "reno" {
        return Err(KernelError::PermissionDenied);
    }
    let _guard = CONG_CTL_WRITE_LOCK.lock();
    let current = CONG_CTL_LIST.read();
    let entry = current.iter().find(|e| e.ops.name() == name)
        .ok_or(KernelError::NotFound)?;
    if entry.refcnt.load(Acquire) > 0 {
        return Err(KernelError::Busy);
    }
    let new_list: Vec<CongCtlEntry> = current.iter()
        .filter(|e| e.ops.name() != name)
        .map(|e| CongCtlEntry { ops: e.ops, refcnt: AtomicU32::new(e.refcnt.load(Relaxed)) })
        .collect();
    CONG_CTL_LIST.update(new_list);
    Ok(())
}

/// Look up an algorithm by name. Caller holds an RCU read guard for the duration.
/// Returns a static reference valid while the RCU guard is held.
fn tcp_find_congestion_control(name: &str) -> Option<&'static dyn CongestionOps> {
    let guard = CONG_CTL_LIST.read_guard();
    guard.iter().find(|e| e.ops.name() == name).map(|e| e.ops)
}

Builtin algorithms:

Name Default? Description
reno fallback RFC 5681 Reno -- always available, never unregistered
bbr yes BBR v2 (pacing + bandwidth estimation)
cubic no CUBIC (RFC 8312)

reno is the fallback: if the system default is unregistered between socket() and connect(), new sockets fall back to reno.

15.4.4 Per-Socket Selection Lifecycle

At connect() time: The TCP engine calls tcp_init_cong_control(cb):

/// Attach the selected (or system-default) congestion control algorithm
/// to a newly connecting TCP socket.
///
/// Called once, from `tcp_connect()`, before the SYN is transmitted.
///
/// Errors:
/// - `KernelError::NotFound`: the socket's requested algorithm was removed
///   from the registry between `setsockopt` and `connect`. The engine
///   falls back to `reno` and returns `Ok(())` (not an error -- matching
///   Linux behaviour where algorithm removal does not fail in-flight connects).
pub fn tcp_init_cong_control(cb: &mut TcpCb) -> Result<(), KernelError> {
    let name = cb.cong_name.as_deref().unwrap_or_else(|| tcp_default_cong_control());
    let ops = tcp_find_congestion_control(name)
        .unwrap_or(&RenoCongestionOps);  // guaranteed fallback
    cb.cong_ops = ops;
    CONG_CTL_REFCNT.lock()
        .entry(ops.name())
        .and_modify(|rc| *rc += 1);
    ops.init(cb);
    Ok(())
}

At socket close() / algorithm change: tcp_cleanup_cong_control(cb):

/// Detach the congestion control algorithm from a TCP socket.
///
/// Called on connection close or when TCP_CONGESTION setsockopt changes the algorithm.
/// After this call, `cb.cong_ops` is set to the no-op `RenoCongestionOps` until
/// a new algorithm is selected; this prevents use-after-free if a racing
/// timer fires between `release()` and the next `init()`.
pub fn tcp_cleanup_cong_control(cb: &mut TcpCb) {
    let ops = cb.cong_ops;
    ops.release(cb);
    CONG_CTL_REFCNT.lock()
        .entry(ops.name())
        .and_modify(|rc| *rc = rc.saturating_sub(1));
    cb.cong_ops = &RenoCongestionOps;
}

TCP_CONGESTION sockopt (setsockopt(IPPROTO_TCP, TCP_CONGESTION, "bbr\0", 4)):

  1. Verify caller holds Capability::NetAdmin OR socket is not yet connected (unprivileged processes may set the algorithm before connecting, matching Linux).
  2. Null-terminate and validate the name (max 15 bytes, ASCII printable).
  3. Look up the algorithm in the registry. Return ENOENT if not found.
  4. If the socket is already connected, call tcp_cleanup_cong_control() then tcp_init_cong_control() with the new name.
  5. If not yet connected, store the name in cb.cong_name for use at connect().

getsockopt(IPPROTO_TCP, TCP_CONGESTION, buf, len) copies cb.cong_ops.name() into buf (null-terminated, matching Linux).

15.4.5 System Default

The system default is stored in umka-net as:

/// System-wide default TCP congestion control algorithm name.
/// Initialised to "bbr". Writable by processes holding `Capability::NetAdmin`.
/// Read by `tcp_init_cong_control()` for sockets that have not called
/// `TCP_CONGESTION` setsockopt before connect().
static DEFAULT_CONG_CTL: RwLock<ArrayString<16>> =
    RwLock::new(ArrayString::from("bbr"));

/// Return the current system default algorithm name.
pub fn tcp_default_cong_control() -> &'static str {
    // SAFETY: ArrayString<16> is always valid UTF-8; the lock ensures
    // consistent read. The returned reference is valid for the lock guard's
    // lifetime, but callers only use it within the critical section.
    DEFAULT_CONG_CTL.read().as_str()
}

/proc/sys/net/ipv4/tcp_congestion_control: Reads return the current default. Writes (requiring Capability::NetAdmin) verify that the named algorithm is registered and update DEFAULT_CONG_CTL. Writing an unknown name returns ENOENT. This sysctl is per-network-namespace (Section 16.1), so each container may independently configure its default.

15.4.6 TCP Sysctl Entries (/proc/sys/net/ipv4/tcp_*)

UmkaOS must implement the following /proc/sys/net/ipv4/tcp_* entries for Linux compatibility. These are required by Docker, Kubernetes, monitoring tools (Prometheus node_exporter, Datadog agent), and system tuning scripts (sysctl -w). All entries are per-network-namespace unless noted otherwise.

Sysctl Type Default Description
tcp_syn_retries u8 6 Max SYN retransmits before aborting a connect attempt.
tcp_synack_retries u8 5 Max SYN-ACK retransmits for a passive connection.
tcp_fin_timeout u32 (seconds) 60 Time a socket stays in FIN-WAIT-2 before being forcibly closed.
tcp_keepalive_time u32 (seconds) 7200 Idle time before the first keepalive probe is sent.
tcp_keepalive_intvl u32 (seconds) 75 Interval between successive keepalive probes.
tcp_keepalive_probes u8 9 Number of unacknowledged probes before declaring the connection dead.
tcp_max_syn_backlog u32 4096 Maximum length of the per-socket SYN backlog (incomplete connections).
tcp_max_tw_buckets u32 262144 Maximum number of TIME-WAIT sockets. Excess sockets are destroyed immediately with a RST.
tcp_tw_reuse u8 (0/1/2) 2 Allow reuse of TIME-WAIT sockets for new outgoing connections. 0 = disabled, 1 = global, 2 = loopback only.
tcp_window_scaling u8 (bool) 1 Enable RFC 1323 window scaling (required for windows > 64 KB).
tcp_sack u8 (bool) 1 Enable RFC 2018 Selective Acknowledgments.
tcp_timestamps u8 (bool) 1 Enable RFC 1323 timestamps (used for RTT measurement and PAWS).
tcp_ecn u8 (0/1/2) 2 Explicit Congestion Notification. 0 = disabled, 1 = enabled, 2 = server-only (negotiate if peer requests).
tcp_congestion_control string "bbr" Default congestion control algorithm (see Section 15.4.5).
tcp_available_congestion_control string (read-only) Space-separated list of registered algorithms. Not writable.
tcp_rmem 3 × u32 4096 131072 6291456 Min, default, max TCP receive buffer sizes (bytes). Auto-tuning operates within [min, max].
tcp_wmem 3 × u32 4096 16384 4194304 Min, default, max TCP send buffer sizes (bytes).
tcp_mem 3 × u64 (pages) auto Low, pressure, high watermarks for total TCP memory consumption (in pages). Below low: no pressure. Above high: new allocations may fail. Auto-computed at boot from total system memory.
tcp_slow_start_after_idle u8 (bool) 1 Reset cwnd to initial window after an idle period (RFC 2861). Set to 0 for long-lived connections with bursty traffic.
tcp_no_metrics_save u8 (bool) 0 If 1, do not cache TCP metrics (RTT, cwnd) in the route cache on connection close.
tcp_base_mss u32 1024 Starting MSS for Path MTU Discovery (PMTUD) search.
tcp_mtu_probing u8 (0/1/2) 0 Enable PMTUD probing. 0 = disabled, 1 = enabled when ICMP blackhole detected, 2 = always enabled.
tcp_fastopen u32 (bitmask) 0x1 TFO configuration. Bit 0 = client enable, Bit 1 = server enable, Bit 2 = server without cookie, Bit 10 = client no-cookie.
tcp_fastopen_key hex string random 128-bit server TFO cookie key (hex, e.g., "00112233-44556677-8899aabb-ccddeeff"). Writable for cluster-consistent TFO.

Implementation notes: - All entries are readable/writable via both /proc/sys/net/ipv4/ and the sysctl(2) system call. - Values are validated on write: out-of-range values return EINVAL. - Per-namespace scoping means Docker containers and Kubernetes pods see isolated sysctl namespaces (consistent with Linux net.ipv4.tcp_* namespace support). - tcp_mem auto-computation follows the Linux heuristic: low = total_pages/16, pressure = total_pages/8, high = total_pages/4, clamped to sane minimums.

15.4.7 /proc/net/ Filesystem Entries

UmkaOS must expose the following /proc/net/ entries for compatibility with Docker, Kubernetes, monitoring tools (Prometheus node_exporter, Datadog agent, ss, netstat, ip, sar), and container health checks. All entries must match Linux output byte-for-byte — tools parse these files with hardcoded field offsets, column positions, and sscanf/awk patterns.

Each entry is per-network-namespace (containers see only their own network state). Implementation uses umka's procfs layer (Section 19.3).

Path Format Description
/proc/net/dev Fixed-width columns, header on lines 1-2 Per-interface statistics: interface name, rx bytes, rx packets, rx errs, rx drop, rx fifo, rx frame, rx compressed, rx multicast, tx bytes, tx packets, tx errs, tx drop, tx fifo, tx colls, tx carrier, tx compressed. One row per interface. Used by ifconfig, Prometheus node_network_* metrics, sar -n DEV.
/proc/net/snmp <Protocol>: <field names>\n<Protocol>: <values>\n pairs SNMP MIB-II counters. Sections: Ip, Icmp, IcmpMsg, Tcp, Udp, UdpLite. Each section has a header line (field names) followed by a values line. Used by SNMP exporters, netstat -s, monitoring dashboards.
/proc/net/netstat Same format as /proc/net/snmp Extended TCP/IP statistics. Sections: TcpExt (SYN cookies, listen overflows, out-of-window drops, fast retransmits, etc.), IpExt (InOctets, OutOctets, InMcastPkts, etc.). Used by netstat -s, ss -s, TCP debugging.
/proc/net/tcp Fixed columns, header on line 1 TCP socket table. Columns: sl, local_address (hex IP:port), rem_address, st (state), tx_queue:rx_queue, tr:tm->when, retrnsmt, uid, timeout, inode, plus additional fields. Hex-encoded IPv4 addresses (little-endian on little-endian hosts). Used by ss, netstat -tnp, container health probes.
/proc/net/tcp6 Same format as tcp TCP6 socket table. IPv6 addresses as 32-hex-char strings. Required for IPv6-enabled containers and dual-stack Kubernetes.
/proc/net/udp Same format as tcp (fewer fields) UDP socket table. Columns match Linux's format. Used by ss -unp, netstat -unp.
/proc/net/udp6 Same format as udp UDP6 socket table.
/proc/net/unix Fixed columns, header on line 1 Unix domain socket table. Columns: Num, RefCount, Protocol, Flags, Type, St, Inode, Path. Used by ss -x, container debugging.
/proc/net/if_inet6 Space-separated, no header IPv6 interface addresses. Columns: address (32 hex chars, no colons), ifindex (hex), prefix_len (hex), scope (hex), flags (hex), ifname. Used by ip -6 addr, NetworkManager, container IPv6 setup.
/proc/net/route Tab-separated, header on line 1 IPv4 routing table (FIB). Columns: Iface, Destination, Gateway, Flags, RefCnt, Use, Metric, Mask, MTU, Window, IRTT. All addresses in hex (network byte order). Used by route -n, legacy routing tools. ip route uses netlink but some containers still parse this file.
/proc/net/arp Fixed columns, header on line 1 ARP cache. Columns: IP address, HW type, Flags, HW address, Mask, Device. Used by arp -n, container network debugging, ARP monitoring.
/proc/net/fib_trie Indented tree structure Routing trie dump. Shows the LC-trie structure of the IPv4 FIB. Used by ip route show table all internals and network diagnostic tools.
/proc/net/fib_triestat Key-value pairs FIB trie statistics: number of nodes, leaves, prefixes, null pointers, trie depth. Used by routing performance analysis tools.

Implementation requirements:

  • Atomicity: Each read() must return a consistent snapshot. Use seq_file-style iteration with RCU read-side protection for socket tables and routing tables, so readers never see partial updates.
  • Hex encoding: IPv4 addresses in /proc/net/tcp, /proc/net/route, and /proc/net/arp are encoded as 8-hex-character little-endian (on little-endian hosts) values — this matches Linux's %08X format and tools depend on it.
  • Performance: /proc/net/tcp can be large on busy servers (100K+ sockets). Use seq_file pagination to avoid allocating the entire output in kernel memory. Kubernetes liveness probes may read these files every few seconds.
  • Namespace isolation: Each entry shows only the network state visible within the reading process's network namespace. A container must not see the host's socket table.

15.5 Traffic Control and Queue Disciplines (tc/qdisc)

The Traffic Control subsystem schedules packets on each network device's transmit path. It sits between the socket layer (where sendmsg() delivers a NetBufHandle) and the NIC driver's hardware transmit ring. Qdiscs enable rate limiting, latency control, and hierarchical QoS without modifying NIC drivers.

Linux parallel: Linux implements tc through net/sched/ -- struct Qdisc, struct Qdisc_ops, and the RTM_NEWQDISC netlink interface. UmkaOS maps these concepts faithfully so that iproute2 tc and Kubernetes CNI plugins using tc (Cilium, Calico, bandwidth plugin) operate without modification.

15.5.1 Architecture

sendmsg() -> socket TX queue -> NetDev::transmit(buf)
                                    |
                            root qdisc enqueue(buf)
                                    |   [rate limiting / shaping wait here]
                            NIC driver poll / NAPI TX
                                    |
                            root qdisc dequeue()
                                    |
                            NIC hardware ring enqueue

Each NetDev has one root qdisc (TX path) and optionally one ingress qdisc (RX path, for filtering and policing before socket delivery). The root qdisc may be classful (HTB, HFSC) -- containing child qdiscs on leaf classes -- or classless (pfifo_fast, fq_codel).

15.5.2 TcHandle and the Handle Namespace

/// Traffic control handle (major:minor encoded as a u32).
///
/// Major identifies a qdisc; minor identifies a class within that qdisc.
/// Minor 0 refers to the qdisc itself (not any class).
///
/// Encoding: upper 16 bits = major, lower 16 bits = minor.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct TcHandle(pub u32);

impl TcHandle {
    /// Root qdisc of the device (attach point for the first qdisc).
    pub const ROOT: TcHandle = TcHandle(0xFFFF_0000);
    /// Ingress pseudo-qdisc handle.
    pub const INGRESS: TcHandle = TcHandle(0xFFFF_FFF1);
    /// Clsact pseudo-qdisc handle (used by BPF/Cilium for tc redirect).
    pub const CLSACT: TcHandle = TcHandle(0xFFFF_FFF2);

    pub fn new(major: u16, minor: u16) -> Self {
        TcHandle(((major as u32) << 16) | (minor as u32))
    }
    pub fn major(self) -> u16 { (self.0 >> 16) as u16 }
    pub fn minor(self) -> u16 { self.0 as u16 }
}

15.5.3 QdiscOps Trait

/// Qdisc algorithm interface.
///
/// Implementations are stateless descriptors. Per-device qdisc state lives in
/// `Qdisc.priv_data`. All methods execute in umka-net's isolation domain.
pub trait QdiscOps: Send + Sync {
    /// Algorithm name (ASCII, max 15 bytes + NUL, matches Linux IFNAMSIZ for qdiscs).
    fn name(&self) -> &'static str;

    /// Enqueue a packet.
    ///
    /// The qdisc takes ownership of `buf`. If the queue is full, the qdisc
    /// must drop `buf` (calling `netbuf_free(buf)`) and return
    /// `Err(NetDevError::QueueFull)`. Returning `Ok(())` guarantees eventual
    /// dequeue.
    fn enqueue(&self, buf: NetBufHandle, qdisc: &mut Qdisc) -> Result<(), NetDevError>;

    /// Dequeue the next packet to transmit.
    ///
    /// Returns `None` if the qdisc has no packet to transmit right now
    /// (queue empty, or rate limited -- the qdisc will call `netdev_wake_queue()`
    /// when it is ready). The NIC driver calls this from its NAPI TX poll.
    fn dequeue(&self, qdisc: &mut Qdisc) -> Option<NetBufHandle>;

    /// Reset the qdisc to its initial (empty) state, dropping all queued packets.
    /// Called when the device is brought down (NETDEV_DOWN).
    fn reset(&self, qdisc: &mut Qdisc);

    /// Free all resources allocated by this qdisc instance.
    /// Called after `reset()` when the qdisc is detached or the device destroyed.
    fn destroy(&self, qdisc: &mut Qdisc);

    /// Reconfigure the qdisc from netlink attributes.
    ///
    /// Called for `RTM_NEWQDISC` with `NLM_F_REPLACE` on an existing qdisc,
    /// or after initial creation. Must validate `opts` before mutating state.
    /// On error, the existing configuration is unchanged.
    fn change(
        &self,
        qdisc: &mut Qdisc,
        opts: &NlAttrSet,
    ) -> Result<(), KernelError>;

    /// Serialise the qdisc's current configuration into `skb` as netlink attributes.
    /// Called for `RTM_GETQDISC` and `RTM_NEWQDISC` replies.
    fn dump(&self, qdisc: &Qdisc, skb: &mut NetBuf) -> Result<(), KernelError>;

    /// Return current statistics snapshot.
    fn stats(&self, qdisc: &Qdisc) -> QdiscStats;
}

15.5.4 Qdisc Struct

Multi-queue native Qdisc design:

Each Qdisc instance is scoped to a single TX queue. Multi-queue NICs (virtio-net, i40e, mlx5) create N Qdisc instances — one per hardware queue. Contention is per-queue, eliminating the single-lock bottleneck of a per-NIC design.

struct Qdisc {
    queue_lock: QueueLock<QdiscState>, // per-queue, not per-NIC
    queue_index: u16,                  // which TX queue this Qdisc serves
    // ... scheduler-specific state
}

Lock-free fast path (simple FIFO): For pfifo (simple priority FIFO with no traffic shaping), the enqueue path uses a lock-free ring buffer: the producer atomically advances the ring tail via AtomicUsize::fetch_add. No spinlock. The consumer (NIC driver NAPI poll) reads the head and advances it after DMA. Lock-free throughput: limited only by ring buffer capacity and NIC hardware speed.

Lock path (hierarchical schedulers): HTB, HFSC, and CBS schedulers require traversing the class hierarchy to find the target leaf queue. These take the queue_lock for the duration of the hierarchy traversal and enqueue. The lock is per-queue, so N queues can run simultaneously on N cores.

RX path: Symmetric per-queue design. Each RX queue has an associated NAPI instance; no shared state between queues on the RX path.

/// A qdisc instance attached to a single TX queue of a network device.
///
/// Scoped to one hardware TX queue (`queue_index`). Multi-queue NICs have one
/// `Qdisc` per queue. Statistics fields are atomics -- readable without locking.
pub struct Qdisc {
    /// Algorithm implementation.
    pub ops: &'static dyn QdiscOps,
    /// This qdisc's handle (major:0 = the qdisc itself).
    pub handle: TcHandle,
    /// Parent handle: `TcHandle::ROOT` for the device root qdisc,
    /// or the parent HTB class handle for leaf qdiscs.
    pub parent: TcHandle,
    /// Which TX hardware queue this Qdisc instance serves.
    pub queue_index: u16,
    /// Weak reference to the owning device (prevents retain cycle).
    pub dev: Weak<NetDev>,
    /// Algorithm-private state for this Qdisc instance (e.g., HTB class tree,
    /// TBF token bucket, FQ flow table). One allocation per Qdisc instance
    /// (one per TX queue) — this is a cold-path configuration object, not a
    /// per-packet hot-path structure.
    ///
    /// **Not related to TCP congestion control**: TCP CC uses `TcpCb.cong_priv`
    /// (a 64-byte inline `CongPriv` union, zero heap allocation per connection)
    /// with a `&'static dyn CongestionOps` ops pointer. See Section 15.1.5.
    /// `Box<dyn Any>` here is exclusively for Qdisc (traffic-shaping) algorithms.
    pub priv_data: Box<dyn Any + Send>,
    /// Bytes enqueued (cumulative; wraps on overflow).
    pub bytes: AtomicU64,
    /// Packets enqueued (cumulative).
    pub packets: AtomicU64,
    /// Packets dropped due to queue full or policing.
    pub drops: AtomicU64,
    /// Packets that exceeded the rate limit (overlimit / shaped).
    pub overlimits: AtomicU64,
    /// Queue length in packets (current, not cumulative).
    pub qlen: AtomicU32,
    pub flags: QdiscFlags,
    /// Optional size table for overhead accounting (ATM cell padding, etc.).
    pub stab: Option<SizeTable>,
    /// Serialises enqueue/dequeue for hierarchical schedulers (HTB, HFSC, CBS).
    /// Not used by pfifo (lock-free ring buffer fast path).
    lock: SpinLock<()>,
}

15.5.5 Builtin Qdiscs

pfifo_fast -- Default for Newly Created Devices

Three-band strict-priority FIFO. Band 0 is highest priority, band 2 lowest. Packet priority is determined by the IP DSCP field: DSCP bits [5:3] are mapped to a band via a static priority map (matching Linux's prio_map). SO_PRIORITY on the socket overrides the DSCP classification.

Parameters: maximum queue depth 1000 packets per band (fixed; not configurable via RTM_NEWQDISC for pfifo_fast). Total limit: 3000 packets.

Enqueue: append to tail of the packet's band. Drop if band is at its limit (tail drop). Dequeue: scan bands 0 to 2; return head of first non-empty band.

fq_codel -- Fair Queue with Controlled Delay

FQ-CoDel (RFC 8290) combines per-flow FIFO queuing (fair queuing) with the CoDel AQM algorithm for delay control.

/// Maximum per-flow packet queue depth for fq_codel.
/// Limits worst-case memory per flow; typical flows stay well under this.
/// Must fit a single cache-line set (32 handles × 8 bytes = 256 bytes per flow).
pub const FQ_CODEL_FLOW_DEPTH: usize = 32;

/// Sentinel index meaning "end of intrusive list" (no next flow).
pub const FQ_FLOW_NONE: u32 = u32::MAX;

/// FQ-CoDel qdisc private state.
///
/// All per-packet structures are pre-allocated at qdisc creation — there is
/// no dynamic allocation on the TX fast path. `flows` is allocated once as a
/// `Box<[CodelFlow]>` with `num_flows` entries. Flow lists use intrusive links
/// embedded in `CodelFlow` rather than heap-allocated `LinkedList` nodes.
pub struct FqCodelPriv {
    /// Hash table of per-flow queues; indexed by 5-tuple hash mod `num_flows`.
    /// Allocated once at qdisc creation; never grown or shrunk at runtime.
    pub flows: Box<[CodelFlow]>,
    /// CoDel target delay (default: 5 ms). Packets sojourning longer than
    /// this in the queue are ECN-marked or dropped.
    pub target_us: u32,
    /// CoDel interval (default: 100 ms). Minimum time between consecutive drops.
    pub interval_us: u32,
    /// DRR quantum in bytes (default: 1514 = max Ethernet frame + 2 for alignment).
    pub quantum: u32,
    /// Number of per-flow queues (default: 1024, must be power-of-two).
    pub num_flows: u32,
    /// Total packet limit across all flows (default: 10240).
    pub limit: u32,
    /// Number of packets currently queued across all flows.
    pub backlog: u32,
    /// Head of the new-flows intrusive list (index into `flows`; FQ_FLOW_NONE = empty).
    /// New flows (sparse, recently active after idle) are served before old flows.
    pub new_flows_head: u32,
    /// Head of the old-flows intrusive list (index into `flows`; FQ_FLOW_NONE = empty).
    pub old_flows_head: u32,
}

/// Per-flow state within fq_codel.
///
/// The packet queue is a fixed-capacity ring buffer embedded directly in this
/// struct (no heap allocation after flow initialization). The flow's position
/// in new_flows or old_flows is tracked via intrusive links (`next_active`),
/// eliminating `LinkedList` node allocation on flow transitions.
pub struct CodelFlow {
    /// Packet queue for this flow: fixed-capacity ring buffer, no allocation.
    /// Enqueue: write to queue_buf[tail % FQ_CODEL_FLOW_DEPTH], advance tail.
    /// Dequeue: read from queue_buf[head % FQ_CODEL_FLOW_DEPTH], advance head.
    /// Drop (tail drop): discard tail packet if (tail - head) == FQ_CODEL_FLOW_DEPTH.
    pub queue_buf: [Option<NetBufHandle>; FQ_CODEL_FLOW_DEPTH],
    pub queue_head: u32,
    pub queue_tail: u32,
    /// DRR deficit counter (credits accumulated for this flow).
    pub deficit: i32,
    /// CoDel state: whether the flow is in "dropping state".
    pub dropping: bool,
    /// Time when CoDel entered dropping state for this flow.
    pub drop_next_us: u64,
    /// Number of packets dropped by CoDel on this flow.
    pub drop_count: u32,
    /// Number of packets ECN-marked (CE) instead of dropped.
    pub ecn_mark: u32,
    /// Intrusive list link: next flow index in the active list (new or old).
    /// `FQ_FLOW_NONE` means this flow is not in any active list or is the tail.
    pub next_active: u32,
    /// Which active list this flow is currently in.
    pub list_tag: FqFlowList,
}

/// Which active list a flow is currently in.
#[repr(u8)]
pub enum FqFlowList {
    None = 0,  // idle, not in any list
    New  = 1,  // in new_flows (sparse)
    Old  = 2,  // in old_flows (bulk)
}

Scheduling: Deficit Round Robin across the two lists (new then old). Each flow gets quantum bytes of credit per round. CoDel monitors the sojourn time of the head packet in each flow; if the sojourn exceeds target for longer than interval, it either ECN-marks (if the packet has the ECT bit) or drops, computing the next drop time as drop_next = drop_next + interval / sqrt(drop_count) (matching RFC 8290).

Sparse flow optimisation: flows that send a packet after an idle period are placed in the new-flow list with a full quantum, allowing latency-sensitive flows (DNS, SSH) to bypass the bulk-flow queue.

htb -- Hierarchical Token Bucket

HTB enables guaranteed bandwidth allocation with optional bursting up to a configured ceiling. It is the standard QoS mechanism for Kubernetes network bandwidth enforcement.

/// HTB class state.
pub struct HtbClass {
    pub handle: TcHandle,
    pub parent: TcHandle,
    /// Guaranteed rate (bytes/second). Token bucket replenished at this rate.
    pub rate: u64,
    /// Ceiling rate (bytes/second). Class may borrow up to this rate if parent allows.
    pub ceil: u64,
    /// Token bucket tokens available (in bytes; negative = in deficit).
    pub tokens: i64,
    /// Ceiling token bucket.
    pub ctokens: i64,
    /// Last time tokens were updated (ktime_us).
    pub t_c: u64,
    /// Maximum burst size in bytes (rate * burst_us).
    pub burst: u32,
    /// Leaf qdisc (if this is a leaf class).
    pub leaf: Option<Box<Qdisc>>,
    /// Child classes (if inner class).
    pub children: Vec<Arc<SpinLock<HtbClass>>>,
    /// HTB level (0 = leaf, increases toward root).
    pub level: u32,
    /// Priority queue key for dequeue scheduling.
    pub pq_key: u64,
}

HTB maintains per-level priority queues (HtbLevel arrays) in the qdisc's private data. At each dequeue call, HTB walks from the root downward, selecting the highest- priority class that has tokens available. Borrowed bandwidth: a child class at its rate limit may borrow from its parent's excess capacity up to ceil. Token buckets are replenished lazily on each dequeue, computed from elapsed time since t_c.

noqueue -- No Queuing

Used for loopback and virtual devices (veth, tun/tap) where the driver accepts packets immediately. Enqueue: calls the device's hard-start-xmit directly and returns. Dequeue: always returns None (nothing is buffered). If the device rejects the packet, enqueue() propagates the error to the caller.

15.5.6 Classifiers (tc Filters)

Filters classify packets into qdisc classes. Each filter is attached to a qdisc (or a filter chain within it) and inspects the packet to return a class handle.

/// Classifier (tc filter) interface.
pub trait ClsOps: Send + Sync {
    fn name(&self) -> &'static str;

    /// Classify `buf` into a class.
    ///
    /// Returns:
    /// - `ClsResult::Class(handle)`: packet goes to this class
    /// - `ClsResult::Drop`: packet is dropped immediately
    /// - `ClsResult::Ok`: no match; continue to next filter in chain
    /// - `ClsResult::Redir(ifindex)`: redirect to another device (tc redirect action)
    fn classify(&self, buf: NetBufHandle, tp: &TcFilter) -> ClsResult;

    /// Install or update a filter from netlink attributes.
    fn change(&self, tp: &mut TcFilter, opts: &NlAttrSet) -> Result<(), KernelError>;

    /// Destroy the filter, releasing allocated resources.
    fn destroy(&self, tp: &mut TcFilter);

    /// Dump the filter configuration as netlink attributes.
    fn dump(&self, tp: &TcFilter, skb: &mut NetBuf) -> Result<(), KernelError>;
}

#[derive(Debug)]
pub enum ClsResult {
    /// Packet classified into the specified class.
    Class(TcHandle),
    /// Packet should be dropped.
    Drop,
    /// No match; fall through to next filter.
    Ok,
    /// Redirect packet to another network device (by ifindex).
    Redir(u32),
}

Builtin classifiers:

  • u32: Bitmask matching on arbitrary 32-bit words at fixed offsets in the packet header. Supports up to 128 keys per filter and optional hash tables for O(1) lookup on large rule sets. Used for IP address and port matching.

  • flower: Exact-match classifier on a set of header fields (Ethernet type, IP src/dst, L4 proto, TCP/UDP ports, VLAN id, MPLS label, etc.). Backed by a hash table; O(1) lookup regardless of rule count. Used by Kubernetes CNI plugins for policy enforcement and network overlays.

  • bpf: Attaches a verified eBPF program as the classifier. The program receives the NetBufHandle (via a BPF map lookup that translates the handle to the data pointer) and returns a class handle or TC_ACT_SHOT. This is the primary mechanism used by Cilium and Calico for Kubernetes network policy -- all policy logic is compiled to eBPF by the CNI plugin and loaded via RTM_NEWTFILTER. Requires Capability::NetAdmin.

The following RTM message types are handled by the rtnetlink processor (Section 15.2.1):

Message Direction Description
RTM_NEWQDISC user->kernel Create or replace a qdisc on a device
RTM_DELQDISC user->kernel Delete a qdisc; reverts to pfifo_fast
RTM_GETQDISC user->kernel Get one qdisc; kernel replies RTM_NEWQDISC
RTM_DUMPQDISC user->kernel Dump all qdiscs on all devices
RTM_NEWTFILTER user->kernel Attach a filter to a qdisc
RTM_DELTFILTER user->kernel Remove a filter
RTM_GETTFILTER user->kernel Get one filter
RTM_DUMPTFILTER user->kernel Dump all filters on a qdisc
RTM_NEWCHAIN user->kernel Create a named filter chain on a qdisc
RTM_DELCHAIN user->kernel Delete a filter chain
RTM_GETCHAIN user->kernel Get/dump filter chains

All mutating operations require Capability::NetAdmin.

15.5.8 Integration with cgroups Network Bandwidth Enforcement

The net_cls cgroup controller marks each socket's packets with the cgroup's classid (a TcHandle). The u32 classifier is configured by the container runtime to match on this classid, routing packets to an HTB leaf class with the container's bandwidth limit. This is how Docker's --network-opt bandwidth limit and Kubernetes's bandwidth CNI plugin enforce per-container egress shaping.

The net_prio cgroup controller sets the skb_priority field on packets from a cgroup, which pfifo_fast uses for band selection -- providing per-cgroup priority without a full HTB hierarchy.

Integration path: 1. Container runtime creates HTB qdisc on the host-side veth of the container's network namespace. 2. An HTB class is created with the container's rate/ceil limits. 3. A u32 filter matches classid (from net_cls) to the HTB class. 4. umka-net's cgroup accounting hook stamps each outgoing packet's classid from current_task.cgroup.net_cls_classid before calling QdiscOps::enqueue.

15.5.9 Ingress Path

The ingress pseudo-qdisc (handle TcHandle::INGRESS) and clsact (handle TcHandle::CLSACT) attach to the RX path rather than TX. Packets arrive from the NIC driver before protocol processing; classifiers may drop them, redirect them to another device (tc redirect), or pass them through for normal processing.

clsact supports two hook points: - egress (TC_EGRESS): applied after routing, before NIC TX -- same as the TX qdisc chain but without buffering/shaping. - ingress (TC_INGRESS): applied before IP routing -- used by Cilium for pre-routing network policy and XDP-equivalent packet manipulation without the full XDP driver port.

Both hooks execute eBPF classifiers attached via RTM_NEWTFILTER and are the primary mechanism for Kubernetes CNI plugin data planes.


15.6 IPsec and XFRM Framework

The XFRM (transform) framework provides the kernel infrastructure for IPsec (ESP and AH) and any other per-packet cryptographic transform. IKEv2 key exchange is handled in userspace (strongSwan, libreswan, or systemd-networkd's IKEv2 client); the kernel implements packet transformation and the SA/SP databases.

Linux parallel: Linux's XFRM lives in net/xfrm/. UmkaOS implements the same xfrm_user netlink interface so that strongSwan, ip xfrm, and NetworkManager's IKEv2 support work unmodified.

15.6.1 Security Association (SA) -- XfrmState

/// An IPsec Security Association (SA).
///
/// An SA represents a one-directional security relationship between two endpoints.
/// It is identified by the triple (destination address, SPI, protocol) -- the `XfrmId`.
/// IKEv2 creates SAs in pairs (one for each direction).
pub struct XfrmState {
    /// SA identifier: (destination, SPI, protocol -- AH=51 or ESP=50).
    pub id: XfrmId,
    /// Source address of this SA (used to select the correct local interface).
    pub saddr: XfrmAddress,
    /// Traffic selector: which packets this SA covers (src/dst/proto/port ranges).
    /// For tunnel mode, this is the inner traffic; for transport, the endpoint pair.
    pub selector: XfrmSelector,
    /// Authenticated encryption algorithm (preferred: AES-GCM-128/256).
    /// Mutually exclusive with `auth` + `enc`.
    pub aead: Option<Box<AeadTfm>>,
    /// Authentication algorithm (HMAC-SHA256, etc.). Used with `enc` for CBC+HMAC.
    pub auth: Option<Box<ShashTfm>>,
    /// Encryption algorithm (AES-CBC, ChaCha20). Used with `auth`.
    pub enc: Option<Box<SkcipherTfm>>,
    /// SA lifetime limits (bytes transmitted, packets transmitted, wall-clock time).
    pub lifetime: XfrmLifetime,
    /// Counters for bytes, packets, and replay-window errors.
    pub stats: XfrmStats,
    /// ESP sequence number; incremented atomically on each TX packet.
    pub seq: AtomicU32,
    /// Anti-replay window (Section 15.6.5).
    pub replay_window: ReplayWindow,
    /// IPsec mode: Transport (host-to-host) or Tunnel (gateway-to-gateway).
    pub mode: XfrmMode,
    /// Address family: AF_INET or AF_INET6.
    pub family: AddressFamily,
    pub flags: XfrmStateFlags,
    /// Outer header overhead added by this SA (used by PMTU).
    pub header_len: u16,
    /// Optional UDP encapsulation port (NAT traversal: ESP-in-UDP, RFC 3948).
    pub encap: Option<XfrmEncap>,
    /// RCU-protected: the SA is read-locked during packet processing.
    _rcu: RcuHead,
}

/// SA identifier triple.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct XfrmId {
    /// Destination IP address (outer for tunnel, inner for transport).
    pub daddr: XfrmAddress,
    /// Security Parameters Index (network byte order).
    pub spi: u32,
    /// Protocol: IPPROTO_ESP (50) or IPPROTO_AH (51).
    pub proto: u8,
}

15.6.2 Security Policy (SP) -- XfrmPolicy

/// An IPsec Security Policy.
///
/// Policies are checked on every packet before the SA lookup.
/// A policy may require one or more transforms (an SA bundle), allow the
/// packet without transformation, or block it entirely.
pub struct XfrmPolicy {
    /// Traffic selector: src/dst addresses, L4 protocol, port ranges.
    pub selector: XfrmSelector,
    /// Action: apply transforms (Ipsec), pass (Allow), or drop (Block).
    pub action: XfrmAction,
    /// Required SA template chain. Each entry specifies the mode, protocol
    /// (ESP or AH), and algorithm requirements. At most 4 chained SAs
    /// (e.g., AH transport + ESP tunnel -- unusual but valid per RFC 4301).
    pub xfrm_vec: ArrayVec<XfrmTmpl, 4>,
    /// Priority (lower = higher priority). Policies are searched in priority order.
    pub priority: u32,
    /// Direction: In (inbound), Out (outbound), or Fwd (forwarded packets).
    pub dir: XfrmDir,
    pub flags: XfrmPolicyFlags,
    /// Index assigned at creation (for RTM_GETPOLICY lookup by index).
    pub index: u32,
    _rcu: RcuHead,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmAction {
    Allow,
    Block,
    Ipsec,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmMode {
    Transport,
    Tunnel,
}

#[derive(Debug, Clone, Copy)]
pub enum XfrmDir {
    In,
    Out,
    Fwd,
}

15.6.3 SA and SP Databases

SAD (Security Association Database): Hash table keyed by XfrmId (dst, SPI, proto). Load factor target <= 0.75; resizes when exceeded. The SAD uses a two-layer concurrency design to eliminate lock contention on the per-packet lookup path:

  • Read path (per-packet, hot path): rcu_dereference() on sad_rcu — zero lock contention. xfrm_state_lookup() acquires only an RCU read guard, never sad_lock.
  • Write path (SA install/delete, cold path): acquire sad_lock (a SpinLock), clone the current HashMap, insert or remove the entry, then publish the new map via rcu_assign() on sad_rcu. Release sad_lock. The old HashMap is freed after the current RCU grace period expires.

Individual XfrmState entries have their own RCU protection for in-place state updates (lifetime counters, replay windows). SA policy changes (SPD) are rare; SA lookups occur per packet.

Security Policy Database (SPD) lookup: UmkaOS uses a Patricia trie (radix tree) as the baseline SPD data structure — not a linear list at any policy count.

  • IP prefix matching: Patricia trie on source/destination IP prefixes. O(W) lookup where W = key width (32 bits for IPv4, 128 for IPv6). Policy specificity (more-specific prefixes take priority) is handled by longest-prefix match semantics native to the trie.

  • Port range selectors: When a policy has port-range selectors, a two-level lookup applies: first the Patricia trie on IP prefix, then an interval tree (augmented red-black tree) on port ranges within the prefix bucket. O(W + log P) where P = number of port-range policies matching the IP prefix.

  • Insert/delete: O(W) trie operations. Policy database updates are rare (usually at IPsec SA negotiation) and not on the fast path.

This design is correct for all policy set sizes; there is no threshold above which a different structure is used.

/// XFRM subsystem state (per network namespace).
pub struct XfrmNetns {
    /// Security Association Database (SAD).
    ///
    /// **Read path** (per-packet, hot path): `rcu_dereference(sad_rcu)` — zero lock
    /// contention. Individual `XfrmState` entries have their own RCU protection for
    /// in-place state updates (lifetime counters, replay windows).
    ///
    /// **Write path** (SA install/delete, cold path): acquire `sad_lock`, clone the
    /// HashMap, insert or remove the entry, then publish via `rcu_assign(sad_rcu)`.
    /// Release `sad_lock`. The old HashMap is freed after the current grace period.
    ///
    /// This two-layer design eliminates lock contention on the lookup path entirely.
    /// SA policy changes (SPD) are rare; SA lookups occur per packet.
    pub sad_rcu:  RcuCell<HashMap<XfrmId, Arc<XfrmState>>>,
    /// Serializes concurrent SAD write operations. NOT held during reads.
    pub sad_lock: SpinLock<()>,

    // Security Policy Database (SPD) — policy changes are infrequent; RwLock is fine.
    pub spd_in:   RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,
    pub spd_out:  RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,
    pub spd_fwd:  RwLock<PatriciaTrie<Arc<XfrmPolicy>>>,

    pub nlsk_group: NlMulticastGroup,
}

15.6.4 Packet Processing Hooks

Outbound (TX) -- xfrm_output(netns, buf):

  1. After routing decides the output interface, before QdiscOps::enqueue.
  2. Look up SPD (Out) with the packet's 5-tuple selector.
  3. If no matching policy or action == Allow: pass through.
  4. If action == Block: drop, return Err(KernelError::PermissionDenied).
  5. For action == Ipsec: find or trigger creation of the required SA chain.
  6. If SA exists in SAD: proceed.
  7. If SA missing: send XFRM_MSG_ACQUIRE to the IKEv2 daemon (via netlink); hold the packet in a xfrm_bundle_pending queue for up to 5 seconds. If no SA arrives within 5 seconds, drop and return EHOSTUNREACH.
  8. For each SA in the bundle (in order): apply transform.
  9. Transport mode: insert ESP/AH header between IP header and payload.
  10. Tunnel mode: prepend new outer IP header and ESP/AH; original packet becomes payload.
  11. Update XfrmState.stats.bytes, stats.packets. Check lifetime limits; if exceeded, send XFRM_MSG_EXPIRE and mark SA as expiring.
  12. Increment XfrmState.seq atomically; embed in ESP header.

Inbound (RX) -- xfrm_input(netns, buf):

  1. After IP receive, if protocol == ESP (50) or AH (51).
  2. Extract SPI from the ESP/AH header. Look up SAD by (dst, spi, proto).
  3. If no SA: drop, log INVALID_SPI error.
  4. Check anti-replay window (Section 15.6.5). If replayed: drop.
  5. Decrypt and authenticate using the SA's crypto transforms (AeadTfm for ESP-AEAD). On auth failure: drop, increment stats.integrity_failed.
  6. Strip the ESP/AH header. For tunnel mode: re-inject the inner packet at the IP receive path.
  7. Look up SPD (In) with the inner packet's 5-tuple; verify that a matching policy requiring this SA exists (inbound policy check -- prevents SA bypass by sending non-IPsec traffic to a port that should be protected).
  8. Deliver inner packet to the transport layer (TCP, UDP).

15.6.5 Anti-Replay Window

/// 64-packet anti-replay sliding window.
///
/// Prevents replay attacks where an attacker re-injects captured ESP packets.
/// The window tracks the 64 most recently received sequence numbers.
pub struct ReplayWindow {
    /// Sequence number of the right edge of the window (highest received).
    pub seq: u32,
    /// Bitmask: bit N is set if seq-N has been received. Bit 0 = seq itself.
    pub bitmap: u64,
}

impl ReplayWindow {
    /// Check and record a received sequence number.
    ///
    /// Returns `Ok(())` if the sequence number is acceptable (in window and not seen).
    /// Returns `Err(ReplayError::TooOld)` if the sequence is before the window.
    /// Returns `Err(ReplayError::Duplicate)` if the sequence has been seen.
    ///
    /// On `Ok`, records the sequence number in the bitmap and advances the window
    /// if this is the new highest sequence number.
    pub fn check_and_record(&mut self, new_seq: u32) -> Result<(), ReplayError> {
        if new_seq == 0 {
            // Sequence 0 is invalid per RFC 4303 §3.3.3.
            return Err(ReplayError::TooOld);
        }
        if new_seq > self.seq {
            // New highest: advance window.
            let diff = new_seq - self.seq;
            if diff < 64 {
                self.bitmap = (self.bitmap << diff) | 1;
            } else {
                self.bitmap = 1;
            }
            self.seq = new_seq;
        } else {
            let diff = self.seq - new_seq;
            if diff >= 64 {
                return Err(ReplayError::TooOld);
            }
            let mask = 1u64 << diff;
            if self.bitmap & mask != 0 {
                return Err(ReplayError::Duplicate);
            }
            self.bitmap |= mask;
        }
        Ok(())
    }
}

All XFRM management messages use the NETLINK_XFRM socket family. Messages require Capability::NetAdmin. Key message types:

Message Description
XFRM_MSG_NEWSA Create an SA (called by IKEv2 daemon after key exchange)
XFRM_MSG_DELSA Delete an SA by XfrmId
XFRM_MSG_GETSA Get one SA; or dump all SAs (NLM_F_DUMP)
XFRM_MSG_UPDSA Update an existing SA (rekey without connection teardown)
XFRM_MSG_NEWPOLICY Create a policy
XFRM_MSG_DELPOLICY Delete a policy by index or selector
XFRM_MSG_GETPOLICY Get one policy; or dump all
XFRM_MSG_UPDPOLICY Update a policy
XFRM_MSG_ACQUIRE kernel->daemon: SA needed for a packet; carry policy selector
XFRM_MSG_EXPIRE kernel->daemon: SA lifetime exhausted; carry SA id + hard/soft flag
XFRM_MSG_NEWAE Update SA sequence / replay state (for SA migration)
XFRM_MSG_REPORT kernel->daemon: audit event (policy bypass, integrity failure)

ACQUIRE flow: When xfrm_output encounters a packet matching an Ipsec policy but no matching SA, it sends XFRM_MSG_ACQUIRE to all sockets subscribed to the XFRM_NLGRP_ACQUIRE multicast group. The IKEv2 daemon receives the acquire, negotiates keys with the peer, and installs the SA via XFRM_MSG_NEWSA. The pending packet is held in the kernel and transmitted once the SA is installed.

15.6.7 Crypto API Integration

All IPsec transforms use the Kernel Crypto API (Section 9.1):

IPsec Algorithm Crypto API Request
AES-GCM-128 ESP crypto_alloc_aead("gcm(aes)", 0, 0)
AES-GCM-256 ESP crypto_alloc_aead("gcm(aes)", 0, 0) (256-bit key)
ChaCha20-Poly1305 ESP crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, 0)
AES-CBC-128 + HMAC-SHA256 crypto_alloc_skcipher("cbc(aes)", ...) + crypto_alloc_ahash("hmac(sha256)", ...)
AH HMAC-SHA256 crypto_alloc_ahash("hmac(sha256)", 0, 0)

The algorithm name and key material are supplied by the IKEv2 daemon in XFRM_MSG_NEWSA. The kernel allocates the transform, validates the key length, and stores the handle in XfrmState.aead / .auth / .enc.


15.7 SCTP -- Stream Control Transmission Protocol

SCTP (RFC 4960) is a transport protocol providing multi-homing, multi-streaming, reliable ordered delivery, and message-boundary preservation. UmkaOS implements SCTP as a registered transport in umka-net's socket layer (Section 15.1.2), using the same SocketOps trait and NetBuf pipeline as TCP and UDP.

Use cases in the UmkaOS deployment context: Corosync cluster heartbeat (Section 14.6), telecom DIAMETER/SS7 gateways, and iSCSI login negotiation all require SCTP. The kernel-side SCTP implementation allows these to work over standard AF_INET/AF_INET6 sockets without any userspace SCTP library.

15.7.1 Association State Machine

SCTP connections are called associations. The state machine matches RFC 4960 Section 4:

Closed ──INIT──────────────────────────────► CookieWait
       ◄──INIT-ACK (with cookie)───────────── (peer)
CookieWait ──COOKIE-ECHO────────────────────► CookieEchoed
           ◄──COOKIE-ACK────────────────────── (peer)
CookieEchoed ────────────────────────────────► Established
Established ──SHUTDOWN────────────────────────► ShutdownPending
            ──recv SHUTDOWN──────────────────── ShutdownReceived
ShutdownPending ──SHUTDOWN-ACK───────────────── ShutdownSent
ShutdownReceived ──SHUTDOWN-ACK──────────────── ShutdownAckSent
ShutdownSent / ShutdownAckSent ──SHUTDOWN-COMPLETE── Closed

Cookie mechanism: The INIT-ACK carries a State Cookie -- a MAC-protected (HMAC-SHA256) blob encoding the association parameters, timestamps, and a random nonce. The initiator echoes the cookie in COOKIE-ECHO without the responder storing any state between INIT and COOKIE-ECHO. This prevents memory exhaustion attacks (equivalent to TCP SYN cookies but specified by RFC 4960). The MAC key is rotated every 60 seconds; a grace period of one key-rotation interval accepts cookies from the previous key.

15.7.2 SctpAssoc Struct

/// An SCTP association (the SCTP equivalent of a TCP connection).
pub struct SctpAssoc {
    /// Kernel-assigned association ID (exposed via SCTP_ASSOCINFO sockopt).
    pub assoc_id: u32,
    /// Current state machine state.
    pub state: SctpState,
    /// Local IP addresses (multi-homing: all addresses bound to the socket).
    pub local_addrs: Vec<IpAddr>,
    /// Remote addresses (multi-homed peer; each has independent path state).
    pub peer_addrs: Vec<SctpPeer>,
    /// Index into `peer_addrs` for the active primary path.
    pub primary_path: usize,
    /// Per-stream send/receive state (indexed by Stream ID).
    pub streams: Vec<SctpStream>,
    /// Next TSN (Transmission Sequence Number) to use on the next DATA chunk TX.
    pub tsn_next: AtomicU32,
    /// Cumulative TSN ACKed by peer (from last received SACK).
    pub cum_tsn_ack: u32,
    /// Receiver window advertised by peer (bytes).
    pub rwnd: u32,
    /// Sender-side congestion window (bytes; per-path in SctpPeer).
    /// u64 to support high-BDP paths (400 Gbps × 100ms RTT = ~5 GB BDP exceeds u32 max).
    pub cwnd: u64,
    /// Slow-start threshold (u64 to match cwnd — same high-BDP rationale).
    pub ssthresh: u64,
    /// Partial bytes ACKed (for cwnd increment in congestion avoidance).
    pub partial_bytes_acked: u32,
    /// Effective MTU (minimum across all active paths).
    /// `u16` is sufficient: SCTP MTU cannot exceed 65535 bytes (IPv4/IPv6 packet
    /// limit), and all practical paths use ≤9000 bytes (jumbo Ethernet).
    /// Note: `RouteEntry::mtu` uses `u32` for forward compatibility with hypothetical
    /// future link types; the narrower `u16` here is intentional for the association
    /// aggregate (always bounded by the smallest path MTU, which is ≤65535).
    pub mtu: u16,
    /// Current RTO for the primary path.
    pub rto: Duration,
    /// Heartbeat interval (default: 30 seconds).
    pub hb_interval: Duration,
    /// Retransmit queue: DATA chunks awaiting SACK, indexed by TSN.
    pub retransmit_queue: BTreeMap<u32, NetBuf>,
    /// Out-of-order received DATA chunks awaiting gap fill.
    pub ooo_queue: BTreeMap<u32, NetBuf>,
    /// SCTP socket this association belongs to.
    pub sock: Weak<SctpSock>,
}

15.7.3 Multi-Homing

/// Per-path (per-remote-address) state in an SCTP association.
pub struct SctpPeer {
    /// Remote IP address of this path.
    pub addr: SockAddr,
    /// Path reachability state.
    pub state: PathState,
    /// Per-path congestion window (bytes).
    /// u64 to support high-BDP paths (400 Gbps × 100ms RTT = ~5 GB BDP exceeds u32 max).
    pub cwnd: u64,
    /// Per-path slow-start threshold (u64 to match cwnd — same high-BDP rationale).
    pub ssthresh: u64,
    /// Retransmission Timeout for this path (updated by RTTM: RFC 4960 Section 6.3).
    pub rto: Duration,
    /// RTO minimum (default: 1 second per RFC 4960; tunable via SCTP_RTOINFO).
    pub rto_min: Duration,
    /// RTO maximum (default: 60 seconds).
    pub rto_max: Duration,
    /// Smoothed RTT estimate (us).
    pub srtt_us: u32,
    /// RTT variance estimate (us).
    pub rttvar_us: u32,
    /// Heartbeat timer: fires if no data sent/received for `hb_interval`.
    pub hb_timer: TimerHandle,
    /// Consecutive retransmit timeouts on this path.
    pub error_count: u32,
    /// Threshold for declaring path failure (default: 5, tunable via SCTP_PADDRPARAMS).
    pub max_retrans: u32,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum PathState {
    /// Path is reachable and active.
    Active,
    /// Path is unreachable (error_count exceeded max_retrans).
    Inactive,
    /// Path has been added but not yet confirmed by a HEARTBEAT-ACK.
    Unconfirmed,
}

Path failure and failover: When error_count exceeds max_retrans on the primary path, the association marks it Inactive and selects the next Active path as the new primary. Retransmits are sent on the new primary path. Heartbeats continue on inactive paths; a successful HEARTBEAT-ACK resets error_count and marks the path Active again. If all paths become Inactive, the association is aborted (ABORT chunk sent on the last active path before marking it Inactive).

15.7.4 Multi-Streaming

/// Per-stream state in an SCTP association.
pub struct SctpStream {
    /// Stream ID (0 to num_streams - 1).
    pub sid: u16,
    /// Next Stream Sequence Number to assign to an outgoing ordered DATA chunk.
    pub ssn_out: u16,
    /// Next expected SSN for inbound ordered delivery.
    pub ssn_in_expected: u16,
    /// True for ordered streams; false for unordered (no SSN tracking).
    pub ordered: bool,
    /// Reorder buffer for out-of-order ordered chunks: SSN -> NetBuf.
    /// Chunks are delivered to the socket in SSN order; gaps are held here.
    pub reorder_buf: BTreeMap<u16, NetBuf>,
    /// Fragment reassembly buffer: TSN -> partial DATA chunk.
    /// For multi-chunk messages (B-bit=1, E-bit=0 intermediate fragments).
    pub fragment_buf: BTreeMap<u32, NetBuf>,
}

Ordered delivery: When an ordered DATA chunk arrives with ssn != ssn_in_expected, it is placed in reorder_buf. When the gap is filled (the expected SSN arrives), all consecutive SSNs are delivered to the socket receive buffer in order.

Unordered delivery (I-DATA with U-bit set, or DATA with UNORDERED flag): Delivered immediately to the receive buffer regardless of SSN. Fragment reassembly still uses TSN-based tracking.

Message fragmentation: When sendmsg() delivers a message larger than the path MTU, SCTP splits it into DATA chunks. The first chunk has B-bit=1, E-bit=0; middle chunks have both clear; the last has E-bit=1. I-DATA (RFC 8260) adds a Message Identifier and fragment offset for interleaved reassembly -- avoids head-of-line blocking when large messages are mixed with small real-time messages.

15.7.5 SCTP Chunk Types

Value Name Direction Purpose
0x00 DATA bidirectional User data payload
0x01 INIT -> peer Association setup request
0x02 INIT-ACK <- peer Setup response with cookie
0x03 SACK bidirectional Selective acknowledgement of TSNs
0x04 HEARTBEAT -> peer Path liveness probe
0x05 HEARTBEAT-ACK <- peer Path liveness response
0x06 ABORT bidirectional Immediate association teardown
0x07 SHUTDOWN -> peer Graceful shutdown initiation
0x08 SHUTDOWN-ACK <- peer Shutdown acknowledgement
0x09 ERROR bidirectional Error notification chunk
0x0a COOKIE-ECHO -> peer Echo cookie from INIT-ACK
0x0b COOKIE-ACK <- peer Cookie accepted; association open
0x0e SHUTDOWN-COMPLETE <- peer Shutdown sequence complete
0x40 I-DATA bidirectional Interleaved data (RFC 8260)

Unknown chunk types: if the high two bits of the type byte are 00, drop and report error. If 01, drop silently. If 10, skip and continue processing bundle. If 11, skip, continue, and report. This is the RFC 4960 Section 3.2 "chunk type bit" convention.

15.7.6 Socket API Compatibility

SCTP is accessible via two socket styles:

One-to-one (SOCK_STREAM, one association per socket):

fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_SCTP);
// bind, listen, accept, connect -- identical semantics to TCP
// sendmsg / recvmsg -- each sendmsg sends one SCTP message

One-to-many (SOCK_SEQPACKET, multiple associations multiplexed on one socket):

fd = socket(AF_INET, SOCK_SEQPACKET, IPPROTO_SCTP);
// bind; no connect needed -- associations created on first sendmsg to a new peer
// recvmsg returns SCTP notification events when associations change state

SCTP-specific sockopts (level IPPROTO_SCTP):

Sockopt Get Set Description
SCTP_NODELAY yes yes Disable Nagle-equivalent bundling delay
SCTP_MAXSEG yes yes Maximum message size (MTU override)
SCTP_STATUS yes no Association state, primary path, streams
SCTP_ASSOCINFO yes yes RTO params, max retransmits
SCTP_RTOINFO yes yes RTO.initial, RTO.min, RTO.max per assoc
SCTP_PADDRPARAMS yes yes Per-path heartbeat interval and max_retrans
SCTP_EVENTS yes yes Which SCTP notification events to receive
SCTP_INITMSG yes yes Number of streams, max retransmits for INIT
SCTP_PEER_ADDR_INFO yes no State/RTT/cwnd for a specific peer address

15.7.7 Integration with NetBuf

SCTP DATA chunks are carried in NetBuf segments. The SCTP TX path:

  1. sendmsg() delivers user data as a NetBuf (zero-copy from the socket send buffer using NetBuf::from_user_iov()).
  2. If msg_len <= mtu - sctp_header_overhead: wrap in a single DATA chunk, assign TSN, enqueue on the primary path's TX queue.
  3. If msg_len > mtu - sctp_header_overhead: fragment into N chunks. Each fragment is a separate NetBuf chained via the existing scatter-gather list (NetBuf.frags). Fragments share the data pages (reference-counted DmaBufferHandle) -- no copy.
  4. On SACK receipt: retire ACKed NetBufs from retransmit_queue, decrement refcounts.
  5. On retransmit: the retained NetBuf in retransmit_queue is retransmitted without allocating a new buffer.

15.8 AF_VSOCK -- Virtual Machine Sockets

AF_VSOCK (address family 40) enables bidirectional socket communication between a VM guest and its hypervisor host without configuring a network interface. It is used by the QEMU guest agent, containerd's CRI-over-vsock path, systemd-vmspawn, and cloud-init datasource queries.

Linux parallel: Linux implements AF_VSOCK in net/vmw_vsock/. UmkaOS implements the same sockaddr_vm ABI and VMADDR_CID_* constants so that unmodified guest agents and container runtimes work without recompilation.

15.8.1 Address Space

/// Virtual socket address (matches Linux struct sockaddr_vm, 16 bytes).
#[repr(C)]
pub struct SockAddrVm {
    /// Address family: AF_VSOCK (40).
    pub svm_family: u16,
    pub svm_reserved1: u16,
    /// Port number (no privileged/unprivileged distinction -- no system ports below 1024).
    pub svm_port: u32,
    /// Context ID (CID) of the communicating endpoint.
    pub svm_cid: u32,
    /// Must be zero (reserved for future use, matching Linux padding).
    pub svm_zero: [u8; 4],
}

/// Well-known CID values.
pub mod vmaddr_cid {
    /// Bind to all local CIDs (wildcard, for listen sockets).
    pub const ANY: u32 = 0xFFFF_FFFF;
    /// Hypervisor CID (QEMU/KVM host side of the vsock device).
    pub const HYPERVISOR: u32 = 0;
    /// Local loopback within the same VM or host context.
    pub const LOCAL: u32 = 1;
    /// Host (hypervisor userspace, e.g., QEMU process) -- used from the guest.
    pub const HOST: u32 = 2;
}

Guest VMs receive their CID from the hypervisor at VM creation (Section 15.8.6). CIDs >= 3 are dynamically assigned.

15.8.2 VsockTransport Trait

The transport layer is abstracted so that different hypervisor back-ends (virtio-vsock, VMware VMCI, loopback) can be registered. Only one transport is active per boot.

/// Virtual socket transport back-end.
///
/// Implemented by virtio-vsock (Tier 2 guest driver + Tier 1 vhost back-end),
/// and by the loopback transport (for LOCAL-to-LOCAL communication).
pub trait VsockTransport: Send + Sync {
    /// Transport name (for sysfs reporting).
    fn name(&self) -> &'static str;

    /// Initialise the transport at module load.
    fn init(&self) -> Result<(), KernelError>;

    /// Shut down the transport; called when the vsock module is removed.
    fn release(&self);

    /// Initiate a connection from `sock` to its `remote_addr`.
    ///
    /// Sends a REQUEST packet; returns `Ok(())` immediately (async connect).
    /// The caller blocks in `VsockSock.state == Connecting` until a RESPONSE
    /// or RST arrives.
    fn connect(&self, sock: &mut VsockSock) -> Result<(), KernelError>;

    /// Disconnect the socket (send RST or SHUTDOWN).
    fn disconnect(&self, sock: &mut VsockSock, flags: u32) -> Result<(), KernelError>;

    /// Send data from the socket's send buffer (called after a credit update
    /// increases the available send window).
    fn send(&self, sock: &mut VsockSock, msg: &MsgHdr, flags: i32)
        -> Result<usize, KernelError>;

    /// Receive data into the caller's buffer.
    fn recv(&self, sock: &mut VsockSock, msg: &mut MsgHdr, flags: i32)
        -> Result<usize, KernelError>;

    /// Returns true if the socket has incoming data (for poll/epoll).
    fn notify_poll_in(&self, sock: &VsockSock) -> bool;

    /// Returns true if the socket has space to send (for poll/epoll).
    fn notify_poll_out(&self, sock: &VsockSock) -> bool;
}

/// Global active transport.
static VSOCK_TRANSPORT: RwLock<Option<&'static dyn VsockTransport>> =
    RwLock::new(None);

/// Register the active vsock transport (called once at module init).
pub fn vsock_register_transport(t: &'static dyn VsockTransport) -> Result<(), KernelError> {
    let mut slot = VSOCK_TRANSPORT.write();
    if slot.is_some() {
        return Err(KernelError::AlreadyExists);
    }
    *slot = Some(t);
    t.init()
}

15.8.3 Virtio-Vsock Transport

The virtio-vsock transport uses the existing UmkaOS virtio device model (Section 10.4). It operates over two virtio queues: TX (guest->host) and RX (host->guest), plus an event queue for connection lifecycle notifications.

/// A single virtio-vsock packet (maps to struct virtio_vsock_pkt in Linux).
/// Transmitted between guest and host in virtio ring descriptors.
#[repr(C, packed)]
pub struct VsockPacket {
    /// Source context ID.
    pub src_cid: u32,
    /// Destination context ID.
    pub dst_cid: u32,
    /// Source port.
    pub src_port: u32,
    /// Destination port.
    pub dst_port: u32,
    /// Payload length (bytes following this header).
    pub len: u32,
    /// Socket type: SOCK_STREAM (1) or SOCK_SEQPACKET (5).
    pub type_: u16,
    /// Operation code.
    pub op: VsockOp,
    /// Operation-specific flags (e.g., SHUTDOWN_RCV, SHUTDOWN_SEND).
    pub flags: u32,
    /// Receiver buffer allocation (bytes the sender is willing to buffer).
    pub buf_alloc: u32,
    /// Bytes consumed by the receiver since last credit update.
    pub fwd_cnt: u32,
}

#[repr(u16)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum VsockOp {
    Invalid       = 0,
    /// Connection request (CLIENT -> SERVER).
    Request       = 1,
    /// Connection accepted (SERVER -> CLIENT).
    Response      = 2,
    /// Connection reset (either direction).
    Rst           = 3,
    /// Half-close (SHUT_RD or SHUT_WR).
    Shutdown      = 4,
    /// Data payload (either direction).
    Rw            = 5,
    /// Credit update: `buf_alloc` and `fwd_cnt` updated, no payload.
    CreditUpdate  = 6,
    /// Request a credit update from the peer.
    CreditRequest = 7,
}

The host-side vhost_vsock runs as a Tier 1 umka-net thread bound to a specific VM's CID. It processes the virtio vhost ring (Section 10.4.3) and demultiplexes incoming packets to the correct VsockSock by (dst_cid, dst_port).

15.8.4 VsockSock Struct

/// A virtual socket instance.
pub struct VsockSock {
    /// Connection state.
    pub state: VsockState,
    /// Local CID and port.
    pub local_addr: SockAddrVm,
    /// Remote CID and port (zero until connected or accept).
    pub remote_addr: SockAddrVm,
    /// TX buffer: data waiting to be sent, bounded by the peer's credit.
    /// Ring capacity is negotiated at connect time (default: 256 KiB).
    pub send_buf: RingBuffer<u8>,
    /// RX queue: received `NetBuf` segments waiting for `recv()`.
    pub recv_queue: VecDeque<NetBuf>,
    /// Bytes peer has allocated for receiving from us (updated on CREDIT_UPDATE).
    pub credit_peer_buf_alloc: u32,
    /// Bytes peer has consumed from us (fwd_cnt from peer's last CREDIT_UPDATE).
    pub credit_peer_fwd_cnt: u32,
    /// Bytes we have sent to the peer (tracked locally).
    pub bytes_sent: u32,
    /// Our receive buffer allocation (advertised to peer in buf_alloc field).
    pub local_buf_alloc: u32,
    /// Bytes we have consumed from our receive buffer (reported as fwd_cnt).
    pub local_fwd_cnt: u32,
    /// Active transport.
    pub transport: &'static dyn VsockTransport,
    /// Wait queue for blocked send/recv operations.
    pub waitq: WaitQueue,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum VsockState {
    Unconnected,
    Connecting,
    Connected,
    Disconnecting,
    /// Socket has been shut down; no more sends are possible but RX may still arrive.
    Shutdown,
}

15.8.5 Flow Control

AF_VSOCK uses credit-based flow control equivalent to UmkaOS's ring buffer token model (Section 3.1.6):

Send permission: The sender tracks:

send_window = credit_peer_buf_alloc - (bytes_sent - credit_peer_fwd_cnt)

The sender may transmit at most send_window additional bytes before the peer must issue a CREDIT_UPDATE. If send_window == 0, the sender blocks (or returns EAGAIN for O_NONBLOCK sockets) until a CREDIT_UPDATE packet arrives.

Receive credit replenishment: After delivering N bytes of received data to userspace via recv(), the kernel sends a CREDIT_UPDATE packet with the updated fwd_cnt value, replenishing the peer's send window. The credit update is coalesced: it is sent when local_fwd_cnt increases by at least local_buf_alloc / 2 (50% watermark), preventing a credit update storm on small reads.

Initial negotiation: The REQUEST packet carries the local buf_alloc. The RESPONSE packet carries the peer's buf_alloc. Both sides initialise credit_peer_buf_alloc from the received value before data transfer begins.

15.8.6 Integration with KVM

When a VM is created (Section 17.1):

/// Kernel-global CID allocator for vsock.
/// CIDs 0-2 are reserved (HYPERVISOR, LOCAL, HOST).
/// VMs receive CIDs starting from 3.
static VSOCK_CID_ALLOCATOR: Mutex<BTreeSet<u32>> = Mutex::new(BTreeSet::new());

/// Allocate a CID for a new VM.
///
/// Returns a CID in the range [3, 0xFFFF_FFFE].
/// Returns `KernelError::ResourceExhausted` if no CIDs are available
/// (unlikely: ~4 billion VMs).
pub fn vsock_alloc_cid() -> Result<u32, KernelError> {
    let mut allocated = VSOCK_CID_ALLOCATOR.lock();
    // Find the smallest integer >= 3 not in `allocated`.
    let cid = (3u32..).find(|c| !allocated.contains(c))
        .ok_or(KernelError::ResourceExhausted)?;
    allocated.insert(cid);
    Ok(cid)
}

/// Release a CID when the VM is destroyed.
///
/// Resets all sockets using this CID (sends RST to any connected peer
/// sockets) before returning the CID to the pool.
pub fn vsock_free_cid(cid: u32) {
    // 1. Walk the global socket table; RST all sockets with local_cid == cid
    //    or remote_cid == cid.
    vsock_reset_all_for_cid(cid);
    VSOCK_CID_ALLOCATOR.lock().remove(&cid);
}

The vhost_vsock file descriptor is created in the host context and linked to the VM struct via the KVM device model. When KVM_CREATE_VM is issued, the KVM subsystem calls vsock_alloc_cid() and stores the result in the VM struct. When the VM is destroyed (last kvm_fd closed), vsock_free_cid() is called from the VM teardown path.

15.8.7 sysfs Interface

/sys/kernel/umka/vsock/
|-- local_cid       (r--): This context's CID (guest: assigned CID; host: VMADDR_CID_HOST = 2)
|-- transport       (r--): Active transport name (e.g., "virtio-vsock", "loopback")
`-- connections/    (r--): Per-connection state (one subdir per active socket pair)
    `-- <local_cid>:<local_port>:<remote_cid>:<remote_port>/
        |-- state           (r--): VsockState as string
        |-- send_window     (r--): Current send window in bytes
        `-- recv_queued     (r--): Bytes queued in recv_queue

local_cid is written once at transport init and is read-only thereafter. The connections/ subtree uses the existing umka sysfs dynamic-attribute model (Section 19.3), with one entry per connected VsockSock. Entries appear on RESPONSE (connect) and disappear on RST or Shutdown completion.


15.9 802.1Q VLAN Subsystem

IEEE 802.1Q Virtual LANs allow a single Ethernet link to carry traffic for multiple logical networks. Each frame carries a 4-byte tag inserted between the source MAC address and the EtherType field; the 12-bit VLAN ID (VID) partitions broadcast domains on shared infrastructure without rewiring. UmkaOS implements a full 802.1Q VLAN subsystem inside umka-net (Section 15.1.1), at the link layer, below the IPv4/IPv6 network layer.

Linux parallel: Linux implements 802.1Q in net/8021q/ with the 8021q kernel module. UmkaOS's VLAN subsystem provides full API and behavioral compatibility so that vconfig, ip link, and bridge vlan commands work on UmkaOS without modification.

15.9.1 Overview

An 802.1Q tag is inserted into an Ethernet frame immediately after the 6-byte source MAC address. The tag occupies exactly 4 bytes:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         TPID (0x8100)         |PCP|D|        VID (12 bits)   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • TPID (Tag Protocol Identifier, 16 bits): 0x8100 for 802.1Q; 0x88A8 for 802.1ad (QinQ outer tag). Ethertype values are the same across IPv4, IPv6, and ARP frames — the presence of a value ≥ 0x0600 in the 2-byte field after the source MAC identifies either an EtherType (untagged) or a TPID (tagged).
  • PCP (Priority Code Point, 3 bits): 802.1p QoS priority, 0–7. Mapped to internal skb_priority by the ingress priority map.
  • DEI (Drop Eligible Indicator, 1 bit, formerly CFI): set by upstream equipment to indicate this frame may be dropped under congestion.
  • VID (VLAN Identifier, 12 bits): 1–4094 usable values. VID 0 is reserved (priority-only tag, no VLAN membership). VID 4095 is reserved by the standard.

QinQ (802.1ad): Service provider networks use double tagging to tunnel customer VLANs (C-VLAN) across provider networks. The outer tag uses TPID 0x88A8 (S-VLAN, "service tag"); the inner tag uses 0x8100 (C-VLAN, "customer tag"). UmkaOS models the outer tag as a VlanProto::Dot1AD VLAN device stacked on top of a VlanProto::Dot1Q device. The link-layer transmit path inserts both tags, outer first.

15.9.2 VLAN Device Model

UmkaOS models each VLAN as a virtual NetDevice — a VlanDev — that sits on top of a real ("lower") NetDevice. Multiple VlanDev instances can share the same lower device, each with a distinct VID. The VLAN device has its own MAC address (inherited from the lower device by default, but overridable), its own ARP/NDP state, and its own routing table entries. Its MTU is lower.mtu - 4 to account for the tag bytes.

/// A virtual 802.1Q or 802.1ad VLAN network device.
///
/// Sits on top of a lower `NetDevice` and presents a logical interface
/// restricted to one VLAN ID. Transmit inserts (or requests hardware to
/// insert) the 802.1Q/802.1ad tag; receive strips the tag and demultiplexes.
pub struct VlanDev {
    /// The real ("lower") network device this VLAN rides on.
    pub lower: Arc<NetDevice>,
    /// VLAN ID in the range 1..=4094 (VID 0 and 4095 are reserved).
    pub vlan_id: u16,
    /// Tag protocol: 802.1Q (0x8100) or 802.1ad / QinQ outer tag (0x88A8).
    pub vlan_proto: VlanProto,
    /// Feature flags controlling VLAN device behaviour.
    pub flags: VlanDevFlags,
    /// Ingress priority map: PCP value (0–7) → internal skb_priority.
    /// Populated via IFLA_VLAN_INGRESS_QOS or ioctl SIOCSIFVLAN.
    pub ingress_priority_map: [u8; 8],
    /// Egress priority map: internal skb_priority → PCP value (0–7).
    /// Keys are skb_priority values; values are 3-bit PCP codes.
    pub egress_priority_map: BTreeMap<u32, u8>,
    /// The VLAN's own NetDevice handle (MAC, stats, queue disciplines).
    pub netdev: NetDevice,
}

/// Tag protocol identifier distinguishing 802.1Q from 802.1ad QinQ.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u16)]
pub enum VlanProto {
    /// IEEE 802.1Q customer VLAN tag (TPID = 0x8100).
    Dot1Q  = 0x8100,
    /// IEEE 802.1ad service provider outer tag (TPID = 0x88A8).
    Dot1AD = 0x88A8,
}

bitflags::bitflags! {
    /// Behavioural flags for a VLAN device.
    pub struct VlanDevFlags: u32 {
        /// Reorder Ethernet header on ingress for efficient higher-layer processing.
        const REORDER_HDR    = 1 << 0;
        /// Enable GVRP (GARP VLAN Registration Protocol) on this VLAN interface.
        const GVRP           = 1 << 1;
        /// Loose binding: VLAN device stays UP even if lower device is DOWN.
        const LOOSE_BINDING  = 1 << 2;
        /// Enable MVRP (Multiple VLAN Registration Protocol) on this interface.
        const MVRP           = 1 << 3;
        /// Bridge binding: VLAN follows bridge master state instead of lower device.
        const BRIDGE_BINDING = 1 << 4;
    }
}

VLAN devices are created and destroyed through the netlink RTM_NEWLINK / RTM_DELLINK path (Section 15.9.6). The VLAN subsystem registers a LinkOps implementation named "vlan" with the netlink link-type registry; the kernel dispatches RTM_NEWLINK with IFLA_INFO_KIND = "vlan" to this handler.

15.9.3 Transmit Path

When userspace writes to a socket bound through a VLAN device, the packet reaches the VLAN device's ndo_start_xmit entry point. The transmit path proceeds as follows:

  1. PCP selection: Look up the packet's skb_priority in egress_priority_map. If no entry exists, use PCP 0. Encode as the upper 3 bits of the TCI (Tag Control Information) field: tci = (pcp << 13) | (dei << 12) | vlan_id.

  2. Hardware offload check: Inspect the lower device's feature flags for NETIF_F_HW_VLAN_CTAG_TX (hardware 802.1Q TX offload) or NETIF_F_HW_VLAN_STAG_TX (hardware 802.1ad TX offload).

  3. Offload available: Store tci in NetBuf.vlan_tci and set NetBuf.vlan_present = true. Pass the unmodified frame to the lower device. The NIC inserts the 4-byte tag at the correct position in hardware, saving a memmove of the MAC header.

  4. No offload: Prepend the tag in software. Call netbuf_push_vlan_tag(buf, vlan_proto, tci) which expands the headroom by 4 bytes, memmoves the 12-byte MAC header (6 bytes DA + 6 bytes SA) 4 bytes toward the start of the buffer, and writes the 4-byte tag at byte offset 12. The frame is then handed to the lower device's transmit function.

  5. Lower device enqueue: The (possibly tag-inserted) NetBuf is passed to the lower NetDevice's transmit path, which applies traffic control (qdisc) and enqueues to the NIC ring.

QinQ transmit (outer tag 0x88A8, inner tag 0x8100) follows the same path twice: the inner VlanDev inserts or requests the C-VLAN tag, then the outer VlanDev inserts or requests the S-VLAN tag. Hardware NIC offload for double-tagging requires NETIF_F_HW_VLAN_STAG_TX; if absent, both tags are inserted in software in two sequential passes.

15.9.4 Receive Path

On receive, the lower NIC driver delivers frames to umka-net's generic receive entry point. The VLAN receive path runs before the frame is dispatched to the network layer:

  1. Hardware tag strip check: If the NIC reported NETIF_F_HW_VLAN_CTAG_RX (or NETIF_F_HW_VLAN_STAG_RX for QinQ), the tag has already been stripped by the NIC and its TCI value is stored in NetBuf.vlan_tci with vlan_tci_valid = true. Proceed to step 3.

  2. Software tag detection: Inspect the EtherType field at byte offset 12 of the frame. If it equals VlanProto::Dot1Q as u16 (0x8100) or VlanProto::Dot1AD as u16 (0x88A8), a tag is present. Extract the 4-byte TCI, store in NetBuf.vlan_tci, set vlan_tci_valid = true. Remove the 4 tag bytes: memmove the 12-byte MAC header 4 bytes toward the end of the buffer (restoring the original untagged layout), advance the frame data pointer.

  3. VID lookup: Extract vid = vlan_tci & 0x0FFF (lower 12 bits). Walk the lower device's VLAN device table (a BTreeMap<u16, Arc<VlanDev>> keyed on VID) to find the registered VlanDev for this VID.

  4. Ingress priority mapping: Extract pcp = (vlan_tci >> 13) & 0x7. Map via VlanDev.ingress_priority_map[pcp] to set NetBuf.priority for upstream QoS.

  5. Dispatch: If a matching VlanDev is found, deliver the frame to that device's receive queue. If no VlanDev is registered for this VID:

  6. If a trunk port is configured on the lower device (bridge mode, Section 15.9.7), deliver to the bridge for forwarding.
  7. Otherwise, drop the frame and increment the lower device's rx_dropped counter.

15.9.5 GARP and MRP

GARP (Generic Attribute Registration Protocol) is defined in IEEE 802.1D-2004 Annex 12. It provides a distributed mechanism for network nodes to register attributes (such as VLAN membership) with 802.1D-capable switches. Each GARP participant runs an applicant state machine and a registrar state machine per attribute:

  • Applicant: drives Join/Leave declaration transmission. States: VO (Very Anxious Observer), VP (Very Anxious Passive), VN (Very Anxious New), AN (Anxious New), AA (Anxious Active), QA (Quiet Active), LA (Leaving Active), AO (Anxious Observer), QO (Quiet Observer), AP (Anxious Passive), QP (Quiet Passive), LO (Leaving Observer). Transitions are triggered by application events (Join, Leave, New) and protocol timers (Join timer, Leave timer, LeaveAll timer).

  • Registrar: tracks whether an attribute has been declared by a remote participant. States: IN (registered), LV (leaving — Leave message received, awaiting Leave timer expiry), MT (empty — not registered).

GVRP (GARP VLAN Registration Protocol) is the application of GARP to VLAN membership. A GVRP-enabled VLAN device periodically advertises its VID to adjacent 802.1D bridges, which propagate the registration through the spanning tree so that VLANs are dynamically provisioned on trunk links.

MRP (Multiple Registration Protocol) (IEEE 802.1ak, incorporated into 802.1Q-2018) is the successor to GARP. MRP improves scalability and adds support for multiple applications sharing a single PDU exchange:

  • MVRP (Multiple VLAN Registration Protocol): MRP application for VLAN membership. Replaces GVRP on 802.1Q-2011 and later switches.
  • MMRP (Multiple MAC Registration Protocol): MRP application for multicast group membership (Ethernet MAC addresses), enabling efficient multicast pruning in IEEE 802.1Q bridges.

UmkaOS implements MVRP as a Tier 1 kernel worker in umka-net. The MRP engine is structured around the MrpApplication trait:

/// A registered MRP application (e.g., MVRP or MMRP).
pub trait MrpApplication: Send + Sync {
    /// Application identifier (e.g., MVRP_APPLICATION_ID = 0x0021).
    fn application_id(&self) -> u16;

    /// Encode all locally declared attributes into a MRP PDU vector for transmission.
    fn encode_pdu(&self, buf: &mut NetBuf);

    /// Process a received MRP PDU and update local registrar state accordingly.
    fn process_pdu(&self, buf: &NetBuf, port: &MrpPort) -> Result<(), KernelError>;

    /// Called on LeaveAll timer expiry: re-declare all active attributes.
    fn on_leave_all(&self);
}

/// Per-port MRP state.
pub struct MrpPort {
    /// The network device this MRP port is attached to.
    pub netdev: Arc<NetDevice>,
    /// Periodic Join timer handle (default: 200 ms, IEEE 802.1Q-2018 Table 10-7).
    pub join_timer: TimerHandle,
    /// Leave timer handle (default: 600 ms).
    pub leave_timer: TimerHandle,
    /// LeaveAll timer handle (default: 10 s).
    pub leave_all_timer: TimerHandle,
    /// Registered applications sharing this port (MVRP, MMRP).
    pub applications: Vec<Arc<dyn MrpApplication>>,
}

/// Per-attribute applicant state machine state (IEEE 802.1Q-2018 Table 10-3).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MrpApplicantState {
    /// Very Anxious Observer: no declaration, observing.
    VO,
    /// Very Anxious Passive: declaration pending (no active registrar known).
    VP,
    /// Very Anxious New: new declaration, needs immediate transmission.
    VN,
    /// Anxious New: declaration queued, waiting for join period.
    AN,
    /// Anxious Active: declaration active, waiting for join period.
    AA,
    /// Quiet Active: declaration active and acknowledged.
    QA,
    /// Leaving Active: Leave sent, awaiting Leave timer.
    LA,
    /// Anxious Observer: observing, join pending.
    AO,
    /// Quiet Observer: passively observing a remote declaration.
    QO,
    /// Anxious Passive: passive, join pending.
    AP,
    /// Quiet Passive: passive, idle.
    QP,
    /// Leaving Observer: Leave in progress, was observing.
    LO,
}

/// Per-attribute registrar state (IEEE 802.1Q-2018 Table 10-4).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MrpRegistrarState {
    /// Attribute is registered (a remote Join has been received and is current).
    IN,
    /// Attribute is leaving (Leave received; leave_timer running).
    LV,
    /// Attribute is not registered.
    MT,
}

The MRP engine drives these state machines on two triggers:

  • Timer expiry (join_timer, leave_timer, leave_all_timer): The mrp_timer kernel worker fires and calls mrp_port_timer_work(), which iterates all registered applications on the port, advances state machines, and transmits pending PDUs via mrp_encode_and_send().

  • PDU receive: The link-layer receive path recognises MRP Ethernet frames by their destination multicast MAC (01:80:C2:00:00:21 for MVRP) and dispatches them to mrp_rcv(), which decodes the PDU and calls MrpApplication::process_pdu().

MVRP's process_pdu updates the VLAN membership database: a JoinIn or JoinMt vector attribute for VID v causes the VLAN subsystem to ensure VID v is admitted on the ingress port of the bridge; a LeaveIn or LeaveMt attribute triggers removal after the leave timer expires.

15.9.6 Userspace Interface

VLAN devices are managed through two interfaces:

Netlink (preferred, iproute2): ip link add link eth0 name eth0.100 type vlan id 100 protocol 802.1Q. Internally this sends RTM_NEWLINK with:

IFLA_LINK        → ifindex of lower device (eth0)
IFLA_IFNAME      → "eth0.100"
IFLA_LINKINFO:
  IFLA_INFO_KIND → "vlan"
  IFLA_INFO_DATA:
    IFLA_VLAN_ID         → 100
    IFLA_VLAN_PROTOCOL   → ETH_P_8021Q (0x8100) or ETH_P_8021AD (0x88A8)
    IFLA_VLAN_FLAGS      → vlan_flags (e.g., VLAN_FLAG_REORDER_HDR)
    IFLA_VLAN_INGRESS_QOS → list of { from: u32, to: u32 } mappings
    IFLA_VLAN_EGRESS_QOS  → list of { from: u32, to: u32 } mappings

Deletion: ip link del eth0.100RTM_DELLINK. Query: ip -d link show eth0.100RTM_GETLINK with IFLA_INFO_DATA in reply.

Legacy ioctl (vconfig compatibility): The ioctl(SIOCGIFVLAN, SIOCSIFVLAN) path is supported for the legacy vconfig tool. vconfig add eth0 100 translates to SIOCSIFVLAN with cmd = ADD_VLAN_CMD. vconfig rem eth0.100cmd = DEL_VLAN_CMD. vconfig set_flag eth0.100 1 1cmd = SET_VLAN_FLAG_CMD.

/proc/net/vlan/: Read-only informational interface:

/proc/net/vlan/
|-- config         (one line per VLAN device: name, VID, lower device)
`-- eth0.100       (per-device stats: rx/tx bytes, packets, dropped)

The /proc/net/vlan/ tree is implemented via umka's procfs (Section 19.3) dynamic file model. Reads are satisfied without locks by reading per-CPU stats counters and summing them.

15.9.7 Bridge Integration

A VLAN-aware bridge (Section 15.9.2) operates in one of two modes:

  • VLAN-unaware (default): The bridge forwards frames based on MAC address alone; 802.1Q tags are treated as opaque payload. This is the traditional Linux bridge behaviour.

  • VLAN-aware (vlan_filtering = 1): Each bridge port maintains a per-port VLAN filter table. Frames arriving on a port are subject to ingress VLAN filtering; frames leaving a port are subject to egress tagging rules.

The per-port VLAN filter table:

/// A single entry in a bridge port's VLAN filter table.
pub struct BridgeVlanEntry {
    /// VLAN ID this entry applies to (1..=4094).
    pub vid: u16,
    /// If true, this VID is the port's PVID: untagged ingress frames are
    /// assigned this VID for forwarding decisions.
    pub pvid: bool,
    /// If true, frames egressing on this port with this VID are sent untagged
    /// (the tag is stripped on egress).
    pub untagged: bool,
    /// Master flag: this entry was added by the bridge itself (not user-space).
    pub master: bool,
    /// BRFORWARD: this VID is forwarded (not filtered on ingress).
    pub brforward: bool,
}

Bridge port VLAN configuration is done via netlink RTM_SETLINK with IFLA_BRIDGE_VLAN_INFO nested attributes (one per VID, with BRIDGE_VLAN_INFO_PVID and BRIDGE_VLAN_INFO_UNTAGGED flags). The bridge vlan command (from iproute2) uses this interface:

bridge vlan add dev eth0 vid 100 pvid untagged
bridge vlan del dev eth0 vid 200
bridge vlan show

On ingress: if the frame arrives untagged, assign PVID. Perform ingress VLAN filter lookup; drop if the port has no entry for the VID. On egress: if the port's entry for the VID has untagged = true, strip the tag before transmitting.

The bridge's FDB (Forwarding Database) is keyed on (MAC, VID) in VLAN-aware mode, enabling separate MAC learning per VLAN domain.

15.9.8 Linux Compatibility

UmkaOS's 802.1Q VLAN subsystem is fully compatible with Linux's 8021q, garp, mrp, and bridge VLAN subsystems:

  • vconfig (legacy): ADD_VLAN_CMD, DEL_VLAN_CMD, SET_VLAN_FLAG_CMD, SET_VLAN_NAME_TYPE_CMD ioctls all work.
  • ip link (iproute2): all IFLA_VLAN_* netlink attributes parsed and applied.
  • bridge vlan (iproute2): IFLA_BRIDGE_VLAN_INFO and the bridge netlink API work.
  • MVRP/MMRP MRP PDU format matches Linux's mrp module (IEEE 802.1ak-2007 wire format).
  • /proc/net/vlan/config and per-interface stat files present with identical format.
  • Ethtool VLAN offload feature flags (NETIF_F_HW_VLAN_CTAG_TX/RX, NETIF_F_HW_VLAN_STAG_TX/RX) are honoured identically to Linux.

15.10 IPVS — IP Virtual Server

IPVS (IP Virtual Server) is a Layer 4 load balancer implemented inside the kernel. It receives connection requests addressed to a Virtual IP (VIP) and a configured port, selects a Real Server (RS) from a pool using a pluggable scheduling algorithm, and forwards packets to the chosen backend. The original Linux IPVS implementation (ip_vs module, part of the Linux Virtual Server project) is widely deployed as the data-plane engine for Kubernetes kube-proxy --mode=ipvs.

UmkaOS implements IPVS inside umka-net (Section 15.1.1) as a netfilter hook set (Section 15.2.2). It is transparent to both clients and real servers: clients connect to the VIP as if it were a single endpoint; real servers see connections from either the load balancer (NAT mode) or directly from clients (DR/TUN modes).

Linux parallel: Linux's ip_vs module is located in net/netfilter/ipvs/. UmkaOS's IPVS provides binary-compatible ipvsadm support (both ioctl and Generic Netlink interfaces) and identical /proc/net/ip_vs* output.

15.10.1 Overview

IPVS supports three packet forwarding methods:

  • NAT (Masquerade): The load balancer rewrites the destination IP and port of each incoming SYN to the chosen real server's address (DNAT). Return traffic passes back through the load balancer, which rewrites the source IP/port back to the VIP (SNAT). Both directions traverse the load balancer. Real servers need no special configuration; they see connections from the load balancer's IP.

  • DR (Direct Routing): The load balancer rewrites only the destination MAC address of the frame to the chosen real server's MAC; the IP header is unchanged. The real server must have the VIP configured on a loopback interface (with ARP disabled for that VIP) so that it accepts the packet. Return traffic goes directly from the real server to the client without passing through the load balancer. DR is the most performant mode because the load balancer handles only inbound packets. Requires all servers on the same L2 segment.

  • TUN (IP-in-IP Tunneling): The load balancer encapsulates the original IP packet in a new IP header addressed to the real server (IP-in-IP, protocol 4, or GRE optionally). The real server decapsulates and processes the inner packet. Like DR, return traffic bypasses the load balancer. Allows real servers on different L3 networks. Real servers must have the VIP configured on a tunnel interface with ARP disabled.

  • LocalNode: The VIP resolves to the load balancer host itself. Packets are delivered to a local socket. Used when the load balancer is also a real server.

15.10.2 Data Structures

/// A virtual service: the VIP:port tuple that clients connect to.
///
/// Each virtual service has its own scheduler, connection table, and set of
/// real servers. Multiple virtual services may share the same VIP with
/// different ports or protocols.
pub struct IpvsService {
    /// Virtual IP address (IPv4 or IPv6).
    pub addr: IpAddr,
    /// Virtual port (network byte order).
    pub port: u16,
    /// Transport protocol (TCP, UDP, or SCTP).
    pub protocol: IpvsProtocol,
    /// Service-level flags (e.g., persistence, hashed scheduler config).
    pub flags: IpvsServiceFlags,
    /// Persistent session timeout (seconds). 0 = persistence disabled.
    /// When non-zero, the source address is used for session affinity:
    /// all connections from the same client IP are sent to the same RS
    /// for `timeout` seconds after the last connection.
    pub timeout: u32,
    /// Netmask applied to the client IP before persistence lookup.
    /// For IPv4 persistence: typically 255.255.255.255 (per-host) or
    /// 255.255.255.0 (per-subnet). Ignored when `timeout` is 0.
    pub netmask: u32,
    /// Scheduling algorithm used to select a real server for new connections.
    pub scheduler: Arc<dyn IpvsScheduler>,
    /// Pool of real servers. Protected by `RwLock` for concurrent reader access
    /// during packet forwarding; writers hold the write lock to add/remove/edit.
    pub real_servers: RwLock<Vec<Arc<IpvsRealServer>>>,
    /// Connection tracking table for this service.
    pub conn_table: IpvsConnTable,
    /// Aggregate statistics for this virtual service.
    pub stats: IpvsStats,
}

/// A real server: a backend host that handles connections for a virtual service.
pub struct IpvsRealServer {
    /// Real server IP address.
    pub addr: IpAddr,
    /// Real server port. May differ from the virtual port (port mapping).
    pub port: u16,
    /// Scheduling weight.
    ///
    /// - `0` — **drain mode**: no new connections are scheduled to this server;
    ///   existing connections in the `IpvsConnTable` continue until they close
    ///   naturally or time out. Used for graceful removal before maintenance.
    ///   Weight 0 is NOT the same as removing the server — the entry remains in
    ///   the IPVS table with its existing connections tracked until the weight
    ///   is raised back above 0 or the server is explicitly deleted via
    ///   `IPVS_CMD_DEL_DEST` / `IP_VS_SO_SET_DELDEST`.
    /// - `1–65535` — normal operation; higher weight means proportionally more
    ///   new connections scheduled (weighted round-robin / least-connections).
    pub weight: u16,
    /// Packet forwarding method for this real server.
    pub fwd_method: IpvsFwdMethod,
    /// Number of currently active connections (TCP ESTABLISHED or UDP active).
    pub activeconns: AtomicI32,
    /// Number of inactive connections (TCP TIME_WAIT, FIN_WAIT, etc.).
    pub inactconns: AtomicI32,
    /// Per-real-server statistics (bytes, packets, connections).
    pub stats: IpvsStats,
}

/// An established IPVS connection entry.
///
/// Created on the first SYN (TCP) or first packet (UDP) from a new client;
/// destroyed when the connection enters TIME_WAIT and the timeout expires.
/// Keyed in `IpvsConnTable` by `(protocol, caddr, cport, vaddr, vport)`.
pub struct IpvsConn {
    /// Transport protocol.
    pub protocol: IpvsProtocol,
    /// Client (source) IP address.
    pub caddr: IpAddr,
    /// Client (source) port.
    pub cport: u16,
    /// Virtual IP address (destination as seen by client).
    pub vaddr: IpAddr,
    /// Virtual port.
    pub vport: u16,
    /// Destination IP (real server address after forwarding decision).
    pub daddr: IpAddr,
    /// Destination port (real server port, may differ from vport).
    pub dport: u16,
    /// Forwarding method in use for this connection.
    pub fwd_method: IpvsFwdMethod,
    /// TCP/UDP state as tracked by IPVS (independent of nf_conntrack).
    pub state: IpvsConnState,
    /// Remaining time before this connection entry is garbage-collected (seconds).
    pub timeout: u32,
    /// Kernel timer that fires on timeout expiry, triggering GC.
    pub timer: KernelTimer,
    /// Back-pointer to the real server (for `activeconns`/`inactconns` accounting).
    pub real_server: Weak<IpvsRealServer>,
}

/// Packet forwarding method.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsFwdMethod {
    /// Full NAT: DNAT inbound, SNAT outbound. Both directions via load balancer.
    Masquerade,
    /// Direct Routing: rewrite destination MAC only. Real server replies directly.
    DirectRouting,
    /// IP-in-IP tunnel: encapsulate packet, real server decapsulates and replies directly.
    Tunnel,
    /// Local node: deliver to local socket on the same host.
    LocalNode,
}

/// TCP/UDP connection state as tracked by IPVS.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsConnState {
    None,
    /// TCP connection fully established (ESTABLISHED).
    EstablishedTcp,
    /// SYN sent by client, SYN-ACK not yet seen.
    SynSent,
    /// SYN-ACK seen, waiting for client ACK.
    SynRecv,
    /// FIN sent by one side; connection closing.
    FinWait,
    /// TCP TIME_WAIT: waiting for duplicate segments to expire.
    TimeWait,
    /// TCP CLOSE: both FINs exchanged.
    Close,
    /// TCP CLOSE_WAIT: FIN received, application not yet closed.
    CloseWait,
    /// TCP LAST_ACK: FIN sent after CLOSE_WAIT, awaiting ACK.
    LastAck,
    /// TCP LISTEN: server socket listening (LocalNode mode).
    Listen,
    /// SYN-ACK sent by server, completing three-way handshake.
    Synack,
    /// UDP active flow (packet seen within timeout window).
    Udp,
    /// ICMP error response in flight.
    Icmp,
}

/// Transport protocol selector for IPVS services and connections.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum IpvsProtocol {
    Tcp  = 6,
    Udp  = 17,
    Sctp = 132,
}

bitflags::bitflags! {
    /// Virtual service flags.
    pub struct IpvsServiceFlags: u32 {
        /// Persistent service: use source-IP session affinity.
        const PERSISTENT   = 1 << 0;
        /// Hashed scheduler: use a hash of the destination IP (dh scheduler).
        const HASHED       = 1 << 1;
        /// One-packet scheduling: schedule every UDP packet independently.
        const ONE_PACKET   = 1 << 2;
        /// SYN proxy: validate TCP SYN cookies before creating a connection entry.
        const SYN_PROXY    = 1 << 3;
    }
}

/// Aggregate statistics for a virtual service or real server.
pub struct IpvsStats {
    /// Total connections handled (atomic, updated on connection creation).
    pub conns:   AtomicU64,
    /// Total inbound packets processed.
    pub inpkts:  AtomicU64,
    /// Total outbound packets processed (NAT mode only; DR/TUN: outbound bypasses LB).
    pub outpkts: AtomicU64,
    /// Total inbound bytes.
    pub inbytes:  AtomicU64,
    /// Total outbound bytes.
    pub outbytes: AtomicU64,
}

15.10.3 Scheduling Algorithms

Scheduling algorithms implement the IpvsScheduler trait:

/// Pluggable scheduling algorithm for selecting a real server.
///
/// Implementations must be `Send + Sync` (called from multiple CPUs concurrently).
/// They must not block; all internal state uses lock-free or fine-grained locking.
pub trait IpvsScheduler: Send + Sync {
    /// Short name used in `ipvsadm -s` and `/proc/net/ip_vs`.
    fn name(&self) -> &'static str;

    /// Select a real server for a new connection.
    ///
    /// `service` provides access to the real server list and connection table.
    /// Returns `None` if no server with non-zero weight is available (all
    /// servers are administratively disabled or at zero weight).
    fn schedule(
        &self,
        service: &IpvsService,
        conn: &IpvsConn,
    ) -> Option<Arc<IpvsRealServer>>;
}

UmkaOS provides the following built-in schedulers:

Round Robin (rr): Cycles sequentially through the real server list, skipping servers with weight = 0. Uses an AtomicUsize index into the server vector, incremented with fetch_add(..., Relaxed) on each scheduling decision.

Weighted Round Robin (wrr): Servers are selected proportionally to their weight. Maintains a virtual dispatch table: a Vec<Arc<IpvsRealServer>> with each server repeated weight times. Built (or rebuilt) whenever the server list changes. An AtomicUsize cursor cycles through the table. Servers with weight = 0 are excluded from the table. Table rebuild is protected by a Mutex; the cursor is an AtomicUsize for lock-free cycling.

Least Connection (lc): Selects the server minimising overhead:

overhead(rs) = activeconns(rs) * 256 / weight(rs)

Servers with weight = 0 are skipped. Linear scan of the server list; reads activeconns with Relaxed atomic load (approximate: no lock needed for the heuristic).

Weighted Least Connection (wlc, default): The default scheduler. Selects the server minimising:

overhead(rs) = (activeconns(rs) * 256 + inactconns(rs)) / weight(rs)

Including inactconns prevents newly freed TIME_WAIT slots from being over-selected before the OS reclaims them. Linear scan, Relaxed atomic reads.

Source Hashing (sh): Consistent hashing on the client IP address. Maintains a hash table of size sh_buckets (default: 256, power of two). Each bucket caches the Arc<IpvsRealServer> last assigned to that hash slot. On scheduling, computes bucket = hash(caddr) % sh_buckets and returns the cached server if it has non-zero weight; otherwise re-selects via round-robin and updates the bucket. Provides session persistence without the overhead of a full connection table lookup.

Destination Hashing (dh): Consistent hashing on the destination IP address (the real server's IP). Used to pin outbound connections from a proxy to a consistent upstream backend. Same hash table mechanism as sh, keyed on daddr instead of caddr.

Shortest Expected Delay (sed): Selects the server minimising the expected queueing delay, modelled as:

sed(rs) = (activeconns(rs) + 1) * 256 / weight(rs)

The +1 accounts for the new connection being scheduled. Favours lightly loaded servers; does not differentiate between servers with zero active connections.

Never Queue (nq): An enhancement of sed that unconditionally prefers servers with activeconns = 0 (idle). Only if no idle server exists does it fall back to sed selection. This eliminates queue delay for new connections when any server is idle.

15.10.4 Netfilter Integration

IPVS registers the following netfilter hooks via the NfHookOps registration API (Section 15.2):

Hook point Priority Purpose
NF_INET_LOCAL_IN NF_IP_PRI_NAT_DST + 1 Match incoming packets to VIPs; create or look up connection entries; forward to real server.
NF_INET_FORWARD NF_IP_PRI_CONNTRACK Handle forwarded traffic in DR and TUN modes; update connection state.
NF_INET_POST_ROUTING NF_IP_PRI_NAT_SRC SNAT return traffic (NAT/Masquerade mode): rewrite source address to VIP.
NF_INET_LOCAL_OUT NF_IP_PRI_NAT_DST LocalNode forwarding: intercept locally generated packets destined for VIPs on loopback.

The NF_INET_LOCAL_IN hook is the primary entry point. For each incoming packet:

  1. Connection table lookup: Look up (protocol, saddr, sport, daddr, dport) in IpvsConnTable. If a matching IpvsConn is found, update its timeout and proceed to forwarding. This is the fast path for established connections.

  2. Virtual service lookup: If no connection entry exists, look up (protocol, daddr, dport) in the global virtual service table (BTreeMap<(IpvsProtocol, IpAddr, u16), Arc<IpvsService>>). If no match, let the packet pass through (NF_ACCEPT).

  3. Scheduling: Call service.scheduler.schedule(&service, &new_conn) to select a real server. If no server is available (weight = 0 for all), drop the packet (NF_DROP) and log to the IPVS stats (no_route).

  4. Connection entry creation: Allocate an IpvsConn, insert into IpvsConnTable, increment real_server.activeconns.

  5. Forwarding: Rewrite the packet according to fwd_method and re-inject via ip_route_output or dev_queue_xmit (for DR).

nf_conntrack integration: IPVS creates nf_conntrack entries for NAT-mode connections to enable TCP state tracking and ensure symmetrical NAT rewrite on both directions. The conntrack entry is linked to the IpvsConn via a private extension (nf_ct_ext_add). DR and TUN modes do not create conntrack entries for the return path (which bypasses the load balancer).

15.10.5 Connection Table

The IPVS connection table is a hash table of IpvsConn entries:

/// The IPVS per-service connection tracking table.
pub struct IpvsConnTable {
    /// Hash buckets, each a linked list of connections hashing to that bucket.
    /// Default: 2^20 buckets (1,048,576), tunable via
    /// `/proc/sys/net/ipv4/vs/conn_tab_bits`.
    buckets: Vec<RwLock<Vec<Arc<IpvsConn>>>>,
    /// Number of buckets (always a power of two).
    num_buckets: usize,
    /// Current number of active connection entries.
    count: AtomicUsize,
}

impl IpvsConnTable {
    /// Look up a connection by 5-tuple. Called from the hot path (NF_INET_LOCAL_IN).
    /// Read lock is taken per bucket; no global lock.
    pub fn lookup(
        &self,
        proto: IpvsProtocol,
        caddr: IpAddr, cport: u16,
        vaddr: IpAddr, vport: u16,
    ) -> Option<Arc<IpvsConn>>;

    /// Insert a newly created connection entry.
    pub fn insert(&self, conn: Arc<IpvsConn>);

    /// Remove an expired connection entry. Called from the timer expiry path.
    pub fn remove(&self, conn: &IpvsConn);
}

The hash key is hash(protocol, caddr, cport, vaddr, vport) reduced to key & (num_buckets - 1). Each bucket is individually RwLock-protected; readers (lookups from the netfilter fast path) take a read lock and perform a linear scan of the (typically short) bucket chain. Writers (insert, remove) take the write lock. There is no global table lock.

Connection expiry uses per-connection KernelTimer instances. When a timer fires, the connection's timeout field is decremented; if it reaches zero, the entry is removed from the table and the real_server.activeconns (or inactconns) counter is decremented. TCP state transitions (SYN_SENT → ESTABLISHED → TIME_WAIT → CLOSED) reset the timer with the appropriate timeout value for the new state (same timeouts as Linux: ESTABLISHED 15 min, TIME_WAIT 120 s, SYN_SENT 1 min, SYN_RECV 1 min).

15.10.6 Health Checking Integration

IPVS itself does not perform health checks. Health checking is the responsibility of a user-space daemon (keepalived, HAProxy, or a cloud-native controller). The kernel provides the following mechanisms for user-space to signal server health:

  • IP_VS_SO_SET_EDITDEST (ioctl) or IPVS_CMD_SET_DEST (Generic Netlink): Update IpvsRealServer.weight. Setting weight = 0 drains the server: no new connections are assigned, but existing connections in the IpvsConnTable continue until they time out or are explicitly removed. Setting weight > 0 re-enables the server.

  • IP_VS_SO_SET_DELDEST / IPVS_CMD_DEL_DEST: Remove the server from the pool immediately. Existing IpvsConn entries retain their Weak<IpvsRealServer> reference; the strong Arc is released from the real_servers list. When the last IpvsConn drops its Weak reference, the IpvsRealServer is freed.

Graceful drain sequence (as used by keepalived before a rolling upgrade):

  1. IPVS_CMD_SET_DEST weight=0 — stops new connection scheduling.
  2. Poll IP_VS_SO_GET_DESTS: wait until activeconns = 0 and inactconns = 0.
  3. IPVS_CMD_DEL_DEST — remove the server entry.

This sequence is identical to Linux's ip_vs behaviour and is relied upon by Kubernetes kube-proxy during endpoint removal.

15.10.7 Userspace Interface

ipvsadm communicates with the IPVS kernel subsystem via two equivalent interfaces:

Legacy socket ioctl (compatibility with ipvsadm < 1.28 and older scripts): A raw IP socket is created with socket(AF_INET, SOCK_RAW, IPPROTO_RAW). setsockopt and getsockopt calls on this socket with level IPPROTO_IP and option names in the IP_VS_SO_* range carry IPVS commands:

Option Direction Purpose
IP_VS_SO_SET_ADD set Add a virtual service
IP_VS_SO_SET_EDIT set Edit a virtual service (scheduler, flags)
IP_VS_SO_SET_DEL set Delete a virtual service
IP_VS_SO_SET_FLUSH set Delete all virtual services
IP_VS_SO_SET_ADDDEST set Add a real server to a virtual service
IP_VS_SO_SET_EDITDEST set Edit a real server (weight, fwd method)
IP_VS_SO_SET_DELDEST set Remove a real server
IP_VS_SO_SET_TIMEOUT set Set TCP/UDP/SCTP connection timeouts
IP_VS_SO_GET_VERSION get Kernel IPVS version string
IP_VS_SO_GET_INFO get Global stats: num services, conn_tab_size
IP_VS_SO_GET_SERVICES get List of all virtual services
IP_VS_SO_GET_SERVICE get Single virtual service by VIP:port
IP_VS_SO_GET_DESTS get Real servers for a virtual service

Generic Netlink (modern interface, ipvsadm ≥ 1.28, kube-proxy): The IPVS subsystem registers a Generic Netlink family named "IPVS" with the kernel's Generic Netlink layer. Commands mirror the ioctl set: IPVS_CMD_NEW_SERVICE, IPVS_CMD_SET_SERVICE, IPVS_CMD_DEL_SERVICE, IPVS_CMD_GET_SERVICE, IPVS_CMD_NEW_DEST, IPVS_CMD_SET_DEST, IPVS_CMD_DEL_DEST, IPVS_CMD_GET_DEST. Attributes carry the same fields as the ioctl structures (ip_vs_service_user, ip_vs_dest_user) but in netlink TLV form, enabling extensibility without ABI breaks.

procfs read-only status:

/proc/net/ip_vs          — virtual service list with scheduler and stats
/proc/net/ip_vs_conn     — active connection table dump
/proc/sys/net/ipv4/vs/   — sysctl namespace:
    conn_tab_bits        (rw) hash table size as log2 (default: 20)
    expire_nodest_conn   (rw) expire connections on dest removal (default: 0)
    expire_quiescent_template (rw) expire persistence templates (default: 0)
    nat_icmp_send        (rw) send ICMP errors from NAT (default: 0)
    sync_threshold       (rw) connection sync threshold (default: 3 50)
    timeout_tcp          (rw) TCP ESTABLISHED timeout seconds (default: 900)
    timeout_tcp_fin      (rw) TCP FIN_WAIT timeout (default: 120)
    timeout_udp          (rw) UDP flow timeout (default: 300)

15.10.8 IPVS and Kubernetes kube-proxy

Kubernetes kube-proxy --mode=ipvs uses IPVS for all Service type ClusterIP, NodePort, and LoadBalancer load balancing. For UmkaOS to support kube-proxy in IPVS mode, the following must be satisfied — all of which the above design meets:

  • IPv4 and IPv6 (IpAddr is an enum over Ipv4Addr and Ipv6Addr; IPVS services are created independently for each address family).
  • FWD_MASQ (NAT mode): Required for ClusterIP services where the real server is on a different node. UmkaOS supports full NAT with nf_conntrack integration.
  • sh, rr, lc schedulers: kube-proxy defaults to rr; users may select lc or sh. All three are implemented.
  • Session persistence (IpvsServiceFlags::PERSISTENT, non-zero timeout): Used for sessionAffinity: ClientIP Services. UmkaOS honours the timeout field.
  • Graceful server drain (weight = 0): kube-proxy sets weight to 0 when a pod is being terminated and the endpoint is removed from the Endpoints object.
  • /proc/sys/net/ipv4/vs/ sysctl namespace: kube-proxy writes conn_tab_bits and expire_nodest_conn. Both are emulated by UmkaOS's procfs.
  • Generic Netlink IPVS family: kube-proxy uses the IPVS Netlink family for all service and destination management. UmkaOS's implementation exposes identical attributes and command semantics.

kube-proxy additionally uses ipset (netfilter IP sets) for efficient NodePort and externalIPs matching. UmkaOS's netfilter layer (Section 15.2.2) supports ipset-style match modules; kube-proxy's ipset rules operate identically.

15.10.9 Linux Compatibility

UmkaOS's IPVS subsystem is binary-compatible with Linux's ip_vs module behaviour:

  • ipvsadm (all versions): both the ioctl socket API and the Generic Netlink API work without recompilation. The ioctl option numbers, structure layouts (ip_vs_service_user, ip_vs_dest_user, ip_vs_get_info), and the Generic Netlink family name and attribute types are identical to Linux 5.19+.
  • /proc/net/ip_vs and /proc/net/ip_vs_conn: output format identical to Linux (column widths, field ordering). Scripts that awk/grep these files work unchanged.
  • /proc/sys/net/ipv4/vs/ sysctl tree: all keys present, same defaults, same semantics.
  • Connection state machine: TCP state timeouts and transition logic match Linux's ip_vs_proto_tcp.c behaviour exactly, ensuring that keepalived's connection drain logic (which polls activeconns/inactconns via IP_VS_SO_GET_DESTS) operates correctly.
  • Scheduling algorithm names: "rr", "wrr", "lc", "wlc", "sh", "dh", "sed", "nq" — identical strings to Linux, used by ipvsadm -s and kube-proxy's scheduler selection.