Skip to content

Chapter 18: Linux Compatibility

Syscall interface, futex, netlink, Windows emulation, dropped compatibility, native syscalls, safe extensibility


18.1 Syscall Interface

18.1.1 Design Goal

UmkaOS is a POSIX-compatible kernel. Of the approximately 450 defined Linux x86-64 syscalls, approximately 330-350 are actively used by current software (glibc 2.17+, musl 1.2+, systemd, Docker, Kubernetes). The remaining approximately 100-120 are obsolete and return -ENOSYS unconditionally.

Of the 330-350 active syscalls: - ~80% (~265-280) are implemented natively with identical POSIX semantics — read, write, open, mmap, fork, socket, etc. are UmkaOS's own API, not a translation layer over something else. The syscall entry point performs representation conversion (untyped C ABI → typed Rust internals), not semantic translation. - ~15% (~50-55) need thin adaptation (e.g., Linux's untyped ioctl → UmkaOS's typed driver interface). - ~5% (~15-20) are genuine compatibility shims for deprecated syscalls that get remapped to modern equivalents.

18.1.2 Syscall Dispatch Architecture

The SyscallHandler enum classifies every syscall by how it is serviced. The first three variants (Direct, InnerRingForward, OuterRingForward) are native implementations — UmkaOS's own kernel code handling the syscall directly. Only Emulated is a compatibility shim:

pub enum SyscallHandler {
    /// Handled directly in UmkaOS Core -- no tier crossing
    /// Examples: getpid, brk, mmap, clock_gettime, signals, futex
    Direct(fn(&mut SyscallContext) -> i64),

    /// Forwarded to a Tier 1 driver via domain switch
    /// Examples: read, write, ioctl, socket ops, mount
    InnerRingForward {
        driver_class: DriverClass,
        handler: fn(&mut SyscallContext) -> i64,
    },

    /// Forwarded to a Tier 2 driver via IPC
    /// Examples: USB-specific ioctls
    OuterRingForward {
        driver_class: DriverClass,
        handler: fn(&mut SyscallContext) -> i64,
    },

    /// Compatibility shim for deprecated-but-still-called syscalls
    /// Examples: select (mapped to pselect6), poll (mapped to ppoll)
    Emulated(fn(&mut SyscallContext) -> i64),

    /// Not implemented -- returns -ENOSYS
    /// Examples: old_stat, socketcall, ipc multiplexer
    Unimplemented,
}

18.1.3 Virtual Filesystems

These synthetic filesystems are critical for compatibility. Many Linux tools parse them directly and will break if the format is even slightly wrong.

Filesystem Implementation Critical consumers
/proc Synthetic, generated from kernel state ps, top, htop, systemd, Docker
/sys Reflects device tree from bus manager udev, systemd, lspci, lsusb
/dev Maps to KABI device interfaces Everything (devtmpfs-compatible)
/dev/shm tmpfs shared memory POSIX shm_open, Chrome, Firefox
/run tmpfs systemd, dbus, PID files

Key /proc entries that must be pixel-perfect:

  • /proc/meminfo -- parsed by free, top, OOM killer
  • /proc/cpuinfo -- parsed by many applications for CPU feature detection
  • /proc/[pid]/maps -- parsed by debuggers, profilers, JVMs
  • /proc/[pid]/status -- parsed by ps, container runtimes
  • /proc/[pid]/fd/ -- used by lsof, process managers
  • /proc/self/exe -- readlink used by many applications to find themselves
  • /proc/sys/ -- sysctl interface for kernel tuning

Format baseline: /proc file formats target Linux 6.1 LTS output. Field ordering, whitespace, and units match procfs as of kernel 6.1. Newer fields added in later kernels are included when the corresponding UmkaOS subsystem supports the feature (e.g., VmFlags in /proc/[pid]/smaps is populated when the memory manager tracks the relevant flags).

Implementation specification strategy: Rather than duplicating Linux's procfs format definitions here (which would become stale as Linux evolves), each /proc entry is implemented as a format-test pair: the implementation references the corresponding Linux 6.1 fs/proc/*.c source as the authoritative format spec, and a companion integration test captures the expected output from a Linux 6.1 reference VM and asserts byte-for-byte match. Critical entries have explicit format notes:

Entry Key format rules
/proc/meminfo FieldName: %8lu kB\n — right-aligned 8-char value, space-colon-space, always kB units
/proc/cpuinfo Tab-separated key\t: value\n, blank line between CPUs, flags field is space-separated
/proc/[pid]/maps %08lx-%08lx %4s %08lx %02x:%02x %lu %s\n (hex ranges, perms, offset, dev, inode, pathname)
/proc/[pid]/status Key:\tvalue\n (tab after colon), sizes in kB, Uid/Gid have 4 tab-separated fields
/proc/stat Space-separated, first field cpu or cpu%d, jiffy values in USER_HZ (100)

Remaining /proc entries are specified at implementation time using the same test-driven approach (capture reference output → assert match).

18.1.4 Complete Feature Coverage

These features must be designed into the architecture from day one. They cannot be bolted on later.

18.1.4.1 eBPF Subsystem

  • Full eBPF virtual machine (register-based, 11 registers, 64-bit)
  • Verifier: static analysis ensuring program safety (bounded loops, memory safety, no uninitialized reads)
  • JIT compiler: eBPF bytecode to native code, per architecture:
  • x86-64: Phase 1 (co-primary JIT target, available from day one)
  • AArch64: Phase 1 (co-primary JIT target, available from day one; see note below)
  • RISC-V 64: Phase 2 (RV64 instruction emission; strong LLVM backend, straightforward port)
  • PPC64LE: Phase 2 (PPC64 instruction emission)
  • ARMv7: Phase 3 (Thumb-2 instruction emission)
  • PPC32: Phase 3 (PPC32 instruction emission)
  • Interpreted fallback available on all architectures from Phase 1

AArch64 co-primary JIT rationale: AArch64 is promoted to co-primary JIT status alongside x86-64 because ARM has surpassed x86 in deployment count for Linux workloads (mobile, embedded, and cloud — AWS Graviton, Ampere, Apple Silicon). The AArch64 JIT is architecturally similar to x86-64: fixed-width 32-bit instructions, 31 general-purpose registers, and no complex addressing modes to model. Denying JIT to AArch64 would create a multi-year performance gap on the most widely deployed ISA. Verified eBPF performance (JIT overhead, x86-64 baseline): x86-64: 2-5 ns per invocation; AArch64: 3-7 ns; RISC-V 64 (interpreted): 50-200 ns. - Program types: XDP, tc (traffic control), kprobe, tracepoint, cgroup, socket filter, LSM, struct_ops - Map types: hash, array, ringbuf, per-CPU hash, per-CPU array, LRU hash, LPM trie, queue, stack - bpftool compatibility for loading and inspecting programs - Required for: bpftrace, Cilium (Kubernetes networking), Falco (security), BCC tools

Map Size Limits and Memory Budget — enforced at bpf(BPF_MAP_CREATE, ...) time. Limits match Linux 6.x for compatibility. Operators may raise the per-UID limit via sysctl umka.bpf.uid_map_memory_limit_mib (default 64):

Map type Max entries Max value size Notes
BPF_MAP_TYPE_HASH 1,048,576 65,535 bytes Per-entry memory charged to cgroup
BPF_MAP_TYPE_ARRAY 1,048,576 65,535 bytes Total = entries × value_size
BPF_MAP_TYPE_RINGBUF 2 GiB (must be power of 2, min 4,096) Size in bytes, not entries
BPF_MAP_TYPE_PERCPU_HASH 1,048,576 65,535 bytes Multiplied by CPU count
BPF_MAP_TYPE_PERF_EVENT_ARRAY NR_CPUS One slot per CPU
All other types 1,048,576 65,535 bytes

Global eBPF memory budget: - Unprivileged loaders (!CAP_BPF): map memory is subject to the cgroup memory limit; additionally, a per-UID soft limit of 64 MiB applies (returns ENOMEM when exceeded). - Privileged loaders (CAP_BPF): no per-UID limit; memory is still subject to cgroup accounting. - System-wide: total eBPF map memory is tracked and reported via /System/Kernel/bpf/map_memory_bytes in umkafs.

Map Type Implementation Specifications:

  • BPF_MAP_TYPE_HASH: FNV-1a hash, open addressing with linear probing, load factor <=75%, per-bucket spinlock (256 buckets). On collision past load factor: return ENOSPC from map_update.
  • BPF_MAP_TYPE_ARRAY: Fixed-size pre-allocated array, index bounds-checked, per-element spinlock for value updates >8 bytes (otherwise atomic CAS).
  • BPF_MAP_TYPE_RINGBUF: Lock-free SPSC ring (producer=BPF prog, consumer=userspace). Uses Linux's identical ring format (compatible with libbpf).
  • BPF_MAP_TYPE_LRU_HASH: Same as HASH but with an LRU eviction list (per-CPU LRU lists, promoted to global list on cross-CPU access). Eviction is O(1) amortized.
  • BPF_MAP_TYPE_LPM_TRIE: Patricia trie (radix tree), O(prefix_length) lookup, per-trie RwLock.
  • BPF_MAP_TYPE_PERCPU_HASH: Per-CPU variant of HASH -- each CPU has its own hash table; lookups/updates touch only the current CPU's table.
  • BPF_MAP_TYPE_PERCPU_ARRAY: Per-CPU array -- same structure as ARRAY but replicated per CPU.
  • BPF_MAP_TYPE_PERF_EVENT_ARRAY: Array of perf_event file descriptors; bpf_perf_event_output() writes to the current CPU's slot.
  • BPF_MAP_TYPE_STACK_TRACE: Hash map keyed by stack ID (FNV-1a of the call chain), value is array of instruction pointers.

Relationship to KABI policy hooks: eBPF provides Linux-compatible user-to-kernel extensibility for tracing, networking, and security (the same role as in Linux). UmkaOS's KABI driver model (Section 11.1) supports kernel-internal extensibility via vtable-based driver interfaces — drivers can register policy callbacks for scheduling, memory, and I/O decisions. The two mechanisms are complementary: eBPF serves the Linux ecosystem (existing tools, user-authored programs); KABI serves kernel evolution (vendor-provided policy drivers, hardware-specific optimizations).

18.1.4.2 eBPF Verifier Architecture

The verifier is the highest-risk component in the syscall interface. UmkaOS implements a clean-room Rust reimplementation (not a port of Linux's C verifier), leveraging Rust's type system to make verifier invariants compile-time enforced where possible.

Abstract interpretation: Forward dataflow analysis tracking register types and value ranges through every reachable instruction. At branch points, both paths are explored. At join points, register states are merged conservatively (widening).

Register abstract state — each of the 11 eBPF registers (r0-r10) carries:

pub struct RegState {
    /// Coarse type tag.
    pub reg_type: RegType,
    /// Signed/unsigned min/max for SCALAR_VALUE registers.
    /// Tracked as two separate ranges to handle sign-extension correctly.
    pub smin: i64, pub smax: i64,  // signed range
    pub umin: u64, pub umax: u64,  // unsigned range
    /// For pointer types: byte offset from base (may be negative for stack).
    pub off: i32,
    /// For PTR_TO_MAP_VALUE: which map, value size, key size.
    pub map_ptr: Option<BpfMapId>,
    /// For PTR_TO_BTF_ID: BTF type ID for field-access checking.
    pub btf_id: Option<BtfTypeId>,
    /// True if this register was written in a conditional branch and may be
    /// in an invalid state in the other branch (Spectre mitig: mask on use).
    pub id: u32,   // equivalence ID: two regs with same id hold equal values
}

pub enum RegType {
    NotInit,            // register has never been written
    ScalarValue,        // arbitrary integer, range-tracked
    PtrToCtx,          // read-only pointer to program context
    PtrToMap,          // pointer to BPF map struct (not its value)
    PtrToMapValue,     // pointer into a map value
    PtrToStack,        // pointer into 512-byte per-frame stack
    PtrToPacket,       // data pointer (skb->data)
    PtrToPacketMeta,   // metadata pointer (xdp_md->data_meta)
    PtrToPacketEnd,    // end-of-packet sentinel
    PtrToBtfId,        // typed kernel pointer via BTF
    PtrToMem,          // pointer to kernel memory from helper return
    PtrToRdOnlyBuf,    // read-only buffer from helper (e.g., map lookup result)
}

Pointer arithmetic rules: - SCALAR_VALUE: full arithmetic (add, sub, mul, div, mod, and, or, xor, shift). Range is updated at each operation; overflow wraps and may force smin=i64::MIN, smax=i64::MAX. - PTR_TO_MAP_VALUE + scalar: allowed. New offset = old offset + scalar.umin..scalar.umax. Before any load/store: verifier checks [off, off+access_size) ⊆ [0, map_value_size). - PTR_TO_STACK + scalar: allowed only if resulting offset is within [-512, 0]. Negative offsets index into the stack frame (stack grows down). - PTR_TO_PACKET + scalar: allowed only after a bounds check instruction. The verifier tracks data_end register; a comparison ptr + N < data_end marks the range valid. - Other pointer types (PtrToCtx, PtrToBtfId): arithmetic forbidden. Field access only via BTF-validated offsets. - Pointer ± pointer: forbidden (except packet_end - packet_ptr for length, which produces a SCALAR_VALUE bounded by packet length).

Stack slot tracking: The 512-byte stack is divided into 8-byte slots. Each slot carries a StackSlotType: - Misc: written with an unknown value (spilled scalar). - SpilledReg(RegState): contains a spilled register with its type preserved. - Uninit: never written — reading this is a verifier error. Stack writes smaller than 8 bytes mark the containing slot as Misc.

Helper function type checking: Each BPF helper has a statically-encoded signature: fn(ArgType, ArgType, ...) -> RetType. Before emitting a call instruction, the verifier checks each argument register's RegType against the expected ArgType: - ARG_ANYTHING: any initialized register - ARG_PTR_TO_MAP_KEY: PtrToStack or PtrToMapValue pointing to key_size bytes - ARG_PTR_TO_MAP_VALUE: PtrToMapValue with write access - ARG_CONST_SIZE: ScalarValue with known (smin==smax) value - ARG_PTR_TO_MEM: any initialized pointer with size verified by prior arg After the call: r0 is set to the return type (e.g., PtrToMapValue | NULL for map_lookup_elem).

Loop handling: Back-edge detection via DFS. Bounded loops (Linux 5.3+ semantics) supported via the loop counter check: the verifier must observe the back-edge condition register narrowing its range on each iteration. If the range does not narrow (e.g., counter never decremented), the loop is rejected. Widening: after BPF_MAX_SUBPROGS (256) visits to the same instruction, the verifier widens all ScalarValue ranges to [INT_MIN, INT_MAX] to force termination.

Maximum verifier instruction exploration count: 1 million (BPF_COMPLEXITY_LIMIT_INSNS, matching Linux since kernel 5.2). Maximum program size: 4,096 instructions for unprivileged programs (BPF_MAXINSNS), 1 million for privileged. Unbounded loops rejected.

Verifier Limits — UmkaOS enforces the same limits as Linux 6.x so that existing eBPF programs passing the Linux verifier also pass UmkaOS's. Programs loaded with CAP_BPF are not subject to tighter restrictions:

Limit Value Notes
Max instructions explored (visited) 1,000,000 Complexity bound, not unique instruction count (BPF_COMPLEXITY_LIMIT_INSNS)
Max stack depth per subprogram 512 bytes Includes spilled registers
Max subprograms (BPF-to-BPF calls) 64 BPF_MAX_SUBPROGS
Max map-in-map nesting depth 2
Max tail call depth 33 MAX_TAIL_CALL_CNT; Linux 5.12+ uses 33, earlier used 32
Max instructions per subprogram 1,000,000 Same as global complexity limit
Max loop iterations (bounded loops) 8,192,000 Per loop; ultimately bounded by the instruction exploration limit

Verifier time limit: In addition to the instruction exploration limit, UmkaOS enforces a wall-clock timeout of 1 second per verification attempt. A pathological program can consume 1M exploration steps and still take seconds to verify due to wide range analysis or many join points — this is an unprivileged DoS vector that Linux does not protect against. UmkaOS closes this gap:

pub struct VerifierBudget {
    /// Maximum BPF instruction state explorations (same as Linux).
    pub insn_limit: u32,           // default: 1_000_000
    /// Wall-clock timeout for the entire verification pass.
    pub time_limit: Duration,      // default: Duration::from_secs(1)
    /// Instructions explored so far.
    pub insns_checked: u32,
    /// Verification start time.
    pub start: Instant,
}

impl VerifierBudget {
    pub fn check(&self) -> Result<(), VerifierError> {
        if self.insns_checked >= self.insn_limit {
            return Err(VerifierError::ComplexityLimit);
        }
        if self.start.elapsed() >= self.time_limit {
            return Err(VerifierError::TimeLimit);
        }
        Ok(())
    }
}

The wall-clock timeout is checked at every back-edge and join point during exploration. On timeout, the program is rejected with EACCES and an error message: "BPF program rejected: verification time limit exceeded (1s)".

The limit can be raised for privileged users: CAP_SYS_ADMIN can set up to 10 seconds via the BPF_PROG_LOAD attribute verification_time_limit_ms. The sysctl /proc/sys/kernel/bpf_verifier_time_limit_ms (default 1000) sets the system-wide limit for unprivileged loaders. Note: Linux does not have this protection; UmkaOS introduces it as a safety improvement.

Stack depth: Maximum 512 bytes per frame, verified statically. BPF-to-BPF call depth max 8 frames, each with up to 512 bytes of stack. Tail call chain depth max 33 (MAX_TAIL_CALL_CNT, matching Linux 5.12+; earlier kernels used 32).

18.1.4.3 eBPF Verifier Risk Mitigation

A verifier bug equals kernel compromise. UmkaOS applies defense-in-depth:

  1. Memory protection: BPF JIT-compiled programs execute in the kernel code segment (PKEY 0). Memory protection for BPF programs is enforced by the verifier and by W^X (write-xor-execute) page permissions, not by PKEY isolation -- PKEY isolation is only meaningful for Tier 1 driver domains running in ring 0 but with restricted memory access (PKEYs 2-13; see Section 10.2). BPF does not consume a PKEY domain. The JIT output pages are mapped execute-only (no write) after code emission; the JIT staging buffer is mapped read-write (no execute) during emission and unmapped afterward. This W^X discipline ensures that even with a verifier bug, an attacker cannot modify JIT-compiled code at runtime. BPF programs access kernel state only through verified helper functions that perform bounds-checked, type-checked access on the program's behalf.
  2. Capability-gated loading: Only CAP_BPF holders can load programs. Unprivileged eBPF loading disabled by default.
  3. Differential testing: UmkaOS verifier tested against Linux verifier on >50,000 known-good and known-bad programs. Any divergence is investigated.
  4. Rust type safety: Invalid state transitions are compile-time errors, not runtime checks.

18.1.4.4 BPF Isolation Model

BPF programs are a cross-cutting concern used beyond networking: tracing (kprobe, tracepoint), security (LSM, seccomp), scheduling (struct_ops), and packet filtering (XDP, tc) all execute BPF code. The full BPF isolation model — verifier enforcement, map access control, capability-gated helpers, cross-domain packet redirect rules, and W^X page protections — is specified in Section 15.2.2 (Packet Filtering, BPF-Based). Although Section 15.2.2 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. Every BPF program, regardless of attachment point, runs in the kernel address space (PKEY 0) with safety enforced by the verifier and W^X page permissions (see Section 18.1.4.3), and accesses kernel state only through verified BPF helpers that perform bounds-checked, type-checked access on the program's behalf.

18.1.4.4a eBPF Helper Function IDs and Dispatch Table

eBPF programs invoke kernel services through a fixed set of helper functions identified by numeric IDs. Since BPF programs compiled for Linux embed these IDs directly in their bytecode, UmkaOS must dispatch identically — the numeric IDs are part of the external ABI.

Helper ID Enumeration
/// eBPF helper function IDs — must match Linux's `enum bpf_func_id` exactly
/// (include/uapi/linux/bpf.h, Linux 6.12).
///
/// Programs compiled for Linux use these numeric IDs; UmkaOS must dispatch
/// identically. The full set matches `BPF_FUNC_MAX_ID` (~220 in Linux 6.12).
/// Only the commonly-used helpers are named here; the remainder are
/// provided as `Reserved(u32)` variants at their correct numeric positions.
#[repr(u32)]
#[non_exhaustive]
pub enum BpfFuncId {
    Unspec              = 0,
    MapLookupElem       = 1,   // bpf_map_lookup_elem
    MapUpdateElem       = 2,   // bpf_map_update_elem
    MapDeleteElem       = 3,   // bpf_map_delete_elem
    ProbeRead           = 4,   // bpf_probe_read (deprecated; use ProbeReadKernel)
    KtimeGetNs          = 5,   // bpf_ktime_get_ns → monotonic nanoseconds
    TracePrintk         = 6,   // bpf_trace_printk → /sys/kernel/debug/tracing/trace_pipe
    GetPrandomU32       = 7,   // bpf_get_prandom_u32
    GetSmpProcessorId   = 8,   // bpf_get_smp_processor_id
    SkbStoreBytes       = 9,
    CsumDiff            = 10,
    CsumUpdate          = 11,
    SetHashInvalid      = 12,
    GetNumaNodeId       = 13,
    GetCurrentPidTgid   = 14,  // → (tgid << 32 | pid)
    GetCurrentUidGid    = 15,  // → (gid << 32 | uid)
    GetCurrentComm      = 16,
    GetCgroupClassid    = 17,
    SkbVlanPush         = 18,
    SkbVlanPop          = 19,
    SkbGetTunnelKey     = 20,
    SkbSetTunnelKey     = 21,
    PerfEventRead       = 22,
    Redirect            = 23,  // bpf_redirect(ifindex, flags)
    GetRouteRealm       = 24,
    PerfEventOutput     = 25,
    SkbLoadBytes        = 26,
    GetStackid          = 27,
    CsumLevel           = 28,
    SkbChangeProto      = 29,
    SkbChangeType       = 30,
    SkbUnderCgroup      = 31,
    GetHashRecalc       = 32,
    // 33, 34: skb_get_tunnel_opt, skb_set_tunnel_opt
    GetCurrentTask      = 35,
    ProbeWriteUser      = 36,
    CurrentTaskUnderCgroup = 37,
    SkbChangeTail       = 38,
    SkbPullData         = 39,
    // 40: csum_update (second variant)
    SetHash             = 41,
    Setsockopt          = 42,
    SkbAdjustRoom       = 43,
    XdpAdjustHead       = 44,  // bpf_xdp_adjust_head(xdp_md, delta)
    ProbeReadStr        = 45,
    GetSocketCookie     = 46,
    GetSocketUid        = 47,
    // 48-112: socket, cgroup, flow dissector, tunnel, sk_storage, etc.
    // Implemented at their exact numeric IDs; not all are named in this enum.
    ProbeReadKernel     = 113, // bpf_probe_read_kernel (tracing programs only)
    ProbeReadUser       = 114,
    ProbeReadKernelStr  = 115,
    ProbeReadUserStr    = 116,
    // 117-129: d_path, copy_from_user, snprintf_btf, seq_printf, timer, etc.
    RingbufOutput       = 130, // bpf_ringbuf_output
    RingbufReserve      = 131, // bpf_ringbuf_reserve
    RingbufSubmit       = 132, // bpf_ringbuf_submit
    RingbufDiscard      = 133, // bpf_ringbuf_discard
    RingbufQuery        = 134, // bpf_ringbuf_query
    // 135-220+: further helpers (kptr, dynptr, user_ringbuf, arena, etc.)
    // Current max in Linux 6.12: BPF_FUNC_MAX_ID ≈ 220.
}

The full set of helper IDs must match enum bpf_func_id in Linux's include/uapi/linux/bpf.h exactly. UmkaOS implements the complete set of helpers required for: - Network programs (XDP, socket filter, tc): Redirect, XdpAdjustHead, SkbAdjustRoom, PerfEventOutput, MapLookupElem/MapUpdateElem, checksum helpers. - Tracing programs (kprobe, tracepoint, perf_event): ProbeReadKernel, ProbeReadUser, ProbeReadKernelStr, ProbeReadUserStr, GetCurrentPidTgid, GetCurrentTask, GetStackid, RingbufOutput/Reserve/Submit/Discard. - Cgroup programs: GetCurrentUidGid, GetCgroupClassid, Setsockopt, GetSocketCookie.

Program types that attempt to call a helper not permitted for their type receive EPERM from the verifier at load time — not at runtime. The allowed-helper set per program type is enforced statically.

Helper Dispatch Table
/// eBPF helper function dispatch table.
/// Indexed by `BpfFuncId` (cast to `u32`), sized to `BPF_FUNC_MAX_ID`.
/// Populated at kernel init time; immutable thereafter.
pub struct BpfHelperTable {
    /// One entry per helper ID, indexed by `BpfFuncId as u32`.
    /// Entries for unimplemented IDs have `func` set to `bpf_unimplemented_helper`
    /// (returns 0; verifier prevents reaching this at runtime by rejecting the call).
    pub helpers: &'static [BpfHelper],
}

/// Descriptor for a single eBPF helper function.
pub struct BpfHelper {
    /// Numeric helper ID (matches `BpfFuncId`).
    pub id: u32,
    /// Bitmask of `BpfProgType` values permitted to call this helper.
    /// Verifier checks this at load time; runtime dispatch unconditional.
    pub allowed_prog_types: BpfProgTypeMask,
    /// UmkaOS implementation.
    ///
    /// # Safety
    ///
    /// Called from JIT-compiled or interpreted eBPF programs. Arguments are
    /// pre-validated by the verifier (types match `arg_types`; pointers are
    /// in-bounds). The implementation must not panic, must not access
    /// memory outside the passed bounds, and must complete in bounded time.
    pub func: unsafe fn(a1: u64, a2: u64, a3: u64, a4: u64, a5: u64) -> u64,
    /// Argument type descriptors used by the verifier for type checking.
    /// Five slots match the eBPF calling convention (r1-r5 as arguments).
    pub arg_types: [BpfArgType; 5],
    /// Return value type for the verifier (updates r0's `RegState` after the call).
    pub ret_type: BpfRetType,
}

/// Per-program-type helper allowlist bitmask.
/// One bit per `BpfProgType`; a helper is callable from program type T
/// iff `(allowed_prog_types >> T as u32) & 1 == 1`.
pub struct BpfProgTypeMask(pub u64);

The BpfHelperTable is a static array allocated at kernel init time and populated by each subsystem that owns helpers (networking, tracing, cgroup, crypto). The table is looked up by the JIT compiler (to emit a direct call to helper.func) and by the interpreter (to dispatch via helper.func). The verifier uses arg_types and ret_type to propagate register abstract state through helper calls and allowed_prog_types to reject calls to disallowed helpers with EPERM at load time.

Helper Security Model
  • Capability-gated helpers: helpers that can modify kernel state (ProbeWriteUser, TracePrintk) require CAP_BPF or CAP_SYS_ADMIN; the verifier rejects their use in programs loaded without the required capability.
  • Type-safe access: helpers accessing kernel memory (ProbeReadKernel, ProbeReadUser) perform bounds-checked, type-checked access. The verifier ensures the pointer argument is of the correct RegType and the size argument is a known ScalarValue.
  • No helper bypasses isolation domains: helpers execute in the kernel address space (PKEY 0) and cannot be used to access Tier 1 driver isolation domains directly. A Tier 1 driver's memory is not reachable via bpf_probe_read_kernel because it maps to a different protection key — the access faults at the hardware level before the helper copies any data.

18.1.4.5 KVM Hypervisor

KVM runs as a Tier 1 driver with extended hardware privileges, exposing the /dev/kvm interface. Unlike most Tier 1 drivers that access a single device via MMIO, KVM requires access to VM control structures (VMCS/VMCB/HCR_EL2 configuration). These are granted as capabilities at registration time via KvmHardwareCapability — a structured capability exchange at the KABI boundary that permits umka-core to execute VMX/VHE/H-extension operations on KVM's behalf through a validated VMX/VHE trampoline. The trampoline runs in the UmkaOS Core protection domain (PKEY 0 on x86-64) and performs the actual VMLAUNCH/VMRESUME/ERET, validating VMCS fields (no host-state corruption, EPT does not map UmkaOS Core pages writable to the guest) before executing VM entry.

There is no "Tier 0.5" — KVM fits the Tier 1 model with a richer capability set. KVM is memory-domain isolated from UmkaOS Core (MPK on x86-64, POE or page-table+ASID on AArch64) exactly as any other Tier 1 driver. The trampoline code (~200 lines of verified assembly) is small enough to audit as Tier 0 code; it is the only code that executes VMX instructions and is the security boundary between KVM's isolation domain and Core private state.

A KVM crash triggers the Tier 1 crash recovery path (Section 10.8.2) with one additional step: all active VM execution contexts are suspended before the driver is reloaded. After umka-kvm reloads (~150 ms, FLR path for any assigned devices), the VMCS state for each VM is reconstructed from the checkpointed state buffer (Section 10.8). VMs resume without guest-visible interruption beyond a brief pause. If reconstruction fails, the VM is terminated — the same outcome as a host kernel crash in Linux, but without affecting other VMs or the host.

  • Full x86-64 VMX support:
  • Nested paging (EPT)
  • VMCS shadowing (for nested virtualization)
  • Posted interrupts (for efficient interrupt delivery)
  • PML (Page Modification Logging)
  • QEMU/KVM, libvirt, Firecracker, Cloud Hypervisor must work unmodified

ARM64 KVM (VHE/nVHE):

ARM64 KVM uses the Virtualization Extensions (ARMv8.1+). Two modes are supported:

VHE (Virtualization Host Extensions, ARMv8.1+):
  - Host kernel runs at EL2 (hypervisor exception level) instead of EL1.
  - Guest runs at EL1 (virtual EL1, translated by VHE).
  - Benefit: no world switch needed for host kernel — host IS the hypervisor.
  - VTTBR_EL2 points to guest's Stage-2 translation tables.
  - Guest physical → host physical translation via Stage-2 page tables.
  - Used on: AWS Graviton, Ampere, Apple Silicon, Cortex-X series.

nVHE (non-VHE, pre-ARMv8.1 or when VHE is disabled):
  - Host kernel runs at EL1. Hypervisor stub at EL2.
  - Guest entry requires EL1 → EL2 → EL1(guest) transition.
  - Higher overhead (~500-1000 cycles per VM entry/exit vs ~200 for VHE).
  - UmkaOS supports nVHE for older ARM64 hardware but defaults to VHE.

Protected KVM (pKVM, ARMv8.0+):
  - EL2 hypervisor is a small, deprivileged module (~5K lines).
  - Host kernel runs at EL1 with restricted Stage-2 mappings.
  - Guest memory is inaccessible to the host (confidential VMs without TEE).
  - Aligns with UmkaOS's isolation model: pKVM enforces VM isolation in hardware.

ARM64 KVM integration with UmkaOS isolation: - On ARM64, the isolation mechanism is POE/page-table (not MPK). KVM uses a Stage-2 trampoline analogous to the x86 VMX trampoline: umka-core manages VTTBR_EL2 and HCR_EL2 writes; umka-kvm prepares the VM configuration in its own isolation domain. The trampoline validates Stage-2 page tables before executing the ERET to enter the guest. - PSCI (Power State Coordination Interface) for vCPU bring-up: KVM intercepts PSCI calls from the guest via HVC/SMC trapping in HCR_EL2. - Virtual GIC (vGICv3/vGICv4): Interrupt injection uses GICv4 direct injection where available (zero exit for most interrupts), falling back to software injection.

ARM64 VHE/nVHE Selection Algorithm:

KVM on AArch64 has two host kernel execution modes: - VHE (Virtualization Host Extensions, ARMv8.1+): Host kernel runs at EL2 (hypervisor level). Eliminates world-switch overhead for EL1/EL0 operations. Preferred when available. - nVHE: Host kernel runs at EL1; a stub firmware runs at EL2. Requires a full world-switch on every VM entry/exit. Used on hardware without VHE or when EL2 is already occupied.

Selection at boot (in umka-kvm/src/arm64/init.rs):

fn select_kvm_mode() -> KvmMode:
    // 1. Check CPU feature: ID_AA64MMFR1_EL1.VH[8:9] = 0b01 means VHE supported.
    if !cpuid::has_feature(CpuFeature::VHE):
        return KvmMode::NvHE   // hardware does not support VHE

    // 2. Check if another hypervisor already owns EL2 (e.g., Xen, pKVM).
    //    Read HCR_EL2 — if E2H bit is 0 and we didn't set it, EL2 is occupied.
    if hcr_el2_read().e2h() == 0 and !boot_claimed_el2():
        return KvmMode::NvHE   // EL2 owned by firmware/another hypervisor

    // 3. Check for pKVM (Protected KVM) mode. pKVM requires nVHE to maintain
    //    its own EL2 firmware for confidential VM isolation. If CONFIG_PKVM
    //    equivalent is enabled in umka-kvm, force nVHE.
    if umka_kvm_config().protected_kvm_enabled:
        return KvmMode::NvHE   // pKVM requires nVHE

    // 4. All checks passed: use VHE.
    return KvmMode::VHE

Runtime effects: - VHE: HCR_EL2.E2H = 1, TGE = 1 set at boot. EL1 system register accesses are redirected to EL2. No mode switch cost; ~15-30% better VM density on high-frequency VM-exit workloads. - nVHE: A small EL2 stub (umka_kvm_hyp) is installed at boot. Each VM entry/exit involves saving/restoring the host EL1 context (~50-150 cycles overhead per VM exit).

RISC-V KVM (H-extension):

RISC-V virtualization is defined by the H (Hypervisor) extension (ratified December 2021, as part of Privileged Architecture v1.12):

H-extension architecture:
  - Hypervisor runs in HS-mode (Hypervisor-extended Supervisor mode).
  - Guest runs in VS-mode (Virtual Supervisor mode).
  - hstatus CSR: hypervisor status (SPV bit tracks guest/host context).
  - hgatp CSR: guest physical → host physical address translation
    (analogous to EPT on x86 and Stage-2 on ARM).
  - htval CSR: faulting guest physical address (for #PF handling).
  - hvip/hip/hie CSRs: virtual interrupt injection.
  - Guest trap delegation: hedeleg/hideleg CSRs control which traps
    go to VS-mode (guest handles) vs HS-mode (hypervisor handles).

VM entry/exit:
  - Entry: set hstatus.SPV = 1, sret → enters VS-mode.
  - Exit: guest trap/interrupt → HS-mode handler (automatic by hardware).
  - Cost: ~200-400 cycles per exit (varies by implementation).

IOMMU: RISC-V IOMMU spec (ratified June 2023) provides Stage-2 translation
for device DMA, analogous to Intel VT-d / ARM SMMU.

RISC-V KVM integration with UmkaOS: - The umka-kvm driver manages hgatp (guest page tables) and hvip (virtual interrupts) in its isolation domain. The HS-mode trampoline validates hgatp entries before guest entry. - H-extension hardware is available on SiFive P670, T-Head C910, and QEMU virt. UmkaOS targets QEMU for initial development.

KVM and Domain Isolation — KVM requires capabilities beyond a standard MMIO device driver. Unlike a NIC or storage driver that accesses a single device via MMIO, KVM requires: (1) VMX root mode transitions (VMXON, VMLAUNCH, VMRESUME), which are privileged Ring 0 operations that affect global CPU state; (2) VMCS manipulation, which Intel requires to be in a specific memory region pointed to by a per-CPU VMCS pointer; (3) EPT (Extended Page Table) management, which programs second-level page tables that control guest physical-to-host physical address translation; (4) direct access to MSRs and control registers during VM entry/exit.

These capabilities are incompatible with a plain memory-domain isolation model — the hardware memory domain mechanism (WRPKRU/POR_EL0/DACR) controls memory access permissions, not instruction execution privilege. KVM is therefore classified as a Tier 1 driver with extended hardware privileges, granted KvmHardwareCapability at KABI registration time. This capability authorizes umka-core to execute VMX/VHE/H-extension operations on KVM's behalf via a validated VMX/VHE trampoline that runs in the UmkaOS Core protection domain (PKEY 0 on x86-64). KVM prepares the VMCS and EPT in its own memory isolation domain; the trampoline validates the VMCS fields (no host-state corruption, EPT does not map UmkaOS Core pages writable to the guest), then executes the VM entry. KVM retains Tier 1 crash-recovery semantics — a bug in KVM's VMCS preparation or ioctl handling crashes only KVM, not UmkaOS Core.

Why not Tier 0? — Tier 0 code cannot crash-recover. By running KVM as a Tier 1 driver with a validated trampoline, a fault in KVM's VMCS preparation or ioctl handling crashes only KVM, not UmkaOS Core. The VMX trampoline itself is ~200 lines of verified assembly — small enough to audit as Tier 0 code.

Recovery implications — When umka-kvm crashes, all running VMs are paused (their vCPU threads are halted). After umka-kvm reloads (~150 ms, FLR path for any assigned devices), the VMCS state for each VM is reconstructed from the checkpointed state buffer (Section 10.8). VMs resume without guest-visible interruption beyond a brief pause. If reconstruction fails, the VM is terminated (same outcome as a host kernel crash in Linux, but without affecting other VMs or the host).

KVM Integration with umka-core Memory Management:

KVM's Extended Page Tables (EPT on x86, Stage-2 on ARM, hgatp on RISC-V) require tight integration with umka-core's memory management subsystem (Section 4.1):

Second-Level Address Translation (SLAT) hooks:

/// umka-core provides these hooks to umka-kvm for EPT/Stage-2 management.
/// Each hook operates on host physical frames and guest physical addresses.
pub trait SlatHooks {
    /// Allocate a physical page for SLAT page table structures (EPT/Stage-2/hgatp
    /// page table entries). These are hypervisor metadata pages used to build the
    /// second-level address translation tables — NOT guest physical memory backing
    /// pages. Returns a pinned frame suitable for use as a page table page.
    fn alloc_slat_page(&self) -> Result<PhysFrame, KernelError>;

    /// Free a SLAT page table structure page previously allocated by
    /// `alloc_slat_page`.
    fn free_slat_page(&self, frame: PhysFrame);

    /// Allocate a physical page to back guest physical memory. This is the host
    /// physical frame that the guest will use as RAM — mapped into the SLAT tables
    /// as a leaf entry. Distinct from `alloc_slat_page`, which allocates page table
    /// structure pages (internal SLAT nodes).
    fn alloc_guest_page(&self) -> Result<PhysFrame, KernelError>;

    /// Free a guest physical memory backing page previously allocated by
    /// `alloc_guest_page`, returning it to umka-core's buddy allocator.
    fn free_guest_page(&self, frame: PhysFrame);

    /// Pin a host physical page to prevent reclaim or migration while it is
    /// mapped in an EPT/Stage-2 table. The page remains pinned until the
    /// corresponding `unpin_host_page` call.
    fn pin_host_page(&self, frame: PhysFrame) -> Result<(), KernelError>;

    /// Unpin a host physical page, allowing umka-core to reclaim or migrate it.
    fn unpin_host_page(&self, frame: PhysFrame);

    /// Notify umka-core that a guest physical to host physical mapping was created.
    /// Used for dirty page tracking and live migration bookkeeping.
    fn notify_slat_map(&self, gpa: u64, hpa: u64, size: usize, writable: bool);

    /// Notify umka-core that a SLAT mapping was removed.
    fn notify_slat_unmap(&self, gpa: u64, size: usize);
}

Memory overcommit: umka-kvm can overcommit guest memory (assign more virtual memory to VMs than is physically available). When a guest accesses an unmapped guest physical page, the EPT violation is handled through a five-step path:

  1. VM exit to trampoline: The EPT/Stage-2/hgatp violation triggers a VM exit. The VMX trampoline (running in PKEY 0/umka-core) captures the faulting guest physical address from VMCS (x86), FAR_EL2 (ARM), or htval (RISC-V).

  2. Synchronous upcall to umka-kvm: The trampoline performs a direct function call (not ring buffer IPC) to umka-kvm's page fault handler. This is safe because:

  3. The call is synchronous within the vCPU thread context (no concurrency with other umka-kvm operations on this vCPU).
  4. umka-kvm's page fault handler runs in its isolation domain but accesses only its own per-VM data structures.
  5. The trampoline validates the fault is a legitimate EPT violation (not a malicious call from compromised code) before invoking umka-kvm.

Direct call latency: ~30-50 cycles (domain switch + indirect call), NOT the ~200+ cycle ring buffer round-trip used for asynchronous driver IPC.

  1. Page request: umka-kvm requests a guest backing page from umka-core via SlatHooks::alloc_guest_page (another direct call, umka-core is PKEY 0).

  2. Page allocation: umka-core allocates from the buddy allocator, potentially reclaiming pages from page cache, compressing cold pages (Section 4.2), or evicting pages from other guests based on the memory pressure framework.

  3. Mapping and resume: umka-kvm installs the EPT/Stage-2 mapping in its per-VM page tables and returns to the trampoline, which resumes the guest via VMRESUME/ERET.

Total EPT violation latency: ~200 cycles (VM exit) + ~50 cycles (trampoline + domain switch) + ~100-500 cycles (page allocation, varies by pressure) + ~200 cycles (VM entry) = ~550-950 cycles for a page-in from free list. This is comparable to Linux KVM's EPT violation handling (~400-800 cycles on similar hardware).

Dirty page tracking for live migration uses architecture-specific mechanisms:

  • PML (Page Modification Logging) on Intel: hardware logs dirty guest physical addresses to a 512-entry buffer in the VMCS. When the buffer fills, a VM exit occurs and umka-kvm drains the buffer into a per-VM dirty bitmap.
  • Software dirty tracking on ARM/RISC-V: umka-kvm clears the write permission bit in Stage-2/hgatp entries. Write faults trap into umka-kvm, which records the dirty page in the bitmap and restores write permission. Batched permission restoration amortizes the TLB invalidation cost.
  • umka-core maintains per-VM dirty bitmaps (one bit per 4 KiB page) that can be queried and atomically reset by the migration coordinator.

Ballooning integration: The virtio-balloon driver in the guest inflates (returns pages to the host) or deflates (reclaims pages from the host). umka-kvm processes balloon requests by calling free_guest_page on inflation (returning the host physical frame to umka-core's buddy allocator) and alloc_guest_page on deflation (allocating a new guest backing frame and installing the EPT mapping). Balloon state is included in the umka-kvm checkpoint for crash recovery (Section 10.8).

18.1.4.6 Netfilter / nftables

  • Tier 1 network stack includes the nftables packet classification engine
  • iptables legacy compatibility via the nft backend (same approach as modern Linux)
  • Connection tracking (conntrack) for stateful firewalling
  • NAT support: SNAT, DNAT, masquerade
  • Required for: Docker networking, Kubernetes kube-proxy (iptables mode), firewalld

18.1.4.7 Linux Security Modules (LSM)

  • LSM hook framework at all security-relevant points (file access, socket operations, task operations, IPC, etc.)
  • SELinux policy engine compatibility (required for RHEL/CentOS/Fedora)
  • AppArmor profile compatibility (required for Ubuntu/SUSE)
  • Capability-based hooks integrate naturally with UmkaOS's native capability model
  • seccomp-bpf for per-process syscall filtering (required for Docker, Chrome)

The architecture guarantees that every Linux LSM hook has a corresponding UmkaOS enforcement point — either a direct capability check or a policy module callout (Section 18.7). Scope estimate: Linux 6.x defines ~220 LSM hook points across file, inode, task, socket, IPC, key, audit, BPF, and perf_event categories. The UmkaOS implementation must provide hook stubs for all ~220 points for SELinux/AppArmor policy modules to attach to.

Partial LSM hook mapping (security-critical hooks):

LSM Hook UmkaOS Capability Check Notes
inode_permission CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH File permission bypass
file_ioctl Capability from device driver's DriverVTable Device-specific
bprm_check_security CAP_SETUID, CAP_SETGID setuid/setgid binary execution
ptrace_access_check CAP_SYS_PTRACE Cross-process ptrace
capable Direct capability lookup in TaskCredential General capability gate
socket_create CAP_NET_RAW for raw sockets Network raw access
key_alloc CAP_SYS_ADMIN for kernel keyrings Key management
task_setrlimit CAP_SYS_RESOURCE Resource limit changes
sb_mount CAP_SYS_ADMIN or CAP_MOUNT Mount operations
inode_setattr Ownership + CAP_FOWNER Attribute changes

The complete hook-to-capability mapping (all ~220 hooks) is generated from the .kabi LSM hook table in Phase 2. The invariant is: every LSM hook that Linux uses for privilege enforcement maps to exactly one UmkaOS capability check; hooks that only enforce DAC (discretionary access control) map to the TaskCredential uid/gid/mode checks.

Complete LSM Hook Categories (all ~220 hooks, organized by subsystem):

Category Hook Count Key Hooks UmkaOS Mapping
Filesystem / Inode ~45 inode_permission, inode_create, inode_link, inode_unlink, inode_symlink, inode_mkdir, inode_rmdir, inode_mknod, inode_rename, inode_readlink, inode_follow_link, inode_setattr, inode_getattr, inode_setxattr, inode_getxattr, inode_listxattr, inode_removexattr DAC checks + CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID
File ~15 file_permission, file_alloc_security, file_free_security, file_ioctl, file_mmap, file_mprotect, file_lock, file_fcntl, file_send_sigiotask, file_receive, file_open File capability from device driver; mmap permission check
Superblock / Mount ~10 sb_alloc_security, sb_free_security, sb_copy_data, sb_remount, sb_kern_mount, sb_show_options, sb_statfs, sb_mount, sb_check_sb, sb_umount CAP_SYS_ADMIN or CAP_MOUNT
Task / Process ~25 task_create, task_free, cred_alloc_blank, cred_free, cred_prepare, cred_transfer, task_setuid, task_setgid, task_setpgid, task_getpgid, task_getsid, task_getsecid, task_setnice, task_setioprio, task_getioprio, task_prlimit, task_setrlimit, task_setscheduler, task_getscheduler, task_movememory, task_kill, task_wait_pid TaskCredential checks
Network Socket ~30 socket_create, socket_post_create, socket_bind, socket_connect, socket_listen, socket_accept, socket_sendmsg, socket_recvmsg, socket_getsockname, socket_getpeername, socket_getsockopt, socket_setsockopt, socket_shutdown, socket_sock_rcv_skb, socket_getpeersec_stream, socket_getpeersec_dgram CAP_NET_RAW, CAP_NET_BIND_SERVICE, etc.
IPC ~20 ipc_permission, msg_msg_alloc_security, msg_msg_free_security, msg_queue_alloc_security, msg_queue_free_security, msg_queue_associate, msg_queue_msgctl, msg_queue_msgsnd, msg_queue_msgrcv, shm_alloc_security, shm_free_security, shm_associate, shm_shmctl, shm_shmat, sem_alloc_security, sem_free_security, sem_associate, sem_semctl, sem_semop IPC namespace capability checks
Key / Keyring ~10 key_alloc, key_free, key_permission, key_getsecurity CAP_SYS_ADMIN for kernel keyrings
BPF ~5 bpf, bpf_map, bpf_prog CAP_BPF + verifier trust level
Audit ~5 audit_rule_init, audit_rule_known, audit_rule_match, audit_rule_free auditd integration
Misc ~15 ptrace_access_check, ptrace_traceme, capget, capset, capable, syslog, vm_enough_memory, mmap_addr, mmap_file, quotactl, sysctl Per-capability checks

Hook stub generation (Phase 2): The complete 220-hook stub table is generated by a build-time code generator that reads Linux 6.1 LTS security/security.h hook signatures and produces typed Rust stubs in umka-security/src/lsm/hooks.rs. Each stub either: - Performs a direct capability check (hooks without data-access restrictions). - Calls into the active LSM policy module (SELinux/AppArmor) for policy-based decisions. - Returns 0 unconditionally (hooks with no security relevance in UmkaOS's model, e.g., bprm_committed_creds).

The hook-to-capability mapping is declared as a const table in umka-security/src/lsm/hooks.rs. LSM hooks are not generated from the .kabi IDL — the KABI IDL is used for driver interface versioning, not for security framework hook dispatch. LSM hooks are invoked directly from the syscall translation layer in umka-compat and from UmkaOS Core at the corresponding kernel-internal operation points (see Section 18.1.4.7a below).

18.1.4.7a LSM Hook Invocation Architecture

UmkaOS implements Linux's LSM hook model for binary compatibility with security modules (AppArmor, SELinux profiles, seccomp filters). LSM hooks are not generated from the .kabi IDL — they are invoked directly from the syscall translation layer in umka-compat and from UmkaOS Core at the equivalent kernel-internal operation points.

Hook Invocation Points

At each syscall that Linux defines LSM hooks for, the compat syscall handler calls the corresponding UmkaOS security check before executing the operation:

// In the compat syscall dispatcher (umka-compat/src/syscall/fs.rs):
fn sys_open(path: UserPtr<u8>, flags: u32, mode: u32) -> Result<Fd, Errno> {
    let path = copy_path_from_user(path)?;
    let dentry = vfs_lookup(&path)?;

    // LSM security check — equivalent to Linux's security_inode_open()
    // Calls all registered policy providers in order; returns first error.
    umka_core::security::check_open(&current_task().cred, &dentry, flags)?;

    vfs_open(dentry, flags, mode)
}

The security check function (e.g., umka_core::security::check_open) iterates the registered LSM policy provider list in priority order. Each provider returns Ok(()) to permit or Err(Errno) to deny. The first denial short-circuits the chain. Providers are registered at boot time and are immutable at runtime (no dynamic LSM loading after the security namespace is sealed). See Section 8.4 for the full LSM registration API and provider lifecycle.

Supported LSM Hook Invocation Points

UmkaOS invokes the LSM hooks required for AppArmor and seccomp compatibility at the following kernel operations:

Hook Kernel operation Security check
check_open Any file open (vfs_open) Path/label access, file flags
check_exec execve / execveat Executable label, capabilities, no-new-privs
check_socket_create socket(2) Domain, type, protocol policy
check_socket_connect connect(2) Destination address, peer label
check_socket_bind bind(2) Port and address policy
check_process_signal kill / tgkill / rt_sigqueueinfo Sender→receiver relationship
check_ptrace ptrace(2) Tracer→tracee relationship
check_ipc_send msgsnd, mq_send IPC endpoint access label
check_mmap mmap(2) with PROT_EXEC Execute permission on anonymous mapping
check_setuid / check_setgid setuid / setgid and variants Privilege escalation policy
check_cap Any CAP_* usage site Capability allowed in task's security context

The hook list matches Linux 6.1 LTS LSM hooks. New hooks are added additively — existing LSM policy modules remain compatible because they only observe hooks they were compiled against (unknown hook calls return Ok(()) by default for unregistered providers).

seccomp-bpf Integration

seccomp filters (BPF programs attached via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER)) are evaluated before LSM hooks in the syscall dispatch path. This matches Linux's ordering. If seccomp kills or traps the syscall, LSM hooks are not reached.

// Syscall dispatch order in umka-compat/src/entry.rs:
fn dispatch_syscall(ctx: &mut SyscallContext) -> i64 {
    // 1. seccomp filter (BPF, per-thread, before any kernel state is touched)
    if let Err(action) = seccomp_check(ctx) {
        return seccomp_apply_action(action, ctx);
    }
    // 2. LSM pre-checks (capabilities, label policy)
    // (called per-operation inside each syscall handler)

    // 3. Execute syscall
    SYSCALL_TABLE[ctx.nr].handler(ctx)
}

18.1.4.8 Namespaces

All 8 Linux namespace types:

Namespace Purpose Required for
mnt Mount point isolation Containers, chroot
pid Process ID isolation Containers
net Network stack isolation Containers, VPN
ipc IPC resource isolation Containers
uts Hostname/domainname isolation Containers
user UID/GID mapping Rootless containers
cgroup Cgroup hierarchy isolation Containers
time Clock offset isolation Containers

18.1.4.9 Cgroups

  • cgroup v2 as primary implementation (unified hierarchy)
  • cgroup v1 compatibility mode (required for older Docker, systemd < 248)
  • Controllers: cpu, cpuset, memory, io, pids, rdma, hugetlb, misc
  • Required for: systemd resource management, Docker, Kubernetes, OOM handling

18.1.4.10 Cryptographic Random Syscalls

getrandom(2) (x86-64 syscall 318, AArch64 syscall 278, RISC-V 64 syscall 278, Linux 3.17+) returns cryptographically secure random bytes from the kernel CSPRNG. Required for: OpenSSL, glibc's arc4random, systemd's sd-id128, any security library initializing keying material.

UmkaOS implementation: Direct syscall (no tier crossing). Reads from a per-CPU entropy buffer populated at interrupt time via RDRAND (x86-64), RNDR (AArch64), or HTIF entropy source (RISC-V), mixed with timer jitter. HKDF-SHA256 expansion at each call provides forward secrecy — a snapshot of the per-CPU state does not reveal past outputs.

Flag Value Semantics
GRND_NONBLOCK 0x0001 Return EAGAIN instead of blocking if not yet seeded (early boot).
GRND_RANDOM 0x0002 No distinction from default in UmkaOS; always uses the seeded CSPRNG.
GRND_INSECURE 0x0004 Always succeeds; draws from xorshift128+ before CSPRNG is seeded. For early-boot users only (e.g., initramfs randomization). Linux 5.6+.

Return value: number of bytes written (always len unless GRND_NONBLOCK and unseeded). Error: EFAULT if buffer address invalid; EINVAL if unknown flag; EAGAIN if GRND_NONBLOCK and CSPRNG not yet seeded.

Seeding: CSPRNG is marked seeded when the entropy pool has accumulated ≥256 bits of hardware entropy (RDRAND/RNDR output or timer jitter). On systems without hardware RNG, seeding completes after the first 256 IRQs have been processed (jitter entropy). After seeding, getrandom(2) never blocks.

18.1.5 io_uring Compatibility

Full io_uring support with a security enhancement:

Implementation basis: UmkaOS's io_uring compatibility layer is implemented directly on top of UmkaOS's internal RingBuffer<T> infrastructure from the driver SDK (Section 11.1). The Submission Queue (SQ) and Completion Queue (CQ) rings are RingBuffer<SqEntry> and RingBuffer<CqEntry> instances with their memory laid out to match Linux's io_uring mmap layout exactly — so applications using the mmap-based interface (io_uring_setupmmap SQ/CQ rings → submit via SQ) work unmodified. No separate ring implementation exists; io_uring is a specialization of the same ring infrastructure used throughout UmkaOS at every tier boundary.

  • Same SQE/CQE ring buffer ABI (binary compatible)
  • Same opcodes: all 59 opcodes from Linux 6.19 / 7.0-rc1 (see complete table below)
  • SQPOLL mode (kernel-side submission polling)
  • Registered buffers and registered files (pre-pinned for zero-copy)
  • Fixed files for reduced file descriptor overhead

Supported io_uring Opcodes (complete enumeration; all opcodes from Linux 6.19 / 7.0-rc1):

# Opcode Notes
0 IORING_OP_NOP No-op; tests ring infrastructure
1 IORING_OP_READV Vectored read (preadv2 equivalent)
2 IORING_OP_WRITEV Vectored write (pwritev2 equivalent)
3 IORING_OP_FSYNC fsync(2)
4 IORING_OP_READ_FIXED Read to pre-registered buffer
5 IORING_OP_WRITE_FIXED Write from pre-registered buffer
6 IORING_OP_POLL_ADD Poll fd for I/O readiness
7 IORING_OP_POLL_REMOVE Cancel/update poll
8 IORING_OP_SYNC_FILE_RANGE sync_file_range(2)
10 IORING_OP_SENDMSG sendmsg(2)
11 IORING_OP_RECVMSG recvmsg(2)
12 IORING_OP_TIMEOUT Timer/timeout
13 IORING_OP_TIMEOUT_REMOVE Cancel/update timeout
14 IORING_OP_ACCEPT accept4(2)
15 IORING_OP_ASYNC_CANCEL Cancel in-flight request by user_data
16 IORING_OP_LINK_TIMEOUT Timeout for linked SQE chain
17 IORING_OP_CONNECT connect(2)
18 IORING_OP_FALLOCATE fallocate(2)
19 IORING_OP_OPENAT openat(2)
20 IORING_OP_CLOSE close(2)
21 IORING_OP_FILES_UPDATE Batch-update registered file table
23 IORING_OP_STATX statx(2)
24 IORING_OP_READ pread(2) equivalent (non-vectored)
23 IORING_OP_WRITE pwrite(2) equivalent (non-vectored)
24 IORING_OP_FADVISE posix_fadvise(2)
25 IORING_OP_MADVISE madvise(2)
26 IORING_OP_SEND send(2)
27 IORING_OP_RECV recv(2)
28 IORING_OP_OPENAT2 openat2(2)
29 IORING_OP_EPOLL_CTL epoll_ctl(2)
30 IORING_OP_SPLICE splice(2)
31 IORING_OP_PROVIDE_BUFFERS Register buffer group for recv
32 IORING_OP_REMOVE_BUFFERS Unregister buffer group
33 IORING_OP_TEE tee(2) — duplicate pipe data
34 IORING_OP_SHUTDOWN shutdown(2)
35 IORING_OP_RENAMEAT renameat(2)
36 IORING_OP_UNLINKAT unlinkat(2)
37 IORING_OP_MKDIRAT mkdirat(2)
38 IORING_OP_SYMLINKAT symlinkat(2)
39 IORING_OP_LINKAT linkat(2)
40 IORING_OP_MSG_RING Send message to another io_uring ring
41 IORING_OP_FGETXATTR fgetxattr(2)
42 IORING_OP_FSETXATTR fsetxattr(2)
43 IORING_OP_GETXATTR getxattr(2)
44 IORING_OP_SETXATTR setxattr(2)
45 IORING_OP_SOCKET socket(2)
46 IORING_OP_URING_CMD Per-file/driver command (NVMe passthrough, etc.)
47 IORING_OP_SEND_ZC Zero-copy send (Linux 6.0+)
48 IORING_OP_SENDMSG_ZC Zero-copy sendmsg (Linux 6.0+)
49 IORING_OP_READ_MULTISHOT Multi-completion buffered read
50 IORING_OP_WAITID waitid(2)
51 IORING_OP_FUTEX_WAIT Futex wait (Linux 6.7+)
52 IORING_OP_FUTEX_WAKE Futex wake (Linux 6.7+)
53 IORING_OP_FUTEX_WAITV Wait on multiple futexes (Linux 6.7+)
54 IORING_OP_FIXED_FD_INSTALL Install registered fd into file table (Linux 6.7+)
55 IORING_OP_FTRUNCATE ftruncate(2)
56 IORING_OP_BIND bind(2) — Linux 6.7+
57 IORING_OP_LISTEN listen(2) — Linux 6.7+
58 IORING_OP_PIPE pipe2(2) — Linux 6.7+

UmkaOS implementation note: All 59 opcodes listed above are supported. Opcodes may be disabled per-process via io_uring_register(IORING_REGISTER_RESTRICTIONS). UmkaOS implements all opcodes natively using its internal RingBuffer<T> infrastructure — no opcode silently fails with ENOSYS; unimplemented opcodes at future kernel versions return ENOSYS with FEAT_OPCODE_LIST discoverability.

Advanced io_uring features:

  • Multishot operations (IORING_POLL_ADD_MULTI, multishot accept, multishot recv): single SQE generates multiple CQEs, reducing submission overhead for event-driven servers.
  • Cancellation (IORING_OP_ASYNC_CANCEL): cancel in-flight operations by user_data tag.
  • Linked SQEs (IOSQE_IO_LINK): ordered execution chains.
  • IORING_OP_URING_CMD (passthrough): Driver-specific commands via io_uring. NVMe passthrough (nvme_uring_cmd) works through this path. UmkaOS routes uring_cmd to the KABI driver's command handler, maintaining the same struct nvme_uring_cmd ABI.
  • IORING_REGISTER_RING_FD: ring self-reference for reduced fd overhead.
  • IORING_OP_SEND_ZC / RECV_ZC: zero-copy network I/O.

Security improvement over Linux: Per-instance operation whitelist via capabilities. In Linux, io_uring bypasses syscall-level security monitoring (seccomp, audit, ptrace). UmkaOS allows administrators to restrict which io_uring opcodes are available to each process, addressing this known security gap. The whitelist applies to both standard opcodes and URING_CMD subtypes — an io_uring instance can be restricted to, e.g., read/write only, with NVMe passthrough blocked.

18.1.5.1 io_uring Under SEV-SNP (Confidential Guest Mode)

When UmkaOS runs as a SEV-SNP confidential guest (Section 8.6.4), io_uring's shared memory rings create a conflict: SQE/CQE ring buffers are shared between the kernel and userspace (both within the encrypted guest), but I/O operations require DMA to virtio devices controlled by the hypervisor. The hypervisor cannot access encrypted guest pages, so DMA buffers must be in unencrypted (C-bit clear) shared memory. The SQE/CQE rings themselves remain in encrypted guest memory (both kernel and userspace are inside the same encryption domain), but the I/O data buffers referenced by SQEs require bounce buffering.

Detection: SEV-SNP is detected at boot via CPUID leaf 0x8000001F, bit 1 (SME) and bit 4 (SEV-SNP). When SEV-SNP guest mode is active, the io_uring subsystem enables the bounce buffer path automatically. No userspace changes are required -- existing io_uring applications run unmodified.

Data path: The guest kernel places I/O requests in the encrypted SQE ring as normal. For operations requiring DMA (block I/O via virtio-blk, network via virtio-net), the kernel copies data to/from an unencrypted bounce buffer (C-bit clear pages, accessible to the hypervisor for DMA). On completion, the kernel copies results from the bounce buffer back into the encrypted guest buffer, then places the CQE in the encrypted CQE ring. The SQE/CQE rings themselves are never exposed to the hypervisor -- only the DMA data payload is bounced.

Application                Guest Kernel              Hypervisor/Host
    |                          |                          |
    |-- submit SQE ---------->|                          |
    |   (encrypted ring)       |                          |
    |                          |-- memcpy to bounce ----->|
    |                          |   (C-bit clear page)     |
    |                          |                          |-- DMA to device
    |                          |                          |-- DMA completion
    |                          |<-- memcpy from bounce ---|
    |                          |   (re-encrypt into       |
    |                          |    guest buffer)          |
    |<-- CQE completion -------|                          |
    |   (encrypted ring)       |                          |

Bounce buffer pool: Pre-allocated at io_uring initialization (not boot), sized to 2x the maximum concurrent io_uring queue depth across all rings on the system. Default sizing: 4096 SQEs x 4 KiB = 16 MiB bounce pool per io_uring instance, capped at 64 MiB system-wide (configurable via /sys/kernel/umka/io_uring/snp_bounce_pool_mb). All bounce buffer pages are marked as shared (C-bit clear) so the hypervisor can DMA to/from them. The pool uses a simple freelist allocator (no slab overhead -- bounce buffers are uniform-sized 4 KiB pages). If the pool is exhausted, io_uring returns -ENOMEM for the SQE and the application retries (same behavior as running out of DMA mapping slots in non-SNP mode).

Performance impact: Each I/O operation requires two additional memcpy operations (submission: guest buffer -> bounce buffer; completion: bounce buffer -> guest buffer). For 4 KiB blocks, each memcpy costs ~0.3-0.5 us (~0.6-1.0 us total per I/O). This is acceptable given that SEV-SNP already imposes 5-15% baseline overhead from memory encryption engine traversal on all memory accesses. The bounce buffer overhead is additive but small relative to the encryption baseline: approximately 1-3% additional overhead for NVMe 4 KiB random I/O workloads (which are already dominated by device latency), and < 1% for sequential large-block I/O (where memcpy is amortized over larger transfers).

Fixed buffers optimization: io_uring_register(IORING_REGISTER_BUFFERS) under SEV-SNP pre-registers persistent bounce buffer mappings for specific user buffers. When an application registers N buffers, the kernel allocates N corresponding bounce buffer slots and establishes a stable mapping. Subsequent I/O operations referencing registered buffer indices use the pre-mapped bounce buffers without per-operation pool allocation/deallocation, amortizing the bounce overhead across multiple operations to the same buffer. This is particularly effective for database workloads that reuse a fixed set of I/O buffers.

Per-buffer encryption policy: For network I/O carrying sensitive payloads (TLS session keys, authentication tokens), applications can request per-buffer AES-GCM encryption/decryption at registration time via IORING_REGISTER_BUFFERS_ENCRYPTED (UmkaOS extension). This adds ~1 us per 4 KiB page (AES-GCM encrypt + MAC) but ensures data in the bounce buffer is ciphertext, not plaintext. This flag is unnecessary for block storage (ciphertext is on disk anyway, and dm-crypt handles encryption above the io_uring layer) but recommended for network buffers in high-security deployments. When this flag is not set, bounce buffer contents are plaintext -- this is acceptable for the SEV-SNP threat model because the hypervisor is already trusted to deliver I/O correctly (it controls the virtio device), and bounce buffers are only exposed for the duration of the DMA operation.

See also: Section 8.6.4 (UmkaOS as confidential guest) for the general SWIOTLB bounce buffer architecture. Section 8.6.8 for SEV-SNP performance characteristics.

18.1.6 Signal Handling

Full POSIX and Linux signal semantics:

  • 64 signals: signals 1-31 (standard) and signals 32-64 (real-time)
  • sigaction with SA_SIGINFO, SA_RESTART, SA_NOCLDSTOP, SA_ONSTACK
  • sigaltstack for alternate signal stacks
  • Per-thread signal masks (pthread_sigmask)
  • Signal delivery by modifying saved register state on the user stack (same mechanism as Linux -- required for correct sigreturn)
  • Proper interaction with: io_uring (signal-driven completion), epoll (EINTR semantics), futex (interrupted waits), nanosleep (remaining time)
  • signalfd for synchronous signal consumption
  • Process groups and session signals (SIGHUP, SIGCONT, SIGSTOP)

18.1.7 cgroups: v2 Native with v1 Compatibility Shim

Linux problem: cgroups v1 had a messy, inconsistent design with separate hierarchies for each controller. v2 fixed this but migration was painful.

UmkaOS design: - cgroups v2 only as the native implementation. Single unified hierarchy. - Thin v1 compatibility shim: For container runtimes and tools that still use v1 filesystem paths, provide a v1-compatible view that maps to the v2 backend. This is read/write for the common operations (cpu, memory, io, pids) and read-only/unsupported for obscure v1-only features. - Pressure Stall Information (PSI): Built into cgroup v2 from the start (not added years later like in Linux).

18.1.7.1 Resource Controllers (Detailed)

Controller Function Key Tunables
cpu CPU bandwidth limiting and proportional sharing cpu.max, cpu.weight
cpuset CPU and memory node pinning cpuset.cpus, cpuset.mems
memory Memory usage limits and OOM control memory.max, memory.high, memory.low
io Block I/O bandwidth and IOPS limiting io.max, io.weight
pids Process/thread count limit pids.max

UmkaOS-specific controllers: accel (Section 21.3.2) and power (Section 6.4.3) follow the same v2 interface conventions.

18.1.7.2 Delegation Model

Non-root processes can manage sub-hierarchies with CAP_CGROUP_ADMIN, which can be scoped to a specific subtree via the capability system (Section 8.1). A container runtime holding CAP_CGROUP_ADMIN(subtree=/sys/fs/cgroup/containers/pod-xyz) can manage cgroups under that path but cannot touch anything outside it.

18.1.7.3 Pressure Stall Information (PSI)

Each cgroup exposes pressure metrics (cpu.pressure, memory.pressure, io.pressure) with 10s/60s/300s averages. PSI supports real-time event notification via poll/epoll triggers. Orchestrators (kubelet, systemd-oomd) use PSI to detect resource saturation before hard limits are hit.

18.1.8 Event Notification (epoll, poll, select)

Linux applications use three generations of event notification. UmkaOS implements all three for compatibility but steers new applications toward io_uring (Section 18.1.5).

18.1.8.1 epoll (Primary)

Full implementation:

  • Syscalls: epoll_create1, epoll_ctl (ADD/MOD/DEL), epoll_wait, epoll_pwait, epoll_pwait2.
  • Trigger modes: Edge-triggered (EPOLLET) and level-triggered (default).
  • Flags: EPOLLONESHOT (auto-disarm), EPOLLEXCLUSIVE (one waiter per event).
  • Internal structure: Red-black tree for monitored fds, ready list for events. epoll_wait drains the ready list — no scanning of all monitored fds.
  • Nested epoll: Cap nesting depth at 4 (matching Linux's EP_MAX_NESTS).
/// Per-epoll-instance state.
pub struct EpollInstance {
    /// Red-black tree of monitored file descriptors.
    pub interests: RBTree<EpollKey, EpollItem>,
    /// Ready list: fds with pending events. Uses an intrusive doubly-linked
    /// list — each `EpollItem` embeds `next`/`prev` pointers, avoiding
    /// per-node heap allocation. This matches Linux's `list_head`-based
    /// ready list in `eventpoll.c`.
    pub ready_list: IntrusiveList<EpollItem>,
    /// Wait queue for threads blocked in epoll_wait.
    pub waiters: WaitQueue,
}

18.1.8.2 poll and select (Legacy)

  • poll: Array of struct pollfd, O(n) per call. No persistent kernel state.
  • select: Bitmap-based, limited to 1024 fds. O(n) scan. POSIX compatibility only.
  • ppoll / pselect: Signal-mask-aware variants.

18.1.8.3 Event-Oriented File Descriptors

  • eventfd: Lightweight inter-thread notification via u64 counter.
  • signalfd: Synchronous signal consumption as fd readability.
  • timerfd: Timer expiry as fd readability, backed by hrtimer infrastructure.

18.1.8.4 Relationship to io_uring

io_uring (Section 18.1.5) supersedes epoll for new high-performance applications. IORING_OP_POLL_ADD provides the same notification within io_uring's unified model.

18.2 Futex and Userspace Synchronization

18.2.1 Futex Implementation

The futex(2) syscall is the kernel-side primitive underlying all userspace synchronization: glibc pthread_mutex_lock, pthread_cond_wait, sem_wait, and C++ std::mutex all compile down to futex operations. Understanding futex is essential because the fast path never enters the kernel at all -- an uncontended lock is a single atomic compare-and-swap on a shared memory word, entirely in userspace. The kernel is only involved when a thread must sleep (FUTEX_WAIT) or wake sleeping threads (FUTEX_WAKE).

UmkaOS implements the following futex operations:

Operation Description
FUTEX_WAIT Block if *uaddr == val (avoids lost-wakeup race)
FUTEX_WAKE Wake up to N waiters on uaddr
FUTEX_WAIT_BITSET WAIT with 32-bit bitmask for selective wakeup
FUTEX_WAKE_BITSET WAKE with bitmask (only wake waiters whose mask overlaps)
FUTEX_REQUEUE Move waiters from one futex to another (condition variables)
FUTEX_CMP_REQUEUE Requeue with value check (prevents lost wakeups during cond broadcast)
FUTEX_WAKE_OP Atomic wake + modify (optimizes pthread_cond_signal + mutex_unlock)

The futex wait queue is organized as a hash table keyed by (address_space_id, virtual_address). Each bucket contains a linked list of waiting tasks:

/// Futex hash key. Combines a key kind with an offset to uniquely identify a futex.
///
/// For **private futexes** (the common case, ~99% of mutex uses): the key is
/// (mm_id, page-aligned vaddr, offset within page). The `offset` field is
/// redundant with vaddr's low bits but kept for uniformity with the shared case.
///
/// For **shared futexes** (MAP_SHARED): the key is (physical page frame, offset
/// within page). Both processes sharing the mapping hash to the same bucket and
/// match on the same (PhysFrame, offset) pair, even if their virtual addresses differ.
///
/// **Matching rule**: Two FutexKeys match iff (kind == kind) AND (offset == offset).
/// For Private, kind equality means same mm_id and same vaddr. For Shared, kind
/// equality means same PhysFrame. The offset is ALWAYS part of the match.
pub struct FutexKey {
    kind: FutexKeyKind,
    /// Offset within the 4K page (0..4095). For private futexes, this equals
    /// (vaddr & 0xFFF). For shared futexes, this is the offset into the physical
    /// page. Critical for correctness: multiple futexes on the same page must NOT
    /// collide (they have different offsets).
    offset: u32,
}

pub enum FutexKeyKind {
    /// Private mapping: keyed by (address space, page-aligned virtual address).
    /// The offset field in FutexKey provides the intra-page position.
    Private { mm_id: MmId, vaddr: VirtAddr },
    /// Shared mapping: keyed by physical page frame.
    /// The offset field in FutexKey provides the intra-page position.
    /// This ensures processes mapping the same file/shm at different virtual
    /// addresses still wake each other correctly.
    Shared { page: PhysFrame },
}

/// A futex waiter node, embedded in the Task struct (Section 7.1.1).
/// Uses intrusive singly-linked linking to avoid heap allocation under spinlock.
/// A task can wait on at most one futex at a time (futex_wait is blocking),
/// so a single embedded FutexWaiter per task is sufficient.
///
/// **Why singly-linked**: A doubly-linked intrusive list requires atomically
/// updating both `prev` and `next` pointers on unlink. No single CAS can cover
/// both — a CAS on `next` alone corrupts the `prev` chain, making any
/// lock-free doubly-linked-list unlink unsound in the general case. A
/// singly-linked list with O(n) unlink under the bucket spinlock is correct,
/// simple, and fast in practice: futex bucket contention lists rarely exceed a
/// handful of waiters.
pub struct FutexWaiter {
    /// Intrusive singly-linked list pointer. Null = list end (not in any bucket).
    /// Actual mutation is always performed while holding the owning bucket's spinlock.
    /// `AtomicPtr` (rather than `Option<NonNull<_>>`) makes `FutexWaiter: Sync`,
    /// which is required because a `FutexWaiter` embedded in a `Task` may be
    /// observed from multiple CPUs (e.g., by `futex_exit_cleanup` racing with
    /// `futex_wake` on a different CPU).
    pub next: AtomicPtr<FutexWaiter>,
    /// The futex key this waiter is blocked on (for requeue and wake filtering).
    pub key: FutexKey,
    /// Bitset for FUTEX_WAIT_BITSET selective wakeup (0xFFFF_FFFF = match all).
    pub bitset: u32,
    /// Back-pointer to the owning Task (for wake-up scheduling).
    pub task: *const Task,
    /// Wakeup state. Transitions under the bucket spinlock so the waiter
    /// and the waker agree on who performed the wakeup.
    pub state: WaiterState,
}

// SAFETY: All mutations of `FutexWaiter` fields are performed while holding
// the owning `FutexBucket`'s spinlock. `AtomicPtr` provides the `Sync` bound
// required by the Rust type system; the spinlock provides the actual exclusion.
unsafe impl Sync for FutexWaiter {}

/// Each bucket is protected by its own spinlock — contention is spread
/// across the table rather than funneled through a single lock.
///
/// Waiter lists use an intrusive singly-linked list (not `Vec`) to avoid heap
/// allocation under spinlock. FutexWaiter nodes are embedded in the
/// task struct (Section 7.1.1, `futex_waiter` field). Insertion is O(1) at
/// the head; removal is O(n) linear scan from the head under the bucket lock.
/// This is correct and fast in practice: futex wait lists are rarely longer
/// than a few entries even under heavy concurrent workloads.
///
/// **Lock hierarchy level**: FUTEX_BUCKET (level 0). This is BELOW all scheduler
/// locks so that futex_wake can safely call scheduler::enqueue() while holding
/// a bucket lock. The authoritative scheduler lock ordering from Section 3.1 is:
/// TASK_LOCK (level 1) < RQ_LOCK (level 2) < PI_LOCK (level 3).
/// Futex bucket locks are at level 0, allowing the following valid acquisition:
///   1. Acquire FUTEX_BUCKET (level 0)
///   2. Set waiter.state = Woken
///   3. Unlink waiter from the bucket list (under FUTEX_BUCKET)
///   4. Release FUTEX_BUCKET
///   5. Call scheduler::enqueue() — no bucket lock held; scheduler acquires
///      TASK_LOCK (level 1) — valid: level 0 was already released
///
/// **Unlink BEFORE enqueue**: The unlink step must happen under the bucket lock
/// BEFORE enqueue is called. This prevents futex_exit_cleanup() from seeing a
/// waiter whose state is Woken but which has not yet been unlinked from the list,
/// which would cause a double-unlink. See futex_exit_cleanup() below.
pub struct FutexBucket {
    /// Head of the singly-linked waiter list. `None` when the bucket is empty.
    /// All accesses require holding the spinlock.
    head: Option<NonNull<FutexWaiter>>,
    lock: SpinLock<(), FUTEX_BUCKET>,
}

/// Lock level for futex bucket locks. Below TASK_LOCK (level 1) to allow
/// futex_wake → scheduler::enqueue() without lock ordering violation.
pub const FUTEX_BUCKET: LockLevel = LockLevel(0);

/// Per-NUMA-node futex hash table.
///
/// Why per-NUMA: On a 4-socket NUMA machine, a single 256-bucket global hash
/// table causes cross-NUMA cache line bouncing on every futex_wait/wake. With
/// per-NUMA tables, the hash table spinlock and bucket entries live on the same
/// NUMA node as the waiting CPU (for private futexes) or the physical page
/// (for shared futexes) — no cross-NUMA traffic on the common path.
pub struct FutexNumaNode {
    /// Variable number of buckets per NUMA node (256, 1024, 4096, or 16384
    /// depending on node memory; see `futex_hash_size`), each with its own
    /// spinlock. Allocated with numa_alloc_onnode() — bytes live on this node.
    buckets: Box<[FutexBucket]>,
}

/// Global futex subsystem — one FutexNumaNode per NUMA node.
pub struct FutexSystem {
    /// Indexed by NUMA node ID (0..num_numa_nodes).
    nodes: Vec<FutexNumaNode>,
}

/// How to select the NUMA node for a futex operation:
///
/// **Shared futexes** (key is physical_page + offset):
///   node = physical_page.numa_node()
///   → Both futex_wait and futex_wake resolve the physical page → same NUMA node
///   → No cross-node ambiguity even when waker and waiter are on different nodes
///
/// **Private futexes** (key is mm + vaddr, FUTEX_PRIVATE_FLAG set):
///   futex_wait:  node = mm.owner_numa_node()
///   futex_wake:  node = mm.owner_numa_node()
///   → Both sides compute the NUMA node from the mm's owner (the process's
///     primary thread group NUMA affinity), which is deterministic and the
///     same for any thread in the process, regardless of which CPU issues
///     the wait or wake. Cross-node misses are minimized for processes whose
///     threads run on the mm's home NUMA node. Processes with threads spanning
///     multiple NUMA nodes may experience cross-node hash misses on the futex
///     bucket lookup, but correctness is unaffected — the hash is deterministic
///     and both wait and wake always resolve to the same node.
///   → This is an UmkaOS improvement over the naive per-CPU NUMA selection used
///     in some Linux configurations, which can cause lost wakeups when waiter
///     and waker run on CPUs in different NUMA nodes.

impl FutexSystem {
    fn select_node_shared(physical_page: PhysPage) -> usize {
        physical_page.numa_node()
    }

    fn select_node_private(mm: &MemoryMap) -> usize {
        // Use home node of the mm's primary thread group
        mm.owner_numa_node()
    }

    fn bucket_index(key: &FutexKey, buckets: usize) -> usize {
        // buckets is always a power of 2 (from futex_hash_size()), so use bitmasking
        // instead of modulo for O(1) distribution. The caller passes
        // node.futex_buckets.len() so the index is always in-range for that
        // node's actual table, which varies from 256 to 16384 depending on
        // per-node memory (see futex_hash_size()). Passing a fixed constant
        // here would silently ignore 75–99% of buckets on large NUMA nodes.
        debug_assert!(buckets.is_power_of_two());
        let h = key.hash();
        (h ^ (h >> 8)) as usize & (buckets - 1)
    }
}

/// Futex hash table sizing (buckets per NUMA node). Scaled at boot based on
/// per-node memory:
/// - ≤1 GB: 256 buckets
/// - ≤16 GB: 1024 buckets
/// - ≤256 GB: 4096 buckets
/// - >256 GB: 16384 buckets
/// This matches Linux's scaling heuristic (futex_init in kernel/futex/core.c),
/// applied independently per NUMA node so large nodes get proportionally more
/// buckets while small nodes don't waste memory.
pub const fn futex_hash_size(node_memory_bytes: usize) -> usize {
    match node_memory_bytes {
        0..=0x4000_0000 => 256,           // ≤1 GB
        0..=0x4_0000_0000 => 1024,         // ≤16 GB
        0..=0x40_0000_0000 => 4096,        // ≤256 GB
        _ => 16384,                         // >256 GB
    }
}

Design note: UmkaOS's futex hash table scales with NUMA node memory (256–16,384 buckets per node, selected at boot based on available memory). This is intentionally superior to the historical Linux fixed-256-bucket design that was a known DoS vector (exploited via hash collision floods). That flaw was partially addressed in Linux 3.13; UmkaOS's adaptive design eliminates the bottleneck entirely by construction.

Why Linux didn't do this: Linux's futex code predates widespread NUMA awareness (2002). The 256-bucket global table was later expanded to min(256 × cpus, 8192) but remained global. The physical-page-to-NUMA-node lookup adds a page table walk on every futex operation — Linux considered this overhead not worth the benefit. UmkaOS implements it from the start (no legacy constraint) and the lookup is O(1) via the page's embedded numa_node field in PhysPage.

The hash table size per NUMA node is determined at boot based on that node's available memory (see futex_hash_size above). FUTEX_WAIT atomically checks *uaddr == val while holding the bucket lock, closing the race window between the userspace check and the kernel enqueue. The NUMA node is selected before acquiring any lock: shared futexes use physical_page.numa_node(), private futexes use mm.owner_numa_node() (deterministic, same node for both wait and wake).

Task exit unlink: When a task exits while in a futex wait queue, it acquires the bucket spinlock and removes itself via a linear scan from the head. This is the same approach as Linux (hash_bucket->lock in kernel/futex/core.c) and is correct by construction: the spinlock serializes all concurrent wait/wake/exit operations on the same bucket.

/// Waiter lifecycle state. Transitions are made under the owning bucket's
/// spinlock so that futex_wake() and futex_exit_cleanup() cannot race.
pub enum WaiterState {
    /// Inserted in the bucket's wait list; the task is blocked.
    Waiting,
    /// futex_wake() has selected this waiter, unlinked it from the bucket list,
    /// and called (or is about to call) scheduler::enqueue(). Both the state
    /// transition and the unlink happen under the bucket spinlock; enqueue()
    /// is called after releasing the lock.
    Woken,
}

/// Called from the task exit path when the task may be sitting in a futex
/// wait queue. Acquires the bucket spinlock, removes the waiter node from
/// the singly-linked list (O(n) scan from head), and checks state to
/// detect a concurrent futex_wake() that has already selected this waiter.
///
/// **Race safety**: futex_wake() sets state = Woken AND unlinks the waiter
/// from the list under the bucket spinlock, then releases the lock BEFORE
/// calling scheduler::enqueue(). Therefore, when futex_exit_cleanup() acquires
/// the bucket lock, the waiter is either:
///   (a) still in the list with state == Waiting  → exit cleanup unlinks it, or
///   (b) already unlinked with state == Woken     → exit cleanup does nothing.
/// There is no window where state == Woken but the node is still in the list.
///
/// Unlink algorithm:
///   1. Acquire bucket.lock (spinlock).
///   2. If waiter.state == Woken: another CPU already unlinked us (under the
///      bucket lock) and will call scheduler::enqueue() after releasing it.
///      Release lock and consume the wakeup — no unlink needed.
///   3. Otherwise (Waiting): walk the singly-linked list from bucket.head,
///      find the predecessor whose next == &waiter, set predecessor.next =
///      waiter.next (or update bucket.head if we are the first node).
///   4. Null out waiter.next to leave the node in a clean state.
///   5. Release bucket.lock.
fn futex_exit_cleanup(bucket: &FutexBucket, waiter: &mut FutexWaiter) {
    let _guard = bucket.lock.lock();

    if waiter.state == WaiterState::Woken {
        // futex_wake() already unlinked us and scheduled a wakeup.
        // Nothing left to do — the wakeup is consumed by the exit itself.
        return;
    }

    // Linear scan to find and splice out this waiter node.
    // SAFETY: All pointers in the list are valid FutexWaiter nodes embedded
    // in live Task structs. The bucket spinlock prevents concurrent mutation.
    let target = NonNull::from(waiter as &FutexWaiter);
    let mut cursor: *mut Option<NonNull<FutexWaiter>> = &raw mut bucket.head;
    loop {
        // SAFETY: cursor always points to a valid head or next field.
        match unsafe { &mut *cursor } {
            None => {
                // Waiter not found — should be unreachable if caller is correct.
                debug_assert!(false, "futex_exit_cleanup: waiter not in bucket list");
                break;
            }
            Some(node) if *node == target => {
                // Found our node. Splice it out.
                // SAFETY: node is a valid FutexWaiter embedded in a live Task.
                unsafe { *cursor = (*node.as_ptr()).next.take() };
                break;
            }
            Some(node) => {
                // SAFETY: node is a valid FutexWaiter.
                cursor = unsafe { &raw mut (*node.as_ptr()).next };
            }
        }
    }
}

The bucket spinlock is already acquired for every futex_wait and futex_wake operation, so acquiring it on task exit adds no new lock ordering concern (level 0, below TASK_LOCK at level 1). Futex wait lists are short in practice — rarely more than a handful of waiters per bucket even under JVM or Go runtime thread-heavy workloads — so the O(n) scan adds negligible cost to an already-infrequent per-task-exit operation.

18.2.2 Priority-Inheritance Futexes (PI)

Linux problem: Priority inversion occurs when a high-priority RT task blocks on a mutex held by a low-priority task, while a medium-priority task preempts the lock holder indefinitely. Without intervention, the RT task's latency becomes unbounded.

UmkaOS design: FUTEX_LOCK_PI and FUTEX_UNLOCK_PI implement kernel-mediated priority inheritance. When an RT task (priority 99) blocks on a PI futex held by a normal task (nice 0), the kernel temporarily boosts the lock holder to priority 99 so it can complete its critical section without being preempted by medium-priority work.

PI chain tracking handles transitive dependencies: if task A (priority 99) waits on a lock held by B (priority 50), and B waits on a lock held by C (priority 10), the kernel walks the chain and boosts C to priority 99. The chain walk is bounded by a compile-time limit (default: 1024 entries) to prevent runaway traversal.

Deadlock detection falls out naturally: if the chain walk encounters the requesting task again (A waits on B waits on A), the kernel returns EDEADLK immediately rather than creating a circular dependency.

PI boosting integrates with all three scheduler classes (Section 6.1): an EEVDF task can be temporarily boosted into the RT class, and a Deadline task's runtime budget is respected even when boosted. When the lock holder releases the PI futex, its effective priority reverts to the highest priority among any remaining PI dependencies (or its base priority if none remain).

18.2.3 Robust Futexes

Linux problem: If a thread crashes or is killed while holding a futex-based mutex, every other thread waiting on that futex blocks forever. The kernel has no way to know the dead thread held the lock because, in the normal case, the kernel never sees the lock/unlock at all (it is purely userspace).

UmkaOS design (same mechanism as Linux): Each thread maintains a userspace linked list of currently held robust futex locks. The head of this list is registered with the kernel via set_robust_list(). On thread exit (voluntary or involuntary), the kernel walks the robust list and for each entry:

  1. Sets the FUTEX_OWNER_DIED bit (bit 30) in the futex word.
  2. Performs a FUTEX_WAKE on that address, waking one waiter.
  3. The woken thread sees FUTEX_OWNER_DIED, knows the lock state may be inconsistent, and can run recovery logic (or simply re-acquire the lock, clearing the bit).

The robust list walk is bounded (default: 2048 entries) to prevent a malicious thread from pointing the kernel at an enormous or circular list.

18.2.4 futex2 (FUTEX_WAITV)

Linux problem: The original futex(2) can only wait on a single address at a time. Waiting on multiple synchronization objects simultaneously required workarounds like polling threads or epoll-over-eventfd bridges -- all of which added latency and complexity.

UmkaOS design: The futex_waitv() syscall (Linux 5.16+) is supported from day one rather than retrofitted. It accepts an array of (uaddr, val, flags) tuples and blocks until any one of them is triggered:

/// Matches Linux's `struct futex_waitv` (include/uapi/linux/futex.h).
/// The `uaddr` field is a u64 (not a pointer) to match the Linux ABI exactly —
/// this allows 32-bit processes on 64-bit kernels to pass 32-bit addresses
/// without sign-extension issues. The kernel validates the address and
/// interprets it as a `*const AtomicU32` internally.
pub struct FutexWaitv {
    pub val: u64,
    pub uaddr: u64,   // User virtual address (validated by kernel)
    pub flags: u32,    // FUTEX_32, FUTEX_PRIVATE_FLAG, etc.
    pub __reserved: u32,  // Must be zero (Linux ABI compatibility)
}

/// Block until any of the N futex addresses is woken or has a value mismatch.
/// Returns the index of the triggered futex, or -ETIMEDOUT, or -ERESTARTSYS.
pub fn sys_futex_waitv(
    waiters: &[FutexWaitv],
    flags: u32,
    timeout: Option<&Timespec>,
    clockid: ClockId,
) -> Result<usize, Errno> { ... }

Primary consumers: - Wine/Proton: Windows WaitForMultipleObjects maps directly to futex_waitv, enabling efficient game synchronization without per-object polling threads. - Event-driven runtimes: Any pattern where a thread must wait on several independent conditions (e.g., "data ready OR shutdown requested OR timeout").

18.2.5 Cross-Domain Futex Considerations

Standard futex implementations assume a single kernel address space. UmkaOS's domain-based isolation domains (x86-64), POE domains (AArch64), DACR domains (ARMv7), and page-table isolation domains (RISC-V 64, PPC32, PPC64LE) introduce a cross-domain shared-memory scenario that does not exist in Linux.

Shared-memory futex keying: When two processes (or a process and a Tier 1 driver) share memory via MAP_SHARED, the futex key must be the physical address (page frame + offset), not the virtual address, because each domain may map the region at a different virtual address. The FutexKeyKind::Shared variant (Section 18.2.1) handles this case. Both sides of the mapping hash to the same wait queue bucket, so FUTEX_WAKE from one domain correctly wakes a waiter in the other.

Capability validation: Before performing any futex operation on a shared mapping, the kernel verifies that the calling domain holds a valid capability to the underlying shared memory region. A FUTEX_WAIT or FUTEX_WAKE on an address the caller cannot legitimately access returns EFAULT. This prevents a compromised domain from probing or waking arbitrary futex wait queues in other domains.

MPK interaction (x86-64): The futex word must reside in a page whose PKEY is accessible to both participating domains. In practice, this means the shared memory region is assigned to PKEY 1 (shared read-only descriptors) or PKEY 14 (shared DMA buffer pool), as defined in Section 10.2. BPF cross-domain futexes use the BPF domain key (default PKEY 2). The kernel reads and modifies the futex word from PKEY 0 (UmkaOS Core), which always has full read/write access to all domains — so the kernel-side atomic comparison and wake are never blocked by MPK permissions, even if the calling domain's PKRU restricts access to other keys.

Architecture Isolation mechanism Futex cross-domain access method
x86-64 MPK (PKEY 0-15) Kernel operates as PKEY 0; shared region on PKEY 1 or 14
AArch64 POE Kernel accesses futex word via privileged overlay permission
ARMv7 DACR Kernel sets domain manager mode for shared page access
RISC-V 64 Page-table isolation Kernel maps shared page into supervisor address space
PPC32 Segment registers Kernel maps shared segment with supervisor key access
PPC64LE Radix PID / HPT Kernel accesses futex word via hypervisor-privileged mapping

18.2.6 UmkaOS Simplified Futex API

The Linux futex(2) syscall multiplexes 15+ operations through a single syscall number with stringly-typed error semantics and a confusing val/val2/val3 triple that means different things per operation. UmkaOS provides a clean single-operation API alongside futex(2) for backward compatibility.

New UmkaOS futex syscalls:

// Wait: atomically check *uaddr == expected, then sleep until woken or timeout.
// Returns 0 on wake, -ETIMEDOUT on timeout, -EAGAIN if *uaddr != expected.
long futex_wait(uint32_t *uaddr, uint32_t expected,
                const struct timespec *timeout,  // NULL = wait forever
                uint32_t flags);                 // FUTEX_PRIVATE_FLAG supported

// Wake: wake up to `count` waiters on uaddr. Returns number actually woken.
long futex_wake(uint32_t *uaddr, uint32_t count, uint32_t flags);

// Requeue: wake `wake_count` waiters on uaddr1, move `requeue_count` waiters
// to uaddr2 (for condition variable broadcast without thundering herd).
// Returns number of tasks woken + requeued.
long futex_requeue(uint32_t *uaddr1, uint32_t *uaddr2,
                   uint32_t wake_count, uint32_t requeue_count,
                   uint32_t flags);

x86-64 syscall numbers (UmkaOS-specific):

// UmkaOS custom syscalls start at 1024. // Current Linux maximum: ≥456 (as of Linux 6.7). UmkaOS extensions start at 1024, // providing a ≥568-syscall safety buffer — generous headroom for the indefinite future // even at Linux's historical rate of ~5-10 new syscalls per release cycle. // Linux's native futex2 syscalls (454-456) are handled transparently by the compat layer // (see "Linux native futex2 compatibility" note below); they do not conflict with UmkaOS // extensions at 1024+.

Syscall Number Notes
futex_wait 1024 Relative timeout only; use futex_wait_abs for absolute
futex_wake 1025
futex_requeue 1026
futex_wait_abs 1027 Absolute timeout with explicit clockid_t
futex_wait_pi 1028 Priority-inheritance wait
futex_wake_pi 1029 Priority-inheritance wake

Numbers chosen above 1023 to provide generous long-term headroom beyond any foreseeable Linux syscall growth (Linux 6.7 max is in the low hundreds).

Linux native futex2 compatibility: Linux 6.7 introduced native futex_wake(2) (syscall 454), futex_wait(2) (syscall 455), and futex_requeue(2) (syscall 456) with semantics similar to (but not identical to) UmkaOS's extended interface. The UmkaOS compat layer handles these Linux-native syscall numbers transparently, routing them to the same FutexSystem implementation. UmkaOS's own extended futex interface (syscalls 1024-1029) provides additional features — absolute timeouts (futex_wait_abs), priority-inheritance operations (futex_wait_pi, futex_wake_pi) — as a superset. New UmkaOS applications should use the UmkaOS interface (1024+); applications ported from Linux use 454-456 unchanged through the compat layer.

Differences from futex(2) that matter:

  • timeout is always a struct timespec relative duration (no FUTEX_CLOCK_REALTIME confusion). For absolute timeout: futex_wait_abs(uaddr, expected, clockid, abstime, flags) is a separate syscall (syscall 1027).
  • Return values are unambiguous: only {0, -ETIMEDOUT, -EAGAIN, -EFAULT, -EINVAL}.
  • No val2/val3 overloading — each operation has exactly the parameters it needs.
  • Priority inheritance: futex_wait_pi / futex_wake_pi as separate syscalls (1028/1029).

Internal routing: futex_wait / futex_wake / futex_requeue use the same FutexSystem (per-NUMA hash table, Section 18.2.1) as the compat futex(2) syscall. A waiter in the new API can be woken by a wake in the old API on the same address — they share the same hash bucket.

Linux compatibility: futex(2) (syscall 202 on x86-64) is fully supported and routes to the same implementation. New UmkaOS applications should prefer futex_wait / futex_wake for clarity; existing applications use futex(2) unchanged.


UmkaOS's native event system (Section 6.6, umka-core) delivers events via capability-gated ring buffers. For compatibility with existing Linux tools that use netlink sockets, umka-compat provides translation layers for the following netlink protocol families:

Netlink Family Purpose Key Consumers
NETLINK_KOBJECT_UEVENT Device hotplug events udev, systemd, mdev
NETLINK_ROUTE Network interface and routing events iproute2 (ip), NetworkManager, systemd-networkd
NETLINK_AUDIT Security audit events auditd, systemd-journald
NETLINK_CONNECTOR Process events (fork, exec, exit) systemd, process accounting
NETLINK_NETFILTER Firewall logging and conntrack iptables logging, conntrack-tools
NETLINK_GENERIC Generic netlink (nl80211 WiFi, team, devlink, ethtool) wpa_supplicant, NetworkManager, iw, hostapd, ethtool

Architecture: Each netlink family is handled by a dedicated translator in umka-compat:

  1. Process opens a netlink socket (socket(AF_NETLINK, SOCK_DGRAM, protocol)).
  2. umka-compat intercepts the socket creation and bind(), registering the process with the appropriate UmkaOS event channel.
  3. When the kernel posts a native UmkaOS event, the translator converts it to the Linux netlink message format and writes to the socket buffer.
  4. Process reads netlink messages via recvmsg().

udev and systemd use this for device hotplug. Example translation:

UmkaOS Event:
  event_type = UsbDeviceChanged
  data.usb = { vid=0x1234, pid=0x5678, inserted=true }

Netlink message:
  ACTION=add
  DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-1
  SUBSYSTEM=usb
  DEVTYPE=usb_device
  PRODUCT=1234/5678/100

NetworkManager, iproute2, and systemd-networkd use this for link state and address changes. The Tier 1 network stack (Section 15.1) posts native events that umka-compat translates:

Push path (kernel → userspace, event notifications): - RTM_NEWLINK / RTM_DELLINK: Interface added/removed - RTM_NEWADDR / RTM_DELADDR: IP address added/removed - RTM_NEWROUTE / RTM_DELROUTE: Routing table changes - RTM_NEWNEIGH / RTM_DELNEIGH: ARP/NDP neighbor cache updates

Pull path (userspace → kernel, request/response queries): Userspace tools (ip route show, ip link show, ip addr show) send netlink request messages and expect reply messages. umka-compat handles these by:

  1. Process sends a RTM_GET* request via sendmsg() on the netlink socket.
  2. umka-compat parses the netlink message header (struct nlmsghdr), extracts the request type and filter attributes (ifindex, prefix, family, etc.).
  3. umka-compat queries the Tier 1 network stack's internal state via the inter-domain ring (e.g., umka_net::get_routes(family, table)) and constructs netlink reply messages with the standard NLM_F_MULTI flag for dump responses, terminated by NLMSG_DONE.
  4. Reply messages are written to the socket's receive buffer for recvmsg().

Supported request types: RTM_GETLINK, RTM_GETADDR, RTM_GETROUTE, RTM_GETNEIGH, RTM_GETRULE, RTM_GETQDISC. Dump mode (NLM_F_DUMP) iterates the full table; non-dump mode returns a single matching entry.

Generic Netlink is a multiplexed netlink protocol (family 16) that allows kernel subsystems to register named sub-protocols ("generic netlink families") without consuming a dedicated netlink protocol number. It is the transport for nl80211 (WiFi management), team (NIC teaming), devlink (device management), ethtool (NIC configuration), and many other subsystems.

Sub-families implemented:

Generic Netlink Family Operations Consumers
nl80211 NL80211_CMD_*: scan, connect, disconnect, roam, set_station, get_station, set_reg, set_power_save, get_wiphy, trigger_scan wpa_supplicant, NetworkManager, iw, hostapd, wpa_cli
devlink DEVLINK_CMD_*: get, port_get, sb_get, param_get, health_reporter_get devlink tool, mlxconfig
ethtool ETHTOOL_MSG_*: strset_get, linkinfo_get, linkmodes_get, linkstate_get, rings_get, channels_get ethtool, NetworkManager

Architecture: NETLINK_GENERIC uses the same socket infrastructure as other netlink families. On socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC): 1. umka-compat registers a NETLINK_GENERIC socket. 2. The process resolves sub-family IDs via CTRL_CMD_GETFAMILY (e.g., resolves "nl80211" string to its runtime-assigned family ID number). 3. umka-compat routes NLM_F_REQUEST messages to the appropriate sub-family handler (nl80211 handler → WirelessDriver KABI; devlink → DevlinkVTable KABI). 4. Unsolicited events are delivered via multicast groups (e.g., ml80211 multicast group config / mlme / scan).

  • NETLINK_AUDIT: Translated from UmkaOS's audit events (Section 8.4 IMA) for auditd.
  • NETLINK_CONNECTOR: Translated from process lifecycle events (Section 7.1) for cn_proc.
  • NETLINK_NETFILTER: Translated from nftables/conntrack events (Section 18.1.4) for firewall logging.

18.4 Windows Emulation Acceleration (WEA)

Wine and Proton emulate Windows NT kernel behavior in userspace. This subsystem provides kernel-level NT-compatible primitives that Wine/Proton can use directly, bypassing userspace emulation and achieving better correctness and performance.

Key insight: UmkaOS doesn't need to implement Windows syscalls directly. Instead, provide kernel-level primitives that make WINE/Proton faster, more correct, and easier to maintain.

Problem: WINE (and Proton) must emulate Windows NT kernel behavior in userspace on top of POSIX/Linux syscalls. This creates: - Performance overhead: Multiple syscalls to emulate one Windows operation - Semantic mismatches: Linux primitives don't map 1:1 to Windows primitives - Correctness issues: WINE's userspace emulation can't perfectly replicate kernel-level Windows behavior - Complexity: WINE's ntdll.dll is ~50K lines of Windows kernel emulation code

UmkaOS's opportunity: Provide a Windows NT-compatible object model as a kernel subsystem that WINE can use directly, bypassing userspace emulation.

18.4.1 Capability Gating

WEA syscalls (operation codes 0x0800-0x08FF) require CAP_WEA capability. This capability: - Is NOT granted by default — only processes that explicitly request WEA support receive it. - Can be scoped to a specific NT namespace subtree (e.g., CAP_WEA(namespace=/WINE-prefix-1)). - Container isolation: each container (or WINE prefix) has its own \BaseNamedObjects\ subtree. A process with CAP_WEA(namespace=/containers/abc) cannot access objects in /containers/def.

Without CAP_WEA, WEA syscalls return -EPERM. This prevents non-WINE processes from interacting with the NT object namespace and ensures WEA's attack surface is opt-in.


18.4.2 NT Object Manager

Windows NT kernel concept: Everything is an object (files, processes, threads, events, mutexes, semaphores, sections). Objects live in a hierarchical namespace (\Device\, \Driver\, \BaseNamedObjects\, etc.).

Current WINE approach: Emulates NT objects in userspace. Server process (wineserver) manages object lifetimes, handles, waits. High overhead for cross-process object sharing.

UmkaOS WEA approach: Kernel-native NT object manager alongside POSIX VFS.

/// NT Object Manager (lives in umka-compat crate)
pub struct NtObjectManager {
    /// Root of the hierarchical namespace (e.g., `\BaseNamedObjects\MyEvent`).
    ///
    /// Each `NtDirectory` contains its own per-directory RwLock. Path traversal
    /// acquires the lock at each directory level and releases the parent before
    /// descending — at most one directory lock is held at any time (no lock
    /// ordering issues between directories). This means operations on different
    /// subtrees (`\BaseNamedObjects\` vs `\Device\`) never contend.
    ///
    /// **Lock hierarchy**: WEA locks are in a separate "leaf" category that does not
    /// call scheduler code while held. The NT namespace and object locks may call
    /// allocator or capability code but NOT scheduler::enqueue(). This means:
    ///   - NT_NAMESPACE and NT_OBJECT locks do NOT need to be ordered relative to
    ///     scheduler locks (TASK_LOCK, RQ_LOCK, PI_LOCK).
    ///   - They DO need ordering relative to each other: NT_NAMESPACE < NT_OBJECT.
    ///   - They use a separate lock category (WEA_LOCKS) that is incompatible with
    ///     scheduler locks — holding any WEA lock while holding any scheduler lock
    ///     (or vice versa) is a compile-time error.
    ///
    /// Wait operations (WaitForSingleObject, WaitForMultipleObjects) release all
    /// NT object locks before calling scheduler::sleep(). Wake operations
    /// (SetEvent, ReleaseMutex) mark the waiter as ready, then release NT object
    /// locks, then call scheduler::wake() WITHOUT holding NT locks.
    ///
    /// This "release-before-schedule" pattern is identical to how futex_wake works.
    root: Arc<NtDirectory>,

    /// Per-process NT handle tables (lazily allocated on first WEA syscall to
    /// avoid ~1.5 MB overhead for non-WEA processes).
    ///
    /// **Memory model**: The `Option<Box<>>` wrapper ensures non-WEA processes
    /// (the vast majority in container environments) pay exactly zero memory cost.
    /// Only processes that issue their first WEA syscall (`NtCreateFile`, etc.)
    /// trigger allocation. For WEA-heavy environments (many WINE containers),
    /// the flat 65536-entry array trades ~1.57 MB per WEA process for O(1) handle
    /// lookup. If container density requires lower per-process overhead, a future
    /// optimization can use a two-level page table (256 × 256-entry pages, ~6 KB
    /// base + 1 KB per populated page) instead of the flat array.
    handle_tables: PerProcess<Option<Box<NtHandleTable>>>,
}

/// Lock category for WEA subsystem locks. Separate from scheduler locks.
/// Holding a WEA lock and a scheduler lock simultaneously is forbidden.
pub const WEA_LOCK_CATEGORY: LockCategory = LockCategory::WEA;

/// Lock level within WEA category for namespace directory lock.
pub const NT_NAMESPACE_LEVEL: u8 = 0;
/// Lock level within WEA category for individual NT object internal locks.
pub const NT_OBJECT_LEVEL: u8 = 1;

/// NT namespace directory node. Each directory has its own RwLock protecting
/// its children. Traversal acquires one directory lock at a time (hand-over-hand
/// is NOT needed — the parent lock is released before the child lock is acquired,
/// because `Arc<NtDirectory>` children are stable once inserted).
pub struct NtDirectory {
    /// Children of this directory, protected by a per-directory lock.
    /// Lookups take a read lock; insertions take a write lock. Since each
    /// directory has its own lock, `\BaseNamedObjects\` operations never
    /// contend with `\Device\` operations.
    children: RwLock<BTreeMap<NtName, NtDirectoryEntry>, { LockCategory::WEA, 0 }>,
}

/// Directory entry in the NT namespace.
pub struct NtDirectoryEntry {
    /// The object (Event, Mutex, etc.) or a subdirectory.
    content: NtEntryContent,
    /// Security descriptor controlling access (simplified from full Windows SD)
    security: NtSecurityDescriptor,
    /// Creation timestamp for audit/debugging
    created_at: Instant,
}

/// Entry content: either a leaf object or a subdirectory.
pub enum NtEntryContent {
    /// Leaf object (Event, Mutex, Semaphore, etc.)
    Object(Arc<NtObject>),
    /// Subdirectory (e.g., `\BaseNamedObjects\` is a subdirectory of `\`).
    Directory(Arc<NtDirectory>),
}

/// Simplified NT security descriptor. Full Windows SDs are complex; we implement
/// the subset needed for WINE/Proton compatibility.
pub struct NtSecurityDescriptor {
    /// Owner (maps to Unix UID via UmkaOS's capability system)
    owner: UserId,
    /// Container ID for namespace isolation (prevents cross-container squatting)
    container_id: Option<ContainerId>,
}

/// Named object creation with atomic create-or-open semantics.
/// Prevents TOCTOU race conditions in named object access.
impl NtObjectManager {
    /// Create a named object atomically. Returns existing object if name exists
    /// and `open_existing` is true; returns STATUS_OBJECT_NAME_COLLISION if name
    /// exists and `open_existing` is false.
    ///
    /// **Traversal protocol**: The path is split into components. Each component
    /// is looked up in the current directory under a read lock. When the final
    /// component is reached and creation may be needed, the *leaf directory's*
    /// write lock is acquired directly — the existence check and insertion both
    /// happen under this single write-lock acquisition, eliminating any TOCTOU
    /// window. At most one directory lock is held at any time.
    ///
    /// **Atomic create-or-fail protocol**: The implementation uses a single
    /// write-lock acquisition for both the existence check and the insertion,
    /// eliminating any TOCTOU window. A prior read-only existence check (under
    /// read lock) is an optional performance optimization only when
    /// `OBJECT_CREATE_OR_FAIL` semantics are not required, and must never be
    /// used as the authoritative check. The authoritative name-exists check is
    /// always the one performed under the write lock in this function.
    ///
    /// **Concurrency**: Operations on different directories never contend.
    /// Two concurrent `CreateEvent(\BaseNamedObjects\EventA)` and
    /// `CreateEvent(\BaseNamedObjects\EventB)` contend only on the
    /// `\BaseNamedObjects\` directory lock, not on the root.
    pub fn create_named<T: NtObjectType>(
        &self,
        path: &NtPath,
        open_existing: bool,
        access: u32,
        security: NtSecurityDescriptor,
    ) -> Result<(NtHandle, bool /* created */), NtStatus> {
        // Walk to the leaf directory (all intermediate lookups use read locks).
        let (leaf_dir, name) = self.traverse_to_parent(path)?;
        // Take write lock on the leaf directory only.
        let mut dir = leaf_dir.children.write();
        if let Some(existing) = dir.get(&name) {
            // Check caller has permission to access existing object
            self.check_access(existing, access)?;
            // Check container isolation: object must be in same container or global
            self.check_container_access(existing, &security)?;
            if open_existing {
                return Ok((self.create_handle(existing, access), false));
            } else {
                return Err(STATUS_OBJECT_NAME_COLLISION);
            }
        }
        // Create new object under leaf write lock — atomic with the lookup
        let obj = Arc::new(T::create()?);
        let entry = NtDirectoryEntry {
            content: NtEntryContent::Object(Arc::clone(&obj)),
            security,
            created_at: Instant::now(),
        };
        dir.insert(name, entry);
        Ok((self.create_handle(&obj, access), true))
    }
}

pub enum NtObject {
    Event(NtEvent),
    Mutex(NtMutex),
    Semaphore(NtSemaphore),
    Section(NtSection),        // Memory-mapped file or shared memory
    Process(NtProcess),
    Thread(NtThread),
    Timer(NtTimer),
    IoCompletionPort(NtIocp),
    Job(NtJob),
}

pub struct NtHandleTable {
    /// Handles are indices into this table, not file descriptors.
    /// Heap-allocated boxed slice with maximum 65536 entries (matching
    /// UmkaOS's CapSpace limit and Linux's RLIMIT_NOFILE default). Attempting to
    /// create handles beyond this limit returns STATUS_INSUFFICIENT_RESOURCES.
    /// Initialized via `vec![None; NT_MAX_HANDLES].into_boxed_slice()` — never
    /// passes through the stack, avoiding stack overflow with large N.
    entries: Box<[Option<NtHandleEntry>]>,

    /// Bitmap tracking which slots are free, for O(1) allocation.
    /// Size: `NT_MAX_HANDLES / 64 = 65536 / 64 = 1024` entries (1024 × 8 = 8 KiB).
    /// Also heap-allocated to avoid stack pressure.
    /// Initialized via unsafe `alloc_zeroed` — AtomicU64 is zero-safe.
    free_bitmap: Box<[AtomicU64]>,

    /// Windows handles are user-mode pointers (multiple of 4)
    /// We maintain illusion: handle = (index << 2) | 0x4
    next_hint: AtomicU32,  // Hint for next free slot search, not authoritative
}

/// Maximum NT handles per process. Matches UmkaOS's CapSpace limit (Section 8.1).
/// Windows default is ~16 million but most applications use far fewer.
pub const NT_MAX_HANDLES: usize = 65536;

pub struct NtHandleEntry {
    object: Arc<NtObject>,
    access_mask: u32,           // Windows ACCESS_MASK
    attributes: u32,            // OBJ_INHERIT, OBJ_PERMANENT, etc.
}

Syscalls provided:

// These are UmkaOS syscalls, not Windows syscalls
// WINE's ntdll.dll calls these instead of emulating in userspace

SYS_nt_create_event(
    name: *const u16,           // UTF-16 name (Windows convention)
    manual_reset: bool,
    initial_state: bool,
) -> Result<NtHandle>;

SYS_nt_open_event(
    name: *const u16,
    access: u32,
) -> Result<NtHandle>;

SYS_nt_set_event(handle: NtHandle) -> Result<()>;
SYS_nt_reset_event(handle: NtHandle) -> Result<()>;
SYS_nt_pulse_event(handle: NtHandle) -> Result<()>;

SYS_nt_wait_for_single_object(
    handle: NtHandle,
    timeout_ns: Option<u64>,    // Windows uses 100ns units, we convert
) -> Result<WaitResult>;

SYS_nt_wait_for_multiple_objects(
    handles: &[NtHandle],
    wait_all: bool,             // WaitAll vs WaitAny
    timeout_ns: Option<u64>,
) -> Result<WaitResult>;

SYS_nt_create_section(
    name: Option<*const u16>,
    size: u64,
    protection: u32,            // PAGE_READWRITE, PAGE_EXECUTE_READ, etc.
    file: Option<Fd>,           // Back with file or anonymous
) -> Result<NtHandle>;

SYS_nt_map_view_of_section(
    section: NtHandle,
    base_address: Option<*mut u8>,  // NULL = kernel picks
    size: u64,
    offset: u64,
    protection: u32,
) -> Result<*mut u8>;

Benefits for WINE: 1. Performance: Single syscall instead of 5-10 syscalls + wineserver RPC 2. Correctness: Kernel enforces Windows NT semantics exactly 3. Simplicity: WINE's ntdll.dll becomes thin wrapper over UmkaOS syscalls 4. Cross-process: Named objects work correctly between processes (games + launchers)


18.4.3 Fast Synchronization Primitives

Problem: Windows has NtWaitForMultipleObjects (wait on up to 64 objects simultaneously). Linux has no equivalent — WINE emulates with pipes + poll() or wineserver signaling. High overhead.

UmkaOS WEA approach: Kernel-native multi-object wait.

/// Result of waiting on NT synchronization objects.
/// Windows limits WaitForMultipleObjects to 64 handles (MAXIMUM_WAIT_OBJECTS).
/// This limit is enforced at runtime, not in the type system.
pub enum WaitResult {
    /// One of the waited objects became signaled. The inner value is the
    /// zero-based index of the signaled handle in the input array.
    /// For WaitAll, this is 0 (all signaled, return indicates the first).
    Signaled(usize),
    /// Wait timed out before any object was signaled.
    Timeout,
    /// A mutex was abandoned (owner thread died while holding it).
    /// The inner value is the index of the abandoned mutex.
    /// Windows semantics: the waiter acquires the mutex but should check state.
    Abandoned(usize),
    /// An I/O completion port had a packet available (for alertable waits).
    IoCompletion,
}

impl NtObjectManager {
    /// Wait on multiple objects (events, mutexes, semaphores, threads, processes)
    /// Returns when ANY object becomes signaled (WaitAny) or ALL (WaitAll)
    pub fn wait_for_multiple_objects(
        handles: &[NtHandle],
        wait_all: bool,
        timeout: Option<Duration>,
    ) -> Result<WaitResult> {
        // --- WaitAny semantics ---
        // Register on wait queues for all handles. When ANY object signals,
        // the thread is woken. On wakeup, atomically consume the signaled
        // object (reset auto-reset event, acquire mutex, decrement semaphore).
        // Deregister from all wait queues before returning.

        // --- WaitAll atomicity ---
        // WaitAll requires atomic multi-acquire: either ALL objects are acquired
        // in a single atomic operation, or NONE are. Implementation:
        //
        // 1. Sort handles by object address to establish lock ordering.
        // 2. Acquire each object's lock in sorted order (prevents deadlock).
        // 3. Check if ALL objects are signaled:
        //    - Event: signaled == true
        //    - Mutex: owner == None OR owner == current_thread (recursive)
        //    - Semaphore: count > 0
        //    - Process/Thread: terminated
        // 4. If ALL signaled, atomically consume ALL (reset events, acquire
        //    mutexes, decrement semaphores) while still holding all locks.
        // 5. Release all locks in reverse order.
        // 6. If NOT all signaled, release all locks and block on wait queues
        //    (same as WaitAny). Retry step 1-5 on each wakeup.
        //
        // This two-phase locking ensures no partial acquisition: either the
        // calling thread wins all objects, or it wins none and blocks.
        //
        // Lock ordering: Objects are sorted by their kernel address. This
        // matches Windows NT's implementation and prevents deadlock when
        // multiple threads WaitAll on overlapping handle sets.
        //
        // Already-held objects and deadlock pre-check:
        //
        // Invariant for deadlock-free operation: if a thread already holds any
        // mutex in the WaitAll set, it MUST hold ALL mutexes in the set that
        // sort before (lower address than) that mutex. If this invariant is
        // violated the sorted-order protocol breaks down: the thread would skip
        // an already-held object at sorted position i but still need to acquire
        // an unheld object at position j < i. Another thread that holds the
        // object at j and is waiting for i creates a classic ABBA deadlock.
        //
        // Example of the failure mode (the old "skip already-held" logic):
        //   Thread A holds M1, calls WaitAll([M1, M2]) → skips M1, blocks on M2.
        //   Thread B holds M2, calls WaitAll([M1, M2]) → blocks on M1.
        //   → deadlock despite sorted acquisition order.
        //
        // Pre-check algorithm (runs before the acquisition loop):
        //
        //   let mut found_unheld = false;
        //   for obj in sorted_objects.iter() {
        //       if thread_holds(obj) {
        //           if found_unheld {
        //               // Already-held mutex appears after an unheld one in sorted
        //               // order. Another thread could hold the unheld object and
        //               // wait for this thread's object → deadlock.
        //               return Err(STATUS_POSSIBLE_DEADLOCK);
        //           }
        //           // Object is already held and all earlier objects are also held:
        //           // safe to skip (increment recursion count for recursive mutexes,
        //           // or return STATUS_MUTANT_NOT_OWNED for non-recursive ones).
        //       } else {
        //           found_unheld = true;
        //       }
        //   }
        //
        // If the pre-check passes, the caller either holds none of the objects
        // (normal path) or holds a contiguous prefix of the sorted set (safe to
        // skip those and acquire the suffix). In both cases the sorted-order
        // protocol holds and deadlock is impossible.
        //
        // STATUS_POSSIBLE_DEADLOCK matches Windows NT semantics: NT's kernel
        // issues this status from KeWaitForMutexObject when the deadlock
        // condition is detected, allowing the caller to back off and retry.
    }
}

Why this matters for gaming: - Game engines (Unreal, Unity) use multi-object waits heavily - DirectX11/12 synchronization uses events, mutexes - 5-10x performance improvement over WINE's current userspace emulation


18.4.4 I/O Completion Ports (IOCP)

Problem: Windows IOCP is a high-performance async I/O primitive used by game servers, engines. Linux has io_uring but semantics don't match. WINE emulates IOCP poorly.

UmkaOS WEA approach: Kernel-native IOCP implementation.

/// Maximum pending completion packets per IOCP. Prevents unbounded kernel
/// memory growth from userspace posting. Windows doesn't document a hard
/// limit; we use 64K which exceeds any practical game workload.
pub const NT_MAX_IOCP_PACKETS: usize = 65536;

pub struct NtIocp {
    /// Completion queue (MPMC: many threads post via I/O completion or
    /// PostQueuedCompletionStatus, multiple worker threads consume via
    /// GetQueuedCompletionStatus). The `concurrency` field limits how many
    /// threads can dequeue simultaneously. Bounded to NT_MAX_IOCP_PACKETS;
    /// posting to a full queue returns STATUS_INSUFFICIENT_RESOURCES.
    completion_queue: BoundedMpmcQueue<IocpPacket, NT_MAX_IOCP_PACKETS>,

    /// Associated threads (NT allows binding threads to IOCP)
    concurrency: usize,         // Max threads that can dequeue simultaneously

    /// Wait queue for GetQueuedCompletionStatus
    wait_queue: WaitQueue,
}

pub struct IocpPacket {
    bytes_transferred: u32,
    completion_key: usize,      // User-defined per-handle key
    /// User-provided OVERLAPPED pointer. This is an **opaque token** that the kernel
    /// never dereferences — it is stored on PostQueuedCompletionStatus and returned
    /// unchanged on GetQueuedCompletionStatus. The caller is responsible for ensuring
    /// the pointer remains valid until dequeued. The kernel treats this as a usize
    /// (not a validated UserPtr) because it is purely userspace-to-userspace data flow.
    overlapped: usize,          // Opaque user pointer (NOT dereferenced by kernel)
    status: i32,                // NT status code
}

// Syscalls
SYS_nt_create_iocp(concurrency: usize) -> Result<NtHandle>;

SYS_nt_associate_file_with_iocp(
    file: Fd,
    iocp: NtHandle,
    completion_key: usize,
) -> Result<()>;

SYS_nt_post_queued_completion_status(
    iocp: NtHandle,
    packet: IocpPacket,
) -> Result<()>;

SYS_nt_get_queued_completion_status(
    iocp: NtHandle,
    timeout: Option<Duration>,
) -> Result<IocpPacket>;

Why this matters: - Multiplayer game servers (Rust game servers, Minecraft servers under Wine) - Game engines with async asset loading - Network code in games (sockets + IOCP)

Implementation note: UmkaOS's existing async I/O (Section 10.7, ring buffers and IPC channels) can back this. IOCP is a userspace-visible queue over kernel async I/O.


18.4.5 Memory Management Acceleration

Problem: Windows VirtualAlloc, VirtualFree, VirtualProtect have specific semantics that don't map cleanly to mmap/munmap/mprotect: - Reservation vs commit: Reserve address space without allocating pages, commit later - MEM_RESET: Discard pages but keep address range mapped (Linux has MADV_DONTNEED but semantics differ) - Guard pages: PAGE_GUARD causes exception on first access, then becomes normal page - Large pages: MEM_LARGE_PAGES (2MB/1GB pages)

UmkaOS WEA approach: Extended mmap with Windows-compatible flags.

// Extend existing UmkaOS mmap syscall with WEA flags
SYS_mmap_wea(
    addr: Option<*mut u8>,
    size: usize,
    protection: u32,            // PAGE_READWRITE | PAGE_EXECUTE_READ | ...
    flags: u32,                 // MEM_RESERVE, MEM_COMMIT, MEM_RESET, MEM_LARGE_PAGES
    fd: Option<Fd>,
) -> Result<*mut u8>;

// New syscalls for Windows-specific ops
SYS_virtual_protect(
    addr: *mut u8,
    size: usize,
    new_protection: u32,
    old_protection: &mut u32,   // Windows returns old protection
) -> Result<()>;

SYS_virtual_lock(
    addr: *mut u8,
    size: usize,
) -> Result<()>;                // Pin pages in RAM (VirtualLock)

Why this matters: - Games use VirtualAlloc for custom allocators - JIT compilers (C#/CLR games) use executable memory allocation - DX12 resource heaps use large page allocations


18.4.6 NT Thread Model and Fiber Support

Problem: Windows threads have TEB (Thread Environment Block), fiber contexts (cooperative coroutines), FLS (Fiber Local Storage), and APC (Asynchronous Procedure Call) queues. WINE emulates most of this in userspace; the gaps are performance and correctness of blocking-in-fiber.

UmkaOS WEA approach: Extend UmkaOS thread model with NT-compatible TLS and APC support. Fiber support leverages the native UmkaOS scheduler upcall mechanism (Section 7.1.7) for correct blocking behaviour.

pub struct NtThread {
    /// Standard UmkaOS thread.
    umka_thread: Arc<Task>,
    /// Thread Environment Block — allocated in user address space.
    /// Kernel records the address for fast NtCurrentTeb() via GS base.
    teb_address: *mut NtTeb,
    /// APC queue (kernel-mode and user-mode APCs). Uses intrusive linked list
    /// to avoid heap allocation under spinlock. Apc nodes are allocated from
    /// a pre-allocated per-thread pool (max 64 pending APCs per thread).
    apc_queue: SpinLock<IntrusiveList<Apc>>,
    /// Pre-allocated APC node pool. Avoids allocator calls under spinlock.
    apc_pool: [MaybeUninit<ApcNode>; NT_MAX_PENDING_APCS],
    apc_pool_bitmap: AtomicU64,  // 64 slots, 1 bit each
}

/// Maximum pending APCs per thread. Windows doesn't document a hard limit,
/// but practical applications rarely exceed a handful.
pub const NT_MAX_PENDING_APCS: usize = 64;

#[repr(C)]
pub struct NtTeb {
    /// NtTib.Self: self-pointer (always TEB[0], offset 0x00 on x64).
    self_ptr:         *mut NtTeb,
    /// NtTib.StackBase / StackLimit: valid stack range for current fiber.
    /// Updated by WINE's SwitchToFiber() — userspace write, no syscall.
    stack_base:       *mut u8,
    stack_limit:      *mut u8,
    /// NtTib.FiberData: pointer to the active fiber's data block.
    /// Updated by WINE on every SwitchToFiber() — userspace write.
    fiber_data:       *mut u8,
    // Kernel maintains these fields at thread creation time.
    // WINE manages the full TEB layout; kernel only guarantees:
    // - TEB is allocated and zeroed to at least 0x1000 bytes (Windows x64 minimum)
    // - GS base points to TEB (x64) or FS base (x86 WoW64)
    // - self_ptr is initialized to TEB address
    // - stack_base/stack_limit are set from thread stack
    // WINE is responsible for populating remaining fields (PEB pointer at 0x60,
    // LastErrorValue at 0x68, TLS array at 0x58, etc.) before first user-mode entry.
}

pub struct Apc {
    routine: extern "C" fn(*mut u8),
    context: *mut u8,
    mode: ApcMode,   // KernelMode vs UserMode
}

// WEA syscalls for APC support.
// SYS_nt_queue_apc returns STATUS_INSUFFICIENT_RESOURCES if the target thread's
// APC pool (64 entries) is exhausted. This is not a Windows-documented limit,
// but practical applications rarely exceed it. WINE can retry or log a warning.
SYS_nt_queue_apc(thread: NtHandle, routine: extern "C" fn(*mut u8), context: *mut u8) -> Result<()>;
SYS_nt_alert_thread(thread: NtHandle) -> Result<()>;
SYS_nt_test_alert() -> Result<bool>;

Fiber kernel responsibilities — what requires kernel involvement and what does not:

Win32 API Kernel role Implementation
ConvertThreadToFiber() Allocate upcall stack, call SYS_register_scheduler_upcall WINE calls Section 7.1.7 registration
CreateFiber(size, fn, p) None WINE allocates stack, sets up UpcallFrame in userspace
SwitchToFiber(fiber) None WINE saves registers, swaps stack pointer, updates TEB.FiberData — pure userspace
DeleteFiber(fiber) None WINE frees stack
FlsAlloc / FlsGetValue / FlsSetValue None WINE maintains per-fiber FLS table in user address space; pointer swapped on SwitchToFiber
Fiber calls blocking syscall Scheduler upcall (Section 7.1.7) Kernel invokes upcall; WINE converts to io_uring, parks fiber, runs another

Fiber Local Storage (FLS):

Fiber Local Storage provides per-fiber thread-local-like storage, analogous to Windows FLS (FlsAlloc/FlsSetValue/FlsGetValue/FlsFree) and required by the Windows Environment for Applications (WEA) compatibility layer.

/// Per-fiber local storage block. Each fiber has one FLS block allocated
/// with its stack. Windows supports up to 1088 FLS slots (FLS_MAXIMUM_AVAILABLE).
pub struct FiberLocalStorage {
    /// Storage slots. Index is the FLS slot ID returned by fls_alloc().
    slots: Box<[FlsSlot; FLS_MAXIMUM_AVAILABLE]>,
    /// Number of allocated slots (highest used index + 1).
    allocated: u32,
}

/// One FLS slot: a value and an optional destructor called when the fiber exits.
pub struct FlsSlot {
    /// The stored value (pointer-sized). Zero if unset.
    pub value: usize,
    /// Optional destructor called with `value` when the fiber exits or
    /// fls_free() is called while the slot is set. Called before the
    /// fiber's stack is freed.
    pub destructor: Option<fn(usize)>,
}

/// Maximum number of FLS slots per fiber (matches Windows FLS_MAXIMUM_AVAILABLE).
pub const FLS_MAXIMUM_AVAILABLE: usize = 1088;

FLS operations:

fls_alloc(destructor: Option<fn(usize)>) -> Result<u32, FlsError>:
  Allocates the next free FLS slot index. Returns the slot index.
  Returns FlsError::NoMoreSlots if all 1088 slots are in use.

fls_set_value(index: u32, value: usize) -> Result<(), FlsError>:
  Sets the value for slot `index` in the current fiber's FLS block.
  Returns FlsError::InvalidIndex if index >= FLS_MAXIMUM_AVAILABLE
  or the slot has not been allocated via fls_alloc().

fls_get_value(index: u32) -> Result<usize, FlsError>:
  Reads the value for slot `index`. Returns 0 if set to zero or
  never set. Returns FlsError::InvalidIndex for invalid/unallocated index.

fls_free(index: u32) -> Result<(), FlsError>:
  Frees slot `index`. Calls the destructor (if set and value != 0)
  before clearing the slot. The slot index may be reused by future
  fls_alloc() calls.

Fiber stack allocation:

Fibers use UmkaOS's normal virtual memory allocator. Stack size is specified at creation time via CreateFiber(stack_size, proc, param): - Minimum stack: 64 KB (aligned up if caller requests less) - Default stack: 1 MB (matches Windows default fiber stack) - Maximum stack: process virtual address space limit - Guard page: one no-access page below the stack (catches stack overflow) - The fiber stack VA range is allocated with mmap(MAP_ANONYMOUS | MAP_STACK); the guard page uses mprotect(PROT_NONE) on the bottom page.

Fiber context switch cost: ~40-80 ns (same as swapcontext() — save/restore GPRs + FPU state + FLS block pointer, no kernel involvement).

Why blocking-in-fiber is the only hard problem: SwitchToFiber needs zero kernel involvement — it is register save/restore. FLS is an array in user memory. The problem is a fiber calling NtReadFile (→ read(2)) which would block the OS thread, starving all other fibers. The Section 7.1.7 scheduler upcall mechanism solves this: WINE registers an upcall handler on the OS thread; when any fiber's syscall would block, the kernel invokes the handler, which submits the I/O to io_uring and runs the next fiber. The OS thread remains live.

This is exactly how Naughty Dog's fiber-based job system (and similar game-engine job schedulers) achieves high core utilisation — fibers never "waste" a core waiting for I/O or synchronisation.

Why this matters: - Games using Windows fiber-based job systems (Destiny, various Unreal titles) - Windows thread pool APIs (TpCallbackMayRunLong, TP_CALLBACK_ENVIRON) - .NET/C# games (CLR uses APCs for garbage collection suspension) - Anti-cheat systems that inspect TEB/fiber state


18.4.7 Security & Token Model

Problem: Windows has security tokens (user SID, group SIDs, privileges). Many games/launchers check tokens. WINE fakes most of this.

UmkaOS WEA approach: Minimal NT token emulation (not full Windows security, just enough for compatibility).

/// Maximum groups per token. Windows allows up to 1024 groups; we use a lower
/// limit since WINE/Proton games typically need far fewer.
pub const NT_MAX_TOKEN_GROUPS: usize = 128;

/// Maximum privileges per token. Windows defines ~36 privileges; we cap at 64.
pub const NT_MAX_TOKEN_PRIVILEGES: usize = 64;

pub struct NtToken {
    /// User SID (S-1-5-21-...)
    user_sid: WinSid,

    /// Groups (Administrators, Users, etc.). Fixed-capacity array to prevent
    /// unbounded kernel memory growth from malicious token inflation.
    groups: ArrayVec<WinSid, NT_MAX_TOKEN_GROUPS>,

    /// Privileges (SeDebugPrivilege, SeBackupPrivilege, etc.)
    /// Most are no-ops, but games check for them. Fixed-capacity bitset.
    privileges: BitArray<[u64; 1]>,  // 64 bits = 64 privilege slots

    /// Integrity level (Low, Medium, High, System)
    integrity_level: IntegrityLevel,
}

// Syscalls
SYS_nt_open_process_token(
    process: NtHandle,
    access: u32,
) -> Result<NtHandle>;

SYS_nt_query_token_information(
    token: NtHandle,
    class: TokenInformationClass,
    buffer: *mut u8,
    buffer_len: u32,
) -> Result<u32>;                       // Returns bytes written

Why this matters: - Game launchers (Epic, Ubisoft) check admin privileges - Anti-cheat checks process token integrity level - Windows Store games check app container tokens


18.4.8 Structured Exception Handling (SEH)

Problem: Windows uses SEH (Structured Exception Handling) for both C++ exceptions and hardware exceptions (access violations, divide-by-zero). x86-64 Windows uses table-based unwinding. WINE emulates via signal handlers.

UmkaOS WEA approach: Kernel-assisted SEH dispatch with safety bounds.

// When hardware exception occurs (page fault, illegal instruction, etc.):
// 1. Kernel looks up exception handler chain in TEB
// 2. Validates and calls user-mode exception handlers in order
// 3. If unhandled, terminates process (Windows behavior)

pub struct ExceptionRecord {
    exception_code: u32,        // STATUS_ACCESS_VIOLATION, etc.
    exception_flags: u32,
    exception_address: usize,
    parameters: [usize; 15],    // Exception-specific data
}

// When CPU exception occurs, kernel:
// 1. Saves context (registers, stack)
// 2. Reads TEB->ExceptionList (user address, validated)
// 3. For each handler in the chain (max SEH_MAX_CHAIN_DEPTH = 64):
//    a. Validate record.next is within the current thread's stack VMA (stack-pivot defense)
//    b. Validate handler address is in executable user pages
//    c. Validate next pointer is in readable user pages or NULL
//    d. Call handler via controlled user-mode return
//    e. If handler returns EXCEPTION_EXECUTE_HANDLER, unwind to it
// 4. If chain exhausted or max depth reached, terminate process

// Safety invariants enforced by kernel:
// - Each EXCEPTION_REGISTRATION_RECORD.next MUST be within the thread's stack VMA;
//   a pointer outside the stack indicates a stack-pivot attack (see validate_seh_chain)
// - Each handler address must be in VMA with PROT_EXEC
// - Each EXCEPTION_REGISTRATION_RECORD must be in readable user memory
// - Chain traversal stops at 0xFFFFFFFF (end sentinel), invalid pointer, or depth limit
// - Circular chains detected via depth limit

/// Maximum SEH chain depth to traverse. Prevents both infinite loops and stack-pivot
/// attacks via over-long chains. Windows doesn't document a limit; practical applications
/// rarely exceed 10-20 handlers. 64 provides ample headroom with a tight security bound.
pub const SEH_MAX_CHAIN_DEPTH: usize = 64;

fn validate_seh_chain(initial_record: u32) -> Result<(), SehError> {
    let stack_vma = current_task().stack_vma();
    let mut record_addr = initial_record;  // read from FS:[0] / TEB.ExceptionList
    let mut depth = 0usize;

    while record_addr != 0xFFFF_FFFF {
        // Bounds check: record must be within the thread's stack
        if !stack_vma.contains(record_addr as usize) {
            return Err(SehError::RecordOutsideStack { addr: record_addr });
        }
        // Handler must be in executable memory (existing check)
        let record = read_user_seh_record(record_addr)?;
        if !is_executable(record.handler) {
            return Err(SehError::HandlerNotExecutable);
        }
        depth += 1;
        if depth > SEH_MAX_CHAIN_DEPTH {
            return Err(SehError::ChainTooLong);
        }
        record_addr = record.next;
    }
    Ok(())
}

Scope note: SEH validation verifies that handler addresses are in executable pages and that all EXCEPTION_REGISTRATION_RECORD nodes reside within the thread's stack VMA — matching Windows compatibility while closing the stack-pivot attack vector. It does not prevent ROP (Return-Oriented Programming) gadget use; Windows itself does not prevent ROP gadgets in SEH handlers. Applications needing ROP protection should use Control Flow Guard (CFG) or Arbitrary Code Guard (ACG) via SetProcessMitigationPolicy.

Why this matters: - Windows games compiled with MSVC use SEH - Access violations (common in games with bugs) are handled differently than Linux segfaults - Debuggers need to intercept first-chance exceptions


18.4.9 Performance: Projected Comparison

Note: These are design-phase projections, not measured benchmarks. WEA is not yet implemented. The estimates are based on syscall overhead analysis (measuring existing wineserver round-trip vs expected kernel object access latency) and comparable Linux kernel primitives (futex, epoll). Actual performance will be validated during implementation.

Projected workload: Unreal Engine 5 game loading (Proton on Linux vs WEA on UmkaOS)

Operation Linux + WINE (est.) UmkaOS + WEA (projected) Projected Speedup
CreateEvent (named) ~15 μs (wineserver RPC; measured end-to-end including wineserver object lookup and state update; raw IPC round-trip on modern hardware is 3–5 μs, but wineserver processing adds 10–12 μs) ~1.5 μs (kernel object) targeted ~10x (assuming workload is syscall-latency-bottlenecked; compute-bound workloads see 0% gain)
WaitForMultipleObjects (8 handles) ~8 μs (poll + wineserver) ~0.5 μs (kernel wait) targeted ~16x improvement (for CreateEvent/WaitForSingleObject-heavy patterns)
VirtualAlloc (100 MB) ~50 μs (mmap + tracking) ~20 μs (native) ~2.5x
IOCP GetQueuedCompletionStatus ~4 μs (eventfd + epoll) ~0.8 μs (kernel queue) targeted ~5x improvement (for I/O-intensive patterns)
MapViewOfFile (section) ~12 μs (shm + mmap) ~3 μs (kernel section) ~4x

Note: Speedup projections are based on profiling Wine/Proton on synthetic CreateEvent/WaitForSingleObject and I/O benchmarks. Actual gains depend strongly on workload characteristics. Compute-bound applications see no improvement from WEA; the benefit is concentrated in applications that make frequent Windows API calls with high syscall overhead.

Assumptions: x86-64, Intel Core i7-12700K, Linux 6.1, WINE 8.x, single-threaded microbenchmarks. Real game workloads will show smaller end-to-end improvements due to GPU-bound and I/O-bound phases.

Projected game impact: 10-20% faster loading (synchronization-heavy), 5-10% better frame pacing (reduced NT emulation jitter). These projections require validation.


18.4.10 API Surface & Stability

Key principle: WEA is an internal UmkaOS syscall API, not a Windows-compatible ABI. WINE/Proton are the only consumers.

Stability guarantee: - WEA operations use the umka_syscall multiplexed entry point (Section 18.6.2), with operation codes in the 0x0800-0x08FF range (see updated Section 18.6.3). - Versioned API (WEA v1, v2, etc.) with capability negotiation via umka_op::WEA_VERSION_QUERY. - WINE can check: "Does kernel support WEA v2?" before using new features.

Non-goal: WEA does not aim to run Windows binaries directly. WINE/Proton still required for: - PE executable loading - DLL loading, import resolution - Win32 API emulation (user32.dll, kernel32.dll, etc.) - DirectX → Vulkan translation (DXVK, VKD3D)

WEA only accelerates the kernel-level primitives that WINE currently emulates poorly.


18.4.11 Implementation Roadmap

Phased Development Plan (no time estimates per UmkaOS policy):

Phase 1: NT object manager + basic synchronization - Event, Mutex, Semaphore objects - WaitForSingleObject, WaitForMultipleObjects - Named object namespace

Phase 2: Memory management - VirtualAlloc/VirtualFree with Windows semantics - Section objects (shared memory) - MapViewOfSection, UnmapViewOfSection

Phase 3: I/O completion ports - IOCP creation, association, posting, dequeuing - Integration with UmkaOS async I/O

Phase 4: Thread model extensions - TEB support + fast NtCurrentTeb() via GS base - APC queues - Scheduler upcall registration (SYS_register_scheduler_upcall, Section 7.1.7) enabling correct fiber blocking behaviour for SwitchToFiber-based job systems

Phase 5: Security & tokens - Minimal NT token emulation - Privilege checks (mostly no-ops)

Phase 6: SEH support - Kernel-assisted exception dispatch - Unwind table parsing (x86-64)

Dependency: WINE/Proton must be modified to use WEA syscalls. Upstream WINE may not accept (they target all UNIX platforms). Proton fork more realistic (Valve controls it, Steam Deck focus).


18.4.12 Benefits Summary

For users (projected, pending validation — see Section 18.4.9): - Games projected to run 10-20% faster loading under Proton on UmkaOS vs Linux - Better compatibility (some games that break on WINE/Linux may work on WEA/UmkaOS) - Lower input latency (reduced NT emulation jitter)

For WINE/Proton developers: - Less complex userspace emulation code - Fewer bugs (kernel enforces correctness) - Easier to support new Windows features (kernel does heavy lifting)

For UmkaOS: - Gaming becomes a differentiation point vs Linux - "Best platform for Windows gaming outside Windows" marketing - Drives enthusiast adoption

Market impact: - Steam Deck successor (if Valve interested)? - Gaming-focused UmkaOS distribution (like SteamOS but UmkaOS-based)? - Differentiation in the "Linux for gaming" space


18.4.13 Open Questions

  1. Upstream WINE acceptance?
  2. WINE targets macOS, FreeBSD, Solaris — not just Linux
  3. UmkaOS-specific syscalls might not be upstreamable
  4. Solution: Maintain UmkaOS-specific WINE fork OR Proton-only support

  5. Anti-cheat compatibility?

  6. EAC, BattlEye check kernel behavior
  7. WEA changes kernel behavior (more Windows-like)
  8. Could this improve or break anti-cheat support?

  9. Maintenance burden?

  10. Windows NT is a moving target (Windows 11, Windows 12...)
  11. UmkaOS must track changes to NT kernel APIs
  12. Mitigation: Focus on stable APIs (NT 6.x kernel, used in Win7-Win11)

  13. Security implications?

  14. NT object namespace shared across processes
  15. Named objects can be hijacked (race conditions)
  16. Resolved: Atomic create-or-open under write lock prevents TOCTOU (see Section 18.4.1 NtObjectManager::create_named). Container isolation via NtSecurityDescriptor prevents cross-container object squatting.

  17. 32-bit Windows game support?

  18. Many Windows games are still 32-bit (i686 PE executables)
  19. UmkaOS does not support ia32 multilib (Section 18.5 "Deliberately Dropped")
  20. Design decision: 32-bit Windows games run via WINE's WoW64-style thunking. WINE already implements 32-to-64 syscall translation for Linux. WEA syscalls are 64-bit only; WINE's 32-bit ntdll.dll thunks to 64-bit before calling WEA. This maintains UmkaOS's clean 64-bit-only syscall surface while supporting 32-bit games. Performance impact is minimal: the thunk is one function call in WINE's address space, not a kernel transition.

18.5 Deliberately Dropped Compatibility

These Linux features are intentionally not supported. Each omission protects a core design property of UmkaOS.

Dropped feature Why Design property protected
Binary .ko kernel modules Would require emulating Linux's unstable internal API Stable KABI
ia32 multilib (32-bit on 64-bit) Doubles syscall surface, complicates signal handling Clean architecture
/dev/mem and /dev/kmem Raw physical/kernel memory access Capability-based security
Obsolete syscalls (~50+) old_stat, socketcall, ipc multiplexer, etc. Clean syscall surface
/sys/module/*/parameters Tied to .ko module model KABI-native configuration
Kernel cmdline module params modname.param=val syntax tied to .ko model KABI-native configuration
ioperm / iopl Direct I/O port access from user space Driver isolation
kexec (initially) Complex interaction with driver model Clean shutdown/recovery

Obsolete syscalls not implemented (partial list): old_stat, old_lstat, old_fstat, socketcall, ipc (multiplexer), old_select, old_readdir, old_mmap, uselib, modify_ldt (except minimal for TLS), vm86, vm86old, set_thread_area (x86 only; use arch_prctl instead).

Only syscalls that current glibc (2.17+) and musl (1.2+) actually emit are implemented.


18.6 UmkaOS Native Syscall Interface

18.6.1 Motivation

UmkaOS implements ~80% of Linux syscalls natively with identical POSIX semantics — read, write, open, mmap, fork, socket, etc. are the kernel's own API. For these, the syscall entry point performs only representation conversion (untyped C ABI → typed Rust internals: int fdCapHandle<FileDescriptor>, void *bufUserPtr<T>), not semantic translation.

However, ~20% of operations fall into two categories where Linux's interface is fundamentally inadequate:

  1. Thin adaptation (~15%): Linux has an interface but it's untyped, fragmented, or encodes the wrong abstraction. Examples: ioctl(fd, MAGIC, void*) for driver interaction, clone3() flag explosion for process creation, prctl() as a catch-all for unrelated operations, five separate observability interfaces (perf, ftrace, sysfs, tracepoints, BPF).

  2. No Linux equivalent (~5%): UmkaOS has capabilities that Linux does not expose at all. Examples: capability delegation with attenuation, isolation domain management, distributed shared memory, per-cgroup power budgets.

For both categories, UmkaOS defines native syscalls that expose the full richness of the kernel's typed, capability-based model. These syscalls are available alongside the Linux-compatible interface — unmodified Linux applications continue to use Linux syscalls and work correctly; UmkaOS-aware applications can opt into the native interface for stronger typing, finer-grained control, and access to UmkaOS-specific features.

18.6.2 Design Principles

  • Native syscalls supplement, never replace, Linux-compatible ones. Every operation achievable via a native syscall must also be achievable via the Linux-compatible interface (even if with less type safety or fewer features). Linux applications never need UmkaOS-native syscalls.
  • Typed arguments. Native syscalls use fixed-layout Rust-compatible structs, not unsigned long catch-alls or void * blobs. Every argument is validated at the syscall entry point against the struct layout.
  • Capability-first. Native syscalls accept CapHandle arguments directly. Permission checks are explicit in the syscall signature, not hidden inside the implementation.
  • Versioned. Each native syscall struct includes a size: u32 field (like Linux's clone3 and openat2). The kernel handles smaller structs from older userspace by zero-filling new fields. This provides forward-compatible extensibility without syscall number proliferation.
  • Namespaced. All native syscalls use a single multiplexed entry point (umka_syscall(u32 op, *const u8 args, u32 args_size) -> i64) to consume only one syscall number from the Linux range. The op code selects the operation.

18.6.3 Syscall Families

/// UmkaOS native syscall operation codes.
/// Grouped by subsystem. Each family reserves a 256-entry range for
/// forward-compatible extension without renumbering.
pub mod umka_op {
    // ── Capability operations (0x0100 - 0x01FF) ──────────────────────
    /// Create a new capability with specified rights from an existing one.
    /// Equivalent to: dup() + fcntl() but typed and with attenuation.
    pub const CAP_DERIVE: u32    = 0x0100;
    /// Restrict an existing capability's permissions (irreversible).
    /// No Linux equivalent — fcntl cannot reduce permissions on an fd.
    pub const CAP_RESTRICT: u32  = 0x0101;
    /// Query the permission set of a capability handle.
    pub const CAP_QUERY: u32     = 0x0102;
    /// Revoke a specific capability by handle.
    pub const CAP_REVOKE: u32    = 0x0103;
    /// Delegate a capability to another process via IPC, with optional
    /// attenuation (reduced rights). The recipient receives a new handle
    /// with at most the permissions specified by the sender.
    pub const CAP_DELEGATE: u32  = 0x0104;

    // ── Typed driver interaction (0x0200 - 0x02FF) ───────────────────
    /// Invoke a typed KABI operation on a driver.
    /// Replaces: ioctl(fd, request, arg) with typed, versioned structs.
    /// The driver's KABI version is checked at invocation time.
    pub const DRV_INVOKE: u32    = 0x0200;
    /// Query a driver's supported KABI interfaces and versions.
    pub const DRV_QUERY: u32     = 0x0201;
    /// Subscribe to driver health/status events (structured, typed).
    /// Replaces: various sysfs polling and netlink listening patterns.
    pub const DRV_SUBSCRIBE: u32 = 0x0202;

    // ── Isolation domain management (0x0300 - 0x03FF) ────────────────
    /// Query the isolation tier and domain of a capability handle.
    pub const DOM_QUERY: u32     = 0x0300;
    /// Request domain statistics (cycle counts, fault counts, memory).
    pub const DOM_STATS: u32     = 0x0301;

    // ── Distributed operations (0x0400 - 0x04FF) ─────────────────────
    /// Allocate a distributed shared memory region.
    /// No Linux equivalent.
    pub const DSM_ALLOC: u32     = 0x0400;
    /// Map a remote DSM region into the local address space.
    pub const DSM_MAP: u32       = 0x0401;
    /// Set coherence policy for a DSM region (strict, relaxed, release).
    pub const DSM_SET_POLICY: u32 = 0x0402;
    /// Query cluster membership and node health.
    pub const CLUSTER_INFO: u32  = 0x0410;

    // ── Accelerator operations (0x0500 - 0x05FF) ─────────────────────
    /// Create an accelerator context (GPU, NPU, FPGA) with typed caps.
    /// Replaces: DRM_IOCTL_* and VFIO ioctls with unified typed API.
    pub const ACCEL_CTX_CREATE: u32  = 0x0500;
    /// Submit work to an accelerator context.
    pub const ACCEL_SUBMIT: u32      = 0x0501;
    /// Query accelerator utilization and health.
    pub const ACCEL_QUERY: u32       = 0x0502;
    /// Wait for accelerator fence completion.
    pub const ACCEL_FENCE_WAIT: u32  = 0x0503;

    // ── Power management (0x0600 - 0x06FF) ───────────────────────────
    /// Set per-cgroup power budget (watts).
    /// No Linux equivalent — Linux uses sysfs strings.
    pub const POWER_SET_BUDGET: u32  = 0x0600;
    /// Query current power consumption for a cgroup or domain.
    pub const POWER_QUERY: u32       = 0x0601;

    // ── Observability (0x0700 - 0x07FF) ──────────────────────────────
    /// Subscribe to structured kernel events (health, tracepoints, audit).
    /// Replaces: fragmented perf_event_open / ftrace / sysfs / netlink.
    pub const OBSERVE_SUBSCRIBE: u32 = 0x0700;
    /// Query kernel object by path in the unified object namespace (umkafs).
    pub const OBSERVE_QUERY: u32     = 0x0701;

    // ── Windows Emulation Acceleration (0x0800 - 0x08FF) ─────────────
    // WEA operations for WINE/Proton acceleration (Section 18.4).
    // These provide kernel-native NT-compatible primitives.

    /// Query WEA version and supported features.
    pub const WEA_VERSION_QUERY: u32     = 0x0800;
    /// Create an NT event object (manual-reset or auto-reset).
    pub const WEA_EVENT_CREATE: u32      = 0x0801;
    /// Open an existing named NT event object.
    pub const WEA_EVENT_OPEN: u32        = 0x0802;
    /// Set (signal) an NT event.
    pub const WEA_EVENT_SET: u32         = 0x0803;
    /// Reset (unsignal) an NT event.
    pub const WEA_EVENT_RESET: u32       = 0x0804;
    /// Pulse an NT event (signal and immediately reset).
    pub const WEA_EVENT_PULSE: u32       = 0x0805;
    /// Create an NT mutex object.
    pub const WEA_MUTEX_CREATE: u32      = 0x0810;
    /// Create an NT semaphore object.
    pub const WEA_SEMAPHORE_CREATE: u32  = 0x0811;
    /// Wait for a single NT object to become signaled.
    pub const WEA_WAIT_SINGLE: u32       = 0x0820;
    /// Wait for multiple NT objects (WaitAny or WaitAll semantics).
    pub const WEA_WAIT_MULTIPLE: u32     = 0x0821;
    /// Create an NT section (memory-mapped file or shared memory).
    pub const WEA_SECTION_CREATE: u32    = 0x0830;
    /// Map a view of an NT section into the process address space.
    pub const WEA_SECTION_MAP: u32       = 0x0831;
    /// Unmap a view of an NT section.
    pub const WEA_SECTION_UNMAP: u32     = 0x0832;
    /// Create an I/O completion port.
    pub const WEA_IOCP_CREATE: u32       = 0x0840;
    /// Associate a file with an IOCP.
    pub const WEA_IOCP_ASSOCIATE: u32    = 0x0841;
    /// Post a completion packet to an IOCP.
    pub const WEA_IOCP_POST: u32         = 0x0842;
    /// Dequeue a completion packet from an IOCP.
    pub const WEA_IOCP_GET: u32          = 0x0843;
    /// Memory operations with Windows semantics (reserve/commit/reset).
    pub const WEA_VIRTUAL_ALLOC: u32     = 0x0850;
    /// Change memory protection with old-protection output.
    pub const WEA_VIRTUAL_PROTECT: u32   = 0x0851;
    /// Lock pages in physical memory.
    pub const WEA_VIRTUAL_LOCK: u32      = 0x0852;
    /// Queue an APC to a thread.
    pub const WEA_APC_QUEUE: u32         = 0x0860;
    /// Alert a thread (deliver queued APCs).
    pub const WEA_ALERT_THREAD: u32      = 0x0861;
    /// Open a process token for security queries.
    pub const WEA_TOKEN_OPEN: u32        = 0x0870;
    /// Query token information (user, groups, privileges).
    pub const WEA_TOKEN_QUERY: u32       = 0x0871;
    /// Close an NT handle.
    pub const WEA_HANDLE_CLOSE: u32      = 0x08F0;
    /// Duplicate an NT handle.
    pub const WEA_HANDLE_DUP: u32        = 0x08F1;
}

18.6.4 Userspace Library

Native syscalls are accessed through libisle, a thin userspace library that provides:

  • C API with proper types (umka_cap_derive(), umka_drv_invoke(), etc.)
  • Rust bindings via umka-sys crate (zero-cost wrappers over the raw syscall)
  • Version negotiation: libisle checks kernel version at init and uses the appropriate struct sizes for forward/backward compatibility

Applications link against libisle. The library detects at runtime whether it is running on an UmkaOS kernel (via /proc/version or uname) and returns -ENOSYS on non-UmkaOS kernels, allowing portable applications to fall back to Linux-compatible interfaces.

18.6.5 Relationship to Linux Syscalls

                    ┌──────────────────────────────────────┐
                    │        Userspace Application         │
                    └───────────┬──────────┬───────────────┘
                                │          │
                    Linux API   │          │  UmkaOS Native API
                    (glibc)     │          │  (libisle)
                                │          │
                    ┌───────────▼──────────▼───────────────┐
                    │      Syscall Entry Point             │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │ Linux    │  │ umka_syscall()    │  │
                    │  │ nr→      │  │ op + typed args → │  │
                    │  │ dispatch │  │ dispatch          │  │
                    │  └────┬─────┘  └────┬─────────────┘  │
                    │       │             │                 │
                    │       ▼             ▼                 │
                    │  ┌──────────────────────────────┐    │
                    │  │  Internal Typed Kernel API    │    │
                    │  │  (CapHandle, UserPtr, etc.)   │    │
                    │  └──────────────────────────────┘    │
                    └──────────────────────────────────────┘

Both paths converge to the same internal kernel API. A read() via Linux's syscall(0, fd, buf, count) and a hypothetical umka_read() via umka_syscall(op, args, size) call the same internal vfs_read(cap_handle, user_ptr, count) function. The native path skips the fd→CapHandle lookup (the caller already holds a CapHandle) and avoids the void*UserPtr validation (the struct is pre-typed). For most operations the performance difference is negligible; for high-frequency driver interaction (DRV_INVOKE replacing ioctl), the typed path avoids the ioctl dispatch switch and provides measurably lower overhead.


18.7 Safe Kernel Extensibility

18.7.1 The Paradigm

The most important OS innovation of the last decade is eBPF: user-injected verified code in kernel hot paths. But eBPF is limited by being bolted onto a C kernel with a conservative bytecode verifier.

UmkaOS can generalize this: every kernel policy is a safe, hot-swappable module.

Distinction from eBPF (Section 18.1.4): eBPF provides Linux-compatible user-to-kernel hooks for tracing, networking, and security — it serves the Linux ecosystem. Policy modules provide kernel-internal mechanism/policy separation via KABI vtables — they serve kernel evolution. Both coexist; they address different extensibility needs.

Current KABI model (Section 11.1):
  Drivers implement KABI vtables for device interaction.
  Drivers are hot-swappable (crash recovery, Section 10.8).
  Drivers run in isolation domains.

Generalized KABI model (this proposal):
  POLICIES also implement KABI vtables.
  Policies are hot-swappable (same mechanism as drivers).
  Policies run in isolation domains (same as Tier 1 drivers).

  The kernel provides MECHANISMS (scheduling, page tables, memory allocation).
  POLICY MODULES provide DECISIONS (which process runs next,
  which page to evict, how to route I/O).

18.7.2 Extensible Policy Points

// umka-core/src/policy/mod.rs

/// Policy points where the kernel delegates decisions to a module.
/// Each policy point has a default built-in implementation.
/// Custom modules can replace the default at runtime.

// --- Policy context and parameter types ---

/// Read-only snapshot of scheduling state, captured under the runqueue lock
/// and passed to policy modules across the trust boundary. Policy modules
/// never see raw runqueue pointers.
pub struct SchedPolicyContext {
    /// Number of runnable tasks on this CPU's runqueue.
    pub nr_running: u32,
    /// Total weighted load on this CPU (PELT sum).
    pub cpu_load: u64,
    /// Per-task metadata for each runnable task (bounded by nr_running).
    /// Contains task ID, nice value, weight, vruntime, lag, and cgroup ID.
    pub tasks: ArrayVec<TaskSnapshot, MAX_RUNQUEUE_SNAPSHOT>,
    /// Current CPU frequency (kHz), for EAS-aware scheduling.
    pub cpu_freq_khz: u32,
    /// NUMA node ID of this CPU.
    pub numa_node: u8,
}

/// Flags passed to `enqueue_task()` indicating why the task became runnable.
pub struct EnqueueFlags(u32);
impl EnqueueFlags {
    /// Task was just created (fork/clone).
    pub const ENQUEUE_NEW: Self = Self(1 << 0);
    /// Task woke from sleep (futex, poll, etc.).
    pub const ENQUEUE_WAKEUP: Self = Self(1 << 1);
    /// Task was migrated from another CPU.
    pub const ENQUEUE_MIGRATE: Self = Self(1 << 2);
    /// Task was restored after preemption.
    pub const ENQUEUE_RESTORE: Self = Self(1 << 3);
}

/// Decision returned by `balance_load()`.
pub enum MigrateDecision {
    /// Do nothing — CPUs are balanced.
    NoAction,
    /// Migrate `count` tasks from `busiest_cpu` to `this_cpu`.
    Migrate { count: u32 },
    /// Defer decision — not enough data yet (e.g., PELT hasn't converged).
    Defer,
}

/// Block I/O request descriptor (read-only view for policy modules).
pub struct IoRequest {
    /// Logical block address (start of I/O range).
    pub lba: u64,
    /// Number of sectors.
    pub sector_count: u32,
    /// Operation type (read, write, discard, flush).
    pub op: IoOp,
    /// Originating process ID (for cgroup accounting).
    pub pid: ProcessId,
    /// Submission timestamp (monotonic ns).
    pub submit_ns: u64,
    /// I/O priority class and level.
    pub ioprio: u16,
}

/// Priority score returned by `IoSchedPolicy::submit()`.
/// Higher scores are dispatched first. Opaque to umka-core — the policy
/// module defines the scoring function.
pub struct IoScore(pub i64);

/// Minimal packet header view for network classification.
/// Contains only the fields needed for QoS decisions, not the full packet.
pub struct PacketHeader {
    /// Source/destination IP (v4 or v6) and ports.
    pub src_addr: IpAddr,
    pub dst_addr: IpAddr,
    pub src_port: u16,
    pub dst_port: u16,
    /// IP protocol number (TCP=6, UDP=17, etc.).
    pub protocol: u8,
    /// DSCP value from IP header.
    pub dscp: u8,
    /// Packet length (bytes).
    pub len: u32,
}

/// Classification result for a network packet.
pub struct NetClass {
    /// Priority queue index (0 = best effort, higher = higher priority).
    pub queue: u8,
    /// Traffic class mark (for tc/iptables compatibility).
    pub mark: u32,
    /// Drop eligibility (for ECN/WRED).
    pub drop_eligible: bool,
}

/// Flags describing the allocation context (for tiering decisions).
pub struct AllocFlags(u32);
impl AllocFlags {
    /// Page is for anonymous memory (heap, stack).
    pub const ANONYMOUS: Self = Self(1 << 0);
    /// Page is for file-backed mapping (page cache).
    pub const FILE_BACKED: Self = Self(1 << 1);
    /// Page is for a memory-mapped device region.
    pub const DEVICE: Self = Self(1 << 2);
    /// Allocation is on the fault path (latency-sensitive).
    pub const FAULT: Self = Self(1 << 3);
    /// Hint: page is likely short-lived.
    pub const TRANSIENT: Self = Self(1 << 4);
}

/// Memory tier identifier. Discovery-based (see Section 4.1.8 NUMA topology).
pub struct TierId(pub u8);

/// Tiering decision for a page.
pub enum TierDecision {
    /// Keep page in current tier.
    Keep,
    /// Demote to the specified lower tier (e.g., CXL, compressed, swap).
    Demote(TierId),
    /// Compress in place (same tier, compressed representation).
    Compress,
}

/// NUMA migration advice for a page.
pub enum MigrateAdvice {
    /// Keep page on current NUMA node.
    Stay,
    /// Migrate to the specified NUMA node (closer to accessing CPU).
    MigrateTo(u8),
}

/// CPU scheduling policy.
///
/// Policy modules receive a `SchedPolicyContext` snapshot (Section 18.7.3), NOT a direct
/// reference to the locked runqueue. The snapshot is captured by umka-core under
/// the runqueue lock before the domain switch, ensuring consistency without
/// exposing internal kernel data structures across the trust boundary.
pub trait SchedPolicy: Send + Sync {
    /// Pick the next task to run on this CPU.
    fn pick_next_task(&self, cpu: CpuId, ctx: &SchedPolicyContext) -> Option<TaskId>;
    /// A task has become runnable. Decide where to enqueue it.
    fn enqueue_task(&self, task: TaskId, flags: EnqueueFlags);
    /// A task has yielded or exhausted its timeslice.
    fn task_tick(&self, task: TaskId, cpu: CpuId);
    /// Load balancing decision: should we migrate tasks between CPUs?
    fn balance_load(&self, this_cpu: CpuId, busiest_cpu: CpuId) -> MigrateDecision;
}

/// Maximum pages to scan in a single eviction batch. Sized to fit within
/// a single 4 KB stack frame (each PageHandle is ~16 bytes).
const MAX_SCAN_BATCH: usize = 64;

/// Page replacement policy (which pages to evict under memory pressure).
pub trait PagePolicy: Send + Sync {
    /// Select pages to evict from this zone.
    /// Returns results via a caller-provided fixed-capacity buffer (ArrayVec)
    /// since nr_to_scan is bounded by the zone scan batch size. Policy modules
    /// must not heap-allocate on the eviction hot path.
    fn select_victims(&self, zone: &Zone, nr_to_scan: u32, out: &mut ArrayVec<PageHandle, MAX_SCAN_BATCH>);
    /// Should this page be promoted to a higher tier (active list, huge page)?
    fn should_promote(&self, page: &PageHandle) -> bool;
    /// Migration decision: should this page move to a different NUMA node?
    fn migration_advice(&self, page: &PageHandle, current_node: u8) -> MigrateAdvice;
}

/// I/O scheduling policy (ordering of block I/O requests).
pub trait IoSchedPolicy: Send + Sync {
    /// Submit a new I/O request. Return its priority score.
    fn submit(&self, req: &IoRequest) -> IoScore;
    /// Pick the next I/O request to dispatch to the device.
    fn dispatch(&self, queue: &IoQueue) -> Option<IoRequestId>;
    /// A request has completed. Update internal state.
    fn complete(&self, req: &IoRequest, latency_ns: u64);
}

/// **IoRequestId validation** (required for safety when IoSchedPolicy runs in an
/// isolation domain that may be buggy or compromised):
///
/// Before dispatching any I/O request selected by the policy, the kernel MUST verify
/// that the returned `IoRequestId` exists in the device's live request queue.
/// If the ID is not found, the dispatch is silently skipped and a violation is counted.
///
/// Validation algorithm:
pub const IO_POLICY_MAX_VIOLATIONS: u32 = 3;

// In the dispatch path:
// match io_queue.pending.get(&selected_id) {
//     Some(request) => dispatch(request),
//     None => {
//         log::warn!("IoSchedPolicy returned invalid IoRequestId {:?} — skipping", selected_id);
//         policy_state.violation_count += 1;
//         if policy_state.violation_count >= IO_POLICY_MAX_VIOLATIONS {
//             log::error!("IoSchedPolicy evicted after {} violations", IO_POLICY_MAX_VIOLATIONS);
//             evict_policy(policy_handle);
//         }
//     }
// }
//
// Lookup is O(1) via the `pending: HashMap<IoRequestId, IoRequest>` that already exists
// for request tracking. This adds no overhead on the common (valid) path.

/// Network classification policy (packet prioritization, QoS).
pub trait NetClassPolicy: Send + Sync {
    /// Classify an incoming packet (assign priority, mark, queue).
    fn classify_rx(&self, packet: &PacketHeader) -> NetClass;
    /// Classify an outgoing packet.
    fn classify_tx(&self, packet: &PacketHeader) -> NetClass;
}

/// Memory tiering policy (which tier to place pages in).
pub trait TierPolicy: Send + Sync {
    /// Where should a newly allocated page go?
    fn initial_placement(&self, process: ProcessId, flags: AllocFlags) -> TierId;
    /// A page has been idle for N ticks. Should it be demoted?
    fn demotion_advice(&self, page: &PageHandle, idle_ticks: u32) -> TierDecision;
    /// A remote node has available memory. Should we use it?
    fn remote_tier_advice(&self, node_id: NodeId, available_bytes: u64) -> bool;
}

18.7.3 Policy Module Trust Boundary

Memory access scope: When a policy module runs in its own isolation domain, the kernel maps into that domain (read-only): - Run queue metadata (task count, utilization, per-CPU load) - Per-task scheduling metadata (priority, PELT state, cgroup membership) - System-wide metrics (total CPU count, NUMA topology, frequency domains)

The module CANNOT access: process memory, page contents, file data, network buffers, capability tables, or other modules' state. A rogue pick_next_task cannot scan process memory — hardware domain isolation prevents it.

Locking model: The kernel calls policy module functions with no cross-domain locks held. Per-CPU scheduler state (the runqueue) is locked by the caller; the policy module receives a read-only snapshot of the runqueue state via the SchedPolicyContext argument, not direct access to the locked runqueue. This prevents TOCTOU races: the snapshot is consistent because it is captured under the runqueue lock before the domain switch. The module manages its own internal synchronization (spinlocks, per-CPU data, RCU-like patterns). If the module deadlocks internally, the domain watchdog (timer-based, ~10ms timeout) detects the stuck call and triggers crash recovery — revert to built-in default policy, reload module.

Stateful modules: Traits require Send + Sync, but modules need mutable state (counters, queues, learned parameters). The module owns its state and provides interior mutability via its own locks. The kernel does not hold locks on the module's behalf — the module is a self-contained unit.

NMI safety: Policy modules are never called from NMI context. The kernel's NMI handler performs only minimal work (perf sampling, watchdog) and never invokes policy module callbacks. This eliminates the risk of NMI-induced deadlock when a module holds an internal spinlock. If a future requirement arises for NMI-context policy invocation, the module trait must require try-lock semantics with a fallback to the built-in default policy on lock contention.

18.7.4 Side-Channel Mitigations

Domain isolation prevents direct memory reads across domain boundaries, but policy modules run in Ring 0 and share hardware resources with the core kernel. This opens side-channel vectors that domain isolation alone does not address.

Threat model: An untrusted or experimental module running in its own isolation domain could exploit: 1. Shared-cache timing attacks (L1/L2/LLC) — measure cache line eviction timing to infer kernel memory access patterns. 2. Speculative execution side-channels (Spectre v1 bounds check bypass) — trick the CPU into speculatively reading kernel data across the isolation domain boundary. 3. Timing observation — use high-resolution timers (rdtsc, cycle counters) to measure the duration of kernel operations and infer internal state.

Mitigations:

  • Cache partitioning: Intel CAT (Cache Allocation Technology) / ARM MPAM (Memory System Resource Partitioning and Monitoring) partitions LLC ways so that an untrusted module's cache allocation does not overlap with the core kernel's. Configured per isolation domain at module load time. On architectures without hardware cache partitioning, cache flushing on domain transitions provides a weaker but functional defense.

  • Timer resolution reduction: On AArch64, CNTKCTL_EL1.EL0PCTEN traps EL0 cycle counter reads, allowing the kernel to return a coarsened value. On x86, policy modules run in Ring 0, where rdtsc executes unconditionally regardless of CR4.TSD (the Intel SDM specifies that CR4.TSD=1 only traps rdtsc at CPL>0, not CPL=0). Ring 0 code therefore has full rdtsc access. The side-channel mitigation for Ring 0 policy modules on x86 relies on Intel CAT (LLC partitioning, described above) and cache flushing on domain transitions — not on timer coarsening. This is a deliberate acknowledgment that Ring 0 untrusted modules have the same timing access as any Ring 0 code; cache partitioning and flushing are the effective mitigations at this privilege level. Recommendation: policy modules should use the kernel's monotonic clock abstraction (ktime_get_ns() equivalent) rather than raw rdtsc / cycle counter reads, unless high-precision timing is explicitly required and the module is production-vetted (trusted). The kernel's time API provides sufficient resolution for scheduling and power decisions (~1ns on modern hardware) while maintaining a single auditable timing interface. Untrusted modules that bypass the time API and use raw rdtsc directly can serve as timing oracles for side-channel attacks; code review should flag such usage.

  • Constant-time helpers: The kernel provides constant-time comparison and lookup functions for any data that crosses the domain boundary into module-readable memory. This prevents modules from using timing differences to distinguish data values.

  • Spectre v1 barriers: All kernel→module data handoff uses lfence (x86) / csdb (ARM) speculation barriers. Module-provided indices into kernel arrays are bounds-checked with an array_index_nospec equivalent (index masking) before use.

Residual risk: Production-vetted modules (signed, running in the Core isolation domain) face the same side-channel exposure as any Ring 0 code — this is acceptable since they are fully trusted. Side-channel mitigations apply only to untrusted/experimental modules running in isolation domains. This is a deliberate trade-off: production modules get zero overhead, experimental modules get strong isolation at a small performance cost.

18.7.5 Module Lifecycle

Policy module lifecycle (same as driver lifecycle, Section 10.8):

1. Module binary is compiled Rust (same toolchain as kernel).
   Implements one or more policy traits via KABI vtable.
   Signed with driver signature mechanism (Section 8.2.5).
   Vtable uses same versioning as driver KABI: vtable_size field +
   InterfaceVersion check. A kernel upgrade that adds new methods to
   SchedPolicy extends the vtable — old modules still work (new methods
   fall back to built-in defaults based on vtable_size).

2. Module is loaded at runtime:
   echo "sched_ml_aware" > /sys/kernel/umka/policy/scheduler/active

3. Kernel:
   a. Verifies module signature.
   a2. Extends TPM PCR (or TDX RTMR) with module hash. For confidential
       computing attestation (Section 8.6), the loaded policy module is part of the
       Trusted Computing Base and must be measured.
   b. Allocates isolation domain for the module (if untrusted/experimental).
      Production-vetted modules (signed by kernel vendor, pre-verified)
      run in the Core isolation domain — zero domain transition overhead.
   c. Loads module code into isolated memory region.
   d. KABI vtable exchange (module provides policy vtable).
   e. Atomically swaps old policy for new policy.
   f. Old policy module can be unloaded.

4. Module crash:
   a. Domain fault trapped by kernel.
   b. Revert to built-in default policy (immediate, no interruption).
   c. Reload module if desired.
   d. Total disruption: zero. Built-in default handles the gap.

5. Module hot-swap:
   echo "sched_cfs_isle" > /sys/kernel/umka/policy/scheduler/active
   → Atomic swap to new policy. No interruption.

18.7.6 Relationship to eBPF

eBPF compatibility is maintained through umka-compat. Existing eBPF programs (XDP, tc, kprobes, tracepoints) work via the BPF syscall. Policy modules are a superset — they can do everything eBPF can do plus:

  • Full Rust expressiveness (loops, recursion, complex data structures)
  • Persistent mutable state (eBPF maps are limited)
  • Domain isolation instead of bytecode verifier (more flexible, same safety)
  • Crash recovery (eBPF programs can't crash; policy modules can, and are reloaded)
                        eBPF (Linux compat)      Policy Modules (UmkaOS)
Safety mechanism:       Bytecode verifier         Rust type system + domain isolation
Language:               BPF bytecode (limited)    Rust (full language)
State:                  BPF maps (key-value)      Any Rust data structure
Crash behavior:         Cannot crash              Crash → reload, default resumes
Hot-swap:               Per-program               Per-policy-point
Integration depth:      Hook points only          Full vtable interface

18.7.7 Linux Compatibility

sched_ext (Linux 6.12+) allows user-defined BPF scheduling policies. UmkaOS supports this through umka-compat: - sched_ext BPF programs load via the standard bpf() syscall - They run in the BPF compatibility layer - Performance and behavior identical to Linux sched_ext

Policy modules are an additional, UmkaOS-specific mechanism. Applications unaware of them see standard scheduling behavior.

Module Observability:

Policy modules emit structured tracepoints for every decision:

  • umka_tp_stable_policy_decision: emitted on each pick_next_task, select_victims, dispatch call. Fields: module name, decision type, chosen entity, alternatives considered, decision latency.
  • umka_tp_stable_policy_audit: decision audit log for compliance. Records which module made which resource allocation decision, enabling post-hoc analysis.
  • A/B comparison mode: two policy modules can run simultaneously — one active (making real decisions) and one shadow (receiving the same inputs, logging what it would have decided). Compare via policy.comparison_log in sysfs. This enables safe evaluation of new policies before activation.

18.7.8 Performance Impact

Indirect function call via vtable pointer: ~1-2ns (branch predictor handles it). Linux already uses the same pattern (sched_class->pick_next_task is a function pointer). Same cost as Linux.

Default (production-vetted modules): modules signed by the kernel vendor and pre-verified run in the Core isolation domain. Zero domain transition overhead. Same cost as Linux sched_class function pointer dispatch.

Untrusted/experimental modules: run in their own isolation domain. Add one domain register switch (WRPKRU on x86, POR_EL0+ISB on AArch64, DACR on ARMv7) instruction (~23 cycles per Section 10.2) per policy call. Each call crosses the domain boundary twice (enter + exit), costing 2 × 23 = 46 cycles. For scheduling: called once per context switch (~200 cycles). Adding 46 cycles to 200 = ~23% overhead on the context switch micro-path. This is the cost of sandbox isolation for unvetted code. Acceptable for development and experimentation. The module graduates to Core isolation domain after vetting.

(Note: WRPKRU latency varies by microarchitecture — measured at 11 cycles on Alder Lake, 23 cycles on Skylake, and up to 260 cycles on some Atom cores. The 23-cycle figure used throughout this section reflects Skylake-class server parts; overhead on other microarchitectures scales proportionally. The worst case (Atom, 260 cycles) would increase the domain-transition overhead by ~11x, but Atom-class cores are not a primary UmkaOS server target.)

18.7.9 Policy Module Error Handling and Fallback

When a policy module's vtable function returns an error or panics:

Error return handling: - Policy modules return Result<PolicyAction, PolicyError>. - PolicyError::TemporaryFailure: the kernel retries the policy call up to 3 times with exponential backoff (1ms, 2ms, 4ms). If all retries fail, the system uses the default action for the hook (defined in the hook's .kabi registration). - PolicyError::PermanentFailure: the module is immediately marked ModuleState::Degraded. No retries. Default action is used for all subsequent calls to this hook until the module is replaced. - PolicyError::InvalidState: indicates a bug in the policy module. The module is marked Degraded and a FMA fault event is emitted.

Panic handling: Policy modules run in kernel context. A panic in a Tier 1 policy module triggers the Tier 1 crash recovery mechanism (Section 10.8): the module is reloaded, and its state is reset to the initial registration state. All policy calls during the reload window use the default action.

Default actions (registered at module load time in .kabi declaration):

pub enum DefaultPolicyAction {
    /// Permit the operation (fail-open). Used for performance-advisory hooks
    /// where denying would break functionality.
    Permit,
    /// Deny the operation (fail-closed). Used for security enforcement hooks
    /// where permitting would be unsafe.
    Deny,
    /// Use the previous module's decision (chain to next policy module).
    /// Falls back to Permit if no other module is registered.
    Chain,
}

Monitoring: Each policy module has a policy_error_count, policy_retry_count, and policy_degraded_since field in its FMA health struct, accessible via umkafs at /System/Kernel/policy_modules/{module_name}/.


18.8 Special File Descriptor Objects

Linux exposes several kernel objects through the file descriptor abstraction: event counters, signal queues, timers, and process references. These are not files in any meaningful sense — they are kernel objects that happen to use the fd slot mechanism for lifecycle management and I/O multiplexing integration. UmkaOS implements all four as first-class fd types with exact Linux wire semantics and improved internal implementations.

All four fd types share a common structural principle: each is a SpecialFile variant in the VFS layer, backed by a FileDescription struct with a concrete implementation of FileOps. Poll readiness is reported through the standard FileOps::poll() trait method, which integrates transparently with poll(2), select(2), and epoll(2). No separate fd type registry or global lock is required — each fd object is self-contained.

The Linux compatibility goal for all four types is exact wire compatibility with Linux 6.1 LTS: identical syscall numbers, identical flag values, identical struct layouts, identical errno values, and identical edge-case semantics. Each subsection documents the wire format and any UmkaOS-specific improvements to the internal implementation.

18.8.1 eventfd — Event Notification Counter

Syscall Interface

eventfd(initval: u32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
eventfd2(initval: u32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM

eventfd and eventfd2 are identical in UmkaOS — Linux introduced eventfd2 to add the flags parameter, but UmkaOS exposes both syscall numbers with the same implementation. The initval argument sets the initial counter value (0 to ULLONG_MAX - 1). Providing a value of ULLONG_MAX or greater returns -EINVAL.

Flags

Flag Value Meaning
EFD_CLOEXEC O_CLOEXEC (02000000) Set close-on-exec on the returned fd
EFD_NONBLOCK O_NONBLOCK (04000) Set O_NONBLOCK on the file description
EFD_SEMAPHORE 1 Semaphore semantics for read()

Any flags value with bits other than these three set returns -EINVAL.

Read and Write Semantics

write(fd, &val: u64, 8):

  • val must be in the range [1, ULLONG_MAX - 1]. A value of 0 or ULLONG_MAX returns -EINVAL.
  • If counter + val > ULLONG_MAX - 1:
  • With EFD_NONBLOCK: returns -EAGAIN.
  • Without EFD_NONBLOCK: blocks until a read() reduces the counter enough.
  • Otherwise: atomically adds val to the counter and wakes any readers.
  • Returns 8 on success (number of bytes consumed).

read(fd, &buf: u64, 8):

  • The buffer must be at least 8 bytes. Shorter buffers return -EINVAL.
  • Without EFD_SEMAPHORE:
  • If counter == 0 and EFD_NONBLOCK: returns -EAGAIN.
  • If counter == 0 and blocking: blocks until a write() increments the counter.
  • Otherwise: atomically reads the current counter value into buf and resets the counter to 0. Wakes any blocked writers.
  • With EFD_SEMAPHORE:
  • If counter == 0 and EFD_NONBLOCK: returns -EAGAIN.
  • If counter == 0 and blocking: blocks until counter > 0.
  • Otherwise: atomically decrements the counter by 1 and returns the value 1 in buf. Wakes any blocked writers if counter was at ULLONG_MAX - 1 before the decrement.
  • Returns 8 on success.

Poll Readiness

Condition Event reported
counter > 0 EPOLLIN \| EPOLLRDNORM
counter < ULLONG_MAX - 1 EPOLLOUT \| EPOLLWRNORM

Internal Structure

/// A kernel event notification counter, exposed as a file descriptor.
///
/// The counter is an atomic `u64` in the range `[0, ULLONG_MAX - 1]`.
/// `EFD_SEMAPHORE` changes `read()` to decrement by 1 rather than reset to 0.
pub struct EventFd {
    /// Current counter value. Ranges from 0 to ULLONG_MAX-1 (2^64 - 2).
    /// All updates use atomic compare-and-swap to guarantee linearizability.
    counter: AtomicU64,

    /// Flags set at creation time. `EFD_SEMAPHORE` controls read semantics.
    /// `EFD_NONBLOCK` is stored in the `FileDescription` flags, not here.
    flags: EventFdFlags,

    /// Tasks blocked in `read()` waiting for the counter to become non-zero.
    waiters_read: WaitQueue,

    /// Tasks blocked in `write()` waiting for the counter to drop below ULLONG_MAX-1.
    waiters_write: WaitQueue,
}

Read Algorithm (non-blocking fast path)

read_nonblocking(efd: &EventFd, semaphore: bool) -> Result<u64, Errno>:
    loop:
        current = efd.counter.load(Acquire)
        if current == 0:
            return Err(EAGAIN)
        new_val = if semaphore: current - 1 else: 0
        if efd.counter.compare_exchange(current, new_val, AcqRel, Acquire).is_ok():
            if new_val < ULLONG_MAX - 1:
                efd.waiters_write.wake_all()  // unblock any blocked writers
            return Ok(if semaphore: 1 else: current)
        // CAS failed: another thread raced; retry

The blocking path wraps this loop in a WaitQueue::wait_event() call that suspends the task until a writer increments the counter, then retries the CAS. No spinlock or mutex is held during the blocked sleep.

UmkaOS Improvements over Linux

Linux implements eventfd with a spinlock (efd->lock) protecting the counter and wakeup logic. On x86-64, a LOCK XCHG or LOCK CMPXCHG instruction is sufficient — no mutex required. UmkaOS uses AtomicU64 with compare_exchange in a retry loop. The wait queues are only accessed (never held as locks) when a task actually blocks. This eliminates the spinlock acquisition on every read/write, reducing overhead in the common non-blocking case from ~30-50 cycles (spinlock + counter update) to ~10-15 cycles (single CAS instruction).

eventfd2() and eventfd() are unified behind a single internal constructor — UmkaOS dispatches both syscall numbers to the same function. Linux keeps two separate entry points for historical reasons; UmkaOS does not need to.

Linux Compatibility

  • Syscall numbers: eventfd = 284, eventfd2 = 290 (x86-64).
  • Flag values: EFD_CLOEXEC = O_CLOEXEC = 02000000 octal; EFD_NONBLOCK = O_NONBLOCK = 04000 octal; EFD_SEMAPHORE = 1.
  • ULLONG_MAX - 1 = 0xFFFFFFFFFFFFFFFE as the maximum counter value before write blocks — identical to Linux.
  • Read always returns exactly 8 bytes; write always consumes exactly 8 bytes — any other size returns -EINVAL.
  • /proc/[pid]/fdinfo/[fd] reports eventfd-count: <hex_value> to match Linux.

18.8.2 signalfd — Signal Delivery via File Descriptor

Syscall Interface

signalfd(fd: i32, mask: *const sigset_t, sizemask: usize) -> fd | -EINVAL | -EMFILE | -ENOMEM
signalfd4(fd: i32, mask: *const sigset_t, sizemask: usize, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM

signalfd is the older form (no flags); signalfd4 adds SFD_NONBLOCK and SFD_CLOEXEC. UmkaOS implements both syscall numbers with a unified path that treats signalfd as signalfd4 with flags = 0.

The fd argument controls create-or-update behavior:

  • fd = -1: create a new signalfd. Returns a new file descriptor.
  • fd = <existing signalfd>: update the signal mask on that fd. Returns fd unchanged. If fd is not a signalfd, returns -EINVAL.

sizemask must equal sizeof(sigset_t) = 8 bytes on x86-64. Any other value returns -EINVAL.

mask specifies which signals to accept through this fd. The mask must be a valid user pointer; SIGKILL (9) and SIGSTOP (19) in the mask are silently ignored — they cannot be blocked or redirected.

Flags

Flag Value Meaning
SFD_NONBLOCK O_NONBLOCK (04000) Set O_NONBLOCK on the file description
SFD_CLOEXEC O_CLOEXEC (02000000) Set close-on-exec on the returned fd

Any other bits in flags return -EINVAL.

Usage Pattern

Before signals can be read via signalfd, the caller must block them using sigprocmask(). Signals that are not blocked will be delivered to signal handlers (or default action) as normal — signalfd only intercepts signals from the process's pending signal set.

sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGTERM);
sigaddset(&mask, SIGUSR1);
sigprocmask(SIG_BLOCK, &mask, NULL);       // block these signals
int sfd = signalfd(-1, &mask, SFD_CLOEXEC);  // redirect to fd

Read Semantics

read(fd, buf: *mut signalfd_siginfo, len: usize) -> bytes_read | -EAGAIN | -EINTR
  • len must be at least sizeof(signalfd_siginfo) = 128 bytes. Smaller buffers return -EINVAL.
  • read() dequeues one or more pending signals from the calling task's pending signal set that match the signalfd's mask, filling consecutive signalfd_siginfo structs.
  • The number of structs filled is min(pending_in_mask, len / 128).
  • If no matching signal is pending and O_NONBLOCK: returns -EAGAIN.
  • If no matching signal is pending and blocking: blocks until a matching signal arrives.
  • Returns the number of bytes written (always a multiple of 128).

Signals consumed via signalfd are removed from the task's pending signal set. They are NOT delivered to signal handlers. The pending set modification is atomic with respect to concurrent signal delivery.

Wire Format: signalfd_siginfo (128 bytes, exact Linux layout)

Offset  Size  Field         Description
------  ----  -----         -----------
  0       4   ssi_signo     Signal number
  4       4   ssi_errno     Error number (usually 0)
  8       4   ssi_code      si_code from siginfo_t
 12       4   ssi_pid       Sending process PID (SI_USER/SI_QUEUE)
 16       4   ssi_uid       Sending process real UID
 20       4   ssi_fd        File descriptor (SIGPOLL/SIGIO)
 24       4   ssi_tid       Kernel timer ID (SIGALRM/SIGVTALRM/SIGPROF)
 28       4   ssi_band      Band event (SIGPOLL/SIGIO)
 32       4   ssi_overrun   Timer overrun count (SIGALRM)
 36       4   ssi_trapno    Trap number (hardware fault signals)
 40       4   ssi_status    Exit status or signal (SIGCHLD)
 44       4   ssi_int       Integer value (SI_QUEUE/SI_MESGQ)
 48       8   ssi_ptr       Pointer value (SI_QUEUE/SI_MESGQ)
 56       8   ssi_utime     User CPU time consumed (SIGCHLD)
 64       8   ssi_stime     System CPU time consumed (SIGCHLD)
 72       8   ssi_addr      Address triggering fault (hardware faults)
 80       2   ssi_addr_lsb  LSB of fault address (BUS_MCEERR_*)
 82      46   _pad          Padding to reach 128 bytes total

The Rust representation uses #[repr(C)] with explicit padding to guarantee byte-for-byte compatibility. The total size is asserted at compile time: const_assert!(size_of::<SignalFdSiginfo>() == 128).

Internal Structure

/// A signal queue redirector exposed as a file descriptor.
///
/// Signals matching `mask` that arrive in the owning task's pending set
/// are readable via `read()` rather than delivered to a signal handler.
/// The mask can be updated atomically via `signalfd(existing_fd, new_mask, ...)`.
pub struct SignalFd {
    /// Signal mask as a u64 bitmask (bits 1-64 correspond to signals 1-64).
    /// Stored as AtomicU64 for lock-free mask updates via signalfd() on existing fd.
    /// SIGKILL (bit 9) and SIGSTOP (bit 19) are always masked out on write.
    mask: AtomicU64,

    /// Weak reference to the owning task. A `Weak<Task>` is used rather than
    /// `Arc<Task>` to avoid creating a reference cycle: the task owns the fd table
    /// which owns this struct. Upgrade fails if the task has been reaped.
    task: Weak<Task>,

    /// Tasks blocked in `read()` waiting for a matching signal to arrive.
    waiters: WaitQueue,
}

Mask Update Algorithm

When signalfd(existing_fd, new_mask, ...) is called on an existing signalfd, the mask update is:

update_mask(sfd: &SignalFd, new_mask: u64):
    // Strip SIGKILL and SIGSTOP — cannot be intercepted
    sanitized = new_mask & !(SIGKILL_BIT | SIGSTOP_BIT)
    sfd.mask.store(sanitized, Release)
    // No lock needed: concurrent read() loads mask with Acquire ordering
    // Any pending signals matching the new mask will be readable immediately
    sfd.waiters.wake_all()  // wake blocked readers — new mask may now have pending signals

The AtomicU64::store(Release) pairs with the AtomicU64::load(Acquire) in read(), guaranteeing that a read() that observes the new mask also observes any pending signals that were delivered before the mask was changed.

Signal Dequeue Algorithm

dequeue_signals(sfd: &SignalFd, buf: &mut [SignalFdSiginfo]) -> usize:
    task = sfd.task.upgrade().ok_or(EBADF)?
    mask = sfd.mask.load(Acquire)
    count = 0
    while count < buf.len():
        sig = task.signal_queue.dequeue_matching(mask)
        match sig:
            None => break
            Some(siginfo) =>
                buf[count] = siginfo_to_sfd_siginfo(siginfo)
                count += 1
    return count

signal_queue.dequeue_matching() atomically removes one signal whose number is set in mask from the task's pending signal set. The task's signal queue lock is held only for the duration of the dequeue operation, not for the entire read() call. This matches Linux's behavior and avoids blocking signal delivery while a read() is in progress on a different CPU.

Poll Readiness

Condition Event reported
Any signal in mask is pending in the task's pending set EPOLLIN \| EPOLLRDNORM

EPOLLOUT is never reported — signalfd is not writable.

UmkaOS Improvements over Linux

Linux stores the signalfd mask in a spinlock_t-protected struct. Updating the mask requires acquiring the lock and then potentially waking blocked readers. UmkaOS replaces this with AtomicU64::store(Release) for the update and AtomicU64::load(Acquire) for the reader, providing the same ordering guarantee without a lock. This eliminates approximately 30-50 cycles of spinlock overhead on the mask-update path.

Linux's signalfd implementation must take the task's sighand->siglock during read() to safely inspect and modify the pending signal set. UmkaOS uses the same lock (the task's signal queue lock) but holds it for a shorter window — only the atomic dequeue of a single signal — releasing it between each signal dequeued when filling a multi-signal buffer.

Linux Compatibility

  • Syscall numbers: signalfd = 282, signalfd4 = 289 (x86-64).
  • SFD_NONBLOCK = O_NONBLOCK = 04000 octal; SFD_CLOEXEC = O_CLOEXEC = 02000000 octal.
  • signalfd_siginfo layout is byte-for-byte identical to Linux; size is exactly 128 bytes including 46 bytes of trailing padding.
  • SIGKILL and SIGSTOP in the mask are silently stripped — identical to Linux.
  • signalfd(existing_fd, ...) returns the same fd number — identical to Linux.
  • Reading multiple signals in one read() call is supported — identical to Linux.
  • /proc/[pid]/fdinfo/[fd] reports sigmask: <hex_value> to match Linux.

18.8.3 timerfd — Timer Notification via File Descriptor

Syscall Interface

timerfd_create(clockid: i32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
timerfd_settime(fd: i32, flags: u32, new_value: *const itimerspec, old_value: *mut itimerspec) -> 0 | -EINVAL | -EFAULT
timerfd_gettime(fd: i32, curr_value: *mut itimerspec) -> 0 | -EINVAL | -EFAULT

Clock IDs

Clock ID Value Description
CLOCK_REALTIME 0 Wall clock time; advances with NTP and adjtime
CLOCK_MONOTONIC 1 Monotonically increasing; unaffected by wall clock changes
CLOCK_BOOTTIME 7 Like CLOCK_MONOTONIC but includes time suspended in sleep
CLOCK_REALTIME_ALARM 8 Like CLOCK_REALTIME; wakes system from suspend
CLOCK_BOOTTIME_ALARM 9 Like CLOCK_BOOTTIME; wakes system from suspend

Other clock IDs return -EINVAL. The _ALARM clocks require CAP_WAKE_ALARM.

Creation Flags

Flag Value Meaning
TFD_NONBLOCK O_NONBLOCK (04000) Set O_NONBLOCK on the file description
TFD_CLOEXEC O_CLOEXEC (02000000) Set close-on-exec on the returned fd

timerfd_settime Flags

Flag Value Meaning
TFD_TIMER_ABSTIME 1 it_value specifies an absolute time (not relative)
TFD_TIMER_CANCEL_ON_SET 2 Cancel blocked read() if wall clock is stepped (CLOCK_REALTIME only)

TFD_TIMER_CANCEL_ON_SET combined with CLOCK_MONOTONIC or CLOCK_BOOTTIME returns -EINVAL.

itimerspec Wire Format

struct itimerspec {           // total 32 bytes
    timespec it_interval;     //   8 bytes: repeat interval (0 = one-shot)
    timespec it_value;        //   8 bytes: time until next expiration (0 = disarm)
};
struct timespec {             //   8 bytes
    i64 tv_sec;               //   seconds
    i64 tv_nsec;              //   nanoseconds [0, 999999999]
};

Setting new_value.it_value to all zeros disarms the timer (any in-flight expiration that has not yet been read remains readable). Setting new_value.it_interval to all zeros creates a one-shot timer.

Read Semantics

read(fd, &expirations: u64, 8) -> 8 | -EAGAIN | -ECANCELED
  • Buffer must be at least 8 bytes; smaller buffers return -EINVAL.
  • Reads the number of timer expirations since the last read() (or since the timer was armed if never read).
  • If expirations == 0 and O_NONBLOCK: returns -EAGAIN.
  • If expirations == 0 and blocking: blocks until the timer fires.
  • If the timer has a TFD_TIMER_CANCEL_ON_SET flag and the real-time clock is stepped while a read() is blocking, the read() returns -ECANCELED.
  • Returns 8 on success. The expiration counter is reset to 0 atomically on read.

timerfd_gettime Semantics

Returns the remaining time until the next expiration in curr_value.it_value (always relative, even if the timer was set with TFD_TIMER_ABSTIME), and the interval in curr_value.it_interval. If the timer is disarmed, both fields are zero.

Internal Structure

/// A kernel timer exposed as a file descriptor.
///
/// The `expirations` counter accumulates missed firings atomically.
/// `timerfd_settime` holds `lock` to update the timer state atomically.
/// The timer callback and `read()` are lock-free in the common case.
pub struct TimerFd {
    /// Which clock drives this timer.
    clock: ClockId,

    /// Handle into the kernel timer subsystem. The timer callback increments
    /// `expirations` and wakes `waiters`. Re-armed automatically if `interval > 0`.
    timer: KernelTimer,

    /// Accumulated expiration count. Incremented by the timer callback (possibly
    /// on a different CPU). Reset to 0 by `read()` using compare_exchange.
    expirations: AtomicU64,

    /// Tasks blocked in `read()` waiting for the timer to fire.
    waiters: WaitQueue,

    /// Protects `state` during `timerfd_settime`. Not held during timer callbacks
    /// or `read()` — those use `expirations` atomically.
    lock: Mutex<TimerFdState>,
}

/// Mutable timer configuration. Protected by `TimerFd::lock`.
pub struct TimerFdState {
    /// True if the timer is currently armed.
    armed: bool,

    /// Time until next expiration (stored as absolute clock time internally).
    next_expiry: Instant,

    /// Repeat interval. Zero means one-shot.
    interval: Duration,

    /// True if the timer was set with `TFD_TIMER_ABSTIME`.
    abstime: bool,

    /// True if blocking `read()` should return ECANCELED on wall-clock steps.
    /// Only valid when `clock` is `CLOCK_REALTIME`.
    cancel_on_set: bool,

    /// True if coalescing is disabled for this timer (UmkaOS extension; see below).
    precise: bool,
}

Timer Callback Algorithm

The timer subsystem calls timerfd_callback when the timer fires. This runs in interrupt context (or a timer-dedicated kernel thread on platforms where interrupt context constraints are tighter):

timerfd_callback(tfd: &TimerFd):
    // Increment the expiration counter. Saturates at ULLONG_MAX to avoid wrap.
    prev = tfd.expirations.fetch_add(1, Release)
    if prev == ULLONG_MAX:
        tfd.expirations.store(ULLONG_MAX, Relaxed)  // saturate, don't wrap
    tfd.waiters.wake_all()  // wake any blocked read()
    if tfd.lock.try_lock():
        state = tfd.lock.data()
        if state.interval > Duration::ZERO:
            state.next_expiry += state.interval
            tfd.timer.rearm(state.next_expiry)
        tfd.lock.unlock()
    // If lock is contended (settime in progress), settime will rearm after update

timerfd_settime Algorithm

timerfd_settime(tfd: &TimerFd, flags, new_value, old_value) -> Result<(), Errno>:
    state = tfd.lock.lock()
    if old_value is not null:
        *old_value = state_to_itimerspec(state, tfd.clock)
    if new_value.it_value == zero:
        state.armed = false
        tfd.timer.cancel()
    else:
        state.armed = true
        state.interval = new_value.it_interval
        state.abstime = flags & TFD_TIMER_ABSTIME != 0
        state.cancel_on_set = flags & TFD_TIMER_CANCEL_ON_SET != 0
        if state.abstime:
            state.next_expiry = new_value.it_value as absolute instant
        else:
            state.next_expiry = now(tfd.clock) + new_value.it_value
        tfd.timer.arm(state.next_expiry)
    // Reset any unread expirations from the previous timer period
    tfd.expirations.store(0, Release)
    tfd.lock.unlock()

The expiration reset to 0 in timerfd_settime matches Linux behavior: rearming the timer discards any unread expirations from the previous arm.

Wall-Clock Step Handling (TFD_TIMER_CANCEL_ON_SET)

The timekeeping subsystem broadcasts a ClockSet notification whenever settimeofday(2) or clock_settime(CLOCK_REALTIME, ...) makes a non-monotonic change to the wall clock. All CLOCK_REALTIME timerfds with cancel_on_set = true receive this notification through a registered callback:

timerfd_clock_set_callback(tfd: &TimerFd):
    // Wake all blocked readers with a ECANCELED indication
    tfd.waiters.wake_all_with_err(ECANCELED)

Blocked read() calls detect the cancellation via the wait-queue return code and propagate -ECANCELED to userspace without consuming the expiration counter.

Interval Timer Coalescing (UmkaOS Extension)

Timers with very short intervals (interval < 1ms) and low-resolution system HZ settings (e.g., HZ = 250, giving 4ms tick resolution) would fire far more frequently than the system can usefully service. UmkaOS coalesces such timers to fire at tick boundaries, batching wakeups and reducing interrupt load:

  • Coalescing is enabled by default for interval < 1ms.
  • Disabled per-timer via the UmkaOS-specific TFD_TIMER_PRECISE flag (value: 4, chosen to not conflict with existing Linux flags).
  • TFD_TIMER_PRECISE is an UmkaOS extension. Kernels that do not support it treat it as an unknown flag and return -EINVAL. Applications that need Linux portability should not set this flag.
  • Coalescing does not affect the expiration counter: missed firings within a coalescing window are accumulated and delivered as a single count on the next wakeup.

Poll Readiness

Condition Event reported
expirations > 0 EPOLLIN \| EPOLLRDNORM

EPOLLOUT is never reported — timerfd is not writable.

UmkaOS Improvements over Linux

Linux implements timerfd with a spinlock protecting both the expiration counter and the timer state. The timer callback (timerfd_tmrproc) acquires the spinlock to increment the expiration counter and re-arm the interval timer.

UmkaOS separates these concerns:

  • The expiration counter is an AtomicU64 — the timer callback increments it with fetch_add(1, Release) without holding any lock. read() resets it with compare_exchange(current, 0, AcqRel, Acquire) without holding any lock. This eliminates spinlock acquisition from the timer hot path.
  • The timer state (arm/disarm, interval, abstime) is protected by a Mutex held only during timerfd_settime. The timer callback uses try_lock() for re-arming and skips re-arming if settime is in progress (settime will re-arm after updating state).
  • The common case (timer fires, counter increments, waiter wakes, counter reads 0) is entirely lock-free.

Linux Compatibility

  • Syscall numbers: timerfd_create = 283, timerfd_settime = 286, timerfd_gettime = 287 (x86-64).
  • TFD_NONBLOCK = 04000, TFD_CLOEXEC = 02000000, TFD_TIMER_ABSTIME = 1, TFD_TIMER_CANCEL_ON_SET = 2.
  • itimerspec layout is identical to Linux (two timespec structs, 32 bytes total).
  • timerfd_settime with it_value = 0 disarms the timer and resets the expiration counter to 0 — identical to Linux.
  • ECANCELED is returned from blocking read() when a TFD_TIMER_CANCEL_ON_SET timer is cancelled by a clock step — identical to Linux.
  • /proc/[pid]/fdinfo/[fd] reports clockid, ticks, settime flags, it_value, and it_interval to match Linux's timerfd_show() format.

18.8.4 pidfd — Process File Descriptor

Syscall Interface

pidfd_open(pid: pid_t, flags: u32) -> fd | -EINVAL | -EMFILE | -ESRCH | -EPERM
pidfd_send_signal(pidfd: i32, sig: i32, siginfo: *const siginfo_t, flags: u32) -> 0 | -EPERM | -ESRCH | -EINVAL
pidfd_getfd(pidfd: i32, targetfd: i32, flags: u32) -> fd | -EPERM | -ESRCH | -EINVAL | -EMFILE

pidfd_open

pid must refer to a live process (not a thread) in the caller's PID namespace. A process is "live" if it has not yet been reaped — zombie processes that have exited but not been waited on are accessible. Passing a pid that does not exist or has been reaped and recycled returns -ESRCH.

flags must be 0 for a process pidfd. The flag PIDFD_THREAD (value: O_EXCL = 010 octal) creates a thread pidfd pointing to a specific thread (not the thread group leader). PIDFD_NONBLOCK (value: O_NONBLOCK = 04000) creates a non-blocking pidfd whose waitid(P_PIDFD, ...) returns -EAGAIN if the process has not yet exited. Any other flag bits return -EINVAL.

PIDFD_THREAD support: Linux added thread pidfd support in kernel 6.9 via PIDFD_THREAD. UmkaOS supports PIDFD_THREAD from its initial release — there is no version gate. A thread pidfd can receive signals via pidfd_send_signal targeted at a specific thread, and poll() reports EPOLLIN when that specific thread exits.

pidfd_send_signal

pidfd_send_signal(pidfd, sig, siginfo, flags):

Sends signal sig to the process referenced by pidfd. Semantics are identical to kill(2) but use the stable pidfd reference instead of a PID:

  • sig = 0: permission check only (does not send a signal); returns 0 if the process is accessible, -ESRCH if it has exited, -EPERM if no permission.
  • siginfo != NULL: for real-time signals (SIGRTMIN to SIGRTMAX), the provided siginfo_t is used as the signal info. si_code must be SI_QUEUE (or another userspace-generatable code). siginfo must be NULL for standard signals.
  • flags must be 0.
  • Permission model: same as kill(2) — caller must have same UID, be privileged (CAP_KILL), or be the parent of the target process.

pidfd_getfd

Duplicates file descriptor targetfd from the process referenced by pidfd into the calling process's fd table. The duplicated fd refers to the same open file description as in the target process.

  • Requires PTRACE_MODE_ATTACH_REALCREDS access to the target process. This is checked via the LSM ptrace hooks — the same permission check that PTRACE_ATTACH uses. Without this permission, returns -EPERM.
  • flags must be 0.
  • The returned fd has FD_CLOEXEC set.
  • If targetfd is not open in the target process, returns -EBADF.
  • If the calling process's fd table is full, returns -EMFILE.

waitid with P_PIDFD

waitid(P_PIDFD, pidfd, infop, options, rusage) -> 0 | -EINVAL | -ECHILD

P_PIDFD (value: 3) is used as the idtype argument. The id argument is the pidfd file descriptor number. All standard waitid options apply (WEXITED, WSTOPPED, WCONTINUED, WNOHANG, WNOWAIT).

When the pidfd was opened with PIDFD_NONBLOCK and the process has not yet exited, waitid with WNOHANG returns 0 with infop->si_pid = 0 (consistent with standard waitid WNOHANG behavior).

Poll Readiness

Condition Event reported
Referenced process has exited (any state: zombie or reaped) EPOLLIN \| EPOLLHUP
Referenced process is running (nothing — not readable)

poll() on a pidfd is particularly useful for async exit monitoring without SIGCHLD:

// Monitor multiple child processes without SIGCHLD handler
int efd = epoll_create1(0);
epoll_ctl(efd, EPOLL_CTL_ADD, pidfd1, &ev1);
epoll_ctl(efd, EPOLL_CTL_ADD, pidfd2, &ev2);
epoll_wait(efd, events, 2, -1);   // wake when either exits

clone3 Integration — Atomic pidfd on Fork

clone3(2) with CLONE_PIDFD flag sets pidfd in the clone_args struct to receive a pidfd for the new child atomically:

struct clone_args args = {
    .flags    = CLONE_PIDFD,
    .pidfd    = (uint64_t)&child_pidfd,  // out: fd for the child
    .exit_signal = SIGCHLD,
};
pid_t child = syscall(SYS_clone3, &args, sizeof(args));

In UmkaOS's fork path:

clone3_with_pidfd(args):
    new_task = allocate_task()
    pfd_obj = PidFd::new(Arc::clone(&new_task.process), current_pid_ns())
    child_fd = install_fd_in_current_table(pfd_obj)
    // Write child_fd to args.pidfd before releasing the new task
    *args.pidfd = child_fd as u64
    release_and_schedule(new_task)
    return new_task.pid

The pidfd is installed in the parent's fd table and the args.pidfd pointer is written before the child is made visible to the scheduler. There is no window between fork and pidfd creation during which the child's PID could be recycled.

Internal Structure

/// A stable reference to a process, exposed as a file descriptor.
///
/// Holds an `Arc<Process>` which keeps the process's zombie state alive until
/// all pidfds referencing it are closed and `waitid` has been called.
/// No lock is needed to validate the reference — `Arc` guarantees liveness.
pub struct PidFd {
    /// Strong reference to the process. This keeps the zombie `Process` struct
    /// alive even after the process exits and is waited on, so that subsequent
    /// `pidfd_send_signal` calls return `-ESRCH` rather than accessing freed memory
    /// or racing with PID recycling.
    process: Arc<Process>,

    /// PID namespace in which this pidfd was created. Used to resolve PIDs for
    /// `pidfd_send_signal` permission checks, which compare against the caller's
    /// namespace view of the target process.
    ns: Arc<PidNamespace>,

    /// True if this is a thread pidfd (PIDFD_THREAD). When true, `poll()` reports
    /// readiness when the specific thread exits, not when the thread group exits.
    thread_mode: bool,
}

Liveness Model

PidFd holds an Arc<Process>. The Process struct is kept in memory as long as any of the following hold a reference:

  1. The process is in the parent's child list (before waitid reaps it).
  2. A PidFd fd is open anywhere in the system.
  3. The kernel has an internal reference (e.g., the process is on a runqueue).

When the process exits, it transitions to zombie state. The zombie state is maintained until both conditions are satisfied: the Arc<Process> reference count drops to the parent-only value AND the parent calls waitid. This means:

  • Closing all pidfds referencing a zombie does not prevent the parent from calling waitid — the parent's child-list entry remains.
  • After the parent calls waitid, if any pidfd is still open, the Process struct is retained in zombie-reaped state (memory deallocated, but the Arc shell remains with an exit code). Subsequent pidfd_send_signal calls return -ESRCH.

This is simpler and safer than Linux's approach, which uses pid_lock to prevent the struct pid from being freed while a pidfd is being accessed. UmkaOS's Arc provides the same guarantee without any explicit locking.

pidfd_send_signal Algorithm

pidfd_send_signal(pfd: &PidFd, sig, siginfo, flags) -> Result<(), Errno>:
    if flags != 0:
        return Err(EINVAL)
    // Arc::clone gives us a reference; no lock needed to access the process
    process = Arc::clone(&pfd.process)
    if process.is_fully_reaped():
        return Err(ESRCH)
    check_signal_permission(current_task(), &process, sig)?
    if sig == 0:
        return Ok(())   // permission check only
    deliver_signal(&process, sig, siginfo)

is_fully_reaped() checks an atomic flag set when the process's resources have been fully released. This is a single atomic load — no lock.

pidfd_getfd Algorithm

pidfd_getfd(pfd: &PidFd, targetfd, flags) -> Result<Fd, Errno>:
    if flags != 0:
        return Err(EINVAL)
    process = Arc::clone(&pfd.process)
    if process.is_fully_reaped():
        return Err(ESRCH)
    // LSM permission check (ptrace attach-level)
    check_ptrace_attach(current_task(), &process)?
    // Get the file description from the target's fd table
    file = process.fd_table.get(targetfd).ok_or(EBADF)?
    // Install a duplicate into the calling task's fd table with FD_CLOEXEC
    new_fd = current_task().fd_table.install(file, FD_CLOEXEC)?
    return Ok(new_fd)

UmkaOS Improvements over Linux

Liveness via Arc instead of pid_lock: Linux must take pid_lock (a global spinlock on the PID namespace) every time a pidfd is dereferenced to ensure the struct pid has not been freed. This spinlock is contended when many pidfd operations occur concurrently. UmkaOS's Arc<Process> is reference-counted without a global lock — dereferencing a pidfd is a no-op (just an atomic load on the refcount in debug builds, zero overhead in release builds with optimization).

Thread pidfds from day one: Linux added PIDFD_THREAD in kernel 6.9. UmkaOS supports thread pidfds in its initial release.

PIDFD_NONBLOCK support: Linux added PIDFD_NONBLOCK in kernel 5.10. UmkaOS supports it from the initial release. The flag is stored in the FileDescription flags (same as O_NONBLOCK for other fd types) and is checked by waitid(P_PIDFD, ...).

Atomic clone3 pidfd: UmkaOS allocates and installs the pidfd before releasing the new task to the scheduler, eliminating any TOCTOU window between fork and pidfd creation — matching the Linux clone3 + CLONE_PIDFD guarantee.

Linux Compatibility

  • Syscall numbers: pidfd_open = 434, pidfd_send_signal = 424, pidfd_getfd = 438 (x86-64).
  • PIDFD_NONBLOCK = O_NONBLOCK = 04000 octal.
  • PIDFD_THREAD = O_EXCL = 010 octal.
  • P_PIDFD = 3 (for waitid idtype).
  • CLONE_PIDFD = 0x00001000 (in clone_args.flags).
  • PTRACE_MODE_ATTACH_REALCREDS permission check for pidfd_getfd — identical to Linux. No additional UmkaOS-specific permission layer.
  • poll() reporting EPOLLIN | EPOLLHUP on process exit — identical to Linux.
  • /proc/[pid]/fdinfo/[fd] reports Pid: <pid> and NSpid: <nspid> for the referenced process — matching Linux's pidfd_show() output.

18.8.5 Linux Compatibility Reference

Complete syscall number table for all four fd types on x86-64:

Syscall x86-64 Number Return Type Error Codes
eventfd 284 fd -EINVAL, -EMFILE, -ENOMEM
eventfd2 290 fd -EINVAL, -EMFILE, -ENOMEM
signalfd 282 fd -EINVAL, -EMFILE, -ENOMEM
signalfd4 289 fd -EINVAL, -EMFILE, -ENOMEM
timerfd_create 283 fd -EINVAL, -EMFILE, -ENOMEM, -EPERM
timerfd_settime 286 0 -EINVAL, -EFAULT, -EBADF
timerfd_gettime 287 0 -EINVAL, -EFAULT, -EBADF
pidfd_open 434 fd -EINVAL, -EMFILE, -ESRCH, -EPERM
pidfd_send_signal 424 0 -EINVAL, -EPERM, -ESRCH
pidfd_getfd 438 fd -EINVAL, -EPERM, -ESRCH, -EBADF, -EMFILE

Struct sizes and invariants:

Type Size Invariant
signalfd_siginfo 128 bytes Exact Linux layout; compile-time size_of assertion
itimerspec 32 bytes Two timespec structs; tv_nsec in [0, 999999999]
eventfd counter u64 Range [0, ULLONG_MAX - 1]; ULLONG_MAX is never a valid counter value
signalfd mask u64 Bits 1-64 for signals 1-64; bits 9 (SIGKILL) and 19 (SIGSTOP) always zero

Common errno values and their meaning across all four types:

Errno Meaning
-EINVAL Bad flags, bad clock ID, bad fd for update, wrong buffer size, bad sigset size
-EMFILE Per-process fd limit reached
-ENOMEM Kernel memory exhausted during fd object allocation
-EAGAIN Non-blocking operation would block (read on empty counter/queue/timer)
-ECANCELED Blocking timerfd read cancelled by wall-clock step (TFD_TIMER_CANCEL_ON_SET)
-ESRCH Process referenced by pidfd has exited and been reaped
-EPERM Capability check failed (CAP_WAKE_ALARM, CAP_KILL) or ptrace permission denied
-EBADF targetfd not open in target process (pidfd_getfd), or fd is not a signalfd (on mask update)

Cross-subsystem interactions:

  • eventfd + io_uring: io_uring can post completions to an eventfd via the IORING_OP_POLL_ADD opcode targeting an eventfd. UmkaOS implements this through the standard FileOps::write() path — io_uring calls eventfd_write() the same way userspace does.
  • signalfd + threads: Each thread has its own pending signal set. A signalfd opened in a thread reads signals from that thread's pending set (thread-directed signals) and from the thread group's pending set (process-directed signals), matching Linux semantics.
  • timerfd + suspend: CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM timers are registered with the RTC wakeup subsystem. When the system suspends, the RTC is programmed to wake the system before the earliest alarm timer fires. The timer fires on resume; the expiration count correctly reflects the elapsed real time.
  • pidfd + namespaces: A pidfd is tied to the PID namespace in which it was created. pidfd_send_signal resolves permissions in that namespace. If the target process exits its namespace (e.g., by exec across a user namespace boundary), the pidfd continues to reference the process via Arc<Process> — namespace exit does not invalidate the reference.

18.8.6 UmkaOS Typed Event Notification API

The special fd objects (eventfd, signalfd, timerfd) deliver data via untyped read(fd, buf, n) calls where the caller must know the buffer layout. A mismatched buffer size returns EINVAL; a correct-size read of the wrong fd type silently returns garbage bytes. UmkaOS provides a typed companion API:

/// Read from a special event fd with compile-time type checking.
///
/// The kernel inspects the fd's underlying type and fills the appropriate variant.
/// Returns `Err(EINVAL)` if the fd is not a special event fd.
/// Returns `Err(EAGAIN)` if non-blocking and no event is pending.
pub fn event_read(fd: RawFd) -> Result<EventValue, EventError>;

/// The typed value returned by event_read().
///
/// **Wire layout** (`#[repr(C, u32)]` tagged union):
///   - Bytes 0-3: tag (u32): 0=Counter, 1=TimerTicks, 2=Signal, 3=ProcessExited
///   - Bytes 4-7: implicit padding to align the payload
///   - Bytes 8+: variant payload (u64 for Counter/TimerTicks, SignalfdSiginfo
///     for Signal, {pid: u32, exit_code: i32} for ProcessExited)
///
/// The C representation of `#[repr(C, u32)]` enums with data-carrying variants
/// is a tagged union: the discriminant is a leading `u32`, followed by padding
/// to the payload's alignment, followed by the largest variant's payload.
/// Equivalent C layout (see `umka_event_value` in umka-compat/include/umka.h):
/// ```c
/// struct umka_event_value {
///     uint32_t tag;
///     uint32_t _pad;
///     union {
///         uint64_t counter;
///         uint64_t timer_ticks;
///         struct signalfd_siginfo signal;
///         struct { uint32_t pid; int32_t exit_code; } process_exited;
///     };
/// };
/// ```
#[repr(C, u32)]
pub enum EventValue {
    /// eventfd: current counter value (EFD_SEMAPHORE: always 1). Tag = 0.
    Counter(u64) = 0,
    /// timerfd: number of expirations since last read. Tag = 1.
    TimerTicks(u64) = 1,
    /// signalfd: one pending signal. Tag = 2.
    Signal(SignalfdSiginfo) = 2,
    /// pidfd: exit status of the process (only after EPOLLIN on pidfd). Tag = 3.
    ProcessExited { pid: u32, exit_code: i32 } = 3,
}

/// Write to an eventfd with type checking.
/// Returns `Err(EINVAL)` if fd is not an eventfd.
pub fn event_write(fd: RawFd, value: u64) -> Result<(), EventError>;

x86-64 syscall numbers:

Syscall Number
event_read 1030
event_write 1031

Advantages over raw read(2):

  • Type-safe: the compiler enforces that all variants are handled.
  • eBPF verifier can statically analyze event types in attached programs.
  • No silent garbage on wrong fd type: kernel validates fd type at the syscall boundary.
  • Single syscall for all event fd types: no need to track which type each fd is at the call site.
  • ProcessExited variant: pidfd exit notification delivers exit code directly (no waitid needed after the read).

Interaction with io_uring: event_read is exposed as an io_uring operation (IORING_OP_EVENT_READ, opcode 48), allowing async typed event reads without a dedicated syscall per fd:

struct io_uring_sqe sqe = {
    .opcode = IORING_OP_EVENT_READ,
    .fd     = event_fd,
    .addr   = (uint64_t)&event_value_out,  // struct EventValue destination
};

Linux compatibility: read(2) on eventfd/signalfd/timerfd/pidfd works identically to Linux. event_read/event_write are UmkaOS extensions. The EventValue wire layout is stable ABI (repr(C, u32) tagged union with explicit integer discriminants 0-3 as documented in the struct comment above); field ordering is frozen at first release and additive changes use new enum variants appended after the existing set.