Chapter 19: System API¶
Syscall interface, futex, netlink, Windows emulation, dropped compatibility, native syscalls, safe extensibility
The system API layer provides Linux syscall compatibility (unmodified glibc/musl work
out of the box) plus UmkaOS-native extensions. Futex, io_uring, eBPF, and netlink are
fully supported. Native multi-object wait (SYNC_WAIT_ANY / SYNC_WAIT_ALL) provides
heterogeneous waiting on fds, events, PIDs, timers, and semaphores in a single call.
Windows emulation acceleration (WEA) provides NT kernel object primitives for
WINE/Proton — built as a translation layer on top of the native wait primitives.
Safe kernel extensibility allows hot-swappable policy modules via KABI vtables.
19.1 Syscall Interface¶
19.1.1 Design Goal¶
UmkaOS is a POSIX-compatible kernel. Of the approximately 450 defined Linux x86-64
syscalls, approximately 330-350 are actively used by current software (glibc 2.17+,
musl 1.2+, systemd, Docker, Kubernetes). The remaining approximately 100-120 are
obsolete and return -ENOSYS unconditionally.
Of the 330-350 active syscalls:
- ~80% (~265-280) are implemented natively with identical POSIX semantics — read,
write, open, mmap, fork, socket, etc. are UmkaOS's own API, not a translation
layer over something else. The syscall entry point performs representation conversion
(untyped C ABI → typed Rust internals), not semantic translation.
- ~15% (~50-55) need thin adaptation (e.g., Linux's untyped ioctl → UmkaOS's typed
driver interface).
- ~5% (~15-20) are genuine compatibility shims for deprecated syscalls that get
remapped to modern equivalents.
19.1.2 Syscall Dispatch Architecture¶
The SyscallHandler enum classifies every syscall by how it is serviced. The first
three variants (Direct, InnerRingForward, OuterRingForward) are native
implementations — UmkaOS's own kernel code handling the syscall directly. Only Emulated
is a compatibility shim:
pub enum SyscallHandler {
/// Handled directly in UmkaOS Core -- no tier crossing
/// Examples: getpid, brk, mmap, clock_gettime, signals, futex, uname
/// Note: uname() reads hostname/domainname from
/// current_task().nsproxy.uts_ns ([Section 17.1](17-containers.md#namespace-architecture)).
Direct(fn(&mut SyscallContext) -> i64),
/// Forwarded to a Tier 1 driver via domain switch
/// Examples: read, write, ioctl, socket ops, mount
InnerRingForward {
driver_class: DriverClass,
handler: fn(&mut SyscallContext) -> i64,
},
/// Forwarded to a Tier 2 driver via IPC
/// Examples: USB-specific ioctls
OuterRingForward {
driver_class: DriverClass,
handler: fn(&mut SyscallContext) -> i64,
},
/// Compatibility shim for deprecated-but-still-called syscalls
/// Examples: select (mapped to pselect6), poll (mapped to ppoll)
Emulated(fn(&mut SyscallContext) -> i64),
/// Not implemented -- returns -ENOSYS
/// Examples: old_stat, socketcall, ipc multiplexer
Unimplemented,
}
19.1.2.1 SyscallContext¶
SyscallContext is the per-invocation state frame passed to every syscall handler.
It is constructed by the architecture-specific syscall entry code (Layer 1) and passed
by mutable reference to dispatch_syscall() (Layer 2).
/// Per-invocation syscall state frame. Constructed by the architecture-specific
/// entry stub from saved registers and the current task pointer.
///
/// **Lifetime**: Lives on the kernel stack for the duration of the syscall.
/// The `'a` lifetime ties it to the current task's existence on this CPU —
/// the task cannot be freed or migrated while a syscall is in progress.
///
/// **Namespace access**: Handlers access namespace-specific views through
/// `ctx.task.nsproxy` ([Section 17.1](17-containers.md#namespace-architecture)). For example:
/// - `ctx.task.nsproxy.pid_ns` — PID translation
/// - `ctx.task.nsproxy.net_ns` — network namespace routing
/// - `ctx.task.nsproxy.mount_ns` — mount visibility
///
/// **Filesystem context**: `ctx.task.fs` provides the task's root directory
/// (chroot boundary), current working directory, and umask
/// ([Section 8.1](08-process.md#process-and-task-management--fsstruct)).
pub struct SyscallContext<'a> {
/// Syscall number. Positive values index into the Linux-compatible dispatch
/// table; negative values index into the UmkaOS-native dispatch table.
/// Extracted from the architecture-specific syscall number register
/// (see per-architecture register mapping table below).
///
/// **UmkaOS native syscall range**: Negative numbers `-1` through `-4096`
/// are reserved for UmkaOS-native syscalls. The dispatch table uses
/// `(-nr - 1)` as the index into the native table (so `-1` → index 0,
/// `-4096` → index 4095). Numbers below `-4096` return `-ENOSYS`.
/// This range is large enough for all planned UmkaOS extensions
/// (capability operations, cluster primitives, driver management,
/// live evolution) while leaving the entire positive namespace for
/// Linux-compatible syscalls (currently up to ~450 on x86-64).
///
/// On AArch64, the entry stub sign-extends `w8` to `x8` via `sxtw`
/// so that native negative syscall numbers are correctly represented
/// in the 64-bit register.
pub nr: i32,
/// Syscall arguments (up to 6), extracted from architecture-specific
/// registers. Unused arguments are zero-filled. The argument registers
/// differ per architecture — see the register mapping table below.
pub args: [u64; 6],
/// Reference to the calling task. Provides access to:
/// - `task.nsproxy` — namespace set for namespace-aware syscalls
/// - `task.cred` — per-task credentials (`RcuCell<Arc<Cred>>`) for ALL
/// permission checks. This is the authoritative credential source.
/// `task.process.cred` is the baseline credential inherited on fork;
/// per-task overrides (setresuid, prctl) modify `task.cred` independently.
/// Always use `task.cred.read()` (RCU-protected, lock-free) for capability
/// and UID/GID checks — never `task.process.cred` directly.
/// - `task.process.mm` — memory descriptor for mmap/brk/munmap
/// - `task.files` — file descriptor table for fd-based syscalls
/// - `task.fs` — filesystem context (root, cwd, umask) for path resolution
/// - `task.signal_mask` — blocked signal set for signal-related syscalls
/// - `task.capabilities` — per-thread capability restriction handle
pub task: &'a Task,
/// Return value, set by the handler before returning. Negative values
/// are negated errno codes (e.g., `-ENOENT` = -2). The entry stub
/// writes this value back to the architecture-specific return register.
pub ret: i64,
}
Per-architecture register mapping:
The syscall entry stub extracts the syscall number and up to 6 arguments from architecture-specific registers. The mapping is fixed by the Linux ABI and must be identical for binary compatibility:
| Register | x86-64 | AArch64 | RISC-V 64 | ARMv7 | PPC64LE | PPC32 | s390x | LoongArch64 |
|---|---|---|---|---|---|---|---|---|
| Syscall number | rax (eax) |
x8 (w8) |
a7 |
r7 |
r0 |
r0 |
r1 |
a7 |
| Arg 0 | rdi |
x0 |
a0 |
r0 |
r3 |
r3 |
r2 |
a0 |
| Arg 1 | rsi |
x1 |
a1 |
r1 |
r4 |
r4 |
r3 |
a1 |
| Arg 2 | rdx |
x2 |
a2 |
r2 |
r5 |
r5 |
r4 |
a2 |
| Arg 3 | r10 |
x3 |
a3 |
r3 |
r6 |
r6 |
r5 |
a3 |
| Arg 4 | r8 |
x4 |
a4 |
r4 |
r7 |
r7 |
r6 |
a4 |
| Arg 5 | r9 |
x5 |
a5 |
r5 |
r8 |
r8 |
r7 |
a5 |
| Return value | rax |
x0 |
a0 |
r0 |
r3 |
r3 |
r2 |
a0 |
Notes:
- x86-64: r10 is used for arg 3 instead of rcx because SYSCALL clobbers
rcx (saves rip there). glibc's syscall() wrapper moves the fourth argument
from rcx to r10 before the SYSCALL instruction.
- AArch64: The syscall number is in w8 (32-bit view of x8). The entry stub
sign-extends to x8 via sxtw for UmkaOS-native negative syscall numbers.
- RISC-V 64: Uses the ecall instruction. The syscall number is in a7, which
differs from the standard calling convention (where a7 is argument 7).
- ARMv7: Uses svc #0. The syscall number is in r7. Arguments overlap with
the standard ARM calling convention registers.
- PPC64LE/PPC32: Uses sc (system call) instruction. The syscall number is in
r0, which is NOT the first argument register (that is r3).
- s390x: Uses SVC 0 (Supervisor Call) instruction. UmkaOS uses the modern
s390x syscall convention: syscall number in %r1, SVC immediate must be 0. The
legacy SVC immediate encoding (syscall number in the SVC operand) is not supported.
Arguments are in r2-r7, return value in r2. The SVC triggers a PSW (Program
Status Word) swap, which saves the old PSW and loads the new PSW from the SVC
old/new PSW pair.
- LoongArch64: Uses the SYSCALL instruction. The syscall number is in a7,
arguments in a0-a5, return value in a0 — identical register convention to RISC-V.
InnerRingForward dispatch protocol:
1. The handler resolves the syscall's file descriptor to its OpenFile entry,
which contains a FileOps vtable pointer.
2. Invokes the appropriate FileOps method (read, write, ioctl, etc.)
with a domain switch into the Tier 1 driver's isolation domain.
3. For non-fd syscalls (mount, umount), the handler dispatches to the
VFS KABI vtable via KabiDispatch::invoke().
4. Return value is translated back to Linux errno convention.
19.1.3 Foundational ABI Types¶
19.1.3.1 KernelLong / KernelULong¶
The C long type is 4 bytes on ILP32 (ARMv7, PPC32) and 8 bytes on LP64 (x86-64,
AArch64, RISC-V 64, PPC64LE, s390x, LoongArch64). All #[repr(C)] ABI structs that
contain C long or unsigned long fields MUST use KernelLong / KernelULong
instead of hard-coded i64/u64 to ensure correct layout on all 8 supported
architectures.
/// Rust equivalent of Linux's `__kernel_long_t`.
/// C `long` is 4 bytes on ILP32 (ARMv7, PPC32) and 8 bytes on LP64.
///
/// **ABI rule**: Every `#[repr(C)]` struct exposed to userspace via syscall,
/// ioctl, procfs, or core dump that contains a C `long` field MUST use this
/// type. Using `i64` directly is a 32-bit ABI break.
///
/// **Review checklist item**: "Does this ABI struct use KernelLong for all
/// C `long` fields?"
#[cfg(target_pointer_width = "64")]
pub type KernelLong = i64;
#[cfg(target_pointer_width = "32")]
pub type KernelLong = i32;
/// Rust equivalent of Linux's `__kernel_ulong_t`.
/// Same width rules as `KernelLong` but unsigned.
#[cfg(target_pointer_width = "64")]
pub type KernelULong = u64;
#[cfg(target_pointer_width = "32")]
pub type KernelULong = u32;
Affected structs (non-exhaustive): Timeval, RusageWire, SigInfoSigchld,
SigInfoSigpoll, SigInfoSigfault padding, ElfPrstatus signal masks, epoll_event
(packing), AccelCbsServer (32-bit atomics). Each struct definition includes
per-architecture const_assert! for its size.
const_assert! pattern for per-architecture size verification:
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<Timeval>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<Timeval>() == 8);
19.1.4 Virtual Filesystems¶
These synthetic filesystems are critical for compatibility. Many Linux tools parse them directly and will break if the format is even slightly wrong.
| Filesystem | Implementation | Critical consumers |
|---|---|---|
/proc |
Synthetic, generated from kernel state | ps, top, htop, systemd, Docker |
/sys |
Reflects device tree from bus manager | udev, systemd, lspci, lsusb |
/dev |
Maps to KABI device interfaces | Everything (devtmpfs-compatible) |
/dev/shm |
tmpfs shared memory | POSIX shm_open, Chrome, Firefox |
/run |
tmpfs | systemd, dbus, PID files |
Key /proc entries that must be pixel-perfect:
/proc/meminfo-- parsed byfree,top, OOM killer/proc/cpuinfo-- parsed by many applications for CPU feature detection/proc/[pid]/maps-- parsed by debuggers, profilers, JVMs/proc/[pid]/status-- parsed byps, container runtimes/proc/[pid]/fd/-- used by lsof, process managers/proc/self/exe-- readlink used by many applications to find themselves/proc/sys/-- sysctl interface for kernel tuning
Format baseline: /proc file formats target Linux 6.1 LTS output. Field ordering, whitespace, and units match
procfsas of kernel 6.1. Newer fields added in later kernels are included when the corresponding UmkaOS subsystem supports the feature (e.g.,VmFlagsin/proc/[pid]/smapsis populated when the memory manager tracks the relevant flags).Implementation specification strategy: Rather than duplicating Linux's procfs format definitions here (which would become stale as Linux evolves), each
/procentry is implemented as a format-test pair: the implementation references the corresponding Linux 6.1fs/proc/*.csource as the authoritative format spec, and a companion integration test captures the expected output from a Linux 6.1 reference VM and asserts byte-for-byte match. Critical entries have explicit format notes:
Entry Key format rules /proc/meminfoFieldName: %8lu kB\n— right-aligned 8-char value, space-colon-space, alwayskBunits/proc/cpuinfoTab-separated key\t: value\n, blank line between CPUs,flagsfield is space-separated/proc/[pid]/maps%08lx-%08lx %4s %08lx %02x:%02x %lu %s\n(hex ranges, perms, offset, dev, inode, pathname)/proc/[pid]/statusKey:\tvalue\n(tab after colon), sizes inkB,Uid/Gidhave 4 tab-separated fields/proc/statSpace-separated, first field cpuorcpu%d, jiffy values in USER_HZ (100)Remaining
/procentries are specified at implementation time using the same test-driven approach (capture reference output → assert match).
19.1.5 Complete Feature Coverage¶
These features must be designed into the architecture from day one. They cannot be bolted on later.
For eBPF subsystem specification, see Section 19.2.
19.1.5.1 KVM Hypervisor¶
KVM runs as a Tier 1 driver with extended hardware privileges, exposing the /dev/kvm
interface. Unlike most Tier 1 drivers that access a single device via MMIO, KVM requires
access to VM control structures (VMCS/VMCB/HCR_EL2 configuration). These are granted as
capabilities at registration time via KvmHardwareCapability — a structured capability
exchange at the KABI boundary that permits umka-core to execute VMX/VHE/H-extension
operations on KVM's behalf through a validated VMX/VHE trampoline. The trampoline runs
in the UmkaOS Core protection domain (PKEY 0 on x86-64) and performs the actual
VMLAUNCH/VMRESUME/ERET, validating VMCS fields (no host-state corruption, EPT does
not map UmkaOS Core pages writable to the guest) before executing VM entry.
There is no "Tier 0.5" — KVM fits the Tier 1 model with a richer capability set. KVM is memory-domain isolated from UmkaOS Core (MPK on x86-64, POE or page-table+ASID on AArch64) exactly as any other Tier 1 driver. The trampoline code (~200 lines of verified assembly) is small enough to audit as Tier 0 code; it is the only code that executes VMX instructions and is the security boundary between KVM's isolation domain and Core private state.
A KVM crash triggers the Tier 1 crash recovery path (Section 11.7.2) with one additional step: all active VM execution contexts are suspended before the driver is reloaded. After umka-kvm reloads (~150 ms, FLR path for any assigned devices), the VMCS state for each VM is reconstructed from the checkpointed state buffer (Section 11.9). VMs resume without guest-visible interruption beyond a brief pause. If reconstruction fails, the VM is terminated — the same outcome as a host kernel crash in Linux, but without affecting other VMs or the host.
- Full x86-64 VMX support:
- Nested paging (EPT)
- VMCS shadowing (for nested virtualization)
- Posted interrupts (for efficient interrupt delivery)
- PML (Page Modification Logging)
- QEMU/KVM, libvirt, Firecracker, Cloud Hypervisor must work unmodified
ARM64 KVM (VHE/nVHE):
ARM64 KVM uses the Virtualization Extensions (ARMv8.1+). Two modes are supported:
VHE (Virtualization Host Extensions, ARMv8.1+):
- Host kernel runs at EL2 (hypervisor exception level) instead of EL1.
- Guest runs at EL1 (virtual EL1, translated by VHE).
- Benefit: no world switch needed for host kernel — host IS the hypervisor.
- VTTBR_EL2 points to guest's Stage-2 translation tables.
- Guest physical → host physical translation via Stage-2 page tables.
- Used on: AWS Graviton, Ampere, Apple Silicon, Cortex-X series.
nVHE (non-VHE, pre-ARMv8.1 or when VHE is disabled):
- Host kernel runs at EL1. Hypervisor stub at EL2.
- Guest entry requires EL1 → EL2 → EL1(guest) transition.
- Higher overhead (~500-1000 cycles per VM entry/exit vs ~200 for VHE).
- UmkaOS supports nVHE for older ARM64 hardware but defaults to VHE.
Protected KVM (pKVM, ARMv8.0+):
- EL2 hypervisor is a small, deprivileged module (~5K lines).
- Host kernel runs at EL1 with restricted Stage-2 mappings.
- Guest memory is inaccessible to the host (confidential VMs without TEE).
- Aligns with UmkaOS's isolation model: pKVM enforces VM isolation in hardware.
ARM64 KVM integration with UmkaOS isolation:
- On ARM64, the isolation mechanism is POE/page-table (not MPK). KVM uses a
Stage-2 trampoline analogous to the x86 VMX trampoline: umka-core manages
VTTBR_EL2 and HCR_EL2 writes; umka-kvm prepares the VM configuration in its own
isolation domain. The trampoline validates Stage-2 page tables before executing
the ERET to enter the guest.
- PSCI (Power State Coordination Interface) for vCPU bring-up: KVM intercepts PSCI
calls from the guest via HVC/SMC trapping in HCR_EL2.
- Virtual GIC (vGICv3/vGICv4): Interrupt injection uses GICv4 direct injection where
available (zero exit for most interrupts), falling back to software injection.
ARM64 VHE/nVHE Selection Algorithm:
KVM on AArch64 has two host kernel execution modes: - VHE (Virtualization Host Extensions, ARMv8.1+): Host kernel runs at EL2 (hypervisor level). Eliminates world-switch overhead for EL1/EL0 operations. Preferred when available. - nVHE: Host kernel runs at EL1; a stub firmware runs at EL2. Requires a full world-switch on every VM entry/exit. Used on hardware without VHE or when EL2 is already occupied.
Selection at boot (in umka-kvm/src/arm64/init.rs):
fn select_kvm_mode() -> KvmMode:
// 1. Check CPU feature: ID_AA64MMFR1_EL1.VH[8:9] = 0b01 means VHE supported.
if !cpuid::has_feature(CpuFeature::VHE):
return KvmMode::NvHE // hardware does not support VHE
// 2. Check if another hypervisor already owns EL2 (e.g., Xen, pKVM).
// Read HCR_EL2 — if E2H bit is 0 and we didn't set it, EL2 is occupied.
if hcr_el2_read().e2h() == 0 and !boot_claimed_el2():
return KvmMode::NvHE // EL2 owned by firmware/another hypervisor
// 3. Check for pKVM (Protected KVM) mode. pKVM requires nVHE to maintain
// its own EL2 firmware for confidential VM isolation. If CONFIG_PKVM
// equivalent is enabled in umka-kvm, force nVHE.
if umka_kvm_config().protected_kvm_enabled:
return KvmMode::NvHE // pKVM requires nVHE
// 4. All checks passed: use VHE.
return KvmMode::VHE
Runtime effects:
- VHE: HCR_EL2.E2H = 1, TGE = 1 set at boot. EL1 system register accesses are redirected to EL2. No mode switch cost; ~15-30% better VM density on high-frequency VM-exit workloads.
- nVHE: A small EL2 stub (umka_kvm_hyp) is installed at boot. Each VM entry/exit involves saving/restoring the host EL1 context (~50-150 cycles overhead per VM exit).
RISC-V KVM (H-extension):
RISC-V virtualization is defined by the H (Hypervisor) extension (ratified December 2021, as part of Privileged Architecture v1.12):
H-extension architecture:
- Hypervisor runs in HS-mode (Hypervisor-extended Supervisor mode).
- Guest runs in VS-mode (Virtual Supervisor mode).
- hstatus CSR: hypervisor status (SPV bit tracks guest/host context).
- hgatp CSR: guest physical → host physical address translation
(analogous to EPT on x86 and Stage-2 on ARM).
- htval CSR: faulting guest physical address (for #PF handling).
- hvip/hip/hie CSRs: virtual interrupt injection.
- Guest trap delegation: hedeleg/hideleg CSRs control which traps
go to VS-mode (guest handles) vs HS-mode (hypervisor handles).
VM entry/exit:
- Entry: set hstatus.SPV = 1, sret → enters VS-mode.
- Exit: guest trap/interrupt → HS-mode handler (automatic by hardware).
- Cost: ~200-400 cycles per exit (varies by implementation).
IOMMU: RISC-V IOMMU spec (ratified June 2023) provides Stage-2 translation
for device DMA, analogous to Intel VT-d / ARM SMMU.
RISC-V KVM integration with UmkaOS: - The umka-kvm driver manages hgatp (guest page tables) and hvip (virtual interrupts) in its isolation domain. The HS-mode trampoline validates hgatp entries before guest entry. - H-extension hardware is available on SiFive P670, T-Head C910, and QEMU virt. UmkaOS targets QEMU for initial development.
KVM and Domain Isolation — KVM requires capabilities beyond a standard MMIO device
driver. Unlike a NIC or storage driver that accesses a single device via MMIO, KVM requires:
(1) VMX root mode transitions (VMXON, VMLAUNCH, VMRESUME), which are privileged
Ring 0 operations that affect global CPU state; (2) VMCS manipulation, which Intel
requires to be in a specific memory region pointed to by a per-CPU VMCS pointer;
(3) EPT (Extended Page Table) management, which programs second-level page tables
that control guest physical-to-host physical address translation; (4) direct access
to MSRs and control registers during VM entry/exit.
These capabilities are incompatible with a plain memory-domain isolation model — the
hardware memory domain mechanism (WRPKRU/POR_EL0/DACR) controls memory access permissions,
not instruction execution privilege. KVM is therefore classified as a Tier 1 driver with
extended hardware privileges, granted KvmHardwareCapability at KABI registration time.
This capability authorizes umka-core to execute VMX/VHE/H-extension operations on KVM's
behalf via a validated VMX/VHE trampoline that runs in the UmkaOS Core protection domain
(PKEY 0 on x86-64). KVM prepares the VMCS and EPT in its own memory isolation domain; the
trampoline validates the VMCS fields (no host-state corruption, EPT does not map UmkaOS Core
pages writable to the guest), then executes the VM entry. KVM retains Tier 1
crash-recovery semantics — a bug in KVM's VMCS preparation or ioctl handling crashes only
KVM, not UmkaOS Core.
Why not Tier 0? — Tier 0 code cannot crash-recover. By running KVM as a Tier 1 driver with a validated trampoline, a fault in KVM's VMCS preparation or ioctl handling crashes only KVM, not UmkaOS Core. The VMX trampoline itself is ~200 lines of verified assembly — small enough to audit as Tier 0 code.
Recovery implications — When umka-kvm crashes, all running VMs are paused (their vCPU threads are halted). After umka-kvm reloads (~150 ms, FLR path for any assigned devices), the VMCS state for each VM is reconstructed from the checkpointed state buffer (Section 11.9). VMs resume without guest-visible interruption beyond a brief pause. If reconstruction fails, the VM is terminated (same outcome as a host kernel crash in Linux, but without affecting other VMs or the host).
KVM Integration with umka-core Memory Management:
KVM's Extended Page Tables (EPT on x86, Stage-2 on ARM, hgatp on RISC-V) require tight integration with umka-core's memory management subsystem (Section 4.1):
Second-Level Address Translation (SLAT) hooks:
/// umka-core provides these hooks to umka-kvm for EPT/Stage-2 management.
/// Each hook operates on host physical frames and guest physical addresses.
pub trait SlatHooks {
/// Allocate a physical page for SLAT page table structures (EPT/Stage-2/hgatp
/// page table entries). These are hypervisor metadata pages used to build the
/// second-level address translation tables — NOT guest physical memory backing
/// pages. Returns a pinned frame suitable for use as a page table page.
/// Allocates from the VM's pre-allocated SLAT page pool first (GFP_ATOMIC
/// safe — pool access is O(1) with no sleeping). If the pool is exhausted,
/// falls back to the buddy allocator. The `pool_or_fallback` parameter
/// controls this behavior:
/// - `SlatAllocMode::PoolOnly`: Only try the pool. Returns `Err` if empty.
/// Used during VM exit handling where sleeping is not permitted.
/// - `SlatAllocMode::PoolThenBuddy`: Try pool first, then buddy (may sleep
/// if GFP_KERNEL). Used during VM setup and pre-fault paths.
fn alloc_slat_page(&self, mode: SlatAllocMode) -> Result<PhysFrame, KernelError>;
/// Free a SLAT page table structure page previously allocated by
/// `alloc_slat_page`.
fn free_slat_page(&self, frame: PhysFrame);
/// Allocate a physical page to back guest physical memory. This is the host
/// physical frame that the guest will use as RAM — mapped into the SLAT tables
/// as a leaf entry. Distinct from `alloc_slat_page`, which allocates page table
/// structure pages (internal SLAT nodes).
fn alloc_guest_page(&self) -> Result<PhysFrame, KernelError>;
/// Free a guest physical memory backing page previously allocated by
/// `alloc_guest_page`, returning it to umka-core's buddy allocator.
fn free_guest_page(&self, frame: PhysFrame);
/// Pin a host physical page to prevent reclaim or migration while it is
/// mapped in an EPT/Stage-2 table. The page remains pinned until the
/// corresponding `unpin_host_page` call.
fn pin_host_page(&self, frame: PhysFrame) -> Result<(), KernelError>;
/// Unpin a host physical page, allowing umka-core to reclaim or migrate it.
fn unpin_host_page(&self, frame: PhysFrame);
/// Notify umka-core that a guest physical to host physical mapping was created.
/// Used for dirty page tracking and live migration bookkeeping.
fn notify_slat_map(&self, gpa: u64, hpa: u64, size: usize, writable: bool);
/// Notify umka-core that a SLAT mapping was removed.
fn notify_slat_unmap(&self, gpa: u64, size: usize);
}
Memory overcommit: umka-kvm can overcommit guest memory (assign more virtual memory to VMs than is physically available). When a guest accesses an unmapped guest physical page, the EPT violation is handled through a five-step path:
-
VM exit to trampoline: The EPT/Stage-2/hgatp violation triggers a VM exit. The VMX trampoline (running in PKEY 0/umka-core) captures the faulting guest physical address from VMCS (x86), FAR_EL2 (ARM), or htval (RISC-V).
-
Synchronous upcall to umka-kvm (Architectural exception to the Unified Domain Model): The trampoline performs a direct function call (not ring buffer IPC) to umka-kvm's page fault handler. This is an explicit exception to the universal rule "different domain = ring buffer" (see
00-design-philosophy.md§Unified Domain Model).
Justification: The EPT violation is synchronous within the vCPU thread context. A ring buffer round-trip (~200+ cycles) on every SLAT fault would add ~100-200 ns to every guest page fault — unacceptable for KVM performance. The direct call costs ~80-130 cycles round-trip (two one-way domain switches: Tier 0 PKEY 0 -> umka-kvm PKEY 7 for the handler call, then umka-kvm PKEY 7 -> Tier 0 PKEY 0 for return to the vcpu_run loop before VMRESUME). Each one-way domain switch costs ~30-50 cycles (WRPKRU + register save/restore); the sum ~60-100 cycles plus call overhead gives ~80-130 total. This is 2-3x cheaper than the ring path.
Safety: The direct call is safe because: - The call is synchronous within the vCPU thread context (no concurrency with other umka-kvm operations on this vCPU). - umka-kvm's page fault handler runs in its isolation domain but accesses only its own per-VM data structures. - The trampoline validates the fault is a legitimate EPT violation (not a malicious call from compromised code) before invoking umka-kvm. - If umka-kvm crashes during the upcall, the domain crash recovery mechanism handles it identically to a ring-based crash — the blast radius is the same.
-
Page request: umka-kvm requests a guest backing page from umka-core via
SlatHooks::alloc_guest_page(another direct call, umka-core is PKEY 0). -
Page allocation: umka-core allocates from the buddy allocator, potentially reclaiming pages from page cache, compressing cold pages (Section 4.12), or evicting pages from other guests based on the memory pressure framework.
-
Mapping and resume: umka-kvm installs the EPT/Stage-2 mapping in its per-VM page tables and returns to the trampoline, which resumes the guest via VMRESUME/ERET.
Total EPT violation latency: ~200 cycles (VM exit) + ~50 cycles (trampoline + domain switch) + ~100-500 cycles (page allocation, varies by pressure) + ~200 cycles (VM entry) = ~550-950 cycles for a page-in from free list. This is comparable to Linux KVM's EPT violation handling (~400-800 cycles on similar hardware).
Dirty page tracking for live migration uses architecture-specific mechanisms:
- PML (Page Modification Logging) on Intel: hardware logs dirty guest physical addresses to a 512-entry buffer in the VMCS. When the buffer fills, a VM exit occurs and umka-kvm drains the buffer into a per-VM dirty bitmap.
- Software dirty tracking on ARM/RISC-V: umka-kvm clears the write permission bit in Stage-2/hgatp entries. Write faults trap into umka-kvm, which records the dirty page in the bitmap and restores write permission. Batched permission restoration amortizes the TLB invalidation cost.
- umka-core maintains per-VM dirty bitmaps (one bit per 4 KiB page) that can be queried and atomically reset by the migration coordinator.
Ballooning integration: The virtio-balloon driver in the guest inflates (returns
pages to the host) or deflates (reclaims pages from the host). umka-kvm processes
balloon requests by calling free_guest_page on inflation (returning the host physical
frame to umka-core's buddy allocator) and alloc_guest_page on deflation (allocating a
new guest backing frame and installing the EPT mapping). Balloon state is included in the umka-kvm
checkpoint for crash recovery (Section 11.9).
19.1.5.2 Netfilter / nftables¶
- Tier 1 network stack includes the nftables packet classification engine
- iptables legacy compatibility via the nft backend (same approach as modern Linux)
- Connection tracking (conntrack) for stateful firewalling
- NAT support: SNAT, DNAT, masquerade
- Required for: Docker networking, Kubernetes kube-proxy (iptables mode), firewalld
19.1.5.3 Linux Security Modules (LSM)¶
- LSM hook framework at all security-relevant points (file access, socket operations, task operations, IPC, etc.)
- SELinux policy engine compatibility (required for RHEL/CentOS/Fedora)
- AppArmor profile compatibility (required for Ubuntu/SUSE)
- Capability-based hooks integrate naturally with UmkaOS's native capability model
- seccomp-bpf for per-process syscall filtering (required for Docker, Chrome)
The architecture guarantees that every Linux LSM hook has a corresponding UmkaOS enforcement point — either a direct capability check or a policy module callout (Section 19.9). Scope estimate: Linux 6.x defines ~220 LSM hook points across file, inode, task, socket, IPC, key, audit, BPF, and perf_event categories. The UmkaOS implementation must provide hook stubs for all ~220 points for SELinux/AppArmor policy modules to attach to.
Partial LSM hook mapping (security-critical hooks):
| LSM Hook | UmkaOS Capability Check | Notes |
|---|---|---|
inode_permission |
CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH |
File permission bypass |
file_ioctl |
Capability from device driver's DriverVTable |
Device-specific |
bprm_check_security |
CAP_SETUID, CAP_SETGID |
setuid/setgid binary execution |
ptrace_access_check |
CAP_SYS_PTRACE |
Cross-process ptrace |
capable |
Direct capability lookup in TaskCredential |
General capability gate |
socket_create |
CAP_NET_RAW for raw sockets |
Network raw access |
key_alloc |
CAP_SYS_ADMIN for kernel keyrings |
Key management |
task_setrlimit |
CAP_SYS_RESOURCE |
Resource limit changes |
sb_mount |
CAP_MOUNT |
Mount operations (regular mounts require CAP_MOUNT only; CAP_SYS_ADMIN is for pivot_root) |
inode_setattr |
Ownership + CAP_FOWNER |
Attribute changes |
The complete hook-to-capability mapping (all ~220 hooks) is generated by a build-time code
generator that reads Linux 6.1 LTS security/security.h (see hook stub generation below).
The invariant is: every LSM hook that Linux uses for privilege
enforcement maps to exactly one UmkaOS capability check; hooks that only enforce DAC
(discretionary access control) map to the TaskCredential uid/gid/mode checks.
Complete LSM Hook Categories (all ~220 hooks, organized by subsystem):
| Category | Hook Count | Key Hooks | UmkaOS Mapping |
|---|---|---|---|
| Filesystem / Inode | ~45 | inode_permission, inode_create, inode_link, inode_unlink, inode_symlink, inode_mkdir, inode_rmdir, inode_mknod, inode_rename, inode_readlink, inode_follow_link, inode_setattr, inode_getattr, inode_setxattr, inode_getxattr, inode_listxattr, inode_removexattr | DAC checks + CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID |
| File | ~15 | file_permission, file_alloc_security, file_free_security, file_ioctl, file_mmap, file_mprotect, file_lock, file_fcntl, file_send_sigiotask, file_receive, file_open | File capability from device driver; mmap permission check |
| Superblock / Mount | ~10 | sb_alloc_security, sb_free_security, sb_copy_data, sb_remount, sb_kern_mount, sb_show_options, sb_statfs, sb_mount, sb_check_sb, sb_umount | CAP_MOUNT for mount/umount; CAP_SYS_ADMIN for pivot_root and MNT_LOCKED override |
| Task / Process | ~25 | task_create, task_free, cred_alloc_blank, cred_free, cred_prepare, cred_transfer, task_setuid, task_setgid, task_setpgid, task_getpgid, task_getsid, task_getsecid, task_setnice, task_setioprio, task_getioprio, task_prlimit, task_setrlimit, task_setscheduler, task_getscheduler, task_movememory, task_kill, task_wait_pid | TaskCredential checks |
| Network Socket | ~30 | socket_create, socket_post_create, socket_bind, socket_connect, socket_listen, socket_accept, socket_sendmsg, socket_recvmsg, socket_getsockname, socket_getpeername, socket_getsockopt, socket_setsockopt, socket_shutdown, socket_sock_rcv_skb, socket_getpeersec_stream, socket_getpeersec_dgram | CAP_NET_RAW, CAP_NET_BIND_SERVICE, etc. |
| IPC | ~20 | ipc_permission, msg_msg_alloc_security, msg_msg_free_security, msg_queue_alloc_security, msg_queue_free_security, msg_queue_associate, msg_queue_msgctl, msg_queue_msgsnd, msg_queue_msgrcv, shm_alloc_security, shm_free_security, shm_associate, shm_shmctl, shm_shmat, sem_alloc_security, sem_free_security, sem_associate, sem_semctl, sem_semop | IPC namespace capability checks |
| Key / Keyring | ~10 | key_alloc, key_free, key_permission, key_getsecurity | CAP_SYS_ADMIN for kernel keyrings |
| BPF | 3 | bpf, bpf_map, bpf_prog | CAP_BPF + verifier trust level |
| Audit | 4 | audit_rule_init, audit_rule_known, audit_rule_match, audit_rule_free | auditd integration |
| Misc | ~15 | ptrace_access_check, ptrace_traceme, capget, capset, capable, syslog, vm_enough_memory, mmap_addr, mmap_file, quotactl, sysctl | Per-capability checks |
Hook stub generation (Phase 2): The complete 220-hook stub table is generated by a
build-time code generator that reads Linux 6.1 LTS security/security.h hook signatures
and produces typed Rust stubs in umka-security/src/lsm/hooks.rs. Each stub either:
- Performs a direct capability check (hooks without data-access restrictions).
- Calls into the active LSM policy module (SELinux/AppArmor) for policy-based decisions.
- Returns 0 unconditionally (hooks with no security relevance in UmkaOS's model, e.g., bprm_committed_creds).
The hook-to-capability mapping is declared as a const table in umka-security/src/lsm/hooks.rs.
LSM hooks are not generated from the .kabi IDL — the KABI IDL is used for driver
interface versioning, not for security framework hook dispatch. LSM hooks are invoked
directly from the syscall translation layer in umka-sysapi and from UmkaOS Core at the
corresponding kernel-internal operation points (see Section 19.1.4.7a below).
19.1.5.4 LSM Hook Invocation Architecture¶
UmkaOS implements Linux's LSM hook model for binary compatibility with security modules
(AppArmor, SELinux profiles, seccomp filters). LSM hooks are not generated from the
.kabi IDL — they are invoked directly from the syscall translation layer in umka-sysapi
and from UmkaOS Core at the equivalent kernel-internal operation points.
19.1.5.4.1.1 Userspace Path Copying¶
All syscall handlers that accept a pathname argument from userspace use
copy_path_from_user to safely copy the NUL-terminated string into a
kernel-owned KernelPath (a #![no_std]-compatible NUL-terminated byte string).
Two entry points exist for user→kernel path transfers: this warm-path variant
(heap-allocated) and copy_path_from_user_stack (hot-path, stack-allocated).
/// Copy a NUL-terminated pathname from userspace into a kernel-owned byte buffer.
/// Returns a `KernelPath` (`#![no_std]`-compatible; `PathBuf` is not available).
///
/// Reads up to `max_len` bytes from `user_ptr`, stopping at the first NUL.
/// Returns `EFAULT` if any byte in the range is unmapped or inaccessible.
/// Returns `ENAMETOOLONG` if no NUL terminator is found within `max_len` bytes.
/// Returns `ENOENT` if the resulting path is empty (first byte is NUL).
///
/// `max_len` is typically `PATH_MAX` (4096) for standard pathname syscalls.
/// Callers that accept shorter names (e.g., `gethostname`) pass a tighter bound.
///
/// The returned `KernelPath` owns its allocation (warm path — bounded by `max_len`).
/// Hot-path callers that need to avoid allocation should use `copy_path_from_user_stack`
/// with an `ArrayVec<u8, PATH_MAX>` instead.
///
/// Note: copies up to `max_len` bytes then scans in-kernel for NUL, unlike
/// Linux's `strncpy_from_user` which stops at NUL. Future optimization:
/// word-at-a-time NUL scanning on source before copy.
///
/// # Safety contract
/// `user_ptr` is a raw pointer from userspace — the function performs full access
/// validation via `copy_from_user()` ([Section 4.15](04-memory.md#extended-memory-operations--user-kernel-copy))
/// before dereferencing any byte.
pub fn copy_path_from_user(user_ptr: *const u8, max_len: usize) -> Result<KernelPath, Errno> {
if user_ptr.is_null() {
return Err(Errno::EFAULT);
}
let mut buf = Vec::with_capacity(max_len);
// SAFETY: copy_from_user validates the entire user range [user_ptr, user_ptr + max_len)
// and returns EFAULT on any unmapped or inaccessible page.
let copied = unsafe { copy_from_user(buf.spare_capacity_mut(), user_ptr, max_len)? };
// Scan for NUL terminator.
let nul_pos = copied.iter().position(|&b| b == 0)
.ok_or(Errno::ENAMETOOLONG)?;
if nul_pos == 0 {
return Err(Errno::ENOENT);
}
unsafe { buf.set_len(nul_pos); }
Ok(KernelPath::from_bytes(buf))
}
/// Copy a NUL-terminated pathname from userspace into a caller-provided stack buffer.
///
/// Hot-path variant of `copy_path_from_user()` — avoids heap allocation on
/// every `sys_open()`, `sys_openat()`, `sys_stat()`, etc. The caller supplies
/// a `&mut [u8; PATH_MAX]` (4096 bytes on the stack). The function copies
/// bytes from `user_ptr` into `buf`, scans for the NUL terminator, and returns
/// a `&CStr` borrowing the stack buffer.
///
/// # Errors
///
/// - `EFAULT` — `user_ptr` is null or any byte in the range fails
/// `copy_from_user()` validation.
/// - `ENAMETOOLONG` — no NUL terminator found within `PATH_MAX` bytes.
/// - `ENOENT` — the path is empty (first byte is NUL).
///
/// # Safety
///
/// `user_ptr` must be a valid userspace pointer (validated by `copy_from_user`).
pub fn copy_path_from_user_stack<'a>(
user_ptr: *const u8,
buf: &'a mut [u8; PATH_MAX],
) -> Result<&'a CStr, Errno> {
if user_ptr.is_null() {
return Err(Errno::EFAULT);
}
// SAFETY: copy_from_user validates the entire user range before copying.
let copied = unsafe { copy_from_user(buf.as_mut_ptr(), user_ptr, PATH_MAX)? };
// Find the NUL terminator within the copied region.
let nul_pos = buf[..copied].iter().position(|&b| b == 0)
.ok_or(Errno::ENAMETOOLONG)?;
if nul_pos == 0 {
return Err(Errno::ENOENT);
}
// SAFETY: buf[..nul_pos+1] contains a valid NUL-terminated C string.
Ok(unsafe { CStr::from_bytes_with_nul_unchecked(&buf[..nul_pos + 1]) })
}
19.1.5.4.1.2 Hook Invocation Points¶
At each syscall that Linux defines LSM hooks for, the compat syscall handler calls the corresponding UmkaOS security check before executing the operation:
// In the compat syscall dispatcher (umka-sysapi/src/syscall/fs.rs):
fn sys_open(path: UserPtr<u8>, flags: u32, mode: u32) -> Result<Fd, Errno> {
let task = current_task();
// Hot path: use stack-based copy to avoid heap allocation per open().
let mut path_buf = [0u8; PATH_MAX];
let path = copy_path_from_user_stack(path.as_ptr(), &mut path_buf)?;
// Acquire read lock on FsStruct to get a consistent (root, pwd) snapshot.
// Without this lock, a concurrent chroot() or chdir() could produce a
// root/pwd pair from different points in time.
let fs = task.fs.read();
// Determine lookup flags from open flags. Default open() follows terminal
// symlinks (LOOKUP_FOLLOW). O_NOFOLLOW clears this flag.
let lookup_flags = if flags & O_NOFOLLOW != 0 {
LookupFlags::empty()
} else {
LookupFlags::FOLLOW
};
// path_lookup() takes the full resolution context: mount namespace (mount tree),
// root dentry (chroot boundary), cwd dentry (relative path base), and lookup flags.
// This is equivalent to Linux's path_openat() → link_path_walk() chain.
let dentry = path_lookup(
&task.nsproxy.load().mount_ns, // mount namespace for mount traversal
&fs.root, // chroot root (FsStruct.root)
&fs.pwd, // current working directory (FsStruct.pwd)
&path, // userspace path string
lookup_flags, // LOOKUP_FOLLOW unless O_NOFOLLOW
)?;
// LSM security check — equivalent to Linux's security_inode_open()
// Calls all registered policy providers in order; returns first error.
umka_core::security::check_open(&task.cred, &dentry, flags)?;
vfs_open(dentry, flags, mode)
}
sys_read handler — argument extraction from SyscallContext:
Signature convention note: sys_open above uses typed parameters directly
(the dispatch macro extracts arguments before calling the handler), while sys_read
below uses raw SyscallContext extraction. Both conventions are valid — the typed-
parameter form is preferred for new handlers (clearer, compile-time type checking).
The SyscallContext form is shown here to illustrate the raw extraction mechanism
that the dispatch macro generates internally.
// In umka-sysapi/src/syscall/fs.rs:
fn sys_read(ctx: &mut SyscallContext) -> i64 {
let fd = ctx.args[0] as i32;
let buf = UserPtr::<u8>::new(ctx.args[1] as *mut u8);
let count = ctx.args[2] as usize;
let file = current_task().files.get(fd).ok_or(Errno::EBADF)?;
let mut user_buf = UserSliceMut::new(buf, count)?;
// Must go through vfs_read() — not file.ops.read() directly — to ensure
// LSM checks (security_file_permission), access mode verification (FMODE_READ),
// fsnotify events, and file position locking are applied consistently.
match vfs_read(&file, &mut user_buf, &mut file.f_pos.lock()) {
Ok(n) => n as i64,
Err(e) => -(e as i64),
}
}
The security check function (e.g., umka_core::security::check_open) iterates the
registered LSM policy provider list in priority order. Each provider returns Ok(())
to permit or Err(Errno) to deny. The first denial short-circuits the chain. Providers
are registered at boot time and are immutable at runtime (no dynamic LSM loading after
the security namespace is sealed). See Section 9.1 for the full LSM
registration API and provider lifecycle.
19.1.5.4.1.3 Supported LSM Hook Invocation Points¶
UmkaOS invokes the LSM hooks required for AppArmor and seccomp compatibility at the following kernel operations:
| Hook | Kernel operation | Security check |
|---|---|---|
check_open |
Any file open (vfs_open) |
Path/label access, file flags |
check_file_permission |
read(2), write(2), readv, writev, pread64, pwrite64, sendfile |
Per-operation file access revalidation |
check_exec |
execve / execveat |
Executable label, capabilities, no-new-privs |
check_socket_create |
socket(2) |
Domain, type, protocol policy |
check_socket_connect |
connect(2) |
Destination address, peer label |
check_socket_bind |
bind(2) |
Port and address policy |
check_process_signal |
kill / tgkill / rt_sigqueueinfo |
Sender→receiver relationship |
check_ptrace |
ptrace(2) |
Tracer→tracee relationship |
check_ipc_send |
msgsnd, mq_send |
IPC endpoint access label |
check_mmap |
mmap(2) with PROT_EXEC |
Execute permission on anonymous mapping |
check_setuid / check_setgid |
setuid / setgid and variants |
Privilege escalation policy |
check_cap |
Any CAP_* usage site |
Capability allowed in task's security context |
check_file_permission placement in read/write dispatch:
The check_file_permission hook (equivalent to Linux's security_file_permission())
is invoked at the start of every read/write syscall dispatch, before any data
transfer or page cache access. This is the per-operation revalidation hook — distinct
from check_open which runs only at open time. SELinux and AppArmor use this hook to
enforce label transitions and revoke access after policy reload without requiring the
file to be closed and reopened.
// In the VFS read dispatch path (umka-vfs/src/read_write.rs):
fn vfs_read(file: &OpenFile, buf: &mut UserSliceMut, pos: &mut i64) -> Result<usize, Errno> {
// 1. LSM file_permission hook — before any I/O.
// Checks: task credentials vs. file label, MAY_READ permission.
// If the LSM denies access, returns EACCES immediately.
// This revalidation catches: SELinux policy reloads that revoke
// read access, AppArmor profile updates, capability drops.
umka_core::security::check_file_permission(
¤t_task().cred,
file,
FilePermission::MAY_READ,
)?;
// 2. Validate userspace buffer (EFAULT on bad pointer).
// 3. Dispatch to FileOps::read() (page cache, direct I/O, etc.).
file.f_ops.read(file, buf, pos)
}
fn vfs_write(file: &OpenFile, buf: &UserSlice, pos: &mut i64) -> Result<usize, Errno> {
// 1. LSM file_permission hook — before any I/O.
umka_core::security::check_file_permission(
¤t_task().cred,
file,
FilePermission::MAY_WRITE,
)?;
// 2. Validate userspace buffer.
// 3. Dispatch to FileOps::write().
file.f_ops.write(file, buf, pos)
}
The hook ordering within the full read syscall path is:
1. seccomp-bpf filter (in dispatch_syscall, before any handler code)
2. fdget_pos() — resolve fd to OpenFile, acquire f_pos serialization
3. check_file_permission — LSM per-operation revalidation
4. UserSliceMut::new() — validate userspace buffer pointer
5. FileOps::read() — actual data transfer (page cache, readahead, etc.)
6. Update f_pos and return byte count
The hook list matches Linux 6.1 LTS LSM hooks. New hooks are added additively — existing
LSM policy modules remain compatible because they only observe hooks they were compiled
against (unknown hook calls return Ok(()) by default for unregistered providers).
19.1.5.4.1.4 seccomp-bpf Integration¶
seccomp filters (BPF programs attached via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER))
are evaluated before LSM hooks in the syscall dispatch path. This matches Linux's
ordering. If seccomp kills or traps the syscall, LSM hooks are not reached.
// Syscall dispatch order in umka-sysapi/src/entry.rs:
fn dispatch_syscall(ctx: &mut SyscallContext) -> i64 {
// 1. seccomp filter (BPF, per-thread, before any kernel state is touched)
if let Err(action) = seccomp_check(ctx) {
return seccomp_apply_action(action, ctx);
}
// 2. LSM pre-checks (capabilities, label policy)
// (called per-operation inside each syscall handler)
// 3. Execute syscall via the bidirectional dispatch table.
// ctx.nr is i32: positive = Linux compat, negative = UmkaOS native.
// ORIGIN points to the boundary element; positive nr indexes forward,
// negative nr indexes backward via two's complement arithmetic.
let nr = ctx.nr as isize;
let biased = (nr as usize).wrapping_add(TABLE.max_umka as usize);
if biased >= TABLE.total as usize {
// Unknown syscall numbers return -ENOSYS (POSIX convention).
// Seccomp filters run BEFORE dispatch (step 1 above), so a filter
// can override this with any SECCOMP_RET_* action.
return -(Errno::ENOSYS as i64);
}
// Safety: bounds checked above. ORIGIN + signed offset is in-table.
// SyscallEntry is a bare function pointer (fn(&mut SyscallContext) -> i64),
// not a struct with a .handler() method. Unimplemented slots point to
// sys_ni_syscall which returns -ENOSYS.
// Compute the origin pointer from `origin_idx` (provenance-safe — no stored
// raw pointer). `nr` is sign-extended: negative = UmkaOS native (indexes
// backward from origin), positive = Linux compat (indexes forward).
let idx = (TABLE.origin_idx as isize + nr as isize) as usize;
let handler: SyscallEntry = TABLE.table[idx];
handler(ctx)
}
19.1.5.4.1.5 Bidirectional Dispatch Table¶
The dispatch table uses a bidirectional layout that unifies Linux-compatible and UmkaOS-native syscalls in a single contiguous array, with zero namespace branching overhead:
Memory layout:
┌──────────────────────┬─────────────────────────┐
│ UmkaOS native │ Linux compat │
│ handlers [M-1 .. 0] │ handlers [0 .. N-1] │
└──────────────────────┴─────────────────────────┘
↑
ORIGIN
- Linux syscalls (positive
nr):ORIGIN[nr]— indexes forward. - UmkaOS native ops (negative
nr):ORIGIN[nr]— two's complement arithmetic indexes backward automatically. No branch, no sign test. - Bounds check: a single unsigned compare covers both directions. The bias trick
maps the range
[-M, +N)to[0, M+N):if (nr + M) as u64 >= (M + N) as u64 { return -ENOSYS; }
/// Current Linux syscall count with headroom. Linux 6.7 max ≈460;
/// 1024 provides decades of growth at ~5-10 new syscalls per release.
pub const MAX_LINUX_NR: usize = 1024;
/// Maximum UmkaOS native op magnitude. Covers all families (0x0100-0x0BFF)
/// with headroom for future families. 4096 entries × 8 bytes = 32 KB.
pub const MAX_UMKA_NR: usize = 4096;
/// Total syscall table entries (UmkaOS + Linux).
pub const MAX_SYSCALLS: usize = MAX_UMKA_NR + MAX_LINUX_NR;
/// Syscall dispatch entry: a function pointer for hot-path dispatch.
/// `Option<SyscallEntry>` uses niche optimization (null = `None`),
/// so each entry is exactly 8 bytes.
type SyscallEntry = fn(&mut SyscallContext) -> i64;
/// Bidirectional syscall dispatch table.
/// Owned by Layer 2 (umka-sysapi, replaceable via live evolution).
/// The `origin_idx` and bounds are read by `dispatch_syscall` above.
///
/// HOT PATH — this table is indexed on every syscall entry. Heap allocation
/// is forbidden; the backing store is a fixed-size array sized at compile
/// time from `MAX_SYSCALLS`. The entire table (5120 × 8 = 40 KB) fits in
/// a single static allocation with no indirection.
pub struct BidirectionalSyscallTable {
/// Full backing array: [umka handlers | linux handlers].
/// `table[0..MAX_UMKA_NR]` = UmkaOS native (in reverse order from ORIGIN).
/// `table[MAX_UMKA_NR..MAX_SYSCALLS]` = Linux compat.
/// Unimplemented slots point to `sys_ni_syscall` (returns -ENOSYS).
/// Fixed-size — no heap allocation on the per-syscall hot path.
table: [SyscallEntry; MAX_SYSCALLS],
/// Index of the boundary element (= MAX_UMKA_NR). Used instead of a raw
/// `*const SyscallEntry` for provenance safety: the origin pointer is
/// computed at lookup time as `&table[origin_idx]`. This avoids storing
/// a raw pointer with no lifetime guarantee and no Send/Sync impl.
pub origin_idx: usize,
/// Number of UmkaOS native entries (backward extent from ORIGIN).
pub max_umka: u32,
/// Total table entries (max_umka + max_linux). Used for bounds check.
pub total: u32,
}
Per-architecture entry asm (Layer 1, non-replaceable):
Each architecture's syscall entry saves registers, extracts the syscall number as a
signed value, then calls dispatch_syscall() in Layer 2. The sign-extension
instruction is architecture-specific:
| Arch | Sign-extend | Indexed load | Notes |
|---|---|---|---|
| x86-64 | cdqe (eax→rax) |
call [origin + rax*8] |
cdqe replaces Linux's movzx; ~0 cycle (rename stage on modern cores) |
| AArch64 | sxtw x8, w8 |
ldr x9, [origin, x8, lsl #3] |
Same cost as Linux's uxtw |
| ARMv7 | Implicit (32-bit native) | ldr pc, [origin, r7, lsl #2] |
No extension needed (native 32-bit) |
| RISC-V 64 | sext.w a7, a7 |
slli t0,a7,3; add t0,origin,t0; ld t0,0(t0) |
Sign-extend replaces zero-extend |
| PPC64LE | extsw r0, r0 |
sldi r0,r0,3; ldx r12,origin,r0 |
Same cost as Linux's clrldi |
| PPC32 | Implicit (32-bit native) | slwi r0,r0,2; lwzx r12,origin,r0 |
No extension needed |
| s390x | lgfr %r1, %r1 |
sllg %r1,%r1,3; lg %r1,0(%r1,origin) |
Sign-extend from 32-bit SVC operand. PSW swap saves old PSW; entry code in SVC new PSW handler. |
| LoongArch64 | sext.w $a7, $a7 |
slli.d $t0,$a7,3; ldx.d $t0,origin,$t0 |
Sign-extend replaces zero-extend; syscall 0 instruction triggers. |
Speculative execution hardening at syscall entry:
The syscall table index is derived from an untrusted user register. To prevent Spectre
v1 (bounds check bypass) from speculatively indexing past the table bounds, the entry
stub applies array_index_nospec() — a branchless clamp that forces the index to zero
when it exceeds the table size, even during speculative execution:
// x86-64: clamp after bounds check (cmov to zero if CF=0)
cmp rax, MAX_ENTRIES
sbb rcx, rcx // rcx = 0xFFFF...F if rax < MAX, 0 otherwise
and rax, rcx // rax = original if in-bounds, 0 if out-of-bounds (speculatively)
Equivalent patterns: AArch64 uses CSEL + CSDB, ARMv7 uses MOVCC + CSDB,
RISC-V uses conditional mask + FENCE, PPC uses isel + ori speculation barrier.
RISC-V uaccess pointer masking (Spectre v1): In addition to array_index_nospec()
for the syscall dispatch table index, RISC-V requires pointer masking in all uaccess
operations (copy_from_user, copy_to_user, get_user, put_user). The user-supplied
address is masked to ensure it falls within the user virtual address range, preventing
speculative access to kernel memory:
// RISC-V uaccess pointer masking (before any user memory access):
// user_addr &= (user_addr < TASK_SIZE) ? 0xFFFF_FFFF_FFFF_FFFF : 0
// Implemented as branchless conditional mask:
sltu t0, a0, TASK_SIZE_REG // t0 = 1 if user_addr < TASK_SIZE
neg t0, t0 // t0 = 0xFFFF...F if in-range, 0 if out-of-range
and a0, a0, t0 // clamp to 0 if out-of-range (speculatively)
fence // speculation barrier
RISC-V scounteren CSR restriction: UmkaOS disables user-mode access to performance
counters by clearing scounteren bits for rdcycle (bit 0), rdtime (bit 1), and
rdinstret (bit 2). User-mode rdcycle/rdinstret are timing side-channel primitives
— they provide cycle-accurate measurement that enables Spectre-style attacks.
rdtime access is maintained through the vDSO (clock_gettime) which adds controlled
jitter. The scounteren CSR is set once during boot per hart and is not modifiable by
userspace. If a future RISC-V extension provides safe performance counter access
(deprivileged, with configurable resolution), UmkaOS can re-enable the relevant bits.
Per-architecture syscall entry mitigation costs:
| Mitigation | x86-64 (modern) | x86-64 (pre-ADL) | AMD Zen 4 | AArch64 | PPC64 | ARMv7 | RISC-V 64 | s390x | LoongArch64 |
|---|---|---|---|---|---|---|---|---|---|
| KPTI (page table switch) | ~100-200 ns | ~100-200 ns | N/A | ~100 ns (A75 only) | N/A | ~200 ns (A15) | N/A (no known vuln) | N/A | N/A |
| VERW (MDS/RFDS buffer clear) | ~5-15 cycles | ~5-15 cycles | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| BHB clear (branch history) | N/A (BHI_DIS_S hw) | ~150-200 cycles | N/A | ~50 cycles (CLEARBHB) | N/A | ~50 cycles | N/A | N/A | N/A |
| RSB fill | N/A (eIBRS) | ~20-40 cycles | N/A (eIBRS) | N/A | N/A | N/A | N/A | N/A | N/A |
| Retpoline overhead | N/A (eIBRS) | ~2-5 cycles/indirect | N/A (AutoIBRS) | N/A (BTI hw) | Expolines | N/A | N/A | Expolines | N/A |
| RFI flush (L1D) | N/A | N/A | N/A | N/A | ~500-1000 cycles | N/A | N/A | N/A | N/A |
| Cumulative | ~15-25 cycles | ~175-240 cycles | ~30-60 cycles | ~50-80 cycles | ~500-1000 cycles | ~50-80 cycles | ~5-10 cycles | ~20-40 cycles | ~5-10 cycles |
"modern" x86-64 = Alder Lake+ (eIBRS, BHI_DIS_S, hardware MDS fix). "pre-ADL" =
Skylake through Tiger Lake (software mitigations for everything). PPC64 POWER7-9
has the highest per-syscall cost due to the L1D flush (Meltdown mitigation via RFI);
POWER10 eliminates most software mitigations via hardware fixes.
ARMv7 (Cortex-A15/A17) requires Spectre-BHB mitigation; newer cores (A7, A53) are
not affected. RISC-V 64 has no known Meltdown/Spectre vulnerabilities on current
hardware — mitigation cost is limited to scounteren CSR restriction (no per-syscall
cost). s390x uses expolines (execute-relative-long trampoline) for Spectre-v2 on z14+.
LoongArch64 has no known microarchitectural side-channel vulnerabilities as of 3A6000.
Cost vs single-table (Linux-only) design: +1 lea instruction for the bias in the
bounds check. On out-of-order cores (all production x86-64, AArch64, PPC64), this
executes in the shadow of the cmp — 0 additional cycles. On in-order cores
(low-end RISC-V, ARMv7): 1 additional cycle.
No 32-bit compat layers: UmkaOS builds separate kernels per architecture (Section 19.7). There are no 32-bit compat dispatch tables (no i386-on-x86-64, no AArch32-on-AArch64, no PPC32-on-PPC64LE). To run 32-bit binaries, use the corresponding 32-bit UmkaOS kernel (e.g., the ARMv7 kernel for ARMv7 binaries). This eliminates the doubled syscall surface and signal handling complexity that compat layers introduce.
Live evolution: the bidirectional table is owned by Layer 2 (umka-sysapi). During
SysAPI layer replacement, the new version builds a new table (potentially with different
max_umka/max_linux if syscalls were added), then Layer 1 atomically updates its
entry-function pointer to the new Layer 2's dispatch_syscall. The TLB flush IPI
(already part of the live evolution protocol) ensures all CPUs see the new code.
19.1.5.5 Namespaces¶
All 8 Linux namespace types:
| Namespace | Purpose | Required for |
|---|---|---|
mnt |
Mount point isolation | Containers, chroot |
pid |
Process ID isolation | Containers |
net |
Network stack isolation | Containers, VPN |
ipc |
IPC resource isolation | Containers |
uts |
Hostname/domainname isolation | Containers |
user |
UID/GID mapping | Rootless containers |
cgroup |
Cgroup hierarchy isolation | Containers |
time |
Clock offset isolation | Containers |
Namespace propagation: SyscallContext carries a reference to the calling
task's NamespaceSet (Section 17.1).
Each handler accesses namespace-specific views through the task's nsproxy
(all 8 namespace types) plus the credential's user namespace:
- ctx.task.nsproxy.pid_ns — PID translation
- ctx.task.nsproxy.net_ns — network stack isolation
- ctx.task.nsproxy.mount_ns — mount point visibility
- ctx.task.nsproxy.ipc_ns — IPC resource isolation (SysV IPC, POSIX mqueues)
- ctx.task.nsproxy.uts_ns — hostname/domainname isolation
- ctx.task.nsproxy.cgroup_ns — cgroup hierarchy root isolation
- ctx.task.nsproxy.time_ns — clock offset isolation (CLOCK_MONOTONIC/BOOTTIME)
- ctx.task.cred.user_ns — UID/GID mapping and capability scope (on the credential, not nsproxy)
The nsproxy reference is obtained from ctx.task (the calling task) and
remains valid for the syscall's duration. The user_ns is on the task's
credential rather than nsproxy because it governs capability interpretation
and UID mapping, which are credential properties.
19.1.5.6 Cgroups¶
- cgroup v2 as primary implementation (unified hierarchy)
- cgroup v1 compatibility mode (required for older Docker, systemd < 248)
- Controllers: cpu, cpuset, memory, io, pids, rdma, hugetlb, misc
- Required for: systemd resource management, Docker, Kubernetes, OOM handling
19.1.5.7 Cryptographic Random Syscalls¶
getrandom(2) (x86-64: 318, AArch64: 278, ARMv7: 384, RISC-V 64: 278,
PPC32: 359, PPC64LE: 359, s390x: 349, LoongArch64: 278; Linux 3.17+) returns
cryptographically secure random bytes from the kernel CSPRNG.
Required for: OpenSSL, glibc's arc4random, systemd's sd-id128, any security
library initializing keying material.
UmkaOS implementation: Direct syscall (no tier crossing). Reads from a per-CPU
entropy buffer populated at interrupt time via RDRAND (x86-64), RNDR (AArch64), or
HTIF entropy source (RISC-V), mixed with timer jitter. HKDF-SHA256 expansion at
each call provides forward secrecy — a snapshot of the per-CPU state does not
reveal past outputs.
| Flag | Value | Semantics |
|---|---|---|
GRND_NONBLOCK |
0x0001 | Return EAGAIN instead of blocking if not yet seeded (early boot). |
GRND_RANDOM |
0x0002 | No distinction from default in UmkaOS; always uses the seeded CSPRNG. |
GRND_INSECURE |
0x0004 | Always succeeds; draws from xorshift128+ before CSPRNG is seeded. For early-boot users only (e.g., initramfs randomization). Linux 5.6+. |
Return value: number of bytes written (always len unless GRND_NONBLOCK and unseeded).
Error: EFAULT if buffer address invalid; EINVAL if unknown flag; EAGAIN if
GRND_NONBLOCK and CSPRNG not yet seeded.
Seeding: CSPRNG is marked seeded when the entropy pool has accumulated ≥256 bits
of hardware entropy (RDRAND/RNDR output or timer jitter). On systems without hardware
RNG, seeding completes after the first 256 IRQs have been processed (jitter entropy).
After seeding, getrandom(2) never blocks.
For io_uring subsystem specification, see Section 19.3.
19.1.6 Modern File Descriptor Operations¶
Linux 5.6-5.9 introduced two syscalls that are now required by systemd, container runtimes, and security-hardened applications. UmkaOS implements both natively.
19.1.6.1 close_range(2)¶
/// close_range(2) — close a range of file descriptors efficiently.
/// Added in Linux 5.9. Required by systemd (used in service startup to close
/// inherited fds), container runtimes (close all fds except stdio before exec),
/// and security-hardened applications.
///
/// Syscall number: 436 (x86-64), 436 (AArch64), 436 (ARMv7), 436 (RISC-V),
/// 436 (PPC32), 436 (PPC64LE), 436 (s390x), 436 (LoongArch64).
///
/// Replaces the old pattern of:
/// for fd in 3..getrlimit(RLIMIT_NOFILE) { close(fd); }
/// which is O(n) in the fd limit (potentially millions of iterations).
/// close_range is O(n) in the number of *open* fds in the range.
pub fn sys_close_range(first: u32, last: u32, flags: u32) -> Result<(), Errno> { ... }
Parameters:
| Parameter | Type | Description |
|---|---|---|
first |
u32 |
First fd to close (inclusive) |
last |
u32 |
Last fd to close (inclusive). u32::MAX means "close all fds >= first" |
flags |
u32 |
Bitflags controlling close behavior (see below) |
Flags:
| Flag | Value | Effect |
|---|---|---|
CLOSE_RANGE_UNSHARE |
1 << 1 |
Unshare the fd table before closing. Creates a private copy of the fd table (equivalent to unshare(CLONE_FILES)) and then closes the range. This is atomic — no window where other threads observe partial state |
CLOSE_RANGE_CLOEXEC |
1 << 2 |
Instead of closing, set O_CLOEXEC on all fds in range. Useful for "close everything except stdio on exec" without actually closing fds now |
Error cases:
| Error | Condition |
|---|---|
EINVAL |
first > last |
ENOMEM |
CLOSE_RANGE_UNSHARE allocation failure (fd table clone) |
EMFILE |
fd table manipulation failure |
Implementation notes:
- Walk the fd table bitmap, closing each open fd in
[first, last]. The bitmap allows skipping gaps in O(1) per word, so complexity is O(open_fds_in_range), not O(last - first). - For
CLOSE_RANGE_UNSHARE: clone theFdTable(copy-on-write — only the bitmap and pointer array are duplicated, not the underlyingFileobjects), then close the range in the private copy. The clone-then-close sequence is performed under the task'sfileslock, making it atomic with respect to other threads. - For
CLOSE_RANGE_CLOEXEC: set the close-on-exec bit in the fd table bitmap without closing any fd. This is a single bitmap OR operation per word.
Dispatch classification: SyscallHandler::Direct — operates entirely on the
calling task's FdTable, no driver interaction required.
19.1.6.2 openat2(2)¶
/// openat2(2) — open file with extended options.
/// Added in Linux 5.6. Provides RESOLVE_* flags for path resolution control,
/// essential for container security (preventing symlink escape attacks).
///
/// Syscall number: 437 (x86-64), 437 (AArch64), 437 (ARMv7), 437 (RISC-V),
/// 437 (PPC32), 437 (PPC64LE), 437 (s390x), 437 (LoongArch64).
#[repr(C)]
pub struct OpenHow {
/// Open flags (O_RDONLY, O_WRONLY, O_RDWR, O_CREAT, O_EXCL, etc.)
pub flags: u64,
/// File creation mode (only used with O_CREAT/O_TMPFILE).
pub mode: u64,
/// Path resolution restriction flags (RESOLVE_* bitfield).
pub resolve: u64,
}
// Layout: 3 × u64 = 24 bytes.
const_assert!(size_of::<OpenHow>() == 24);
pub fn sys_openat2(
dirfd: Fd,
pathname: UserPtr<u8>,
how: UserPtr<OpenHow>,
size: usize,
) -> Result<Fd, Errno> { ... }
RESOLVE_* flags:
| Flag | Value | Effect |
|---|---|---|
RESOLVE_NO_XDEV |
0x01 |
Fail if path crosses a mount point |
RESOLVE_NO_MAGICLINKS |
0x02 |
Fail on /proc/[pid]/fd/* style magic links |
RESOLVE_NO_SYMLINKS |
0x04 |
Fail if any path component is a symlink |
RESOLVE_BENEATH |
0x08 |
Fail if resolution would escape above dirfd (no .. traversal past dirfd) |
RESOLVE_IN_ROOT |
0x10 |
Treat dirfd as the filesystem root (absolute paths in symlinks resolve relative to dirfd, not the real root) |
RESOLVE_CACHED |
0x20 |
Only succeed if the result is already in the dcache (no disk I/O). Returns EAGAIN on cache miss. Added in Linux 5.12 |
Error cases:
| Error | Condition |
|---|---|
EINVAL |
Unknown flags in how.flags, unknown bits in how.resolve, how.mode set without O_CREAT/O_TMPFILE |
E2BIG |
size > sizeof(OpenHow) and extra bytes are non-zero (unknown extension fields) |
EFAULT |
how or pathname points to unmapped memory |
EXDEV |
RESOLVE_NO_XDEV and path crosses a mount point |
ELOOP |
RESOLVE_NO_SYMLINKS and a component is a symlink, or RESOLVE_NO_MAGICLINKS and a magic link is encountered |
EAGAIN |
RESOLVE_CACHED and the dentry is not in the dcache |
Extensibility via size parameter:
The size parameter enables forward and backward compatibility for OpenHow,
following the same pattern as perf_event_open uses for perf_event_attr.size:
- If
size > sizeof(OpenHow): the kernel checks that all bytes beyond the known struct size are zero. If they are, the call proceeds (forward compatibility — new userspace, old kernel). If any are non-zero, returnsE2BIG(the application is using an extension the kernel does not understand). - If
size < sizeof(OpenHow): the kernel zero-fills the missing trailing fields (backward compatibility — old userspace, new kernel). The minimum acceptedsizeisOPEN_HOW_SIZE_VER0(24 bytes, coveringflags+mode+resolve).
Security properties:
RESOLVE_BENEATHis the key container security feature. It prevents..traversal above the starting directory, which closes the classic container escape via symlinks. Essential for:- Container runtimes opening files inside the container rootfs without symlink escape
- Web servers serving static files without directory traversal attacks
-
Unpacking archives safely (tar entries containing
../../../etc/passwd) -
RESOLVE_IN_ROOTmakesdirfdact as a virtual chroot — absolute symlink targets are resolved relative todirfd, not the real filesystem root. Combined withRESOLVE_NO_MAGICLINKS, this provides robust filesystem sandboxing without requiringchrootorpivot_root. -
RESOLVE_CACHEDsupports io_uring: allows non-blocking open that fails immediately withEAGAINif the dentry is not cached. This avoids blocking the io_uring submission thread, which would stall the entire ring. Used byIORING_OP_OPENAT2(Section 19.3).
Dispatch classification: SyscallHandler::InnerRingForward — path resolution
requires VFS traversal (Section 14.1). Mount namespace boundaries
are enforced by the VFS layer itself (Section 17.1). Capability
checks follow the standard open path (Section 9.1).
Cross-references:
- Section 14.1 — VFS path resolution and dentry cache
- Section 17.1 — mount namespace interaction with
RESOLVE_NO_XDEV - Section 9.1 — capability checks on open
- Section 19.3 —
IORING_OP_OPENAT2opcode
19.1.7 Signal Handling¶
Full POSIX and Linux signal semantics:
- 64 signals: signals 1-31 (standard) and signals 32-64 (real-time)
sigactionwithSA_SIGINFO,SA_RESTART,SA_NOCLDSTOP,SA_ONSTACKsigaltstackfor alternate signal stacks- Per-thread signal masks (
pthread_sigmask) - Signal delivery by modifying saved register state on the user stack (same mechanism
as Linux -- required for correct
sigreturn) - Proper interaction with: io_uring (signal-driven completion), epoll
(
EINTRsemantics), futex (interrupted waits), nanosleep (remaining time) signalfdfor synchronous signal consumption- Process groups and session signals (
SIGHUP,SIGCONT,SIGSTOP)
19.1.8 Capability and Credential Syscalls¶
Container runtimes (Docker, containerd, Podman, crun), privilege-dropping daemons
(sshd, nginx), and security tools (capsh, setpriv) rely on capget(), capset(),
and prctl() for capability management. These syscalls are critical for Linux
compatibility — container startup fails without them.
19.1.8.1 capget(2) and capset(2)¶
/// capget(2) — get process capabilities.
/// Syscall number: 125 (x86-64).
///
/// Linux ABI: capget/capset use a versioned header to identify the
/// capability data format. UmkaOS supports VFS_CAP_REVISION_2 (v3 header,
/// two __user_cap_data_struct elements for 64-bit capability sets).
///
/// Dispatch classification: SyscallHandler::Direct
#[repr(C)]
pub struct CapUserHeader {
/// Capability version. Must be _LINUX_CAPABILITY_VERSION_3 (0x20080522).
/// If the caller passes version 0, the kernel writes the preferred
/// version into this field and returns EINVAL (discovery protocol).
pub version: u32,
/// Target process ID. 0 = calling process. Non-zero = inspect another
/// process (requires CAP_SYS_PTRACE or same-user with appropriate
/// namespace relationship).
pub pid: i32,
}
// Layout: 4 + 4 = 8 bytes.
const_assert!(size_of::<CapUserHeader>() == 8);
#[repr(C)]
pub struct CapUserData {
/// Effective capability bits (low 32 bits in element 0, high 32 in element 1).
pub effective: u32,
/// Permitted capability bits.
pub permitted: u32,
/// Inheritable capability bits.
pub inheritable: u32,
}
// Layout: 3 × u32 = 12 bytes.
const_assert!(size_of::<CapUserData>() == 12);
pub fn sys_capget(
header: UserPtr<CapUserHeader>,
data: UserPtr<[CapUserData; 2]>,
) -> Result<(), Errno> { ... }
capget validation:
- Copy
CapUserHeaderfrom userspace. - If
header.version == 0: write_LINUX_CAPABILITY_VERSION_3intoheader.versionand returnEINVAL. This is the version discovery protocol used by libcap. - If
header.version != _LINUX_CAPABILITY_VERSION_3(0x20080522): returnEINVAL. - If
header.pid == 0: target = current task. - If
header.pid != 0: look up the target task. Check that the caller hasCAP_SYS_PTRACEin the target's user namespace, OR the caller's euid matches the target's ruid/euid/suid (Linux ptrace-style access check). ReturnESRCHif the target PID does not exist. - Read the target's credential under
rcu_read_lock(). - Copy
cap_effective,cap_permitted,cap_inheritableinto the twoCapUserDataelements (low 32 bits indata[0], high 32 bits indata[1]). - Copy the data array to userspace.
/// capset(2) — set process capabilities.
/// Syscall number: 126 (x86-64).
///
/// Only the calling process's own capabilities can be modified (pid must be
/// 0 or the caller's own PID). Linux removed the ability to set another
/// process's capabilities in kernel 2.6.24.
///
/// Dispatch classification: SyscallHandler::Direct
pub fn sys_capset(
header: UserPtr<CapUserHeader>,
data: UserPtr<[CapUserData; 2]>,
) -> Result<(), Errno> { ... }
capset validation against commit_creds invariants:
- Copy
CapUserHeaderfrom userspace. Validate version =_LINUX_CAPABILITY_VERSION_3. - If
header.pid != 0 && header.pid != current_pid(): returnEPERM. - Copy the two
CapUserDataelements from userspace. Reconstruct 64-bit sets:effective = (data[1].effective as u64) << 32 | data[0].effective as u64, etc. new_cred = prepare_creds(current_task).- Permitted set shrinking:
new_cred.cap_permitted = old.cap_permitted & new_permitted. The caller can only drop bits fromcap_permitted, never raise them. Ifnew_permitted & !old.cap_permitted != 0: returnEPERM. - Effective set:
new_cred.cap_effective = new_effective & new_cred.cap_permitted. Ifnew_effective & !new_cred.cap_permitted != 0: returnEPERM. (Effective must be a subset of permitted.) - Inheritable set: To raise a bit in
cap_inheritable, the caller needs either: (a) the bit incap_permitted, or (b)CAP_SETPCAPincap_effective. - Ambient invariant maintenance: after updating the three sets, enforce
cap_ambient <= cap_permitted & cap_inheritable: commit_creds(current_task, new_cred)— this enforces all five invariants from Section 9.9.- On success, the UmkaOS-native capability translation is updated:
umka-sysapisynchronizes theSystemCapschanges to the task's UmkaOSCapSpace(Section 9.9). Bits dropped fromcap_permittedpermanently revoke the corresponding UmkaOS capabilities.
19.1.8.2 prctl(2) Capability Operations¶
/// prctl(2) — process control.
/// Syscall number: 157 (x86-64).
///
/// prctl is a multiplexer for per-process control operations. This section
/// specifies the capability-related operations. Other prctl operations
/// (PR_SET_NAME, PR_SET_PDEATHSIG, PR_SET_TIMERSLACK, etc.) are specified
/// in their respective subsystem sections.
///
/// Dispatch classification: SyscallHandler::Direct
pub fn sys_prctl(option: i32, arg2: u64, arg3: u64, arg4: u64, arg5: u64)
-> Result<i64, Errno> { ... }
Capability-related prctl operations:
| Operation | Value | Args | Effect | Capability Required |
|---|---|---|---|---|
PR_CAPBSET_READ |
23 | arg2 = cap number |
Returns 1 if cap is in bounding set, 0 if not | None |
PR_CAPBSET_DROP |
24 | arg2 = cap number |
Drop cap from bounding set (permanent, irreversible) | CAP_SETPCAP |
PR_CAP_AMBIENT |
47 | arg2 = sub-op, arg3 = cap |
Manipulate ambient set (see sub-operations below) | Varies |
PR_SET_SECUREBITS |
28 | arg2 = new securebits |
Set securebits flags | CAP_SETPCAP |
PR_GET_SECUREBITS |
27 | — | Returns current securebits value | None |
PR_SET_NO_NEW_PRIVS |
38 | arg2 = 1 |
Set no_new_privs (one-way, irreversible) | None (self-restriction) |
PR_GET_NO_NEW_PRIVS |
39 | — | Returns no_new_privs flag (0 or 1) | None |
PR_SET_KEEPCAPS |
8 | arg2 = 0 or 1 |
Set/clear SECBIT_KEEP_CAPS | None |
PR_GET_KEEPCAPS |
7 | — | Returns KEEP_CAPS flag (0 or 1) | None |
PR_CAP_AMBIENT sub-operations (arg2):
| Sub-operation | Value | Effect |
|---|---|---|
PR_CAP_AMBIENT_IS_SET |
1 | Returns 1 if arg3 cap is in ambient set, 0 if not |
PR_CAP_AMBIENT_RAISE |
2 | Add arg3 cap to ambient set. Requires: cap in both cap_permitted and cap_inheritable, and SECBIT_NO_CAP_AMBIENT_RAISE not set |
PR_CAP_AMBIENT_LOWER |
3 | Remove arg3 cap from ambient set |
PR_CAP_AMBIENT_CLEAR_ALL |
4 | Clear entire ambient set |
Validation details: Each prctl operation is fully specified with its
prepare_creds / commit_creds sequence in
Section 9.9. The syscall
dispatch layer in umka-sysapi validates the prctl option value, checks that
unused arguments are zero (returns EINVAL otherwise, matching Linux's check for
PR_CAP_AMBIENT where arg4 and arg5 must be 0), and dispatches to the
corresponding credential operation function.
Error cases common to all capability prctl operations:
| Error | Condition |
|---|---|
EINVAL |
Unknown option value, cap number out of range (>= 64), unknown sub-operation for PR_CAP_AMBIENT, non-zero unused arguments |
EPERM |
Missing required capability (CAP_SETPCAP), or attempting to raise ambient cap not in permitted/inheritable |
Cross-references:
- Section 9.9 — Full credential structure and
commit_credsinvariants - Section 9.9 — Detailed
prepare_creds/commit_credssequences for each prctl operation - Section 9.9 — How capabilities are transformed across
execve() - Section 9.2 —
SystemCapsbitflags definition - Section 17.1 — User namespace capability scope
- Section 10.3 —
PR_SET_SECCOMP(seccomp prctl operations, specified separately)
19.1.9 Scheduling Syscalls¶
sched_setattr(2) / sched_getattr(2) (syscall numbers 314/315, x86-64) are the
primary interfaces for configuring per-task scheduling parameters. UmkaOS supports the
standard Linux struct sched_attr fields (size, sched_policy, sched_flags,
sched_nice, sched_priority, sched_runtime, sched_deadline, sched_period)
with identical semantics. See Section 7.1 for the full EEVDF/RT/DL dispatch.
UmkaOS extension — sched_latency_nice: UmkaOS defines a new sched_latency_nice: i32
field in struct sched_attr (at the end, after all Linux-standard fields) and a new flag
SCHED_FLAG_LATENCY_NICE = 0x80. This is a UmkaOS-original extension — it is NOT
present in Linux mainline (the concept was discussed on LKML but never merged). Applications
that use latency_nice are UmkaOS-only and will not work on upstream Linux kernels. See
Section 7.1 for the weight table and
effective_slice formula.
Dispatch classification: SyscallHandler::Direct for all scheduling syscalls
(sched_setscheduler, sched_getscheduler, sched_setattr, sched_getattr,
sched_setparam, sched_getparam, sched_yield, sched_get_priority_max,
sched_get_priority_min, sched_rr_get_interval).
19.1.10 Key Management Syscalls¶
add_key(2) (syscall 248, x86-64), request_key(2) (syscall 249), and keyctl(2)
(syscall 250) provide the kernel key retention service interface. These are used by
fscrypt (Section 15.20), LUKS/dm-crypt, NFS Kerberos, and
ecryptfs for key lifecycle management.
Full sys_add_key and sys_request_key signatures, validation sequences, and error
handling are specified in Section 10.2. The keyctl(2) multiplexer
supports all standard Linux operations (KEYCTL_GET_KEYRING_ID, KEYCTL_DESCRIBE,
KEYCTL_READ, KEYCTL_LINK, KEYCTL_UNLINK, KEYCTL_SEARCH, KEYCTL_SETPERM,
KEYCTL_REVOKE, KEYCTL_INVALIDATE, etc.).
Dispatch classification: SyscallHandler::Direct for all three syscalls. Key operations
are serviced entirely within umka-core — no tier crossing.
19.1.11 cgroups: v2 Native with v1 Compatibility Shim¶
Linux problem: cgroups v1 had a messy, inconsistent design with separate hierarchies for each controller. v2 fixed this but migration was painful.
UmkaOS design: - cgroups v2 only as the native implementation. Single unified hierarchy. - Thin v1 compatibility shim: For container runtimes and tools that still use v1 filesystem paths, provide a v1-compatible view that maps to the v2 backend. This is read/write for the common operations (cpu, memory, io, pids) and read-only/unsupported for obscure v1-only features. - Pressure Stall Information (PSI): Built into cgroup v2 from the start (not added years later like in Linux).
19.1.11.1 Resource Controllers (Detailed)¶
| Controller | Function | Key Tunables |
|---|---|---|
cpu |
CPU bandwidth limiting and proportional sharing | cpu.max, cpu.weight |
cpuset |
CPU and memory node pinning | cpuset.cpus, cpuset.mems |
memory |
Memory usage limits and OOM control | memory.max, memory.high, memory.low |
io |
Block I/O bandwidth and IOPS limiting | io.max, io.weight |
pids |
Process/thread count limit | pids.max |
UmkaOS-specific controllers:
accel(Section 22.5) andpower(Section 22.5) follow the same v2 interface conventions.
19.1.11.2 Delegation Model¶
Non-root processes can manage sub-hierarchies with CAP_CGROUP_ADMIN, which can be
scoped to a specific subtree via the capability system (Section 9.1). A container runtime
holding CAP_CGROUP_ADMIN(subtree=/sys/fs/cgroup/containers/pod-xyz) can manage cgroups
under that path but cannot touch anything outside it.
19.1.11.3 Pressure Stall Information (PSI)¶
Each cgroup exposes pressure metrics (cpu.pressure, memory.pressure, io.pressure)
with 10s/60s/300s averages. PSI supports real-time event notification via poll/epoll
triggers. Orchestrators (kubelet, systemd-oomd) use PSI to detect resource saturation
before hard limits are hit.
19.1.12 Event Notification (epoll, poll, select)¶
Linux applications use three generations of event notification. UmkaOS implements all three for compatibility but steers new applications toward io_uring (Section 19.3).
19.1.12.1 epoll (Primary)¶
Syscalls: epoll_create1, epoll_ctl (ADD/MOD/DEL), epoll_wait,
epoll_pwait, epoll_pwait2.
| Syscall | x86-64 Number | Signature |
|---|---|---|
epoll_create1 |
291 | (flags: i32) -> fd \| -EINVAL \| -EMFILE \| -ENOMEM |
epoll_ctl |
233 | (epfd: i32, op: i32, fd: i32, event: *mut epoll_event) -> 0 \| -EBADF \| -EEXIST \| -EINVAL \| -ENOENT \| -ENOMEM \| -ELOOP \| -EPERM |
epoll_wait |
232 | (epfd: i32, events: *mut epoll_event, maxevents: i32, timeout: i32) -> n \| -EBADF \| -EINTR \| -EINVAL \| -EFAULT |
epoll_pwait |
281 | (epfd, events, maxevents, timeout, sigmask: *const sigset_t, sigsetsize: usize) -> n \| ... |
epoll_pwait2 |
441 | (epfd, events, maxevents, timeout: *const timespec, sigmask, sigsetsize) -> n \| ... |
epoll_create (legacy, number 213) is supported for compatibility — the size argument
is ignored (must be > 0). epoll_create1 is the preferred entry point; flags accepts
EPOLL_CLOEXEC (= O_CLOEXEC = 02000000 octal).
19.1.12.1.1 Wire Format: epoll_event¶
FIX-029: Linux applies __attribute__((packed)) to epoll_event ONLY on x86-64
(arch/x86/include/uapi/asm/epoll.h). On all other architectures, the struct has
natural alignment (4-byte padding between events and data). UmkaOS must replicate
this quirk exactly for binary compatibility.
/// On x86-64: packed (12 bytes, no padding between events and data).
/// This is a Linux ABI quirk — x86-64 is the ONLY architecture that packs this struct.
#[cfg(target_arch = "x86_64")]
#[repr(C, packed)]
pub struct EpollEvent {
/// Event mask (EPOLLIN, EPOLLOUT, EPOLLET, etc.).
pub events: u32,
/// User-supplied opaque value (returned by epoll_wait).
pub data: u64,
}
/// On all other architectures: natural C alignment (16 bytes with 4-byte padding).
#[cfg(not(target_arch = "x86_64"))]
#[repr(C)]
pub struct EpollEvent {
/// Event mask (EPOLLIN, EPOLLOUT, EPOLLET, etc.).
pub events: u32,
/// User-supplied opaque value (returned by epoll_wait).
pub data: u64,
}
#[cfg(target_arch = "x86_64")]
const_assert!(size_of::<EpollEvent>() == 12);
#[cfg(not(target_arch = "x86_64"))]
const_assert!(size_of::<EpollEvent>() == 16);
Portability note: Any code that copies EpollEvent arrays to/from userspace must
use size_of::<EpollEvent>() for stride calculation, never a hard-coded 12 or 16.
The epoll_wait implementation computes maxevents * size_of::<EpollEvent>() for
the copy_to_user length.
19.1.12.1.2 Event Flags¶
| Flag | Value | Meaning |
|---|---|---|
EPOLLIN |
0x001 | Data available for read |
EPOLLOUT |
0x004 | Write will not block |
EPOLLRDHUP |
0x2000 | Peer closed writing half of connection (stream socket) |
EPOLLPRI |
0x002 | Urgent/OOB data or exceptional condition |
EPOLLERR |
0x008 | Error condition (always reported, cannot be masked) |
EPOLLHUP |
0x010 | Hang up (always reported, cannot be masked) |
EPOLLET |
1 << 31 | Edge-triggered mode. Note: This is bit 31 of a u32 field (events in epoll_event), so the Rust constant must be pub const EPOLLET: u32 = 1u32 << 31; (0x80000000). Using 1 << 31 in a signed i32 context would be UB (sign bit). |
EPOLLONESHOT |
1 << 30 | Disable monitoring after one event (re-arm with EPOLL_CTL_MOD) |
EPOLLEXCLUSIVE |
1 << 28 | Wake at most one waiter for this fd (avoids thundering herd) |
EPOLLWAKEUP |
1 << 29 | Keep system awake while event is processed (requires CAP_BLOCK_SUSPEND) |
EPOLLRDNORM |
0x040 | Normal data readable (equivalent to EPOLLIN for most files) |
EPOLLWRNORM |
0x100 | Normal data writable (equivalent to EPOLLOUT for most files) |
19.1.12.1.3 Internal Data Structures¶
/// Composite key for the interests RB-tree. Uniquely identifies a monitored
/// (fd, file) pair. Using the file pointer in the key allows the same fd number
/// to be monitored after close+reopen detects the stale entry.
#[derive(Ord, PartialOrd, Eq, PartialEq)]
pub struct EpollKey {
/// File descriptor number in the monitoring process.
pub fd: i32,
/// Pointer to the `OpenFile` struct (used as identity, not dereferenced
/// for ordering — `Ord` is derived from the raw pointer value).
pub file: *const OpenFile,
}
/// Per-monitored-fd state within an epoll instance.
///
/// Each `EpollItem` is simultaneously:
/// 1. A node in the interests RB-tree (keyed by `EpollKey`).
/// 2. Potentially a node in the ready list (intrusive linked list).
/// 3. The owner of a `WaitQueueEntry` installed on the target file's WaitQueue.
///
/// Slab-allocated from `EPOLL_ITEM_SLAB` to avoid per-item heap allocation.
pub struct EpollItem {
/// Back-pointer to the owning `EpollInstance`. Needed by `ep_poll_callback`
/// to access the ready list and wake the epoll waiters.
pub ep: *const EpollInstance,
/// The fd and file pointer this item monitors.
pub key: EpollKey,
/// Events mask requested by the user (EPOLLIN, EPOLLOUT, EPOLLET, etc.).
/// Updated by `EPOLL_CTL_MOD`. Read atomically by `ep_poll_callback`.
pub events: AtomicU32,
/// User-supplied opaque data returned in `epoll_event.data` by `epoll_wait`.
pub data: u64,
/// WaitQueueEntry installed on the target file's WaitQueue.
/// The wakeup function is `ep_poll_callback`. The `private` field points
/// back to this `EpollItem`.
pub wait: WaitQueueEntry,
/// Intrusive list linkage for the ready list. An item is on the ready list
/// when `on_ready_list` is true.
pub ready_link: IntrusiveListNode,
/// Fast deduplication flag. Set to true (via CAS) when this item is
/// appended to the ready list. Checked by `ep_poll_callback` to avoid
/// acquiring the ready-list spinlock when the item is already queued.
/// Reset to false by `epoll_wait` after the item is removed from the
/// ready list (for level-triggered) or after delivery (for edge-triggered).
pub on_ready_list: AtomicBool,
/// True if this item was added with `EPOLLONESHOT`. After one event is
/// delivered, the events mask is zeroed (disabled) until re-armed with
/// `EPOLL_CTL_MOD`.
pub oneshot: bool,
/// True if this item was added with `EPOLLEXCLUSIVE`.
pub exclusive: bool,
/// Nesting depth for nested epoll detection. 0 for regular files,
/// incremented when the target fd is itself an epoll fd.
pub nesting_depth: u8,
/// RB-tree linkage for the interests tree.
pub rb_node: RBTreeNode,
}
/// Per-epoll-instance state. Created by `epoll_create1`, destroyed when
/// the epoll fd is closed (refcount drops to zero).
///
/// **Locking discipline**:
/// - `interests_lock` (Mutex): protects the RB-tree. Held during `epoll_ctl`
/// ADD/MOD/DEL operations. This is a cold-path lock — `epoll_ctl` is not
/// on the event delivery hot path.
/// - `ready_lock` (SpinLock): protects the ready list. Held briefly by
/// `ep_poll_callback` (to append) and `epoll_wait` (to drain). This is the
/// hot-path lock — it must be fast.
/// - `waiters`: WaitQueue for threads blocked in `epoll_wait`. Woken by
/// `ep_poll_callback` after appending to the ready list.
///
/// **UmkaOS improvement**: Linux uses three locks per epoll instance
/// (`ep->lock`, `ep->mtx`, `ep->wq.lock`). UmkaOS uses two: one Mutex for
/// the interest set (cold) and one SpinLock for the ready list (hot). The
/// `AtomicBool` on each `EpollItem` further reduces contention by allowing
/// `ep_poll_callback` to skip the spinlock entirely when the item is already
/// on the ready list.
///
/// **Lock ordering with signal delivery**: `task.sighand.lock` < `ep.ready_list` (SpinLock).
/// `signalfd_notify()` acquires the `ready_list` SpinLock inside the signal delivery
/// path (which holds `sighand.lock`). `ep_poll_callback()` from non-signal waitqueues
/// acquires `ready_list` directly (no `sighand` involvement). This ordering is safe
/// because the `ready_list` lock is never held when acquiring `sighand.lock`.
pub struct EpollInstance {
/// RB-tree of monitored file descriptors, keyed by `EpollKey`.
/// The data is inside the Mutex so that the only way to access the tree
/// is through the `MutexGuard` — enforcing the locking discipline via
/// the type system. Held during `epoll_ctl` ADD/MOD/DEL. Not held
/// during `epoll_wait` or `ep_poll_callback`.
pub interests: Mutex<RBTree<EpollKey, EpollItem>>,
/// Ready list: items with pending events, linked via `EpollItem::ready_link`.
/// The data is inside the SpinLock so that the only way to access the
/// list is through the `SpinLockGuard`. IRQ-safe (acquired with IRQs
/// disabled) because `ep_poll_callback` may run from interrupt context.
pub ready_list: SpinLock<IntrusiveList<EpollItem>>,
/// Wait queue for threads blocked in `epoll_wait`.
pub waiters: WaitQueueHead,
/// Number of items in the interests tree. Used for O(1) size queries
/// and to enforce per-user epoll item limits.
pub item_count: AtomicU64,
/// Nesting depth of this epoll instance. 0 for top-level instances.
/// Incremented when this epoll fd is added to another epoll instance.
/// `epoll_ctl` rejects ADD if the resulting nesting depth would exceed
/// `EP_MAX_NESTS` (4).
pub nesting_depth: u8,
/// User who created this epoll instance. Used to enforce the per-user
/// limit on total epoll-watched fds (`/proc/sys/fs/epoll/max_user_watches`,
/// default: ~204K derived from available lowmem / sizeof(EpollItem)).
pub user: Arc<UserStruct>,
}
19.1.12.1.4 ep_poll_callback — The Wakeup Hot Path¶
When a monitored file's state changes (e.g., data arrives on a socket, a pipe
becomes writable), the file's WaitQueue fires and calls the WaitQueueEntry::wakeup
function. For epoll entries, this function is ep_poll_callback:
ep_poll_callback(entry: *mut WaitQueueEntry) -> bool:
item = container_of(entry, EpollItem, wait)
ep = item.ep
// 1. Check if the file's current events match what we're monitoring.
// The poll_key passed to the wakeup carries the event that fired.
// If the fired event is not in our interest mask, skip.
revents = entry.private as PollEvents
interest = item.events.load(Relaxed)
if revents & interest == 0:
return false // spurious for this item
// 2. Fast dedup: if already on the ready list, skip the spinlock.
// CAS from false→true. If it was already true, another callback
// already queued this item — nothing to do.
if item.on_ready_list.compare_exchange(false, true, AcqRel, Relaxed).is_err():
return true // already queued, will be processed by epoll_wait
// 3. Append to ready list under spinlock (data inside the SpinLock).
{
let mut ready = ep.ready_list.lock_irqsave();
ready.push_back(&item.ready_link);
} // guard dropped, IRQs restored
// 4. Wake one thread blocked in epoll_wait.
// If EPOLLEXCLUSIVE: wake_up_one() — only one waiter proceeds.
// Otherwise: wake_up() — standard semantic (all non-exclusive + one exclusive).
if item.exclusive:
ep.waiters.wake_up_one()
else:
ep.waiters.wake_up()
return true
Cost: The common case (item already on the ready list from a previous callback
that has not yet been drained by epoll_wait) is a single failed AtomicBool CAS —
no spinlock, no list manipulation, no wakeup. This is the typical steady-state for
high-throughput servers where events arrive faster than epoll_wait drains them.
19.1.12.1.5 epoll_ctl Algorithm¶
epoll_ctl(epfd, op, fd, event) -> Result<(), Errno>:
ep = lookup_epoll_instance(epfd)?
target_file = fdget(fd)?
// Reject monitoring of epoll fds that would create a cycle or exceed nesting depth.
if target_file is EpollInstance:
check_nesting_depth(ep, target_file, EP_MAX_NESTS=4)? // returns ELOOP if too deep
check_no_cycle(ep, target_file)? // returns ELOOP if cycle
key = EpollKey { fd, file: target_file.as_ptr() }
// NOTE: `ep.interests` is `Mutex<RBTree<...>>`. All access below goes
// through the MutexGuard. `ep.ready_list` is `SpinLock<IntrusiveList<...>>`.
// Access to the ready list goes through the SpinLockGuard.
let mut interests = ep.interests.lock();
match op:
EPOLL_CTL_ADD:
if interests.contains(&key):
drop(interests);
return Err(EEXIST)
check_user_watch_limit(ep.user)? // ENOMEM if over limit
// Allocate the EpollItem from slab.
item = EPOLL_ITEM_SLAB.alloc()?
item.ep = ep
item.key = key
item.events.store(event.events, Relaxed)
item.data = event.data
item.on_ready_list.store(false, Relaxed)
item.oneshot = event.events & EPOLLONESHOT != 0
item.exclusive = event.events & EPOLLEXCLUSIVE != 0
item.wait.wakeup = ep_poll_callback
item.wait.private = item as *mut _ as usize
// Install the wait entry on the target file's WaitQueue.
// This calls target_file.f_ops.poll() with a PollTable whose
// queue_proc installs item.wait on the file's WaitQueue(s).
let mut pt = PollTable {
queue_proc: ep_ptable_queue_proc,
private: item as *mut _ as *mut (),
events: event.events & EP_EVENT_MASK,
};
revents = target_file.f_ops.poll(
target_file.inode, target_file.private_data,
event.events, Some(&mut pt),
)?
// Insert into the RB-tree.
interests.insert(key, item)
ep.item_count.fetch_add(1, Relaxed)
// If the file is already ready, put the item on the ready list now.
if revents & event.events != 0:
if item.on_ready_list.compare_exchange(false, true, AcqRel, Relaxed).is_ok():
{ let mut ready = ep.ready_list.lock_irqsave(); ready.push_back(&item.ready_link); }
ep.waiters.wake_up()
drop(interests);
EPOLL_CTL_MOD:
item = interests.get_mut(&key).ok_or(ENOENT)?
// EPOLLEXCLUSIVE cannot be used with MOD — Linux returns EINVAL.
if event.events & EPOLLEXCLUSIVE != 0:
drop(interests);
return Err(EINVAL)
item.events.store(event.events, Release)
item.data = event.data
item.oneshot = event.events & EPOLLONESHOT != 0
// Re-poll to check if the modified events are already ready.
revents = target_file.f_ops.poll(
target_file.inode, target_file.private_data,
event.events, None, // None = don't re-register wait entry
)?
if revents & event.events != 0:
if item.on_ready_list.compare_exchange(false, true, AcqRel, Relaxed).is_ok():
{ let mut ready = ep.ready_list.lock_irqsave(); ready.push_back(&item.ready_link); }
ep.waiters.wake_up()
drop(interests);
EPOLL_CTL_DEL:
item = interests.remove(&key).ok_or(ENOENT)?
// Remove from ready list if present.
if item.on_ready_list.load(Acquire):
{ let mut ready = ep.ready_list.lock_irqsave(); ready.remove(&item.ready_link); }
// Remove the wait entry from the target file's WaitQueue.
// The file's WaitQueue lock is acquired internally.
target_file_wq.remove(&item.wait)
ep.item_count.fetch_sub(1, Relaxed)
drop(interests);
// Free the EpollItem back to slab.
EPOLL_ITEM_SLAB.free(item)
Ok(())
ep_ptable_queue_proc is the PollTable::queue_proc callback used during ADD:
ep_ptable_queue_proc(wq: &WaitQueueHead, pt: &mut PollTable, key: PollEvents):
item = pt.private as *mut EpollItem
item.wait.flags = WaitFlags::empty() // non-exclusive on the target file's WQ
item.wait.private = key.bits() as usize
wq.add_wait_queue(&item.wait)
19.1.12.1.6 epoll_wait Algorithm¶
epoll_wait(epfd, events_buf, maxevents, timeout) -> Result<usize, Errno>:
if maxevents <= 0:
return Err(EINVAL)
ep = lookup_epoll_instance(epfd)?
// Fast path: check if the ready list already has items.
{
let ready = ep.ready_list.lock_irqsave();
if ready.is_empty():
drop(ready);
if timeout == 0:
return Ok(0) // non-blocking, nothing ready
// Block: wait until ep_poll_callback puts something on the ready list
// or timeout expires or a signal arrives.
result = ep.waiters.wait_event_timeout(
|| !ep.ready_list_is_empty_relaxed(),
timeout,
)
match result:
Err(EINTR) => return Err(EINTR) // signal interrupted
Err(ETIME) => return Ok(0) // timeout, nothing ready
Ok(()) => {} // items ready, proceed
}
// Drain ready list into user buffer.
// Move the entire ready list to a local transfer list under the spinlock,
// then process it without holding the spinlock.
transfer_list = IntrusiveList::new()
{
let mut ready = ep.ready_list.lock_irqsave();
ready.splice_to(&mut transfer_list);
}
count = 0
while count < maxevents && !transfer_list.is_empty():
item = transfer_list.pop_front()
// Re-poll the file to get current readiness.
// pt=None: don't re-register wait entries, just check status.
revents = item.key.file.f_ops.poll(
item.key.file.inode, item.key.file.private_data,
item.events.load(Relaxed), None,
).unwrap_or(0)
revents = revents & item.events.load(Relaxed)
if revents == 0:
// No longer ready (race: state changed between callback and drain).
item.on_ready_list.store(false, Release)
continue
// Copy event to user buffer.
events_buf[count] = EpollEvent {
events: revents,
data: item.data,
}
count += 1
// Level-triggered vs edge-triggered re-add behavior:
if item.oneshot:
// EPOLLONESHOT: disable this item until re-armed with EPOLL_CTL_MOD.
item.events.store(0, Release)
item.on_ready_list.store(false, Release)
else if item.events.load(Relaxed) & EPOLLET != 0:
// Edge-triggered: do NOT re-add to ready list.
// The item will only fire again when ep_poll_callback is called
// for a NEW event transition (e.g., new data arrives).
item.on_ready_list.store(false, Release)
else:
// Level-triggered (default): re-add to ready list because the
// condition may still be true (e.g., socket still has data).
// This ensures the next epoll_wait will re-check and report
// if the file is still ready.
{ let mut ready = ep.ready_list.lock_irqsave(); ready.push_back(&item.ready_link); }
// on_ready_list stays true
// Any remaining items in transfer_list (count hit maxevents) go back
// to the ready list — they were not delivered this round.
if !transfer_list.is_empty():
{ let mut ready = ep.ready_list.lock_irqsave(); transfer_list.splice_to(&mut ready); }
// Wake again — there are still ready items not yet delivered
ep.waiters.wake_up()
return Ok(count)
19.1.12.1.7 Level-Triggered vs Edge-Triggered Semantics¶
Level-triggered (default): After epoll_wait delivers an event for an fd,
the EpollItem is re-added to the ready list. On the next epoll_wait call, the
file is re-polled via FileOps::poll(pt=None). If the condition is still true
(e.g., socket still has unread data), the event is reported again. If the condition
has cleared (e.g., all data was read), the item is silently removed from the ready
list and not reported.
This means level-triggered epoll may call FileOps::poll() multiple times for the
same fd between actual state changes. The cost is one virtual call per fd per
epoll_wait invocation — acceptable because level-triggered mode is used for
correctness over performance (applications that miss a read will be reminded).
Edge-triggered (EPOLLET): After epoll_wait delivers an event, the
EpollItem is NOT re-added to the ready list. on_ready_list is set to false.
The item will only appear on the ready list again when ep_poll_callback fires
for a new state transition (new data arrives, new connection accepted, etc.).
This means edge-triggered mode can miss events if the application does not fully
drain the file (e.g., reads only part of the available data). The application must
use non-blocking I/O and loop until EAGAIN to ensure no events are lost. This is
the expected usage pattern and matches Linux exactly.
EPOLLONESHOT: After one event is delivered, the item's events mask is atomically
set to 0. No further events are reported until the application re-arms the item
with EPOLL_CTL_MOD and a new events mask. This is useful for multi-threaded
servers where only one thread should handle each event — the item is disabled
immediately after dispatch, preventing a second thread from picking up the same
event from a subsequent epoll_wait.
EPOLLEXCLUSIVE: When multiple threads are blocked in epoll_wait on the same
epoll instance, and a non-exclusive fd becomes ready, all blocked threads are woken
(thundering herd). With EPOLLEXCLUSIVE, ep_poll_callback calls
wake_up_one() instead of wake_up(), waking exactly one blocked thread.
EPOLLEXCLUSIVE can only be set with EPOLL_CTL_ADD, not EPOLL_CTL_MOD —
attempting MOD with EPOLLEXCLUSIVE returns EINVAL.
19.1.12.1.8 Nested Epoll¶
An epoll fd can be added to another epoll fd. This is used by event loop libraries that compose multiple epoll sets. The constraints are:
-
Maximum nesting depth:
EP_MAX_NESTS = 4.epoll_ctl(ADD)calculates the resulting nesting depth and returnsELOOPif it would exceed 4. -
Cycle detection:
epoll_ctl(ADD)walks the target's epoll tree to verify that adding the target fd would not create a cycle (A monitors B which monitors A). ReturnsELOOPif a cycle is detected. -
Nested wakeup propagation: When a nested epoll fd has events ready, its
FileOps::poll()checks whether its own ready list is non-empty (returningEPOLLINif so). The outer epoll'sep_poll_callbackfires normally, adding the nested epoll'sEpollItemto the outer ready list.
EpollInstance implements FileOps::poll():
EpollInstance::poll(inode, private, events, pt) -> Result<PollEvents>:
if let Some(pt) = pt:
poll_wait(&self.waiters, Some(pt))
mask = PollEvents::empty()
// Check if this epoll instance has any ready items.
self.ready_lock.lock_irqsave()
if !self.ready_list.is_empty():
mask |= EPOLLIN | EPOLLRDNORM
self.ready_lock.unlock_irqrestore()
Ok(mask)
19.1.12.1.9 epoll_pwait / epoll_pwait2¶
epoll_pwait atomically sets the signal mask, calls epoll_wait, and restores
the signal mask on return. This prevents a race between signal delivery and
epoll_wait blocking:
epoll_pwait(epfd, events, maxevents, timeout, sigmask, sigsetsize):
if sigsetsize != sizeof(sigset_t):
return Err(EINVAL)
old_mask = current_task().signal_mask
if sigmask is not null:
current_task().signal_mask = *sigmask
// Recheck pending signals with new mask — if a pending signal is
// now unblocked, do_signal() will run before we block.
recalc_sigpending()
result = epoll_wait(epfd, events, maxevents, timeout)
current_task().signal_mask = old_mask
recalc_sigpending()
// If a signal is pending now, the syscall return path calls do_signal().
// If epoll_wait returned 0 events and a signal arrived, return EINTR.
result
epoll_pwait2 is identical but takes a timespec pointer for nanosecond-precision
timeout instead of the millisecond int timeout of epoll_wait/epoll_pwait.
19.1.12.1.10 UmkaOS Improvements over Linux¶
Simplified locking: Linux's eventpoll.c uses three locks per epoll instance:
ep->lock (spinlock for the ready list and ovflist), ep->mtx (mutex for the
interest set), and ep->wq.lock (wait queue spinlock). The ovflist (overflow list)
exists because Linux cannot append to the ready list while epoll_wait is
transferring items — it redirects callbacks to a temporary list. UmkaOS eliminates
the overflow list entirely by splicing the ready list to a local transfer list
under the spinlock (single atomic pointer swap), then processing the transfer list
without holding any lock. This reduces the lock count from three to two and
eliminates the ovflist drain loop.
AtomicBool deduplication: Linux checks ep_is_linked(&epi->rdllink) under the
spinlock to avoid double-adding. UmkaOS uses on_ready_list.compare_exchange(false,
true) — a single atomic CAS that succeeds only if the item is not already queued.
This avoids acquiring the ready-list spinlock entirely in the common case where
the item is already on the ready list (e.g., a busy socket that fires repeatedly
between epoll_wait calls). On x86-64, the failed CAS is ~5 cycles vs ~25 cycles
for a spinlock acquire+release.
EPOLLEXCLUSIVE uses wake_one(): Linux's EPOLLEXCLUSIVE implementation still
calls wake_up() on the epoll wait queue (which wakes all non-exclusive waiters
plus one exclusive waiter). UmkaOS calls wake_up_one() directly when the triggering
item has EPOLLEXCLUSIVE set, waking exactly one thread regardless of whether other
waiters are exclusive or not.
19.1.12.2 poll and select (Legacy)¶
-
poll(syscall 7 on x86-64; 168 on ARMv7/s390x; 167 on PPC; absent on AArch64/RISC-V/LoongArch64 -- use ppoll): Array ofstruct pollfd { fd: i32, events: i16, revents: i16 }, O(n) per call. No persistent kernel state — each call iterates all fds, callingFileOps::poll(pt=Some)on the first call andFileOps::poll(pt=None)on retries after wakeup. APollTablewith a stack-allocatedWaitQueueEntryper fd is used for the first pass. -
select(syscall 23 on x86-64; absent on AArch64/RISC-V/LoongArch64 -- use pselect6): Threefd_setbitmaps (read/write/except), limited toFD_SETSIZE= 1024 fds. O(n) scan on each call. POSIX compatibility only — no new application should useselect. -
ppoll(syscall 271) /pselect6(syscall 270): Signal-mask-aware variants that atomically set the signal mask before blocking, analogous toepoll_pwait.
The poll/select implementation internally calls FileOps::poll() on each fd
with the same PollTable mechanism used by epoll. The only difference is lifetime:
poll/select allocate WaitQueueEntry nodes on the stack (one per fd) and remove them
on return, while epoll installs persistent entries that live as long as the EpollItem.
19.1.12.3 Event-Oriented File Descriptors¶
See Section 19.10 for complete specifications of eventfd,
signalfd, timerfd, and pidfd. Each implements FileOps::poll() by calling
poll_wait() on its internal WaitQueue and returning current readiness.
19.1.12.4 Relationship to io_uring¶
io_uring (Section 19.3) supersedes epoll for new high-performance applications.
IORING_OP_POLL_ADD provides the same notification within io_uring's unified model.
19.1.12.5 vmsplice(2)¶
vmsplice(2) (NR=278): Transfer user pages to/from a pipe. With SPLICE_F_GIFT,
the pipe takes ownership of user pages (zero-copy — pages are pinned and inserted
directly into the pipe's page ring). Without the flag, the kernel copies data into
pipe-owned pages. See Section 17.3 for the pipe buffer structure.
19.2 eBPF Subsystem¶
- Full eBPF virtual machine (register-based, 11 registers, 64-bit)
- Verifier: static analysis ensuring program safety (bounded loops, memory safety, no uninitialized reads)
- JIT compiler: eBPF bytecode to native code, per architecture:
- x86-64: Phase 1 (co-primary JIT target, available from day one)
- AArch64: Phase 1 (co-primary JIT target, available from day one; see note below)
- RISC-V 64: Phase 2 (RV64 instruction emission; strong LLVM backend, straightforward port)
- PPC64LE: Phase 2 (PPC64 instruction emission)
- ARMv7: Phase 3 (Thumb-2 instruction emission)
- PPC32: Phase 3 (PPC32 instruction emission)
- s390x: Phase 3 (s390x instruction emission)
- LoongArch64: Phase 3 (LoongArch64 instruction emission)
- Interpreted fallback available on all architectures from Phase 1
AArch64 co-primary JIT rationale: AArch64 is promoted to co-primary JIT status alongside x86-64 because ARM has surpassed x86 in deployment count for Linux workloads (mobile, embedded, and cloud — AWS Graviton, Ampere, Apple Silicon). The AArch64 JIT is architecturally similar to x86-64: fixed-width 32-bit instructions, 31 general-purpose registers, and no complex addressing modes to model. Denying JIT to AArch64 would create a multi-year performance gap on the most widely deployed ISA. Verified eBPF performance (JIT overhead, x86-64 baseline): x86-64: 2-5 ns per invocation; AArch64: 3-7 ns; RISC-V 64 (interpreted): 50-200 ns. - Program types: XDP, tc (traffic control), kprobe, tracepoint, cgroup, socket filter, LSM, struct_ops
19.2.1.1.1 Program Type Phasing¶
Not all program types ship simultaneously. The phasing reflects Cilium/Kubernetes priority (Phase 3) versus advanced use cases (Phase 4+):
| Phase | Program Types | Rationale |
|---|---|---|
| Phase 3 | XDP, tc (cls_act), socket_filter, kprobe, tracepoint, cgroup_skb | Cilium-critical types. XDP + tc = core Kubernetes CNI. kprobe + tracepoint = observability. |
| Phase 4 | LSM, struct_ops, sk_msg, sk_skb, cgroup_sock, cgroup_sockopt, fentry/fexit | Advanced types. LSM BPF = security policy. struct_ops = TCP congestion control. |
| Phase 4+ | perf_event, raw_tracepoint, flow_dissector, sk_lookup | Niche types with smaller user base. |
19.2.1.1.2 Cilium Compatibility Milestone¶
Phase 3 exit criteria include: Cilium test suite pass rate >= 95% for XDP and tc program types. Remaining failures must be triaged as UmkaOS bugs (not spec deviations). See Section 24.2 for the full Phase 3 exit criteria.
The verifier implementation strategy is a GPLv2 Rust port (derivative work) of
Linux's kernel/bpf/verifier.c, preserving the accept/reject boundary and
state-pruning heuristics. This is a derivative work of GPL-licensed code —
UmkaOS kernel crate umka-core is GPLv2, making this legally straightforward.
The port targets the Linux 6.12 LTS verifier as the reference implementation,
with subsequent cherry-picks for critical verifier fixes. The Rust port
maintains function-level correspondence to enable systematic review against the
C original.
- Map types: hash, array, ringbuf, per-CPU hash, per-CPU array, LRU hash, LPM trie,
queue, stack, sockmap, sockhash, devmap, cpumap, xskmap, perf_event_array, stack_trace
- bpftool compatibility for loading and inspecting programs
- Required for: bpftrace, Cilium (Kubernetes networking), Falco (security), BCC tools
eBPF Compatibility Scope — UmkaOS guarantees binary compatibility for the external eBPF ABI: helper function numeric IDs (
enum bpf_func_id), map type constants (enum bpf_map_type), program type constants (enum bpf_prog_type), the BPF instruction set encoding (opcodes, register numbering, call convention), thebpf()syscall command numbers and attribute structs, and the ring buffer wire format (libbpf-compatible). A BPF program compiled for Linux and accepted by the Linux verifier will load and execute identically on UmkaOS.Not guaranteed: internal verifier behavior beyond accept/reject decisions (e.g., the exact set of state-pruning heuristics, the order in which paths are explored, or the specific error messages emitted on rejection may differ from Linux). Implementation-specific optimizations (JIT code layout, hash table bucket counts, map memory allocation strategy) are internal and may diverge. Programs must not depend on verifier exploration order, specific JIT instruction sequences, or undocumented map implementation details.
Map Size Limits and Memory Budget — enforced at bpf(BPF_MAP_CREATE, ...) time.
Limits match Linux 6.x for compatibility. Operators may raise the per-UID limit via
sysctl umka.bpf.uid_map_memory_limit_mib (default 64):
| Map type | Max entries | Max value size | Notes |
|---|---|---|---|
BPF_MAP_TYPE_HASH |
1,048,576 | 65,535 bytes | Per-entry memory charged to cgroup |
BPF_MAP_TYPE_ARRAY |
1,048,576 | 65,535 bytes | Total = entries × value_size |
BPF_MAP_TYPE_RINGBUF |
— | 2 GiB (must be power of 2, min 4,096) | Size in bytes, not entries |
BPF_MAP_TYPE_PERCPU_HASH |
1,048,576 | 65,535 bytes | Multiplied by CPU count |
BPF_MAP_TYPE_PERF_EVENT_ARRAY |
NR_CPUS | — | One slot per CPU |
| All other types | 1,048,576 | 65,535 bytes |
Global eBPF memory budget:
- Unprivileged loaders (!CAP_BPF): map memory is subject to the cgroup memory limit;
additionally, a per-UID soft limit of 64 MiB applies (returns ENOMEM when exceeded).
- Privileged loaders (CAP_BPF): no per-UID limit; memory is still subject to cgroup
accounting.
- System-wide: total eBPF map memory is tracked and reported via
/ukfs/kernel/bpf/map_memory_bytes in umkafs.
Map Type Implementation Specifications:
BPF_MAP_TYPE_HASH: SipHash-1-3 (hash-flooding resistant, same as Linux since v4.13), chained hashing with per-bucket singly-linked lists (matching Linux'shtab_mapimplementation). Number of buckets = next power of two >=max_entries. Each bucket has a dedicated spinlock protecting its chain. Entries are pre-allocated from a lock-free freelist at map creation (BPF_F_NO_PREALLOCdefers allocation to update time). On insert when all pre-allocated elements are exhausted: returnE2BIGfrommap_update. Chained hashing is required because open-addressing probe sequences cross bucket boundaries, making per-bucket locking unsound.BPF_MAP_TYPE_ARRAY: Fixed-size pre-allocated array, index bounds-checked, per-element spinlock for value updates >8 bytes (otherwise atomic CAS).BPF_MAP_TYPE_RINGBUF: Lock-free SPSC ring (producer=BPF prog, consumer=userspace). Uses Linux's identical ring format (compatible with libbpf).BPF_MAP_TYPE_LRU_HASH: Same as HASH but with an LRU eviction list (per-CPU LRU lists, promoted to global list on cross-CPU access). Eviction is O(1) amortized.BPF_MAP_TYPE_LPM_TRIE: Patricia trie (radix tree), O(prefix_length) lookup, per-trieRwLock. Performs longest-prefix-match (LPM) semantics — the most specific matching prefix is returned. Maximum prefix length is bounded by key size: 32 bits for IPv4, 128 bits for IPv6, arbitrary for custom keys (up toBPF_MAX_KEY_SIZE= 512 bytes = 4096-bit prefix). Overlapping prefixes are supported; the longest match wins.BPF_MAP_TYPE_PERCPU_HASH: Per-CPU variant of HASH -- each CPU has its own hash table; lookups/updates touch only the current CPU's table.BPF_MAP_TYPE_PERCPU_ARRAY: Per-CPU array -- same structure as ARRAY but replicated per CPU.BPF_MAP_TYPE_PERF_EVENT_ARRAY: Array of perf_event file descriptors;bpf_perf_event_output()writes to the current CPU's slot.BPF_MAP_TYPE_STACK_TRACE: Hash map keyed by stack ID (SipHash-1-3 of the call chain), value is array of instruction pointers.BPF_MAP_TYPE_DEVMAP: Maps interface indices for XDP redirect. Used by Cilium and load balancers to steer packets between network devices viabpf_redirect_map(). Key:u32interface index (ifindex). Value:NetDevicereference (kernel-internal, opaque to BPF programs). XDP programs callbpf_redirect_map(&devmap, ifindex, 0)to forward packets to a different NIC without passing through the full network stack. Populated by userspace viabpf(BPF_MAP_UPDATE_ELEM)with the target interface index.BPF_MAP_TYPE_CPUMAP: Maps CPU indices for XDP redirect. Used to distribute packets across CPUs from the XDP layer viabpf_redirect_map(&cpumap, target_cpu, 0). Key:u32CPU index. Value:struct bpf_cpumap_val(queue size + optional chained BPF program fd). When an XDP program redirects a packet to a CPUMAP entry, the packet is enqueued on the target CPU's per-entry ring buffer (SPSC, pre-allocated at map creation). A dedicated kthread on each target CPU drains the ring and reinjects packets into the normal network stack atnetif_receive_skb()level. This enables RSS-like packet steering from software when hardware RSS is unavailable or insufficient. Max entries bounded byNR_CPUS. Used by Cloudflare's XDP-based DDoS mitigation to distribute accepted packets across CPUs after filtering.BPF_MAP_TYPE_XSKMAP: Maps XDP socket queue indices. Used by AF_XDP to redirect packets to userspace sockets viabpf_redirect_map(). Key:u32queue index. Value:XskSocketreference (kernel-internal, opaque to BPF programs). Max entries limited byNetDevice::num_rx_queues. XDP programs callbpf_redirect_map(&xskmap, queue_idx, 0)to steer packets into the corresponding AF_XDP socket's UMEM ring, bypassing the kernel network stack entirely. The map is populated by userspace viabpf(BPF_MAP_UPDATE_ELEM)after creating AF_XDP sockets withsocket(AF_XDP, SOCK_RAW, 0)+bind().BPF_MAP_TYPE_SOCKMAP: Array-indexed map of socket references. Used bysk_msgandsk_skbBPF programs to redirect messages between sockets viabpf_sk_redirect_map()orbpf_msg_redirect_map(). Key:u32array index. Value: kernel socket reference (opaque to BPF). Enables Cilium/Envoy-style socket-level load balancing and transparent proxying without leaving the kernel. Userspace inserts sockets viabpf(BPF_MAP_UPDATE_ELEM)using the socket fd. Programs attached toBPF_PROG_TYPE_SK_MSGintercept sendmsg() and redirect to a different socket in the map.BPF_MAP_TYPE_SOCKHASH: Hash-keyed variant of SOCKMAP. Key: arbitrary bytes (typically a 4-tuple or 5-tuple struct). Value: kernel socket reference. Lookup viabpf_sk_redirect_hash()/bpf_msg_redirect_hash(). Used when the socket selection requires a key richer than a simple array index (e.g., connection tracking). Same SipHash-1-3 hash function asBPF_MAP_TYPE_HASH. Max entries: 1,048,576.
Relationship to KABI policy hooks: eBPF provides Linux-compatible user-to-kernel extensibility for tracing, networking, and security (the same role as in Linux). UmkaOS's KABI driver model (Section 12.1) supports kernel-internal extensibility via vtable-based driver interfaces — drivers can register policy callbacks for scheduling, memory, and I/O decisions. The two mechanisms are complementary: eBPF serves the Linux ecosystem (existing tools, user-authored programs); KABI serves kernel evolution (vendor-provided policy drivers, hardware-specific optimizations).
19.2.2 eBPF Verifier Architecture¶
The verifier is the highest-risk component in the syscall interface. UmkaOS implements a Rust reimplementation derived from Linux's GPLv2 verifier (fork-and-refactor, preserving the accept/reject boundary — see above), leveraging Rust's type system to make verifier invariants compile-time enforced where possible.
Abstract interpretation: Forward dataflow analysis tracking register types and value ranges through every reachable instruction. At branch points, both paths are explored. At join points, register states are merged conservatively (widening).
Register abstract state — each of the 11 eBPF registers (r0-r10) carries:
pub struct RegState {
/// Coarse type tag.
pub reg_type: RegType,
/// Signed/unsigned min/max for SCALAR_VALUE registers.
/// Tracked as two separate ranges to handle sign-extension correctly.
pub smin: i64, pub smax: i64, // signed range
pub umin: u64, pub umax: u64, // unsigned range
/// For pointer types: byte offset from base (may be negative for stack).
pub off: i32,
/// For PTR_TO_MAP_VALUE: which map, value size, key size.
pub map_ptr: Option<BpfMapId>,
/// For PTR_TO_BTF_ID: BTF type ID for field-access checking.
pub btf_id: Option<BtfTypeId>,
/// True if this register was written in a conditional branch and may be
/// in an invalid state in the other branch (Spectre mitig: mask on use).
pub id: u32, // equivalence ID: two regs with same id hold equal values
}
pub enum RegType {
NotInit, // register has never been written
ScalarValue, // arbitrary integer, range-tracked
PtrToCtx, // read-only pointer to program context
PtrToMap, // pointer to BPF map struct (not its value)
PtrToMapValue, // pointer into a map value
PtrToStack, // pointer into 512-byte per-frame stack
PtrToPacket, // data pointer (skb->data)
PtrToPacketMeta, // metadata pointer (xdp_md->data_meta)
PtrToPacketEnd, // end-of-packet sentinel
PtrToBtfId, // typed kernel pointer via BTF
PtrToMem, // pointer to kernel memory from helper return
PtrToRdOnlyBuf, // read-only buffer from helper (e.g., map lookup result)
}
Pointer arithmetic rules:
- SCALAR_VALUE: full arithmetic (add, sub, mul, div, mod, and, or, xor, shift).
Range is updated at each operation; overflow wraps and may force smin=i64::MIN, smax=i64::MAX.
- PTR_TO_MAP_VALUE + scalar: allowed. New offset = old offset + scalar.umin..scalar.umax.
Before any load/store: verifier checks [off, off+access_size) ⊆ [0, map_value_size).
- PTR_TO_STACK + scalar: allowed only if resulting offset is within [-512, 0].
Negative offsets index into the stack frame (stack grows down).
- PTR_TO_PACKET + scalar: allowed only after a bounds check instruction. The verifier
tracks data_end register; a comparison ptr + N < data_end marks the range valid.
- Other pointer types (PtrToCtx, PtrToBtfId): arithmetic forbidden. Field access
only via BTF-validated offsets.
- Pointer ± pointer: forbidden (except packet_end - packet_ptr for length, which
produces a SCALAR_VALUE bounded by packet length).
Stack slot tracking: The 512-byte stack is divided into 8-byte slots. Each slot
carries a StackSlotType:
- Misc: written with an unknown value (spilled scalar).
- SpilledReg(RegState): contains a spilled register with its type preserved.
- Uninit: never written — reading this is a verifier error.
Stack writes smaller than 8 bytes mark the containing slot as Misc.
Helper function type checking: Each BPF helper has a statically-encoded
signature: fn(ArgType, ArgType, ...) -> RetType. Before emitting a call instruction,
the verifier checks each argument register's RegType against the expected ArgType:
- ARG_ANYTHING: any initialized register
- ARG_PTR_TO_MAP_KEY: PtrToStack or PtrToMapValue pointing to key_size bytes
- ARG_PTR_TO_MAP_VALUE: PtrToMapValue with write access
- ARG_CONST_SIZE: ScalarValue with known (smin==smax) value
- ARG_PTR_TO_MEM: any initialized pointer with size verified by prior arg
After the call: r0 is set to the return type (e.g., PtrToMapValue | NULL for map_lookup_elem).
Loop handling: Back-edge detection via DFS. Bounded loops (Linux 5.3+ semantics)
supported via the loop counter check: the verifier must observe the back-edge condition
register narrowing its range on each iteration. If the range does not narrow (e.g.,
counter never decremented), the loop is rejected. Widening: after
BPF_VERIFIER_WIDEN_VISITS (8) visits to the same instruction, the verifier widens
all ScalarValue ranges to [INT_MIN, INT_MAX] to force termination. (Note:
BPF_VERIFIER_WIDEN_VISITS is a separate constant from BPF_MAX_SUBPROGS (256) —
the subprogram limit and the widening threshold are unrelated concepts.)
Maximum verifier instruction exploration count: 1 million (BPF_COMPLEXITY_LIMIT_INSNS,
matching Linux since kernel 5.2). Maximum program size: 4,096 instructions for unprivileged
programs (BPF_MAXINSNS), 1 million for privileged. Unbounded loops rejected.
Verifier Limits — UmkaOS enforces the same limits as Linux 6.x so that existing eBPF
programs passing the Linux verifier also pass UmkaOS's. Programs loaded with CAP_BPF are
not subject to tighter restrictions:
| Limit | Value | Notes |
|---|---|---|
| Max instructions explored (visited) | 1,000,000 | Complexity bound, not unique instruction count (BPF_COMPLEXITY_LIMIT_INSNS) |
| Max stack depth per subprogram | 512 bytes | Includes spilled registers |
| Max subprograms (BPF-to-BPF calls) | 256 | BPF_MAX_SUBPROGS |
| Max map-in-map nesting depth | 2 | |
| Max tail call depth | 33 | MAX_TAIL_CALL_CNT; Linux 5.12+ uses 33, earlier used 32 |
| Max instructions per subprogram | 1,000,000 | Same as global complexity limit |
| Max loop iterations (bounded loops) | 8,192,000 | Per loop; ultimately bounded by the instruction exploration limit |
Verifier time limit: In addition to the instruction exploration limit, UmkaOS enforces a wall-clock timeout of 1 second per verification attempt. A pathological program can consume 1M exploration steps and still take seconds to verify due to wide range analysis or many join points — this is an unprivileged DoS vector that Linux does not protect against. UmkaOS closes this gap:
pub struct VerifierBudget {
/// Maximum BPF instruction state explorations (same as Linux).
pub insn_limit: u32, // default: 1_000_000
/// Wall-clock timeout for the entire verification pass (nanoseconds).
/// Default: 1_000_000_000 (1 second). `Duration` is not available in
/// `#![no_std]`; use raw u64 nanoseconds with `ktime_get_ns()`.
pub time_limit_ns: u64, // default: 1_000_000_000
/// Instructions explored so far.
pub insns_checked: u32,
/// Verification start time (nanoseconds, from `ktime_get_ns()`).
/// `Instant` is not available in `#![no_std]`.
pub start_ns: u64,
}
impl VerifierBudget {
pub fn check(&self) -> Result<(), VerifierError> {
if self.insns_checked >= self.insn_limit {
return Err(VerifierError::ComplexityLimit);
}
let elapsed = ktime_get_ns() - self.start_ns;
if elapsed >= self.time_limit_ns {
return Err(VerifierError::TimeLimit);
}
Ok(())
}
}
The wall-clock timeout is checked at every back-edge and join point during exploration. On
timeout, the program is rejected with EACCES and an error message:
"BPF program rejected: verification time limit exceeded (1s)".
The limit can be raised for privileged users: CAP_SYS_ADMIN can set up to 10 seconds via
the BPF_PROG_LOAD attribute verification_time_limit_ms. The sysctl
/proc/sys/kernel/bpf_verifier_time_limit_ms (default 1000) sets the system-wide limit for
unprivileged loaders. Note: Linux does not have this protection; UmkaOS introduces it as a
safety improvement.
Stack depth: Maximum 512 bytes per frame, verified statically. BPF-to-BPF call depth
max 8 frames, each with up to 512 bytes of stack. Tail call chain depth max 33
(MAX_TAIL_CALL_CNT, matching Linux 5.12+; earlier kernels used 32).
JIT cycle-budget enforcement: The JIT compiler inserts a decrement + branch at
each backward jump (loop header). The counter is initialized to bpf_jit_limit
(default 1M instructions). When the counter reaches zero, the program is terminated
with BPF_PROG_RUN_TIMEOUT. This provides runtime enforcement complementing the
verifier's static bound on loop iterations.
19.2.3 eBPF Verifier Risk Mitigation¶
A verifier bug equals kernel compromise. UmkaOS applies defense-in-depth:
- Memory protection: Each loaded BPF program is assigned to a dedicated BPF isolation domain (see Section 11.3). BPF programs do NOT run in the same isolation domain as umka-core. The JIT output pages are mapped into the BPF program's own isolation domain with execute-only permissions (no write) after code emission; the JIT staging buffer is mapped read-write (no execute) during emission and unmapped afterward. This W^X discipline, combined with domain isolation, ensures defense-in-depth: even with a verifier bug, an attacker cannot modify JIT-compiled code at runtime or access Core memory directly. BPF programs access kernel state only through verified helper functions that perform bounds-checked, type-checked cross-domain reads into the Core isolation domain on the program's behalf.
- Capability-gated loading: Only
CAP_BPFholders can load programs. Unprivileged eBPF loading disabled by default. - Differential testing: UmkaOS verifier tested against Linux verifier on >50,000 known-good and known-bad programs. Any divergence is investigated.
- Rust type safety: Invalid state transitions are compile-time errors, not runtime checks.
19.2.4 BPF Isolation Model¶
BPF programs are a cross-cutting concern used beyond networking: tracing (kprobe, tracepoint), security (LSM, seccomp), scheduling (struct_ops), and packet filtering (XDP, tc) all execute BPF code. The full BPF isolation model — verifier enforcement, map access control, capability-gated helpers, cross-domain packet redirect rules, and W^X page protections — is specified in Section 16.18 (Packet Filtering, BPF-Based). Although Section 16.18 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. Every BPF program, regardless of attachment point, runs in its own dedicated BPF isolation domain with safety enforced by both the verifier AND domain isolation (see Section 19.2 and Section 11.3). BPF helpers that access Core data perform validated cross-domain reads into the Core domain on the program's behalf.
19.2.5 eBPF Helper Function IDs and Dispatch Table¶
eBPF programs invoke kernel services through a fixed set of helper functions identified by numeric IDs. Since BPF programs compiled for Linux embed these IDs directly in their bytecode, UmkaOS must dispatch identically — the numeric IDs are part of the external ABI.
19.2.5.1 Helper ID Enumeration¶
/// eBPF helper function IDs — must match Linux's `enum bpf_func_id` exactly
/// (include/uapi/linux/bpf.h, Linux 6.12).
///
/// Programs compiled for Linux use these numeric IDs; UmkaOS must dispatch
/// identically. Helper ID range is 0..211 (inclusive, 212 helpers total).
/// The dispatch table is sized to `__BPF_FUNC_MAX_ID` (212 — sentinel value,
/// one past last valid ID). An agent sizing the table as `[...; 211]` would
/// have an off-by-one (OOB on helper 211). Use `[...; 212]` or
/// `[...; __BPF_FUNC_MAX_ID]`.
/// Only the commonly-used helpers are named here; the remainder are
/// provided as named `ReservedNNN = NNN` variants at their correct numeric positions.
#[repr(u32)]
#[non_exhaustive]
pub enum BpfFuncId {
Unspec = 0,
MapLookupElem = 1, // bpf_map_lookup_elem
MapUpdateElem = 2, // bpf_map_update_elem
MapDeleteElem = 3, // bpf_map_delete_elem
ProbeRead = 4, // bpf_probe_read (deprecated; use ProbeReadKernel)
KtimeGetNs = 5, // bpf_ktime_get_ns → monotonic nanoseconds
TracePrintk = 6, // bpf_trace_printk → /sys/kernel/debug/tracing/trace_pipe
GetPrandomU32 = 7, // bpf_get_prandom_u32
GetSmpProcessorId = 8, // bpf_get_smp_processor_id
SkbStoreBytes = 9,
L3CsumReplace = 10, // bpf_l3_csum_replace
L4CsumReplace = 11, // bpf_l4_csum_replace
TailCall = 12, // bpf_tail_call
CloneRedirect = 13, // bpf_clone_redirect
GetCurrentPidTgid = 14, // → (tgid << 32 | pid)
GetCurrentUidGid = 15, // → (gid << 32 | uid)
GetCurrentComm = 16,
GetCgroupClassid = 17,
SkbVlanPush = 18,
SkbVlanPop = 19,
SkbGetTunnelKey = 20,
SkbSetTunnelKey = 21,
PerfEventRead = 22,
Redirect = 23, // bpf_redirect(ifindex, flags)
GetRouteRealm = 24,
PerfEventOutput = 25,
SkbLoadBytes = 26,
GetStackid = 27,
CsumDiff = 28, // bpf_csum_diff
SkbGetTunnelOpt = 29,
SkbSetTunnelOpt = 30,
SkbChangeProto = 31,
SkbChangeType = 32,
SkbUnderCgroup = 33,
GetHashRecalc = 34,
GetCurrentTask = 35,
ProbeWriteUser = 36,
CurrentTaskUnderCgroup = 37,
SkbChangeTail = 38,
SkbPullData = 39,
CsumUpdate = 40, // bpf_csum_update
SetHashInvalid = 41, // bpf_set_hash_invalid
GetNumaNodeId = 42,
SkbChangeHead = 43,
XdpAdjustHead = 44, // bpf_xdp_adjust_head(xdp_md, delta)
// XDP headroom: minimum 256 bytes reserved before packet data.
// bpf_xdp_adjust_head() moves the data pointer by `delta` bytes:
// delta < 0: grow headroom (move data start backward, up to 256 bytes).
// delta > 0: shrink headroom (move data start forward into packet).
// Exceeding bounds (headroom < 0 or > 256, or data_end < data) returns
// -EINVAL. The 256-byte minimum matches Linux's XDP_PACKET_HEADROOM.
XdpAdjustTail = 65, // bpf_xdp_adjust_tail(xdp_md, delta)
// Adjusts the tail (end) of the XDP packet data. delta > 0 extends
// the packet; delta < 0 trims it. Returns -EINVAL if the resulting
// packet length would be < ETH_HLEN (14) or exceed the page boundary.
// Used for packet encapsulation/decapsulation at XDP layer.
XdpAdjustMeta = 54, // bpf_xdp_adjust_meta(xdp_md, delta)
// Adjusts the metadata area preceding the packet data. delta < 0
// grows metadata space (moving data_meta backward); delta > 0 shrinks
// it. Metadata is used to pass per-packet information from XDP to TC
// or the network stack (e.g., flow hash, classification result).
// Returns -EINVAL if data_meta would move past data or below the
// headroom limit.
ProbeReadStr = 45,
GetSocketCookie = 46,
GetSocketUid = 47,
SetHash = 48,
Setsockopt = 49,
SkbAdjustRoom = 50,
// --- Redirect helpers (critical for XDP / TC) ---
RedirectMap = 51, // bpf_redirect_map(&map, key, flags) — AF_XDP, devmap
SkRedirectMap = 52, // bpf_sk_redirect_map — sockmap redirect
// Note: Redirect = 23 is already defined above (line 336).
RedirectNeigh = 152, // bpf_redirect_neigh(ifindex, params, plen, flags)
// Redirect packet to a neighbor (next-hop) on a different
// interface. Unlike bpf_redirect(), this performs a full
// L3 neighbor lookup (ARP/NDP) and fills in the L2 header.
// Used by Cilium for host-routing mode to bypass iptables.
// `params`: optional `struct bpf_redir_neigh` (next-hop addr).
// Linux 5.10+, BPF_PROG_TYPE_SCHED_CLS only.
RedirectPeer = 155, // bpf_redirect_peer(ifindex, flags)
// Redirect packet to the peer device of a veth pair,
// skipping the normal netif_receive_skb() path on the
// peer. This avoids a full trip through the receiving
// device's TC ingress and delivers directly to the peer's
// network namespace. Critical for Cilium's veth-based
// container networking — saves ~2-4 us per packet.
// Linux 5.10+, BPF_PROG_TYPE_SCHED_CLS only.
GetCurrentCgroupId = 80, // bpf_get_current_cgroup_id
// --- Socket lookup helpers ---
SkLookupTcp = 84,
SkLookupUdp = 85,
// --- Storage / signal helpers ---
SkStorageGet = 107,
SkStorageDelete = 108,
SendSignal = 109,
TcpGenSyncookie = 110,
// --- Probe read helpers (replacements for deprecated ProbeRead) ---
ProbeReadUser = 112,
ProbeReadKernel = 113,
ProbeReadUserStr = 114,
ProbeReadKernelStr = 115,
// --- Ring buffer helpers ---
RingbufOutput = 130,
RingbufReserve = 131,
RingbufSubmit = 132,
RingbufDiscard = 133,
RingbufQuery = 134,
CsumLevel = 135, // bpf_csum_level
UserRingbufDrain = 209,
// --- Timer / kptr / dynptr helpers ---
SysClose = 168, // bpf_sys_close (BPF_PROG_TYPE_SYSCALL only)
TimerInit = 169,
TimerSetCallback = 170,
TimerStart = 171,
TimerCancel = 172,
KptrXchg = 194,
DynptrFromMem = 197,
DynptrRead = 201,
DynptrWrite = 202,
// --- Conntrack / FIB helpers (critical for Cilium, load balancers) ---
FibLookup = 69, // bpf_fib_lookup(ctx, params, plen, flags)
// Perform a FIB (routing table) lookup from BPF.
// Returns next-hop ifindex, MAC addresses, and MTU.
// Used by Cilium for policy-based routing and direct
// XDP forwarding without entering the full IP stack.
// `params`: `struct bpf_fib_lookup` (44 bytes).
// Return values: BPF_FIB_LKUP_RET_SUCCESS (0),
// BPF_FIB_LKUP_RET_BLACKHOLE (1),
// BPF_FIB_LKUP_RET_UNREACHABLE (2),
// BPF_FIB_LKUP_RET_PROHIBIT (3),
// BPF_FIB_LKUP_RET_NOT_FWDED (4),
// BPF_FIB_LKUP_RET_FWD_DISABLED (5),
// BPF_FIB_LKUP_RET_UNSUPP_LWT (6),
// BPF_FIB_LKUP_RET_NO_NEIGH (7),
// BPF_FIB_LKUP_RET_FRAG_NEEDED (8).
// Note: conntrack operations (ct_lookup, ct_insert, ct_delete) are exposed
// via BPF kfuncs in Linux 6.x (not classic helpers). They do not consume
// a BpfFuncId slot. UmkaOS implements these as kfuncs registered by the
// conntrack subsystem — see [Section 16.18](16-networking.md#packet-filtering-bpf-based--bpf-kfuncs-for-conntrack).
// --- Iteration / control flow ---
Loop = 181,
ForEachMapElem = 164,
// The full BpfFuncId enum (all 212 helpers 0..211 through Linux 6.12,
// __BPF_FUNC_MAX_ID = 212) is generated at build time from the
// Linux 6.12 `include/uapi/linux/bpf.h` header using a
// const-generating build.rs script. The enum above lists only
// the architecturally critical helpers for each program type.
// Helper IDs must match Linux exactly — they are part of the
// eBPF bytecode ABI.
//
// Implementation status: BPF programs detect helper availability
// at load time via the verifier (not at runtime). If a BPF program
// calls a helper that is not yet implemented, the verifier rejects
// the program with ENOTSUPP. There is no runtime "helper not
// found" path — all validation is static at BPF_PROG_LOAD time.
//
// Forward-compatibility: unknown helper IDs from newer Linux
// eBPF bytecode are rejected by the verifier with ENOTSUPP.
// Invalid function IDs are rejected at load time, not represented in the enum.
// The verifier validates all BPF_CALL instructions against the known
// BpfFuncId range; any ID outside 0..=211 AND outside the UmkaOS
// extension range (0x1000+) is rejected with EINVAL. The verifier
// accepts both the Linux range (0..=211) and the UmkaOS extension range.
// The dispatch table uses an XArray (sparse integer-keyed mapping per
// collection policy) rather than a dense array, supporting both the
// contiguous Linux range and the sparse UmkaOS extensions efficiently.
// --- UmkaOS-specific helpers (extension range 0x1000+) ---
/// UmkaOS-specific: read a named per-CPU counter from the tracepoint
/// subsystem. Used for stable ABI drop counters (e.g., "tp_drops").
/// The counter name is passed as a pointer + length pair (r1, r2).
/// Returns the sum across all CPUs as a u64 in r0.
BpfPerCpuCounter = 0x1001, // UmkaOS extension range starts at 0x1000
}
The full set of helper IDs must match enum bpf_func_id in Linux's
include/uapi/linux/bpf.h exactly. UmkaOS implements the complete set of helpers required
for:
- Network programs (XDP, socket filter, tc): Redirect, XdpAdjustHead,
SkbAdjustRoom, PerfEventOutput, MapLookupElem/MapUpdateElem, checksum helpers.
- Tracing programs (kprobe, tracepoint, perf_event): ProbeReadKernel,
ProbeReadUser, ProbeReadKernelStr, ProbeReadUserStr, GetCurrentPidTgid,
GetCurrentTask, GetStackid, RingbufOutput/Reserve/Submit/Discard.
- BPF_PROG_TYPE_TRACING (fentry/fexit): attaches to kernel function entry/exit via BTF-based function signature matching. The BPF program receives the function's arguments (fentry) or return value (fexit) as its context. Attach target specified by BTF ID. The verifier validates argument types against the target function's BTF signature.
- Cgroup programs: GetCurrentUidGid, GetCgroupClassid, Setsockopt,
GetSocketCookie.
Program types that attempt to call a helper not permitted for their type receive EPERM
from the verifier at load time — not at runtime. The allowed-helper set per program type
is enforced statically.
19.2.5.2 BpfProg and BpfCmd¶
/// Maximum number of BPF maps a single program can reference.
/// Matches Linux `MAX_USED_MAPS` (64, since kernel 3.18).
pub const BPF_MAX_MAPS: usize = 64;
/// A loaded BPF program.
///
/// `insns` uses `Box<[BpfInsn]>` (heap-allocated, exactly-sized) because
/// privileged BPF programs can be up to 1M instructions (8 MiB at 8 bytes
/// per `BpfInsn`). `ArrayVec<BpfInsn, 4096>` cannot hold programs beyond
/// `BPF_MAXINSNS`. Allocation occurs once at program load time (cold path —
/// `bpf(BPF_PROG_LOAD, ...)`), and the slice is immutable after verifier +
/// JIT processing.
pub struct BpfProg {
pub prog_type: BpfProgType,
pub jit_image: Option<JitImage>,
/// Verified BPF instructions. Heap-allocated at load time.
/// Unprivileged programs: max `BPF_MAXINSNS` (4096) instructions.
/// Privileged programs (CAP_BPF): max 1M instructions.
pub insns: Box<[BpfInsn]>,
pub maps: ArrayVec<Arc<BpfMap>, BPF_MAX_MAPS>,
pub aux: BpfProgAux,
pub refcount: AtomicU64,
}
/// bpf(2) command discriminant. Matches Linux `enum bpf_cmd` values.
#[repr(u32)]
pub enum BpfCmd {
MapCreate = 0, MapLookupElem = 1, MapUpdateElem = 2, MapDeleteElem = 3,
MapGetNextKey = 4, ProgLoad = 5, ObjPin = 6, ObjGet = 7,
ProgAttach = 8, ProgDetach = 9, ProgTestRun = 10, ProgGetNextId = 11,
// ... remaining commands match Linux numbering
}
Detach lifecycle: On detach, BpfProg.refcount is decremented. When refcount
reaches 0: JIT image freed, maps dereferenced. Maps persist independently (own refcount
from userspace fd) — a map shared by two programs is only freed when both programs
detach AND all userspace fds to the map are closed.
Full program lifecycle: Refcount tracks all references (userspace fd, kernel
attachment points). On last fd close: if no kernel attachments remain, program is
freed. On detach: refcount decremented; freed when zero. Attachment data is stored
as BpfAttachTarget enum variants: Tracepoint(tp_id), XdpDev(ifindex),
CgroupSkb(cgroup_id, direction), TcClass(ifindex, prio), etc.
19.2.5.3 Helper Dispatch Table¶
/// BPF program type — determines the execution context, available helpers,
/// and return value semantics. Matches Linux `enum bpf_prog_type` values.
#[repr(u32)]
pub enum BpfProgType {
Unspec = 0,
SocketFilter = 1,
Kprobe = 2,
SchedCls = 3,
SchedAct = 4,
Tracepoint = 5,
Xdp = 6,
PerfEvent = 7,
CgroupSkb = 8,
CgroupSock = 9,
LwtIn = 10,
LwtOut = 11,
LwtXmit = 12,
SockOps = 13,
SkSkb = 14,
CgroupDevice = 15,
SkMsg = 16,
RawTracepoint = 17,
CgroupSockAddr = 18,
LwtSeg6local = 19,
LircMode2 = 20,
SkReuseport = 21,
FlowDissector = 22,
CgroupSysctl = 23,
RawTracepointWritable = 24,
CgroupSockopt = 25,
Tracing = 26,
StructOps = 27,
Extension = 28,
Lsm = 29,
SkLookup = 30,
Syscall = 31,
Netfilter = 32,
// 33+ reserved for future Linux additions
}
/// eBPF helper function dispatch table.
/// Indexed by `BpfFuncId` (cast to `u32`), sized to `__BPF_FUNC_MAX_ID` (212).
/// Populated at kernel init time; immutable thereafter.
pub struct BpfHelperTable {
/// One entry per helper ID, indexed by `BpfFuncId as u32`.
/// Entries for unimplemented IDs have `func` set to `bpf_unimplemented_helper`
/// (returns 0; verifier prevents reaching this at runtime by rejecting the call).
pub helpers: &'static [BpfHelper],
}
/// Descriptor for a single eBPF helper function.
pub struct BpfHelper {
/// Numeric helper ID (matches `BpfFuncId`).
pub id: u32,
/// Bitmask of `BpfProgType` values permitted to call this helper.
/// Verifier checks this at load time; runtime dispatch unconditional.
pub allowed_prog_types: BpfProgTypeMask,
/// Capability bitmask required to use this helper. The verifier checks
/// `current_task().caps.has_all(required_caps)` at program load time
/// and rejects the program with `EPERM` if the loading task lacks any
/// required capability. Helpers that only read kernel state (e.g.,
/// `BPF_FUNC_ktime_get_ns`) set this to `SystemCaps::empty()`.
/// Helpers that modify kernel state (e.g., `BPF_FUNC_probe_write_user`,
/// `BPF_FUNC_trace_printk`) require `CAP_BPF` or `CAP_SYS_ADMIN`.
/// This field replaces the ad-hoc capability checks scattered across
/// individual helper implementations in Linux — the verifier enforces
/// the check uniformly for all helpers via a single code path.
pub required_caps: SystemCaps,
/// UmkaOS implementation.
///
/// # Safety
///
/// Called from JIT-compiled or interpreted eBPF programs. Arguments are
/// pre-validated by the verifier (types match `arg_types`; pointers are
/// in-bounds). The implementation must not panic, must not access
/// memory outside the passed bounds, and must complete in bounded time.
pub func: unsafe fn(a1: u64, a2: u64, a3: u64, a4: u64, a5: u64) -> u64,
/// Argument type descriptors used by the verifier for type checking.
/// Five slots match the eBPF calling convention (r1-r5 as arguments).
pub arg_types: [BpfArgType; 5],
/// Return value type for the verifier (updates r0's `RegState` after the call).
pub ret_type: BpfRetType,
}
/// Per-program-type helper allowlist bitmask.
/// One bit per `BpfProgType`; a helper is callable from program type T
/// iff `(allowed_prog_types >> T as u32) & 1 == 1`.
pub struct BpfProgTypeMask(pub u64);
The BpfHelperTable is a static array allocated at kernel init time and populated by
each subsystem that owns helpers (networking, tracing, cgroup, crypto). The table is
looked up by the JIT compiler (to emit a direct call to helper.func) and by the
interpreter (to dispatch via helper.func). The verifier uses arg_types and
ret_type to propagate register abstract state through helper calls and allowed_prog_types
to reject calls to disallowed helpers with EPERM at load time.
Program-type → helper allowlist table. The verifier uses BpfProgTypeMask to
restrict which helpers each program type can call. Representative allowlist (subset;
full table populated by each subsystem at init):
| Helper | SocketFilter | SchedCls/Act | XDP | Tracepoint | Kprobe | CgroupSkb | LwtXmit | SkSkb | Lsm |
|---|---|---|---|---|---|---|---|---|---|
MapLookupElem |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
MapUpdateElem |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
MapDeleteElem |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
KtimeGetNs |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
GetCurrentPidTgid |
- | - | - | Y | Y | - | - | - | Y |
GetCurrentUidGid |
- | - | - | Y | Y | - | - | - | Y |
SkbLoadBytes |
Y | Y | - | - | - | Y | Y | Y | - |
SkbStoreBytes |
- | Y | - | - | - | - | Y | Y | - |
L3CsumReplace |
- | Y | - | - | - | - | Y | - | - |
L4CsumReplace |
- | Y | - | - | - | - | Y | - | - |
CloneRedirect |
- | Y | - | - | - | - | - | - | - |
Redirect |
- | Y | Y | - | - | - | - | - | - |
RedirectMap |
- | Y | Y | - | - | - | - | - | - |
XdpAdjustHead |
- | - | Y | - | - | - | - | - | - |
XdpAdjustTail |
- | - | Y | - | - | - | - | - | - |
PerfEventOutput |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
ProbeReadKernel |
- | - | - | Y | Y | - | - | - | Y |
ProbeReadUser |
- | - | - | Y | Y | - | - | - | Y |
GetStackid |
- | - | - | Y | Y | - | - | - | - |
SkbGetTunnelKey |
- | Y | - | - | - | - | Y | - | - |
GetCurrentCgroupId |
- | Y | - | Y | Y | Y | - | - | Y |
RingbufOutput |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
RingbufReserve |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
RingbufSubmit |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
RingbufDiscard |
Y | Y | Y | Y | Y | Y | Y | Y | Y |
Y = allowed, - = rejected by verifier at load time. Full allowlist is populated by
each subsystem's bpf_register_helpers() call during init. Helpers not listed for a
program type produce verifier error EPERM (or EINVAL if the helper ID is unknown).
19.2.5.4 Helper Security Model¶
- Capability-gated helpers: helpers that can modify kernel state (
ProbeWriteUser,TracePrintk) requireCAP_BPForCAP_SYS_ADMIN; the verifier rejects their use in programs loaded without the required capability. - Type-safe access: helpers accessing kernel memory (
ProbeReadKernel,ProbeReadUser) perform bounds-checked, type-checked access. The verifier ensures the pointer argument is of the correctRegTypeand the size argument is a knownScalarValue. - No helper bypasses isolation domains: BPF helpers perform validated cross-domain
reads into Core memory on the program's behalf, but cannot be used to access
Tier 1 driver isolation domains directly. A Tier 1 driver's memory is not reachable
via
bpf_probe_read_kernelbecause it maps to a different protection key — the access faults at the hardware level before the helper copies any data.
19.2.6 TC/Classifier BPF Program Context (SkBuff)¶
TC (traffic control) classifiers and actions (BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT) receive a pointer to SkBuff as their program
context. This struct is the UmkaOS equivalent of Linux's struct __sk_buff
(include/uapi/linux/bpf.h). Field offsets MUST match Linux's definition
exactly — BPF programs compiled against Linux headers access fields by
byte offset, not by name. Binary compatibility with Cilium, Calico, and all
K8s CNI plugins depends on this layout.
The SkBuff is a "mirror" of the internal NetBuf (Section 16.5):
accesses to SkBuff fields are transparently rewritten by the BPF verifier into
accesses to the corresponding fields in the real NetBuf. This indirection
provides ABI stability (the internal NetBuf layout may change between kernel
versions) and a verification layer (the verifier enforces read/write permissions
per field per program type).
/// TC/classifier BPF program context. Field offsets MUST match Linux's
/// `struct __sk_buff` (include/uapi/linux/bpf.h, Linux 6.12) for binary
/// compatibility. BPF programs compiled against Linux headers access fields
/// by offset — any layout deviation silently corrupts program behavior.
///
/// New fields can only be added to the END of this structure (append-only ABI).
///
/// The verifier's `convert_ctx_access()` rewrites loads/stores on this struct
/// into accesses to the real `NetBuf` fields at the correct internal offsets.
/// Not all fields are readable/writable from all program types — the verifier's
/// `is_valid_access()` callback enforces per-program-type access control.
// kernel-internal, not KABI
#[repr(C)]
pub struct SkBuff {
/// Total packet length (bytes), including non-linear (paged) data.
/// Note: `data_end - data` gives the LINEAR data length only; `len`
/// may be larger when the packet spans multiple pages.
pub len: u32, // offset 0
/// Packet type: PACKET_HOST, PACKET_BROADCAST, PACKET_MULTICAST,
/// PACKET_OTHERHOST, PACKET_OUTGOING. Values from `if_packet.h`.
pub pkt_type: u32, // offset 4
/// General-purpose 32-bit mark. Shared across netfilter, TC, IPsec,
/// and routing subsystems. Cilium uses this for identity propagation.
pub mark: u32, // offset 8
/// TX queue index on the NIC. TC can override for custom balancing.
pub queue_mapping: u32, // offset 12
/// Layer 3 protocol (ETH_P_IP, ETH_P_IPV6, etc.). Network byte order.
pub protocol: u32, // offset 16
/// Boolean: 1 if a VLAN header is present, 0 otherwise.
/// Removed from internal sk_buff in Linux 6.0.
///
/// **Verifier rewrite**: `convert_ctx_access()` for reads at offset 20
/// (`vlan_present`) MUST emit: `result = (skb->vlan_tci != 0) ? 1 : 0`.
/// Linux removed `vlan_present` from its internal `sk_buff` in 6.0;
/// UmkaOS retains it in NetBuf as an explicit `u8` field for clarity
/// (see [Section 16.5](16-networking.md#netbuf-packet-buffer)). The verifier rewrite above applies
/// to BPF program access, not to `netbuf_to_sk_buff()` conversion (which
/// reads `nb.vlan_present` directly). Writes to this offset are rejected
/// by `is_valid_access()`. Retained for backward compatibility.
pub vlan_present: u32, // offset 20
/// VLAN Tag Control Information (priority + DEI + VID).
pub vlan_tci: u32, // offset 24
/// VLAN protocol ID (ETH_P_8021Q, ETH_P_8021AD). Network byte order.
pub vlan_proto: u32, // offset 28
/// Queuing priority (0-63 effective). Only meaningful with `skbprio` qdisc.
pub priority: u32, // offset 32
/// Interface index of the device the packet arrived on (0 if local).
pub ingress_ifindex: u32, // offset 36
/// Interface index of the device the packet is currently "on".
/// Updated on redirect. On egress, the device picked for TX.
pub ifindex: u32, // offset 40
/// TC index — carries Type of Service (TOS/DSCP) info from `dsmark` qdisc.
/// `BPF_PROG_TYPE_SCHED_CLS` programs can modify this.
pub tc_index: u32, // offset 44
/// Control buffer: 5 × u32 with no pre-defined meaning. Shared between
/// network subsystems and BPF programs for per-packet metadata passing.
pub cb: [u32; 5], // offset 48 (20 bytes)
/// Flow hash computed from packet headers. Optionally HW-offloaded.
pub hash: u32, // offset 68
/// TC class ID. Set by `BPF_PROG_TYPE_SCHED_CLS` in direct-action mode.
/// Only meaningful when the program returns `TC_ACT_OK` and the qdisc
/// has classes.
pub tc_classid: u32, // offset 72
/// Pointer to start of linear packet data (L3 header).
/// Used with `data_end` for direct packet access bounds checking.
pub data: u32, // offset 76
/// Pointer past the last byte of linear packet data.
pub data_end: u32, // offset 80
/// NAPI struct ID this packet came from.
pub napi_id: u32, // offset 84
// --- Fields below accessed by BPF_PROG_TYPE_sk_skb types ---
/// Address family of the associated socket (AF_INET, AF_INET6, etc.).
pub family: u32, // offset 88
/// Remote IPv4 address. Network byte order.
pub remote_ip4: u32, // offset 92
/// Local IPv4 address. Network byte order.
pub local_ip4: u32, // offset 96
/// Remote IPv6 address (4 × u32). Network byte order.
pub remote_ip6: [u32; 4], // offset 100 (16 bytes)
/// Local IPv6 address (4 × u32). Network byte order.
pub local_ip6: [u32; 4], // offset 116 (16 bytes)
/// Remote L4 port. Network byte order.
pub remote_port: u32, // offset 132
/// Local L4 port. Host byte order.
pub local_port: u32, // offset 136
// --- End of sk_skb accessible region ---
/// Pointer to start of XDP metadata area (between data_meta and data).
/// If no metadata is set, equals `data`.
pub data_meta: u32, // offset 140
/// Pointer to `struct bpf_flow_keys` (flow dissector context).
/// Only accessible from `BPF_PROG_TYPE_FLOW_DISSECTOR` programs.
/// Uses `__bpf_md_ptr` layout for 32/64-bit compatibility.
pub flow_keys: u64, // offset 144 (__bpf_md_ptr)
/// Timestamp in nanoseconds since boot. On egress with `fq` qdisc,
/// can be set to a future time for bandwidth shaping. Meaning depends
/// on `tstamp_type` (since Linux 5.18).
pub tstamp: u64, // offset 152
/// Wire length — length of the data as it will appear on the wire.
pub wire_len: u32, // offset 160
/// Number of GSO (Generic Segmentation Offload) segments.
pub gso_segs: u32, // offset 164
/// Pointer to `struct bpf_sock` (associated socket info). Read-only.
/// Uses `__bpf_md_ptr` layout for 32/64-bit compatibility.
pub sk: u64, // offset 168 (__bpf_md_ptr)
/// GSO segment size.
pub gso_size: u32, // offset 176
/// Timestamp type: BPF_SKB_TSTAMP_UNSPEC (0) or
/// BPF_SKB_TSTAMP_DELIVERY_MONO (1). Determines `tstamp` semantics.
pub tstamp_type: u8, // offset 180
/// Padding (24 bits). Reserved for future use. Must be zero.
pub _pad: [u8; 3], // offset 181
/// Hardware receive timestamp (nanoseconds). Set by NIC if supported.
pub hwtstamp: u64, // offset 184
}
// Total size: 192 bytes. Matches Linux 6.12 struct __sk_buff.
// Size assertion (verified at compile time):
const _: () = assert!(
core::mem::size_of::<SkBuff>() == 192,
"SkBuff must be exactly 192 bytes to match Linux __sk_buff ABI"
);
TC action return values (matching Linux include/uapi/linux/pkt_cls.h):
/// TC classifier/action return values. BPF_PROG_TYPE_SCHED_CLS programs
/// in direct-action mode return these to indicate packet fate.
/// Values match Linux TC_ACT_* constants.
#[repr(i32)]
pub enum TcAction {
/// Continue to next filter/action in the chain.
Unspec = -1,
/// Accept the packet (deliver to stack / transmit).
Ok = 0,
/// Reclassify the packet (restart classification from the beginning).
Reclassify = 1,
/// Drop the packet immediately.
Shot = 2,
/// Pass to the next filter but do not alter the packet.
Pipe = 3,
/// Stolen — packet consumed by the action (freed internally).
Stolen = 4,
/// Queue for userspace processing (via netlink).
Queued = 5,
/// Repeat the action processing.
Repeat = 6,
/// Redirect the packet to another interface (used with bpf_redirect).
Redirect = 7,
}
NetBuf → SkBuff conversion: When a TC BPF program is attached to an
interface, the TC classifier constructs an SkBuff from the NetBuf before
invoking the BPF program. This is the TC equivalent of the
netbuf_to_xdp_context() conversion for XDP (Section 16.5).
/// Construct an SkBuff context from a NetBuf for TC BPF program execution.
///
/// The SkBuff fields are populated from the NetBuf's metadata and the
/// associated socket (if any). Fields not available from the NetBuf
/// (e.g., socket address fields) are set to zero.
///
/// # Safety
///
/// - `nb` must be a valid NetBuf with initialized metadata fields.
/// - The returned SkBuff is valid only for the duration of the BPF program
/// execution — it must not outlive the NetBuf it references.
fn netbuf_to_sk_buff(nb: &NetBuf, ifindex: u32, ingress_ifindex: u32) -> SkBuff {
let sk_info = nb.socket_info(); // Option<&SocketInfo>
SkBuff {
len: nb.len,
pkt_type: nb.pkt_type as u32,
mark: nb.mark,
queue_mapping: nb.queue_mapping as u32,
protocol: nb.protocol,
vlan_present: nb.vlan_present as u32,
vlan_tci: nb.vlan_tci as u32,
vlan_proto: nb.vlan_proto as u32,
priority: nb.priority,
ingress_ifindex,
ifindex,
tc_index: nb.tc_index as u32,
cb: nb.cb,
hash: nb.hash,
tc_classid: 0, // set by the BPF program's return path
data: nb.data_offset,
data_end: nb.data_offset + nb.linear_len(),
napi_id: nb.napi_id,
family: sk_info.map_or(0, |s| s.family as u32),
remote_ip4: sk_info.map_or(0, |s| s.remote_ip4),
local_ip4: sk_info.map_or(0, |s| s.local_ip4),
remote_ip6: sk_info.map_or([0; 4], |s| s.remote_ip6),
local_ip6: sk_info.map_or([0; 4], |s| s.local_ip6),
remote_port: sk_info.map_or(0, |s| s.remote_port),
local_port: sk_info.map_or(0, |s| s.local_port),
data_meta: nb.data_meta_offset,
flow_keys: 0, // populated only for FLOW_DISSECTOR programs
tstamp: nb.tstamp_ns,
wire_len: nb.wire_len,
gso_segs: nb.gso_segs as u32,
sk: 0, // populated by verifier rewrite (bpf_sock pointer)
gso_size: nb.gso_size as u32,
tstamp_type: nb.tstamp_type,
_pad: [0; 3],
hwtstamp: nb.hwtstamp_ns,
}
}
After the BPF program returns, the TC classifier reads back mutable fields
(mark, priority, tc_index, tc_classid, queue_mapping, tstamp,
tstamp_type) from the SkBuff and applies them to the NetBuf:
// Post-TC BPF: sync mutable fields back to NetBuf.
fn sk_buff_writeback(nb: &mut NetBuf, ctx: &SkBuff) {
nb.mark = ctx.mark;
nb.priority = ctx.priority;
nb.tc_index = ctx.tc_index as u16;
nb.queue_mapping = ctx.queue_mapping as u16;
nb.tstamp_ns = ctx.tstamp;
nb.tstamp_type = ctx.tstamp_type;
// tc_classid is consumed by the qdisc layer, not written back to NetBuf.
}
Verifier access control: The verifier's is_valid_access() callback for
BPF_PROG_TYPE_SCHED_CLS permits:
- Read: all fields.
- Write: mark, priority, tc_index, cb[0..5], tc_classid,
queue_mapping, tstamp, tstamp_type.
- Forbidden writes: len, pkt_type, protocol, data, data_end,
hash, napi_id, ifindex, ingress_ifindex, all socket address fields,
flow_keys, sk, hwtstamp. These are derived from hardware or internal
state and must not be corrupted by BPF programs.
19.2.7 BPF Kfunc Framework¶
Kfuncs are kernel functions callable from BPF programs, registered at runtime
by kernel subsystems. They complement BPF helpers (which use fixed numeric IDs
in the BpfFuncId enum) with a more flexible, type-safe mechanism based on BTF.
Key differences from BPF helpers:
| Property | BPF Helpers | Kfuncs |
|---|---|---|
| Registration | Compile-time BpfFuncId enum slot |
Runtime bpf_register_kfunc_set() |
| Type safety | Manual argument checking in verifier | BTF-based automatic type matching |
| Namespace | Global (all program types see all IDs) | Module-scoped (only visible to programs in the registering module's scope) |
| ABI stability | Stable (numeric IDs are part of bytecode ABI) | Unstable (kfunc signatures may change between kernel versions) |
| Dispatch | Fixed function pointer table indexed by ID | BTF-resolved indirect call (verified at load time) |
Registration:
/// A set of kfuncs registered by a kernel subsystem.
/// Each entry maps a function name (matched via BTF) to a function pointer
/// and a set of flags controlling verification behavior.
// kernel-internal, not KABI — BpfKfuncSet and BpfKfuncDesc contain Rust fat
// pointers (&'static str, &'static [T]) and raw pointers with platform-dependent
// sizes. Never crosses a KABI or compilation boundary. All users are compiled
// together in the kernel. No #[repr(C)] needed — Rust-default layout is sufficient.
pub struct BpfKfuncSet {
/// Module that owns this kfunc set. Kfuncs are unregistered when the
/// module is unloaded. For built-in subsystems: `None` (permanent).
pub owner: Option<&'static KabiModule>,
/// Array of kfunc descriptors. Terminated by a zero-initialized entry.
pub funcs: &'static [BpfKfuncDesc],
}
/// Descriptor for a single kfunc.
// kernel-internal, not KABI (see BpfKfuncSet comment above).
pub struct BpfKfuncDesc {
/// Function name (must match a BTF function definition in the kernel's
/// BTF data). The verifier resolves calls by name, not by numeric ID.
pub name: &'static str,
/// Function pointer. The verifier validates that the BPF program's
/// call-site argument types match the BTF-declared parameter types.
pub func: *const (),
/// Flags controlling verification rules for this kfunc.
pub flags: BpfKfuncFlags,
}
bitflags::bitflags! {
/// Flags for kfunc verification behavior.
pub struct BpfKfuncFlags: u32 {
/// Kfunc may sleep (only callable from sleepable BPF programs).
const SLEEPABLE = 1 << 0;
/// Kfunc acquires a reference that the BPF program must release.
/// The verifier tracks the reference and rejects programs that
/// leak it (same as PTR_TO_BTF_ID_OR_NULL tracking for helpers).
const ACQUIRE_REF = 1 << 1;
/// Kfunc releases a reference previously acquired by an ACQUIRE_REF kfunc.
const RELEASE_REF = 1 << 2;
/// Kfunc returns a pointer that may be NULL. The verifier forces
/// a NULL check before the return value is dereferenced.
const RET_NULL = 1 << 3;
/// Kfunc is destructive (modifies kernel state). Requires CAP_BPF.
const DESTRUCTIVE = 1 << 4;
}
}
/// Register a kfunc set with the BPF subsystem.
/// After registration, BPF programs loaded in the registering module's
/// scope can call these kfuncs. The verifier resolves kfunc calls at
/// program load time via BTF name matching.
///
/// # Errors
///
/// - `EEXIST`: a kfunc with the same name is already registered.
/// - `EINVAL`: a kfunc descriptor has invalid flags or a NULL function pointer.
pub fn bpf_register_kfunc_set(set: &'static BpfKfuncSet) -> Result<(), Errno>;
/// Unregister a kfunc set. Called automatically when the owning module
/// is unloaded. Any BPF programs currently using kfuncs from this set
/// are marked as faulted and detached (they cannot execute after the
/// backing function is gone).
pub fn bpf_unregister_kfunc_set(set: &'static BpfKfuncSet);
Verifier integration:
- At BPF program load time, the verifier encounters a
BPF_CALLinstruction whose target is a kfunc (identified by BTF function name, not numeric ID). - The verifier looks up the kfunc name in the registered kfunc sets.
If not found: reject with
ENOTSUPP. - The verifier compares the BPF program's call-site argument types (tracked
as
RegTypevalues in the verifier state) against the kfunc's BTF-declared parameter types. Mismatches are rejected withEINVAL. - If the kfunc has
ACQUIRE_REF: the verifier marks the return register as holding a reference that must be released before program exit. - If the kfunc has
SLEEPABLE: the verifier rejects the call if the BPF program is not in a sleepable context (BPF_F_SLEEPABLEflag). - If the kfunc has
DESTRUCTIVE: the verifier checksCAP_BPFon the loading credential; reject withEPERMif absent.
Conntrack kfuncs (bpf_ct_lookup, bpf_ct_insert, bpf_ct_set_nat)
are registered by the conntrack subsystem via bpf_register_kfunc_set().
See Section 16.18 for details.
19.2.8 KABI Tracepoint Ring for Tier 1 Tracing¶
Tier 1 subsystems (umka-vfs, umka-net, umka-block, filesystem drivers) run in their own isolation domains. Standard kprobes and tracepoints within Tier 1 code fire while the CPU is in the Tier 1 domain, but BPF programs cannot directly read Tier 1 memory (Section 11.3).
Solution: Tier 1 tracepoints emit data through a KABI tracepoint ring —
a per-CPU SpscRing<TracepointRecord, 256> (Section 3.6)
shared between the Tier 1 domain (producer) and Nucleus (consumer).
/// Tracepoint record written by Tier 1 code into the KABI tracepoint ring.
/// Fixed-size for lock-free SPSC. Variable-length data (filenames, paths)
/// is truncated to fit within the inline buffer.
#[repr(C)]
pub struct TracepointRecord {
/// Tracepoint ID (matches static tracepoint ABI, [Section 20.2](20-observability.md#stable-tracepoint-abi)).
pub tp_id: u32,
/// Explicit padding for u64 alignment of timestamp_ns.
pub _pad: u32,
/// Timestamp (monotonic nanoseconds).
pub timestamp_ns: u64,
/// Number of valid bytes in `data`.
pub data_len: u16,
/// Inline payload (tracepoint arguments serialized by the tracepoint macro).
pub data: [u8; 238],
}
// TracepointRecord must be exactly 256 bytes for cache-line-aligned SPSC ring entries.
// Fields: 4 (tp_id) + 4 (_pad) + 8 (timestamp_ns) + 2 (data_len) + 238 (data) = 256.
const_assert!(core::mem::size_of::<TracepointRecord>() == 256);
Flow:
1. Tier 1 tracepoint fires → serializes arguments into TracepointRecord →
pushes to per-CPU SpscRing (no domain switch, ~15 ns).
2. Nucleus BPF tracepoint consumer drains the ring on context switch or timer
tick → copies deserialized arguments into the BPF program's isolation domain
and invokes the attached BPF programs.
3. BPF bpf_probe_read_kernel() works normally on the copied data because
it has been marshalled into the BPF domain's accessible memory.
If the ring is full (producer outpaces consumer): the tracepoint record is
dropped and a per-CPU tp_drops: AtomicU64 counter is incremented. BPF
programs can read the drop count via bpf_per_cpu_counter("tp_drops").
Tracepoint firing is never blocked — observability must not affect Tier 1
latency.
VFS-specific tracepoints: umka-vfs emits tracepoints for vfs_read,
vfs_write, vfs_open, vfs_unlink, vfs_rename, vfs_fsync, matching
Linux's trace_* events. BPF programs using SEC("tp/vfs/vfs_read") attach
to the Nucleus consumer side and receive the same arguments as Linux tracepoints.
The indirection through the KABI ring adds ~50-100 ns latency to tracepoint
delivery (ring push + consumer drain), which is acceptable for observability
(tracepoints are not on the I/O critical path).
19.3 io_uring Subsystem¶
Full io_uring support with a security enhancement:
Implementation basis: UmkaOS's io_uring compatibility layer is implemented directly on top of UmkaOS's internal
RingBuffer<T>infrastructure from the driver SDK (Section 12.1). The Submission Queue (SQ) and Completion Queue (CQ) rings areRingBuffer<SqEntry>andRingBuffer<CqEntry>instances with their memory laid out to match Linux's io_uring mmap layout exactly — so applications using the mmap-based interface (io_uring_setup→mmapSQ/CQ rings → submit via SQ) work unmodified. No separate ring implementation exists; io_uring is a specialization of the same ring infrastructure used throughout UmkaOS at every tier boundary.
- Same SQE/CQE ring buffer ABI (binary compatible)
- Same opcodes: all 65 opcodes through Linux 6.15 (see per-opcode version notes in the table below)
- SQPOLL mode (kernel-side submission polling)
- Registered buffers and registered files (pre-pinned for zero-copy)
- Fixed files for reduced file descriptor overhead
Supported io_uring Opcodes (complete enumeration; per-opcode Linux version noted in the table):
| # | Opcode | Notes |
|---|---|---|
| 0 | IORING_OP_NOP |
No-op; tests ring infrastructure |
| 1 | IORING_OP_READV |
Vectored read (preadv2 equivalent) |
| 2 | IORING_OP_WRITEV |
Vectored write (pwritev2 equivalent) |
| 3 | IORING_OP_FSYNC |
fsync(2) |
| 4 | IORING_OP_READ_FIXED |
Read to pre-registered buffer |
| 5 | IORING_OP_WRITE_FIXED |
Write from pre-registered buffer |
| 6 | IORING_OP_POLL_ADD |
Poll fd for I/O readiness |
| 7 | IORING_OP_POLL_REMOVE |
Cancel/update poll |
| 8 | IORING_OP_SYNC_FILE_RANGE |
sync_file_range(2) |
| 9 | IORING_OP_SENDMSG |
sendmsg(2) |
| 10 | IORING_OP_RECVMSG |
recvmsg(2) |
| 11 | IORING_OP_TIMEOUT |
Timer/timeout |
| 12 | IORING_OP_TIMEOUT_REMOVE |
Cancel/update timeout |
| 13 | IORING_OP_ACCEPT |
accept4(2) |
| 14 | IORING_OP_ASYNC_CANCEL |
Cancel in-flight request by user_data |
| 15 | IORING_OP_LINK_TIMEOUT |
Timeout for linked SQE chain |
| 16 | IORING_OP_CONNECT |
connect(2) |
| 17 | IORING_OP_FALLOCATE |
fallocate(2) |
| 18 | IORING_OP_OPENAT |
openat(2) |
| 19 | IORING_OP_CLOSE |
close(2) |
| 20 | IORING_OP_FILES_UPDATE |
Batch-update registered file table |
| 21 | IORING_OP_STATX |
statx(2) |
| 22 | IORING_OP_READ |
pread(2) equivalent (non-vectored) |
| 23 | IORING_OP_WRITE |
pwrite(2) equivalent (non-vectored) |
| 24 | IORING_OP_FADVISE |
posix_fadvise(2) |
| 25 | IORING_OP_MADVISE |
madvise(2) |
| 26 | IORING_OP_SEND |
send(2) |
| 27 | IORING_OP_RECV |
recv(2) |
| 28 | IORING_OP_OPENAT2 |
openat2(2) |
| 29 | IORING_OP_EPOLL_CTL |
epoll_ctl(2) |
| 30 | IORING_OP_SPLICE |
splice(2) |
| 31 | IORING_OP_PROVIDE_BUFFERS |
Register buffer group for recv |
| 32 | IORING_OP_REMOVE_BUFFERS |
Unregister buffer group |
| 33 | IORING_OP_TEE |
tee(2) — duplicate pipe data |
| 34 | IORING_OP_SHUTDOWN |
shutdown(2) |
| 35 | IORING_OP_RENAMEAT |
renameat(2) |
| 36 | IORING_OP_UNLINKAT |
unlinkat(2) |
| 37 | IORING_OP_MKDIRAT |
mkdirat(2) |
| 38 | IORING_OP_SYMLINKAT |
symlinkat(2) |
| 39 | IORING_OP_LINKAT |
linkat(2) |
| 40 | IORING_OP_MSG_RING |
Send message to another io_uring ring |
| 41 | IORING_OP_FSETXATTR |
fsetxattr(2) |
| 42 | IORING_OP_SETXATTR |
setxattr(2) |
| 43 | IORING_OP_FGETXATTR |
fgetxattr(2) |
| 44 | IORING_OP_GETXATTR |
getxattr(2) |
| 45 | IORING_OP_SOCKET |
socket(2) |
| 46 | IORING_OP_URING_CMD |
Per-file/driver command (NVMe passthrough, etc.) |
| 47 | IORING_OP_SEND_ZC |
Zero-copy send (Linux 6.0+) |
| 48 | IORING_OP_SENDMSG_ZC |
Zero-copy sendmsg (Linux 6.0+) |
| 49 | IORING_OP_READ_MULTISHOT |
Multi-completion buffered read |
| 50 | IORING_OP_WAITID |
waitid(2) |
| 51 | IORING_OP_FUTEX_WAIT |
Futex wait (Linux 6.7+) |
| 52 | IORING_OP_FUTEX_WAKE |
Futex wake (Linux 6.7+) |
| 53 | IORING_OP_FUTEX_WAITV |
Wait on multiple futexes (Linux 6.7+) |
| 54 | IORING_OP_FIXED_FD_INSTALL |
Install registered fd into file table (Linux 6.7+) |
| 55 | IORING_OP_FTRUNCATE |
ftruncate(2) |
| 56 | IORING_OP_BIND |
bind(2) — Linux 6.11+ |
| 57 | IORING_OP_LISTEN |
listen(2) — Linux 6.11+ |
| 58 | IORING_OP_RECV_ZC |
Zero-copy recv (Linux 6.15+) |
| 59 | IORING_OP_EPOLL_WAIT |
epoll_wait via io_uring (Linux 6.12+) |
| 60 | IORING_OP_READV_FIXED |
Vectored read to pre-registered buffer (Linux 6.13+) |
| 61 | IORING_OP_WRITEV_FIXED |
Vectored write from pre-registered buffer (Linux 6.13+) |
| 62 | IORING_OP_PIPE |
pipe2(2) (Linux 6.13+) |
| 63 | IORING_OP_NOP128 |
128-byte NOP (Linux 6.13+) |
| 64 | IORING_OP_URING_CMD128 |
128-byte uring_cmd (Linux 6.13+) |
UmkaOS implementation note: All 65 opcodes listed above (0-64) are supported. Opcodes may be disabled per-process via io_uring_register(IORING_REGISTER_RESTRICTIONS). UmkaOS implements all opcodes natively using its internal RingBuffer<T> infrastructure — no opcode silently fails with ENOSYS; unimplemented opcodes at future kernel versions return ENOSYS with FEAT_OPCODE_LIST discoverability.
SqEntry — Submission Queue Entry (binary-compatible with Linux io_uring_sqe, 64 bytes):
/// 64-byte Submission Queue Entry. Matches Linux io_uring ABI exactly.
/// Userspace writes one SqEntry per I/O operation into the SQ ring.
#[repr(C)]
pub struct SqEntry {
pub opcode: u8, // IORING_OP_* (see opcode table above)
pub flags: u8, // IOSQE_FIXED_FILE, IOSQE_IO_DRAIN, IOSQE_IO_LINK, etc.
pub ioprio: u16, // I/O priority (IOPRIO_CLASS_* | prio_level)
pub fd: i32, // File descriptor (or fixed-file index if IOSQE_FIXED_FILE)
pub off: u64, // File offset (or addr2 for some opcodes)
pub addr: u64, // Buffer address (or pointer to iovec array for vectored ops)
pub len: u32, // Buffer length (or iovec count for vectored ops)
pub op_flags: u32, // Union: rw_flags (RWF_*), fsync_flags (IORING_FSYNC_DATASYNC),
// poll_events, sync_range_flags, msg_flags, timeout_flags,
// accept_flags, cancel_flags, open_flags, statx_flags,
// fadvise_advice, splice_flags, rename_flags, unlink_flags,
// hardlink_flags, xattr_flags, uring_cmd_flags
pub user_data: u64, // Opaque tag — copied verbatim into CqEntry.user_data
pub buf_index: u16, // Union: buf_index (pre-registered buffer index) /
// buf_group (for IORING_OP_PROVIDE_BUFFERS)
pub personality: u16, // Credentials personality (io_uring_register creds)
pub splice_fd_in: i32, // Union: splice source fd / file_index / addr_len
pub addr3: u64, // Extended address field (opcode-dependent)
pub _resv: u64, // Reserved — must be zero
}
/// 1+1+2+4+8+8+4+4+8+2+2+4+8+8 = 64 bytes
const_assert!(size_of::<SqEntry>() == 64);
VFS completion → CqEntry conversion:
When a VFS operation completes, the result is written to the io_uring
completion queue as a CqEntry (binary-compatible with Linux io_uring_cqe):
#[repr(C)]
pub struct CqEntry {
pub user_data: u64, // Copied from SqEntry.user_data (opaque to kernel)
pub res: i32, // Bytes transferred (success) or -errno (error)
pub flags: u32, // IORING_CQE_F_BUFFER for buffer selection, etc.
}
/// 8(user_data) + 4(res) + 4(flags) = 16 bytes
const_assert!(size_of::<CqEntry>() == 16);
/// 32-byte Completion Queue Entry. Used when io_uring_setup() is called
/// with IORING_SETUP_CQE32 flag (bit 11). The CQ ring allocates 32-byte
/// slots instead of 16-byte slots. Required for URING_CMD128 extended results.
#[repr(C)]
pub struct CqEntry32 {
pub user_data: u64,
pub res: i32,
pub flags: u32,
/// Extended completion data (16 bytes). Usage is opcode-dependent.
pub big_cqe: [u64; 2],
}
const_assert!(size_of::<CqEntry32>() == 32);
CqEntry { user_data: sqe.user_data, res: vfs_result_or_neg_errno, flags: 0 }.
For IORING_OP_READ returning 4096 bytes: res = 4096.
For failed IORING_OP_FSYNC: res = -EIO.
19.3.1 io_uring VFS Integration¶
io_uring file operations cross a tier boundary: the io_uring subsystem runs in Tier 0 (kernel core), while VFS runs in Tier 1 (hardware-isolated via MPK/POE/DACR). This subsection specifies the async completion routing, correlation tracking, crash recovery, and ordering guarantees for this domain crossing.
SQE correlation token — embedded in every in-flight VFS async operation to track the originating SQE across the tier boundary:
/// Correlation token for an in-flight io_uring operation dispatched to VFS.
/// Embedded in every `VfsAsyncOp` submitted to the KABI completion ring.
/// The VFS Tier 1 component treats this as opaque — it copies the token
/// into the completion message without interpretation.
/// Size: exactly 16 bytes (tag: 8 + submit_ns: 8). The submitting CPU is
/// available from CpuLocal at io_uring_enter() time and does not need
/// per-SQE storage.
#[repr(C)]
pub struct SqeCorrelation {
/// Userspace-supplied tag (copied from `SqEntry.user_data`).
/// Returned verbatim in `CqEntry.user_data` on completion.
pub tag: u64,
/// Timestamp (nanoseconds since boot) when the SQE was submitted
/// to the VFS ring. Used by the kernel for latency accounting and
/// stall detection (operations exceeding `IO_STALL_THRESHOLD_NS`
/// are logged at WARN level).
pub submit_ns: u64,
}
const_assert!(core::mem::size_of::<SqeCorrelation>() == 16);
/// Async VFS operation submitted from io_uring (Tier 0) to VFS (Tier 1)
/// via the KABI request ring. Fixed 128 bytes (one ring slot).
#[repr(C, align(64))]
pub struct VfsAsyncOp {
/// Correlation token for completion routing back to io_uring.
pub correlation: SqeCorrelation, // 16 bytes
/// Opcode identifying the VFS operation (maps from IORING_OP_*).
pub opcode: VfsAsyncOpcode, // 4 bytes
/// File descriptor (resolved to kernel `FileRef` by Tier 0 before
/// submission; the fd index is passed for Tier 1 to locate the
/// pre-resolved `FileRef` in the shared file table).
pub fd: i32, // 4 bytes
/// Byte offset within the file (for read/write/fsync range).
pub offset: u64, // 8 bytes
/// Length in bytes (for read/write) or flags (for fsync/openat).
pub len_or_flags: u64, // 8 bytes
/// User buffer virtual address (for read/write). For registered
/// buffers (IORING_OP_READ_FIXED), this is the pre-registered
/// buffer index instead.
pub buf_addr: u64, // 8 bytes
/// Pathname for path-based ops (OPENAT, RENAMEAT, etc.).
/// Pointer to a kernel-space copy (copied from userspace by Tier 0).
pub path_ptr: u64, // 8 bytes
/// Path length in bytes (0 for non-path ops).
pub path_len: u32, // 4 bytes
/// I/O priority (from SQE ioprio field).
pub ioprio: u16, // 2 bytes
/// Padding.
pub _pad: [u8; 2], // 2 bytes
/// Additional opcode-specific parameters (e.g., open flags, rename flags).
pub params: [u8; 64], // 64 bytes
// Total: 128 bytes
}
const_assert!(core::mem::size_of::<VfsAsyncOp>() == 128);
/// VFS operation completion message written from VFS (Tier 1) to the
/// KABI completion ring for io_uring (Tier 0) to consume. Fixed 32 bytes.
#[repr(C)]
pub struct VfsCompletionMsg {
/// Correlation token — copied verbatim from the originating VfsAsyncOp.
pub correlation: SqeCorrelation, // 16 bytes
/// Result value: positive = bytes transferred, negative = -errno.
pub result: i32, // 4 bytes
/// Completion flags (e.g., IORING_CQE_F_BUFFER for buffer selection).
pub flags: u32, // 4 bytes
/// Extended result for multi-shot ops (e.g., buffer ID).
pub extra: u64, // 8 bytes
// Total: 32 bytes
}
const_assert!(core::mem::size_of::<VfsCompletionMsg>() == 32);
/// VFS async operation opcodes (internal mapping from IORING_OP_*).
#[repr(u32)]
pub enum VfsAsyncOpcode {
Read = 0,
Write = 1,
Fsync = 2,
Openat = 3,
Close = 4,
Statx = 5,
Renameat = 6,
Unlinkat = 7,
Mkdirat = 8,
Symlinkat = 9,
Linkat = 10,
Fadvise = 11,
Fallocate = 12,
Splice = 13,
Fsetxattr = 14,
Fgetxattr = 15,
Setxattr = 16,
Getxattr = 17,
Ftruncate = 18,
}
Domain crossing protocol:
-
Submission (Tier 0 -> Tier 1):
io_uring_enter()dequeues SQEs from the userspace SQ ring, constructs aVfsAsyncOpfor each file-related opcode (read, write, fsync, openat, statx, etc.), attaches anSqeCorrelationtoken, and pushes the operation into the KABI request ring shared with the VFS Tier 1 domain. Non-file opcodes (poll, timeout, cancel, msg_ring) are handled entirely within Tier 0 and never cross into VFS. -
Completion (Tier 1 -> Tier 0): When VFS completes an operation, it writes a
VfsCompletionMsg { correlation: SqeCorrelation, result: i32, flags: u32 }to the KABI completion ring. The kernel (Tier 0) reads from this ring, looks up the originatingIoRingCtxviacorrelation.tag(which encodes the ring fd internally), and writes aCqEntry { user_data: correlation.tag, res: result, flags: flags }to the io_uring CQ ring mapped to userspace. -
Batching: The Tier 0 completion reader batches CQE writes: it drains up to
CQ_BATCH_SIZE(32) completions from the KABI ring per iteration, writes them to the CQ ring, then issues a singleeventfd_signal()(if eventfd is registered) or wakes theio_uring_enter(IORING_ENTER_GETEVENTS)waiter.
Block I/O completion to CQE posting: For block I/O operations
(IORING_OP_READ, IORING_OP_WRITE, IORING_OP_READ_FIXED,
IORING_OP_WRITE_FIXED), the bio completion callback runs in softirq/IRQ
context. It enqueues an IoCompletionWork item to the io_uring instance's
io_wq (the per-ring I/O worker pool). An io_wq worker thread (running in
process context) picks up the item and calls io_complete(), which: (1) writes
the CqEntry to the CQ ring (with user_data, res, and flags), (2) signals
the eventfd (if registered via IORING_REGISTER_EVENTFD), and (3) wakes any
io_uring_enter(IORING_ENTER_GETEVENTS) waiter. The softirq-to-process-context
handoff via io_wq avoids holding softirq context across CQE ring writes and
eventfd signaling, which could otherwise cause latency spikes for other softirq
handlers sharing the same CPU.
Bio→CQE correlation: When io_uring submits a block I/O SQE, the submission
path sets bio.end_io = io_uring_bio_end_io and stores the originating SQE's
IoRingInflight index and the io_uring ring reference in bio.private
(cast from a pointer to the correlation struct):
/// Correlation data stored in bio.private (as `usize`) during io_uring
/// submission. Enables the bio's `end_io` callback to locate the originating
/// io_uring ring and SQE for CQE posting.
pub struct BioIoUringPrivate {
/// Weak reference to the IoRingCtx that submitted this bio.
/// Weak because the ring may be destroyed while I/O is in flight
/// (the bio outlives the ring if the user closes the ring fd).
pub ring: Weak<IoRingCtx>,
/// Index into the ring's inflight XArray. Used to look up the
/// IoRingInflight entry (which holds user_data for the CQE).
pub inflight_idx: u32,
}
The correlation flow:
- Submission:
IoRingOps::submit()processes the SQE, inserts anIoRingInflightentry intoctx.inflight(keyed by a monotonic index), setsbio.end_io = io_uring_bio_end_io, and stores the correlation data inbio.private(cast from*const BioIoUringPrivate { ring, inflight_idx }). - Completion: The bio's
end_iocallback (running in softirq context) readsbio.privateas a*const BioIoUringPrivate, upgrades theWeak<IoRingCtx>toArc. If the ring is still alive, it enqueues anIoCompletionWork { inflight_idx, result: status }(status is thei32parameter passed to theend_iocallback bybio_complete()). - CQE posting: The
io_wqworker looks upctx.inflight[inflight_idx], readsuser_data, constructsCqEntry { user_data, res: result, flags: 0 }, writes it to the CQ ring, and removes the inflight entry. - Ring-dead case: If
Weak::upgrade()returnsNone(ring was destroyed), the bio completion is silently discarded — no CQE is posted because no consumer exists. The bio's pages are still properly freed.
Crash recovery (VFS Tier 1 failure):
If the VFS Tier 1 domain crashes (detected via the isolation fault handler in Section 11.9), all in-flight SQEs that were dispatched to VFS are drained with failure completions:
- The kernel enumerates all
IoRingCtxinstances system-wide (via a globalXArray<Weak<IoRingCtx>>registry indexed by ring id). - For each ring, the kernel scans
inflight: XArray<IoRingInflight>for entries whose target domain is the crashed VFS instance. - Each such entry is completed with
CqEntry { user_data: correlation.tag, res: -ECANCELED, flags: 0 }. - The
IoRingCtxitself remains valid — the io_uring instance is not destroyed. Userspace observes-ECANCELEDon affected operations and may retry. The VFS Tier 1 component is reloaded by the crash recovery subsystem; subsequent io_uring submissions to the new VFS instance proceed normally.
Linked SQE chain crash handling: When a VFS Tier 1 crash kills an in-flight
bio that is part of a linked SQE chain (IOSQE_IO_LINK), the chain is terminated
and all successor SQEs are cancelled:
- The crash recovery scan (step 2 above) identifies the failed SQE's
IoRingInflightentry and follows thelink_nextchain. - The failed SQE receives a CQE with
res: -EIOandflags: 0(IORING_CQE_F_MOREcleared — no further completions for this SQE). - Each successor SQE in the chain receives a CQE with
res: -ECANCELEDandflags: 0. These SQEs were never dispatched to VFS, so no I/O was performed for them. - For
IOSQE_IO_HARDLINKchains: the behavior is the same asIOSQE_IO_LINKduring crash recovery — all successors are cancelled. Hard-link semantics ("execute regardless of prior result") apply only to normal operation errors, not to infrastructure crashes. A VFS crash means the execution environment is gone; dispatching successors to a crashed domain would fail immediately anyway. - The chain is fully terminated. The application observes
-EIOon the crashed operation and-ECANCELEDon all dependent operations, which is sufficient to identify the failure scope and retry the entire chain after VFS recovery completes.
This design ensures that a VFS crash never leaves an io_uring ring in an inconsistent state: every submitted SQE always produces exactly one CQE.
Ordering guarantees:
- CQE delivery is unordered with respect to SQE submission order. This matches Linux io_uring semantics: operations complete in whatever order the storage stack and scheduler produce.
IOSQE_IO_LINKchains are respected: linked SQEs are submitted to VFS sequentially (the next linked SQE is not dispatched until the previous one completes). If a linked SQE fails, subsequent SQEs in the chain are cancelled with-ECANCELED. Link chains are tracked entirely within Tier 0 (theIoRingInflightentry records the chain linkage); VFS sees individual independent operations.IOSQE_IO_HARDLINK: likeIOSQE_IO_LINKbut the linked SQE executes regardless of the prior SQE's result (the chain is not broken on failure). Used for cleanup operations that must run even when a preceding operation fails — e.g., anIORING_OP_CLOSEhard-linked after anIORING_OP_WRITEensures the fd is closed even if the write returns an error. Internally,IoRingInflight.link_nextrecords the chain; the dispatch logic checksIOSQE_IO_HARDLINKin the SQE flags and skips the-ECANCELEDpropagation thatIOSQE_IO_LINKwould apply.IOSQE_IO_DRAINbarriers ensure all previously submitted SQEs complete before the drain-flagged SQE is dispatched. Implemented by holding the drain SQE in Tier 0 untilinflight.count() == 0for that ring.
Advanced io_uring features:
- Multishot operations (
IORING_POLL_ADD_MULTI, multishotaccept, multishotrecv): single SQE generates multiple CQEs, reducing submission overhead for event-driven servers. A multishot operation remains active until explicitly cancelled or an error occurs. Each intermediate CQE hasIORING_CQE_F_MOREset inCqEntry.flagsto indicate that more CQEs will follow from the same SQE. The final CQE (error or cancellation) clearsIORING_CQE_F_MORE. Applications must not reuse the SQE'suser_datatag for a new submission until a CQE withoutIORING_CQE_F_MOREis observed. - Cancellation (
IORING_OP_ASYNC_CANCEL): cancel in-flight operations by user_data tag. - Linked SQEs (
IOSQE_IO_LINK): ordered execution chains. EventFD notification and linked SQE dispatch ordering: When a linked SQE completes, the CQE is posted to the completion ring BEFORE the next linked SQE is dispatched. If eventfd notification is enabled, the eventfd is signaled after CQE posting but before next-SQE dispatch. This means userspace may observe the CQE via eventfd before the successor SQE begins execution -- this is correct and intentional. Userspace must not assume the successor has started when it sees the predecessor's CQE. The successor dispatches asynchronously after CQE posting completes. Sequence:linked_sqe_complete() -> cqe_post() -> eventfd_signal() -> dispatch_next_linked_sqe(). This ordering ensures that eventfd-based event loops can process intermediate results before successor side-effects begin. IORING_OP_URING_CMD(passthrough): Driver-specific commands via io_uring. NVMe passthrough (nvme_uring_cmd) works through this path. UmkaOS routes uring_cmd to the KABI driver's command handler, maintaining the samestruct nvme_uring_cmdABI. NVMe passthrough commands complete through the NVMe driver's interrupt handler, which posts the CQE directly to the io_uring CQ ring (bypassing the VFS completion path).IORING_REGISTER_RING_FD: ring self-reference for reduced fd overhead.IORING_OP_SEND_ZC/RECV_ZC: zero-copy network I/O.
Network socket completion path: Network socket operations (IORING_OP_RECV,
IORING_OP_SEND, IORING_OP_RECVMSG, IORING_OP_SENDMSG) post CQEs via the
socket's wakeup callback. When data arrives (or send buffer space becomes available),
the socket layer invokes the registered wakeup function, which writes the CQE
directly to the CQ ring and signals the eventfd (if registered) or wakes the
io_uring_enter(IORING_ENTER_GETEVENTS) waiter. No VFS tier crossing is needed
for socket completions — the network stack runs in Tier 1 and writes CQEs to the
Tier 0 completion ring via the standard KABI completion path.
Special fd compatibility: IORING_OP_READ on timerfd/signalfd/eventfd
calls the fd's FileOps::read() implementation. timerfd.read() returns the
expiration count (u64). signalfd.read() returns SignalfdSiginfo struct.
eventfd.read() returns the counter value. All three support
IORING_OP_POLL_ADD for event-driven notification.
SQPOLL idle and IO_DRAIN interaction: The SQPOLL thread (IoRingSqpoll) idles
(parks via schedule()) when the SQ ring is empty for longer than idle_timeout_ms.
It is woken by io_uring_enter(IORING_ENTER_SQ_WAKEUP) when userspace submits new
SQEs. IOSQE_IO_DRAIN interacts with SQPOLL normally: the drain barrier SQE is held
in the submission path until all previously submitted in-flight operations complete
(inflight.count() == 0), then the drain SQE is dispatched. The SQPOLL thread does
not idle while a drain barrier is pending — it polls for completions to make progress
on draining.
Namespace binding: io_uring operations are bound to the submitting thread's namespace
context at io_uring_enter() time, NOT at individual SQE submission time. This is
critical for correctness: SQEs are written to the shared ring by userspace without kernel
involvement, so the kernel has no opportunity to capture namespaces per-SQE. The namespace
context is captured once when io_uring_enter() transitions into the kernel, and all
SQEs in the submission batch use that single captured context.
/// Namespace context for io_uring operations. Most fields are captured at
/// `io_uring_enter()` and applied to all SQEs in the submission batch.
/// The mount namespace (`mnt_ns`) is the exception: it is captured once at
/// `io_uring_setup()` time and fixed for the lifetime of the io_uring instance.
/// This prevents a TOCTOU attack where a thread calls `unshare(CLONE_NEWNS)`
/// between setup and enter to escape mount restrictions. Stored in IoRingCtx.
pub struct IoRingNsCtx {
/// Mount namespace for path-based operations (open, rename, unlink, etc.).
/// **Captured at `io_uring_setup()` time**, NOT at enter time. Fixed for the
/// lifetime of the io_uring instance. A process that unshares its mount
/// namespace after setup continues to use the original mount namespace for
/// all io_uring path operations. This matches Linux io_uring behavior
/// (see io_uring_create, `ctx->sq_data->mm` and mount pinning).
pub mnt_ns: Arc<MountNamespace>,
/// Network namespace for socket operations (connect, bind, send, recv).
/// Captured at `io_uring_enter()` time.
pub net_ns: Arc<NetNamespace>,
/// PID namespace for process-related operations (waitid).
/// Captured at `io_uring_enter()` time.
pub pid_ns: Arc<PidNamespace>,
/// User namespace for permission checks (credential validation).
/// Captured at `io_uring_enter()` time.
pub user_ns: Arc<UserNamespace>,
/// Time namespace for IORING_OP_TIMEOUT operations. Captured at
/// `io_uring_setup()` time (same as `mnt_ns`) so that timeout operations
/// in time-namespaced containers use the container's
/// CLOCK_MONOTONIC/CLOCK_BOOTTIME offsets, not the host's.
pub time_ns: Arc<TimeNamespace>,
/// Root directory reference captured at `io_uring_setup()` time.
/// Used for `AT_FDCWD` resolution in async path operations (OPENAT,
/// RENAMEAT, UNLINKAT, etc.) where the submitting task's root may
/// have changed by the time the async worker executes the SQE.
pub root: MountDentry,
/// Working directory reference captured at `io_uring_setup()` time.
/// Used for `AT_FDCWD` resolution in async path operations. Without
/// this capture, a concurrent `chdir()` in the submitting thread
/// could cause the async worker to resolve relative paths against
/// an unexpected directory.
pub pwd: MountDentry,
}
Namespace binding protocol:
io_uring_setup(): capturesmnt_nsandtime_nsfromcurrent_task().nsproxyintoring.ns_ctx. These are fixed for the io_uring instance's lifetime.io_uring_enter(): capturesnet_ns,pid_ns, anduser_nsfromcurrent_task().nsproxyintoring.ns_ctx. Each field is anArc::cloneof the corresponding namespace from the task's current nsproxy — this is a reference count increment, not a deep copy.- All SQEs in the batch use
ring.ns_ctxfor namespace resolution. Path-based operations (OPENAT,RENAMEAT,UNLINKAT,MKDIRAT,SYMLINKAT,LINKAT,STATX) resolve paths relative tons_ctx.mnt_ns. Socket operations (CONNECT,ACCEPT,BIND,SEND,RECV,SENDMSG,RECVMSG) usens_ctx.net_ns. - If the calling process calls
unshare(CLONE_NEWNS)orsetns()betweenio_uring_enter()calls, the nextio_uring_enter()captures the updated namespaces. The oldns_ctxis dropped (Arc refcount decrement) and replaced with the new one. - In-flight operations from previous submissions continue using their captured context.
Each async worker thread holds a reference to the
IoRingNsCtxthat was active when its batch was submitted. The namespace objects are kept alive by the Arc references until all operations in the batch complete. - This matches Linux io_uring behavior: namespace context is per-submission-batch.
Seccomp interaction: io_uring_setup(), io_uring_enter(), and
io_uring_register() are regular syscalls subject to seccomp-BPF filtering
(Section 10.3). However, individual io_uring opcodes submitted
via SQE are NOT filtered by seccomp — they bypass the syscall entry path
entirely, executing as kernel-internal operations on the SQ polling thread.
This is identical to Linux behavior and is the primary motivation for the
per-instance operation whitelist below.
Security improvement over Linux: Per-instance operation whitelist via capabilities.
In Linux, io_uring bypasses syscall-level security monitoring (seccomp, audit, ptrace).
UmkaOS allows administrators to restrict which io_uring opcodes are available to
each process, addressing this known security gap. The whitelist applies to both standard
opcodes and URING_CMD subtypes — an io_uring instance can be restricted to, e.g.,
read/write only, with NVMe passthrough blocked.
Whitelist configuration interface: Whitelists are configured via
io_uring_register(fd, IORING_REGISTER_RESTRICTIONS, arg, nr) (Linux 5.13+
compatible). The arg points to an array of struct io_uring_restriction entries:
/// Per-restriction entry (matches Linux struct io_uring_restriction, 16 bytes).
#[repr(C)]
pub struct IoUringRestriction {
/// Restriction type: 0 = IORING_RESTRICTION_REGISTER_OP,
/// 1 = IORING_RESTRICTION_SQE_OP, 2 = IORING_RESTRICTION_SQE_FLAGS_ALLOWED,
/// 3 = IORING_RESTRICTION_SQE_FLAGS_REQUIRED.
pub opcode: u16,
/// For SQE_OP: the io_uring opcode number (0-64) to allow.
/// For REGISTER_OP: the register opcode to allow.
/// For FLAGS: the SQE flags bitmask.
pub arg: u8,
pub resv: u8,
pub resv2: [u32; 3],
}
// Layout: 2 + 1 + 1 + 12 = 16 bytes.
const_assert!(size_of::<IoUringRestriction>() == 16);
Restrictions are additive: each SQE_OP entry adds one allowed opcode. Once
IORING_REGISTER_RESTRICTIONS is called, only explicitly allowed opcodes succeed;
all others return -EACCES in the CQE. Restrictions are immutable after registration
(calling IORING_REGISTER_RESTRICTIONS a second time returns -EBUSY). If no
restrictions are registered, all opcodes are allowed (default open policy, matching Linux).
UmkaOS extends Linux's restriction model with an additional check: URING_CMD
subtypes are filtered by the KABI driver's declared command whitelist
(Section 12.6), so even if IORING_OP_URING_CMD
is allowed, the driver controls which passthrough commands are accepted.
19.3.2 Credential Personalities (IORING_REGISTER_PERSONALITY)¶
io_uring supports credential personalities: pre-registered TaskCredential snapshots
that SQEs can reference to execute operations under different credentials without
requiring the submitting thread to change its own credentials. This is Linux-compatible
(io_uring_register(2) with IORING_REGISTER_PERSONALITY opcode 9, Linux 5.6+).
/// Opcode for io_uring_register(2) to register a credential personality.
pub const IORING_REGISTER_PERSONALITY: u32 = 9;
/// Opcode to unregister a previously registered personality.
pub const IORING_UNREGISTER_PERSONALITY: u32 = 10;
/// Maximum registered personalities per io_uring instance. Bounded to
/// prevent unbounded memory growth from credential snapshots.
pub const IORING_MAX_PERSONALITIES: u32 = 256;
Registration protocol:
- Userspace calls
io_uring_register(ring_fd, IORING_REGISTER_PERSONALITY, NULL, 0). - The kernel snapshots the calling thread's current
TaskCredential(Section 8.2) — this captures uid, gid, supplementary groups, capabilities, LSM context, and user namespace. - The snapshot is stored in
IoRingCtx.personalities(XArray keyed by personality ID). The personality ID (a small integer, 1-based) is returned to userspace as the syscall return value. - The
TaskCredentialis cloned (Arc increment) — the snapshot is independent of any future credential changes by the registering thread.
SQE personality resolution:
When an SQE has a non-zero personality field (see SqEntry.personality: u16), the
io_uring submission path resolves the personality ID to the stored credential snapshot
before dispatching the operation:
/// Resolve SQE personality to stored credentials.
/// Called during SQE dispatch (io_uring_enter → submission loop).
///
/// If personality is 0: use the credentials captured at io_uring_enter()
/// time (the submitting thread's current credentials).
/// If personality is non-zero: look up the registered credential snapshot
/// in IoRingCtx.personalities. Return -EINVAL if the ID is not registered.
fn resolve_personality(
ctx: &IoRingCtx,
sqe: &SqEntry,
) -> Result<Arc<TaskCredential>, Errno> {
if sqe.personality == 0 {
// For SQPOLL mode: the SQPOLL kernel thread's own credentials
// are NOT used. Instead, use the ring creator's credentials
// (captured at io_uring_setup() time). For non-SQPOLL: this
// is the calling task's own credentials (same result as
// current_task().cred but routed through ctx.creator_cred
// for uniformity and SQPOLL correctness).
return Ok(ctx.creator_cred.clone());
}
ctx.personalities
.load(sqe.personality as u64)
.ok_or(Errno::EINVAL)
}
The resolved credentials are used for all permission checks during the operation:
file access checks (inode_permission), capability checks (capable_wrt_inode),
LSM hooks, and cgroup accounting. The operation executes as if the registering thread
(at registration time) had submitted it directly.
IoRingCtx personality storage (added to the IoRingCtx struct):
/// Registered credential personalities. Key = personality ID (1-based u16,
/// stored as u64 for XArray compatibility). Value = cloned TaskCredential
/// snapshot from the thread that called IORING_REGISTER_PERSONALITY.
/// None until the first personality is registered.
/// XArray for O(1) lookup by personality ID on the SQE submission hot path.
pub personalities: XArray<Arc<TaskCredential>>,
Unregistration: io_uring_register(ring_fd, IORING_UNREGISTER_PERSONALITY, NULL, id)
removes the personality. In-flight SQEs that already resolved the personality continue
using the cloned credential — the Arc keeps it alive until the last operation completes.
Security: Personality registration requires CAP_SYS_ADMIN in the user namespace
of the io_uring instance (matching Linux behavior). This prevents unprivileged processes
from impersonating other users' credentials.
Async worker credential override: io_wq worker threads inherit the ring creator's
credentials at worker creation time. Each worker stores creator_cred: Arc<TaskCredential>.
When executing an SQE, the worker temporarily overrides current.cred with the resolved
personality credentials (or creator_cred for personality=0) for the duration of the
operation. The worker calls override_creds(resolved_cred) before dispatching the
operation and revert_creds() after completion. This ensures file permission checks,
LSM hooks, and cgroup accounting all use the personality's credentials rather than the
worker thread's own identity. The override is stack-scoped (RAII guard) — an early
return or panic automatically reverts credentials.
FsStruct snapshot for async workers: Async path operations (OPENAT, RENAMEAT,
etc.) use the IoRingNsCtx.root and IoRingNsCtx.pwd captured at io_uring_setup()
time, NOT the async worker thread's own FsStruct. This prevents TOCTOU races where
the submitting thread's chdir() would affect in-flight operations. See
Section 8.1 for FsStruct definition and CLONE_FS semantics.
19.3.3 Direct I/O Path (O_DIRECT)¶
The primary io_uring use case — high-performance storage I/O — relies on O_DIRECT
to bypass the page cache entirely. UmkaOS defines the DirectIoOps trait as the
contract between io_uring (Tier 0) and filesystem/block layer (Tier 1) for direct
I/O operations. The crossing uses kabi_call! which resolves to ring dispatch
(io_uring is in domain 0, filesystem is in a Tier 1 domain). The full DirectIoOps trait definition (with read_direct,
write_direct, alignment requirements, and cache coherency semantics) is specified
in the "Direct I/O Operations (O_DIRECT)" section below.
/// Constraints for direct I/O alignment. Queried once per file at open time
/// and cached in the File struct for the lifetime of the fd.
pub struct DirectIoConstraints {
/// Minimum alignment for the user buffer virtual address (bytes).
/// Typically 512 for legacy block devices, 4096 for NVMe with 4K LBAs.
/// Derived from the block device's logical block size.
pub buf_align: u32,
/// Minimum alignment for the file offset (bytes). Same as buf_align
/// for most devices; may differ for devices with non-power-of-two sectors.
pub offset_align: u32,
/// Minimum I/O size granularity (bytes). The len field in the iovec
/// must be a multiple of this value. Equal to the device's logical block
/// size (512 or 4096).
pub len_align: u32,
/// Maximum single I/O size (bytes). Larger requests are split by the
/// block layer. Derived from the device's max_sectors queue limit.
/// Typical value: 512 KiB for NVMe, 1 MiB for SCSI.
pub max_io_size: u32,
}
/// Per-operation context for direct I/O, passed from io_uring to the
/// filesystem's DirectIoOps implementation.
pub struct DirectIoCtx {
/// io_uring correlation token for async completion routing.
pub correlation: SqeCorrelation,
/// Pre-mapped DMA handle if the buffer is from a registered buffer pool.
/// Some for READ_FIXED/WRITE_FIXED operations (zero per-op DMA overhead).
/// None for regular READ/WRITE operations (DMA mapping established
/// per-operation by the block layer).
pub dma_handle: Optionioprio field. Passed through to the
/// block layer's I/O scheduler (Section 15.18).
pub ioprio: u16,
/// Whether this operation was submitted with RWF_NOWAIT (non-blocking).
/// If set and the operation would block (e.g., metadata lookup requires
/// I/O), return EAGAIN immediately instead of blocking the io_wq thread.
pub nowait: bool,
}
**Alignment enforcement:**
1. On `io_uring_enter()` SQE validation, the kernel checks alignment constraints
for `O_DIRECT` file descriptors: `sqe.addr % constraints.buf_align == 0`,
`sqe.off % constraints.offset_align == 0`, and `sqe.len % constraints.len_align == 0`.
Misaligned requests are rejected immediately with `CqEntry { res: -EINVAL }` — they
never reach the filesystem layer.
2. **Registered buffer alignment**: `io_uring_register(IORING_REGISTER_BUFFERS)` verifies
that each registered buffer's virtual address and length satisfy the strictest
`DirectIoConstraints` of any currently open `O_DIRECT` fd in the ring. If no `O_DIRECT`
fd is open at registration time, the default 512-byte alignment is enforced (the minimum
for any block device). Buffers that fail alignment checks are rejected with `-EINVAL`.
3. **Fallback to buffered I/O**: The alignment violation policy is controlled per-fd
by the `DIO_ALIGN_FALLBACK` / `DIO_ALIGN_STRICT` mode (see "Direct I/O Operations
(O_DIRECT)" below). By default (`DIO_ALIGN_FALLBACK`), misaligned `O_DIRECT`
requests silently fall back to buffered I/O — this matches common Linux filesystem
behavior and avoids breaking applications that occasionally issue misaligned I/O
(e.g., reading the last partial block of a file). Applications that want deterministic
direct I/O with no silent fallback can opt in to `DIO_ALIGN_STRICT` via
`fcntl(F_SETFL, O_DIRECT_STRICT)`, which returns `-EINVAL` on misaligned requests.
Note: the SQE validation in step 1 above applies in **both** modes — it catches
requests that violate the block device's hard alignment constraints (which cannot
be served even via buffered I/O). The fallback/strict distinction applies only to
requests that satisfy block device constraints but not the filesystem's preferred
alignment.
**DMA mapping for registered buffers:**
When `IORING_REGISTER_BUFFERS` is called for an `O_DIRECT` workload, each buffer is:
1. Pinned in physical memory (`pin_user_pages_fast`).
2. DMA-mapped via the IOMMU (`dma_map_sg` or `dma_map_page`) to obtain an IOVA.
3. The IOVA is stored in `PinnedDmaBuf.iova` and reused for every subsequent
`READ_FIXED`/`WRITE_FIXED` operation referencing that buffer index.
This eliminates per-I/O IOMMU TLB invalidation and page pinning overhead — the two
largest sources of kernel-side latency for NVMe `O_DIRECT` workloads. For a 4 KiB
random read workload on NVMe, registered direct I/O buffers reduce kernel overhead
from ~3.5 us to ~1.2 us per I/O (measured: IOMMU map/unmap accounts for ~1.8 us,
pin/unpin ~0.5 us).
> **See also**: [Section 15.18](15-storage.md#io-priority-and-scheduling) for how `ioprio` from the SQE is
> propagated through the block layer. [Section 15.2](15-storage.md#block-io-and-volume-management) for the
> bio submission path that `DirectIoOps` implementations use internally.
## io_uring Under SEV-SNP (Confidential Guest Mode)
When UmkaOS runs as a SEV-SNP confidential guest ([Section 9.7](09-security.md#confidential-computing--guest-mode-umkaos-as-a-confidential-guest)),
io_uring's shared memory rings create a conflict: SQE/CQE ring buffers are shared
between the kernel and userspace (both within the encrypted guest), but I/O operations
require DMA to virtio devices controlled by the hypervisor. The hypervisor cannot access
encrypted guest pages, so DMA buffers must be in unencrypted (C-bit clear) shared memory.
The SQE/CQE rings themselves remain in encrypted guest memory (both kernel and userspace
are inside the same encryption domain), but the I/O data buffers referenced by SQEs
require bounce buffering.
**Detection**: SEV-SNP is detected at boot via `CPUID` leaf `0x8000001F`, bit 1
(SME) and bit 4 (SEV-SNP). When SEV-SNP guest mode is active, the io_uring subsystem
enables the bounce buffer path automatically. No userspace changes are required --
existing io_uring applications run unmodified.
**Data path**: The guest kernel places I/O requests in the encrypted SQE ring as normal.
For operations requiring DMA (block I/O via virtio-blk, network via virtio-net), the
kernel copies data to/from an unencrypted bounce buffer (C-bit clear pages, accessible
to the hypervisor for DMA). On completion, the kernel copies results from the bounce
buffer back into the encrypted guest buffer, then places the CQE in the encrypted CQE
ring. The SQE/CQE rings themselves are never exposed to the hypervisor -- only the DMA
data payload is bounced.
**Bounce buffer pool**: Pre-allocated at io_uring initialization (not boot), sized to
2x the maximum concurrent io_uring queue depth across all rings on the system. Default
sizing: 4096 SQEs x 4 KiB = 16 MiB bounce pool per io_uring instance, capped at 64 MiB
system-wide (configurable via `/sys/kernel/umka/io_uring/snp_bounce_pool_mb`). All bounce
buffer pages are marked as shared (C-bit clear) so the hypervisor can DMA to/from them.
The pool uses a simple freelist allocator (no slab overhead -- bounce buffers are
uniform-sized 4 KiB pages). If the pool is exhausted, io_uring returns `-ENOMEM` for
the SQE and the application retries (same behavior as running out of DMA mapping slots
in non-SNP mode).
**Performance impact**: Each I/O operation requires two additional `memcpy` operations
(submission: guest buffer -> bounce buffer; completion: bounce buffer -> guest buffer).
For 4 KiB blocks, each `memcpy` costs ~0.3-0.5 us (~0.6-1.0 us total per I/O). This
is acceptable given that SEV-SNP already imposes 5-15% baseline overhead from memory
encryption engine traversal on all memory accesses. The bounce buffer overhead is
additive but small relative to the encryption baseline: approximately 1-3% additional
overhead for NVMe 4 KiB random I/O workloads (which are already dominated by device
latency), and < 1% for sequential large-block I/O (where `memcpy` is amortized over
larger transfers).
**Fixed buffers optimization**: `io_uring_register(IORING_REGISTER_BUFFERS)` under
SEV-SNP pre-registers persistent bounce buffer mappings for specific user buffers. When
an application registers N buffers, the kernel allocates N corresponding bounce buffer
slots and establishes a stable mapping. Subsequent I/O operations referencing registered
buffer indices use the pre-mapped bounce buffers without per-operation pool
allocation/deallocation, amortizing the bounce overhead across multiple operations to the
same buffer. This is particularly effective for database workloads that reuse a fixed set
of I/O buffers.
**Per-buffer encryption policy**: For network I/O carrying sensitive payloads (TLS
session keys, authentication tokens), applications can request per-buffer AES-GCM
encryption/decryption at registration time via `IORING_REGISTER_BUFFERS_ENCRYPTED`
(UmkaOS extension). This adds ~1 us per 4 KiB page (AES-GCM encrypt + MAC) but ensures
data in the bounce buffer is ciphertext, not plaintext. This flag is unnecessary for
block storage (ciphertext is on disk anyway, and dm-crypt handles encryption above the
io_uring layer) but recommended for network buffers in high-security deployments. When
this flag is not set, bounce buffer contents are plaintext -- this is acceptable for the
SEV-SNP threat model because the hypervisor is already trusted to deliver I/O correctly
(it controls the virtio device), and bounce buffers are only exposed for the duration of
the DMA operation.
> **See also**: [Section 9.7](09-security.md#confidential-computing--guest-mode-umkaos-as-a-confidential-guest)
> (UmkaOS as confidential guest) for the general SWIOTLB bounce buffer architecture.
> [Section 24.4](24-roadmap.md#formal-verification-readiness--performance-impact) for SEV-SNP performance
> characteristics.
## io_uring State Ownership and Live Evolution
io_uring ring state is **owned by the task** (via the file descriptor table), not
by the io_uring subsystem component. This design, inspired by Theseus OS's
state spill avoidance principle (Boos et al., OSDI 2020), enables live evolution
of the io_uring component without draining in-flight operations or serializing
ring state.
**Design rationale**: In a conventional live-update model, component state must
be exported, serialized, and imported into the replacement component. For io_uring,
this would require draining all in-flight I/O (tens of ms latency) or atomically
snapshotting ring state mid-flight (high complexity, subtle races). Instead,
UmkaOS structures io_uring so the component holds only code and soft caches — all
per-ring state lives in task-owned structures that persist across component swaps.
**`IoRingCtx` — task-owned ring state** (allocated on `io_uring_setup()`, referenced
through the fd table via `File::private_data`):
```rust
/// io_uring setup parameters. Matches Linux `struct io_uring_params`
/// (include/uapi/linux/io_uring.h) exactly for binary compatibility.
/// Passed by userspace to `io_uring_setup()` and returned with kernel-filled
/// fields (sq_off, cq_off, features).
#[repr(C)]
pub struct IoRingParams {
/// Requested SQ entries (must be power of 2, max IORING_MAX_ENTRIES=32768).
pub sq_entries: u32,
/// Requested CQ entries (must be power of 2; if 0, kernel uses 2*sq_entries).
pub cq_entries: u32,
/// Setup flags (IORING_SETUP_* bitfield). Key values:
/// IOPOLL=0x1, SQPOLL=0x2, SQ_AFF=0x4, CQSIZE=0x8, CLAMP=0x10,
/// ATTACH_WQ=0x20, R_DISABLED=0x40, SUBMIT_ALL=0x80, COOP_TASKRUN=0x100,
/// TASKRUN_FLAG=0x200, SQE128=0x400, CQE32=0x800, SINGLE_ISSUER=0x1000,
/// DEFER_TASKRUN=0x2000, NO_MMAP=0x4000, REGISTERED_FD_ONLY=0x8000,
/// NO_SQARRAY=0x10000.
pub flags: u32,
/// SQ thread CPU affinity (only when IORING_SETUP_SQ_AFF is set).
pub sq_thread_cpu: u32,
/// SQ thread idle timeout in milliseconds (only when IORING_SETUP_SQPOLL is set).
pub sq_thread_idle: u32,
/// Feature flags (kernel-filled on return). Bitmask of IORING_FEAT_*.
pub features: u32,
/// Working group ID for work queue sharing (IORING_SETUP_ATTACH_WQ).
pub wq_fd: u32,
/// Reserved for future use. Must be zero on input.
pub resv: [u32; 3],
/// Offsets of SQ ring fields within the mmap'd SQ ring region.
/// Kernel-filled on return.
pub sq_off: IoSqringOffsets,
/// Offsets of CQ ring fields within the mmap'd CQ ring region.
/// Kernel-filled on return.
pub cq_off: IoCqringOffsets,
}
const_assert!(core::mem::size_of::<IoRingParams>() == 120);
/// Offsets within the SQ ring mmap region. Matches Linux `struct io_sqring_offsets`.
#[repr(C)]
pub struct IoSqringOffsets {
pub head: u32,
pub tail: u32,
pub ring_mask: u32,
pub ring_entries: u32,
pub flags: u32,
pub dropped: u32,
pub array: u32,
pub resv1: u32,
pub user_addr: u64,
}
const_assert!(core::mem::size_of::<IoSqringOffsets>() == 40);
/// Offsets within the CQ ring mmap region. Matches Linux `struct io_cqring_offsets`.
#[repr(C)]
pub struct IoCqringOffsets {
pub head: u32,
pub tail: u32,
pub ring_mask: u32,
pub ring_entries: u32,
pub overflow: u32,
pub cqes: u32,
pub flags: u32,
pub resv1: u32,
pub user_addr: u64,
}
const_assert!(core::mem::size_of::<IoCqringOffsets>() == 40);
/// Per-ring state for one io_uring instance. Owned by the task's fd table
/// (via Arc in File::private_data), NOT by the io_uring subsystem component.
/// Persists across live evolution of the io_uring component code.
///
/// On io_uring_setup(): allocated, inserted into fd table as a new File.
/// On close(fd) or task exit: Drop cleans up rings, cancels in-flight ops,
/// unpins registered buffers.
///
/// The io_uring component provides stateless functions that operate on this
/// struct. Component swap replaces the functions; this struct is unchanged.
pub struct IoRingCtx {
/// Submission queue ring (shared with userspace via mmap).
/// Layout matches Linux io_uring SQ ring exactly for binary compatibility.
pub sq_ring: MappedPages,
/// Completion queue ring (shared with userspace via mmap).
pub cq_ring: MappedPages,
/// SQ entries array (separate mmap region, indexed by SQ ring entries).
pub sqes: MappedPages,
/// Kernel-internal state flags. Separate from `params.flags` (which is the
/// userspace ABI struct and must remain `u32`). Bit 0 = `IORING_CTX_DYING`
/// (set by `io_uring_files_cancel()` during task exit to reject new
/// submissions). `AtomicU32` for interior mutability: the cancel path
/// accesses through `Arc<IoRingCtx>` (`&self`), while `io_uring_enter()`
/// reads concurrently. `Release` on store, `Acquire` on load.
pub state_flags: AtomicU32,
/// Registered buffer table (pre-pinned DMA buffers for zero-copy I/O).
/// None until io_uring_register(IORING_REGISTER_BUFFERS) is called.
/// Bounded by IORING_MAX_REG_BUFFERS (default: 32768, matching Linux).
///
/// **Persistent DMA mappings**: Each registered buffer is pinned in memory
/// and its DMA mapping (IOVA) is established once at registration time.
/// Subsequent `READ_FIXED` / `WRITE_FIXED` operations reuse the pre-mapped
/// IOVA without per-operation `dma_map` / `dma_unmap` calls. Mappings are
/// torn down only on `IORING_UNREGISTER_BUFFERS` or ring destruction.
/// This eliminates IOMMU TLB invalidation overhead on every I/O.
///
/// **In-flight bio safety on ring destruction**: Each bio referencing a
/// registered buffer holds an `Arc<IoRingCtx>`. The ring's reference count
/// prevents destruction until all in-flight bios complete and drop their
/// Arc references. DMA mappings are torn down only after the last bio
/// completes.
///
/// Protected by `SpinLock` for interior mutability: `io_uring_files_cancel()`
/// clears this through `Arc<IoRingCtx>` (`&self`). Concurrent
/// `READ_FIXED`/`WRITE_FIXED` operations are rejected by the
/// `IORING_CTX_DYING` flag check before accessing the buffers.
// Registered buffers are a warm-path operation (set up once, used many
// times). Vec is acceptable here because the maximum count is bounded by
// IORING_MAX_REG_BUFFERS and allocation happens only during
// IORING_REGISTER_BUFFERS.
pub registered_buffers: SpinLock<Option<Vec<PinnedDmaBuf>>>,
/// Registered file table (pre-resolved file descriptors).
/// None until io_uring_register(IORING_REGISTER_FILES) is called.
/// Heap-allocated: IORING_MAX_REG_FILES = 32768, so inline ArrayVec
/// would waste 512 KiB per io_uring instance. Allocated once at
/// registration time; length = user-requested count (≤ IORING_MAX_REG_FILES).
///
/// Protected by `SpinLock` (same rationale as `registered_buffers`).
pub registered_files: SpinLock<Option<Box<[Option<Arc<File>>]>>>,
/// In-flight operation tracking. Key = SQE index, value = operation state.
/// Used for cancellation (IORING_OP_ASYNC_CANCEL) and linked-SQE chains.
/// XArray for O(1) indexed lookup (Section 3.1.13).
pub inflight: XArray<IoRingInflight>,
/// Namespace context captured at the most recent io_uring_enter().
/// All SQEs in a submission batch use this context.
pub ns_ctx: IoRingNsCtx,
/// Registered eventfd for CQE notification. Set via
/// `IORING_REGISTER_EVENTFD`. When set, `eventfd_signal()` is called
/// on each CQE post (batched — one signal per completion drain cycle,
/// not per CQE). Applications use this to integrate io_uring with
/// `epoll` or `select`-based event loops. Cleared via
/// `IORING_UNREGISTER_EVENTFD` or on ring destruction.
pub eventfd: Option<Arc<EventFd>>,
/// Per-instance operation restriction whitelist (None = all allowed).
pub restrictions: Option<IoRingRestrictions>,
/// SQPOLL configuration. None if SQPOLL is not enabled for this ring.
pub sqpoll: Option<IoRingSqpoll>,
/// Ring parameters captured at setup time (SQ size, CQ size, flags).
/// This is the userspace ABI struct (matches Linux `struct io_uring_params`
/// exactly). Kernel-internal flags (e.g., `IORING_CTX_DYING`) live in
/// `state_flags` above, NOT in `params.flags`.
pub params: IoRingParams,
/// SEV-SNP bounce buffer pool (None if not running in confidential guest).
pub snp_bounce: Option<Arc<SnpBouncePool>>,
/// CQE overflow list. When the CQ ring is full and a new CQE must be
/// posted, the CQE is appended here instead of being dropped. On the
/// next `io_uring_enter()`, the kernel drains this list into the CQ
/// ring before processing new SQEs. The overflow list is bounded at
/// 2× the SQ depth (matching Linux). When the overflow list is also
/// full, completions are dropped and `IORING_CQ_OVERFLOW` flag is set
/// in the SQ flags visible to userspace. Userspace must drain
/// completions (via `IORING_ENTER_GETEVENTS`) before submitting more
/// SQEs. Matches Linux's CQE overflow behavior (kernel 5.5+).
/// Capacity = 2 × sq_entries, set at io_uring_setup() time.
/// Bounded: max `2 * IORING_MAX_ENTRIES` (65536). Uses Box<[...]> to avoid
/// heap reallocation under SpinLock. Allocated once at io_uring_setup().
pub cq_overflow: SpinLock<Box<[Option<IoUringCqe>]>>,
/// WaitQueue for task-exit drain. `io_uring_files_cancel()` sleeps
/// here waiting for in-flight operations to complete. The I/O
/// completion path (`io_uring_cqe_post()`) wakes this queue whenever
/// `inflight.count()` decrements, providing event-driven exit drain
/// instead of 1ms polling. No-op when no task is exiting.
pub exit_wq: WaitQueue,
/// State version tag for cross-evolution compatibility checking.
/// Incremented when the IoRingCtx layout changes between io_uring
/// component versions. The new component checks this on first access
/// and migrates inline if needed (see evolution protocol below).
pub state_version: u64,
}
/// SQPOLL thread state, owned by the ring (not the io_uring component).
pub struct IoRingSqpoll {
/// Handle to the kernel SQPOLL thread polling this ring's SQ.
pub thread: TaskRef,
/// Idle timeout in milliseconds. SQPOLL thread parks after this
/// duration of no SQ activity; woken by io_uring_enter(IORING_ENTER_SQ_WAKEUP).
pub idle_timeout_ms: u32,
/// CPU affinity for the SQPOLL thread (set via IORING_REGISTER_IOWQ_AFF).
pub cpu: Option<CpuId>,
}
/// In-flight operation state for one SQE.
pub struct IoRingInflight {
/// Original SQE user_data (echoed in CQE for correlation).
pub user_data: u64,
/// Operation type (opcode from SQE).
pub opcode: u16,
/// Linked-SQE chain next pointer (None if standalone or chain tail).
pub link_next: Option<u32>,
/// Cancellation token. Set to Cancelled if ASYNC_CANCEL targets this op.
pub cancel: AtomicU8,
/// Reference to the async work item (for io_wq-offloaded operations).
pub work: Option<IoWqWorkRef>,
}
pub const IORING_MAX_REG_BUFFERS: usize = 32768;
pub const IORING_MAX_REG_FILES: usize = 32768;
/// Current IoRingCtx layout version. Bumped on struct changes.
pub const IORING_CTX_VERSION: u64 = 1;
The io_uring component is a stateless processor:
/// io_uring component interface. The component provides these operations
/// but holds NO per-ring state. All state is in IoRingCtx (task-owned).
///
/// On live evolution, these function pointers are replaced atomically.
/// IoRingCtx instances are unchanged — they persist in their owning tasks.
pub trait IoRingOps: Send + Sync {
/// Process a batch of SQEs from the submission queue.
/// Called from io_uring_enter() syscall path.
/// Returns number of SQEs successfully submitted.
fn submit(&self, ctx: &mut IoRingCtx, to_submit: u32) -> Result<u32, Errno>;
/// Post a completion entry to the CQ ring.
/// Called from I/O completion callbacks (block, network, VFS).
fn complete(&self, ctx: &IoRingCtx, cqe: CqEntry);
/// Register/unregister buffers, files, restrictions, etc.
/// Called from io_uring_register() syscall path.
fn register(&self, ctx: &mut IoRingCtx, opcode: u32, arg: UserPtr, nr: u32) -> Result<(), Errno>;
/// Cancel an in-flight operation by user_data tag.
fn cancel(&self, ctx: &mut IoRingCtx, user_data: u64, flags: u32) -> Result<(), Errno>;
/// SQPOLL thread main loop body (called repeatedly by the SQPOLL thread).
/// Returns true if work was found (SQ not empty), false if idle.
fn sqpoll_tick(&self, ctx: &mut IoRingCtx) -> bool;
}
Component-internal state (soft, regenerable):
The io_uring component may hold global soft state for performance:
| State | Purpose | On swap |
|---|---|---|
io_wq worker thread pool |
Offloads blocking operations (fsync, statx, etc.) | Existing workers finish current op with old code, then pick up new code on next work item. Pool is shared via Arc — new component inherits the reference. |
| Per-CPU completion batch lists | Batch CQE posting for cache efficiency | Flushed on swap (soft state — rebuilt on first completion). |
| Opcode dispatch table | Fast opcode → handler function map | Rebuilt by new component on load (trivial — 65-entry table). |
Live evolution protocol for io_uring (io_uring-specific instantiation of the generic Phase A/A'/B/C lifecycle defined in Section 13.18):
Phase A — Preparation (normal operation continues):
1. New io_uring component binary loaded.
2. New component calls io_wq_pool.upgrade_ops(new_work_fn) —
existing workers finish current item, then use new code for
subsequent items. No drain, no stall.
3. New component rebuilds its opcode dispatch table.
Phase B — Atomic swap (~1-10 μs, same as any component):
4. IPI stop-the-world.
5. IoRingOps vtable pointer swapped (old → new).
6. SQPOLL threads (if any): their next sqpoll_tick() call invokes
new component code. The threads are NOT restarted — they hold
Arc<IoRingCtx> and simply call through the new vtable.
7. CPUs released.
Phase C — Post-swap:
8. Flush per-CPU completion batch lists (one CQ ring doorbell per CPU).
9. Version check: if new component requires IoRingCtx layout changes
(state_version mismatch), it migrates each ring inline on first
access (lazy migration). Migration adds/removes fields and bumps
state_version. This is bounded: at most one migration per ring
per evolution event.
Why this is better than drain-and-recreate:
| Property | Drain-and-recreate | Ownership model |
|---|---|---|
| Swap latency | Tens of ms (drain all in-flight I/O) | ~1-10 μs (atomic vtable swap) |
| Serialization bugs | StateSerializer must capture every field | No serialization — data unchanged |
| In-flight I/O | Lost or stalled during drain | Continues uninterrupted |
| SQPOLL | Thread killed and restarted | Thread continues with new code |
| Attack surface | Deserializer can be exploited | No deserialization path |
| Implementation | Medium (write StateSerializer + tests) | Low (restructure ownership) |
Struct versioning and ABI decoupling: IoRingCtx is referenced from the
fd table as File::private_data: Arc<dyn Any + Send + Sync>. The io_uring
component downcasts to IoRingCtx on access. The task management subsystem
never inspects IoRingCtx internals — it only calls Drop on fd close. This
decouples IoRingCtx's layout from the task subsystem, allowing the io_uring
component to evolve its struct freely. Cross-version compatibility is handled
by the state_version field: if a new component encounters an old-version
IoRingCtx, it migrates the struct inline (adding new fields with defaults,
removing obsolete fields). Migration is O(1) per field change and happens
at most once per ring per evolution event.
Task struct integration: No new field is added to the Task struct.
IoRingCtx is reached through the existing Task.files: Arc<FdTable> →
FdTable[fd] → File::private_data path. This is the same path Linux uses
(task_struct → files → fdtable → file → private_data → io_ring_ctx).
Tasks that never use io_uring pay zero overhead.
19.3.4 io_uring Exit Cleanup¶
When a task exits, all in-flight io_uring operations must be cancelled and fixed
resource registrations released BEFORE close_files() (Step 5 in
Section 8.2). If close_files() ran first, it would attempt to
close io_uring fds, which triggers IoRingCtx::drop(). But Drop calls
io_uring_cancel_all() internally, which may block waiting for in-flight bios to
complete. Those bios may hold references to files in the dying task's fd table
(e.g., the target file of a IORING_OP_READ). Closing those files first would
require waiting for the bio to release its reference — deadlock. Linux solves this
identically: io_uring_task_cancel() runs before exit_files() in do_exit().
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). See CLAUDE.md Spec Pseudocode Quality Gates.
Task struct integration: Consistent with the io_uring ownership model described
in Section 19.3, no new field
is added to the Task struct. IoRingCtx instances are discovered by scanning the
task's fd table for file descriptors whose File::private_data downcasts to
IoRingCtx. This matches the Linux approach: io_uring_task_cancel() walks the task's
tctx->xa (an XArray of io_uring instances the task has submitted to).
However, repeatedly scanning the fd table on every exit is inefficient. UmkaOS uses a
per-task io_uring_tctx that is lazily allocated on the first io_uring_enter() call:
/// Per-task io_uring tracking context. Lazily allocated on first io_uring_enter().
/// Stored in Task.io_uring_tctx (Option<Box<IoUringTaskCtx>>).
///
/// This is a lightweight index — it does NOT own the IoRingCtx instances
/// (those are owned by the fd table via Arc). It holds Weak references
/// for O(1) enumeration during exit cleanup.
pub struct IoUringTaskCtx {
/// Weak references to io_uring instances this task has interacted with.
/// Bounded: a task rarely has more than 16 io_uring instances.
/// Weak<IoRingCtx> prevents preventing fd-close from freeing the ring.
pub rings: ArrayVec<Weak<IoRingCtx>, 16>,
}
The Task struct field:
/// io_uring task context. None if this task has never called io_uring_enter().
/// Lazily allocated on first io_uring_enter(). Tasks that never use io_uring
/// pay zero overhead (Option is a single pointer, None = null).
pub io_uring_tctx: Option<Box<IoUringTaskCtx>>,
Exit cleanup procedure — called at Step 3e in do_exit(), after perf event
cleanup (Step 3d) and before mm teardown (Step 4). This is per-thread cleanup —
every exiting thread runs it:
/// Cancel all in-flight io_uring operations for the dying task.
///
/// MUST be called BEFORE close_files() in do_exit(). Reason: io_uring
/// operations may hold references to files in the task's fd table. If
/// close_files() runs first, the close blocks waiting for io_uring to
/// release its reference. io_uring won't release until its cleanup runs
/// — deadlock.
///
/// # Preconditions
/// - `task.flags` has `PF_EXITING` set.
/// - Task's address space is still intact (Step 4 has NOT run yet).
/// io_uring cancellation runs as Step 3e, BEFORE mm teardown (Step 4).
///
/// # Postconditions
/// - All io_uring instances owned by this task have no in-flight operations.
/// - Fixed buffer and file registrations are released.
/// - The io_uring fd itself is NOT closed here — that happens in close_files()
/// (Step 5), which triggers final ring teardown via IoRingCtx::drop().
fn io_uring_files_cancel(task: &Task) {
let tctx = match task.io_uring_tctx.as_ref() {
Some(tctx) => tctx,
None => return, // Task never used io_uring — nothing to cancel.
};
for weak_ctx in tctx.rings.iter() {
let ctx = match weak_ctx.upgrade() {
Some(ctx) => ctx,
None => continue, // Ring already closed (fd was closed earlier).
};
// Step 1: Prevent new SQE submissions. Any concurrent io_uring_enter()
// from another thread sharing this fd (via CLONE_FILES) will see
// this flag and return -ECANCELED immediately.
ctx.state_flags.fetch_or(IORING_CTX_DYING, Release);
// Step 2: Cancel all in-flight operations.
// Walk the inflight XArray and cancel each pending operation.
let mut cancelled = 0u32;
for (idx, entry) in ctx.inflight.iter() {
// Set the cancellation token. The I/O completion path checks
// this and posts -ECANCELED CQEs.
entry.cancel.store(CANCEL_DYING, Release);
// Attempt per-opcode cancellation:
match entry.opcode {
// Block I/O: cancel the underlying bio/request.
IORING_OP_READ | IORING_OP_WRITE | IORING_OP_READV
| IORING_OP_WRITEV | IORING_OP_READ_FIXED
| IORING_OP_WRITE_FIXED | IORING_OP_READV_FIXED
| IORING_OP_WRITEV_FIXED => {
if let Some(ref work) = entry.work {
work.cancel();
}
}
// Poll: remove from the poll waitqueue.
IORING_OP_POLL_ADD => {
// poll_remove is idempotent if already completed.
io_poll_remove(&ctx, entry.user_data);
}
// Timer: cancel the hrtimer.
IORING_OP_TIMEOUT | IORING_OP_LINK_TIMEOUT => {
io_timeout_cancel(&ctx, entry.user_data);
}
// Accept: cancel pending accept on the socket.
IORING_OP_ACCEPT => {
if let Some(ref work) = entry.work {
work.cancel();
}
}
// All other opcodes: the cancellation token is sufficient.
// The completion path checks the token and posts -ECANCELED.
_ => {}
}
cancelled += 1;
}
// Step 3: Wait for all in-flight operations to complete or cancel.
// Event-driven: register on `ctx.exit_wq` (WaitQueue). The I/O
// completion path (`io_uring_cqe_post`) wakes this queue whenever
// `inflight.count()` decrements. This converts O(1s) worst-case
// polling to O(completion_latency) with zero busy-wait overhead.
// Bounded by a 1-second timeout as a safety net.
let deadline = clock_monotonic_ns() + 1_000_000_000; // 1 second
while ctx.inflight.count() > 0 {
if clock_monotonic_ns() > deadline {
log_warn!(
"io_uring exit: {} ops still inflight after 1s, pid={}",
ctx.inflight.count(),
task.pid()
);
// Force-remove remaining entries. Bios will complete
// asynchronously and find Weak<IoRingCtx> → None
// (ring-dead case), discarding the completion.
ctx.inflight.clear();
break;
}
// Sleep until the completion path wakes us (O(completion_latency)),
// or until the 1-second deadline (safety net).
let remaining_ns = deadline.saturating_sub(clock_monotonic_ns());
ctx.exit_wq.wait_timeout(Duration::from_nanos(remaining_ns));
}
// Step 4: Release fixed buffer registrations (IORING_REGISTER_BUFFERS).
// Unpins pages, tears down DMA mappings.
// Acquires the SpinLock protecting registered_buffers (interior mutability
// through Arc<IoRingCtx>). Concurrent READ_FIXED/WRITE_FIXED are rejected
// by the IORING_CTX_DYING check in io_uring_enter() before reaching here.
{
let mut bufs = ctx.registered_buffers.lock();
if let Some(ref mut buffers) = *bufs {
for buf in buffers.drain(..) {
buf.unpin_and_unmap();
}
}
*bufs = None;
}
// Step 5: Release fixed file registrations (IORING_REGISTER_FILES).
// Drops the Arc<File> references, allowing the files to be closed.
{
let mut files_guard = ctx.registered_files.lock();
if let Some(ref mut files) = *files_guard {
for slot in files.iter_mut() {
*slot = None;
}
}
*files_guard = None;
}
// Step 6: Cancel SQPOLL thread if active. The thread holds an
// Arc<IoRingCtx> and will exit on its next sqpoll_tick() when it
// observes IORING_CTX_DYING.
if let Some(ref sqpoll) = ctx.sqpoll {
sqpoll.thread.wake(); // Wake from idle to observe DYING flag.
}
}
}
/// Flag bit set in `IoRingCtx.state_flags` to indicate the ring is dying.
/// Prevents new SQE submissions from io_uring_enter().
const IORING_CTX_DYING: u32 = 1 << 0;
/// Cancellation token value for dying-task cleanup.
const CANCEL_DYING: u8 = 2;
Relationship to IoRingCtx::drop(): io_uring_files_cancel() cancels in-flight
operations and releases registrations, but does NOT free the ring itself. The
IoRingCtx is owned by the fd table (via Arc in File::private_data). When
close_files() (Step 5) closes the io_uring fd, Arc::drop runs IoRingCtx::drop(),
which frees the SQ/CQ ring pages, SQPOLL thread (if any), and the context struct.
By this point, all in-flight operations are already cancelled — drop() only needs
to free memory, not wait for I/O.
SQPOLL thread lifecycle: If the io_uring instance has an SQPOLL thread
(IORING_SETUP_SQPOLL), the thread observes IORING_CTX_DYING on its next
sqpoll_tick() call and exits its main loop. The TaskRef in IoRingSqpoll is
dropped during IoRingCtx::drop() (Step 5), which joins and frees the kernel thread.
The SQPOLL thread does not hold file references — it reads SQEs from the ring and
dispatches them, but the file references are resolved at dispatch time from the
ring's registered file table (already released in Step 5 of io_uring_files_cancel)
or from the task's fd table (still open until close_files()).
19.3.5 Direct I/O Operations (O_DIRECT)¶
Direct I/O bypasses the page cache and transfers data directly between user buffers and storage devices. This is the primary I/O path for databases (PostgreSQL, MySQL, RocksDB), key-value stores, and high-performance storage applications that manage their own caching. io_uring + O_DIRECT is the dominant high-performance I/O pattern on Linux; UmkaOS specifies it as a first-class path.
/// Filesystem direct I/O operations. Bypasses the page cache entirely.
///
/// Called from io_uring `IORING_OP_READ` / `IORING_OP_WRITE` (and their
/// `_FIXED` variants) when the file descriptor has `O_DIRECT` set. Also
/// called from synchronous `read()` / `write()` with `O_DIRECT`.
///
/// Implementations are provided by each filesystem that supports direct I/O
/// (ext4, XFS, btrfs, etc.) and registered via `InodeOps`. Filesystems that
/// do not support direct I/O (tmpfs, procfs, sysfs) return `EINVAL` when
/// `O_DIRECT` is specified at `open()` time.
pub trait DirectIoOps: Send + Sync {
/// Read directly from storage into the caller's buffer.
///
/// # Arguments
/// - `inode`: The file's inode (provides extent mapping via `IomapOps`).
/// - `offset`: Byte offset within the file to read from. Must be aligned
/// to the block device's logical sector size (typically 512 bytes; 4096
/// for 4Kn drives). Misaligned offset returns `EINVAL`.
/// - `buf`: Destination buffer. The buffer's physical address must be
/// aligned to the logical sector size. For io_uring registered buffers
/// (`IORING_OP_READ_FIXED`), alignment is guaranteed at registration
/// time. For non-registered buffers, the kernel checks alignment and
/// falls back to buffered I/O if misaligned (see fallback policy below).
///
/// # Returns
/// - `Ok(n)`: Number of bytes read. May be less than `buf.len()` if the
/// read extends past EOF (short read). Zero if `offset >= file_size`.
/// - `Err(IoError::InvalidAlignment)`: Buffer or offset not aligned to
/// logical sector size, and fallback to buffered I/O is disabled
/// (see `DIO_ALIGN_STRICT` below).
/// - `Err(IoError::Io)`: Storage device I/O error.
///
/// # Cache coherency
/// Before initiating the direct read, the kernel calls
/// `invalidate_inode_pages2_range(inode, offset, offset + len)` to evict
/// any cached pages in the page cache that overlap the read range. This
/// ensures the direct read returns data from storage, not stale cached
/// data. The invalidation is a no-op if no cached pages exist for the
/// range (common case for O_DIRECT-only workloads).
///
/// If a cached page in the range is dirty (modified by a buffered write
/// but not yet written back), the invalidation forces a writeback before
/// eviction, ensuring the direct read sees the most recent data.
fn read_direct(
&self,
inode: &Inode,
offset: u64,
buf: &mut [u8],
) -> Result<usize, IoError>;
/// Write directly from the caller's buffer to storage.
///
/// # Arguments
/// - `inode`: The file's inode.
/// - `offset`: Byte offset within the file. Alignment requirements are
/// the same as `read_direct`.
/// - `buf`: Source buffer containing the data to write. Alignment
/// requirements are the same as `read_direct`.
/// - `flags`: Write behavior flags.
///
/// # Returns
/// - `Ok(n)`: Number of bytes written. Equal to `buf.len()` on success
/// (direct writes are all-or-nothing at the bio level; partial writes
/// indicate a device error on the unwritten portion).
/// - `Err(IoError::InvalidAlignment)`: Buffer or offset misaligned.
/// - `Err(IoError::NoSpace)`: Filesystem is full (ENOSPC).
/// - `Err(IoError::Io)`: Storage device I/O error.
///
/// # Cache coherency
/// After the direct write completes, the kernel calls
/// `invalidate_inode_pages2_range(inode, offset, offset + len)` to evict
/// any cached pages overlapping the written range. This ensures subsequent
/// buffered reads see the data written by the direct write, not stale
/// cached copies. The invalidation is performed after the write (not before)
/// because the authoritative data is now on storage.
///
/// If `WriteFlags::DSYNC` is set, the write is guaranteed durable on
/// return (equivalent to `fdatasync()` for the written range). The
/// filesystem issues a storage flush/FUA after the data write completes.
fn write_direct(
&self,
inode: &Inode,
offset: u64,
buf: &[u8],
flags: WriteFlags,
) -> Result<usize, IoError>;
}
bitflags! {
/// Flags controlling direct I/O write behavior.
pub struct WriteFlags: u32 {
/// Data-sync: ensure written data (not necessarily metadata) is durable
/// on return. Maps to REQ_FUA at the block layer.
const DSYNC = 1 << 0;
/// File-sync: ensure both data and metadata are durable. More expensive
/// than DSYNC (requires metadata journal flush on journaling filesystems).
const SYNC = 1 << 1;
/// Append mode: writes are always appended to the end of the file
/// regardless of the offset parameter. The actual write offset is
/// returned in the CQE result for io_uring callers.
const APPEND = 1 << 2;
/// No wait: return EAGAIN immediately if the write would block on
/// extent allocation or journal space. Used by io_uring IOSQE_ASYNC
/// to avoid blocking the submission thread.
const NOWAIT = 1 << 3;
}
}
Alignment requirements and fallback policy:
| Parameter | Requirement |
|---|---|
| Buffer address | Aligned to block device logical sector size (512 or 4096 bytes) |
| File offset | Aligned to block device logical sector size |
| Transfer length | Multiple of block device logical sector size |
If any alignment requirement is violated:
-
Default behavior (
DIO_ALIGN_FALLBACK, the default): The kernel silently falls back to buffered I/O. The operation succeeds but goes through the page cache. This matches Linux's behavior and is required for compatibility — many applications open withO_DIRECTbut occasionally issue misaligned I/O (e.g., reading the last partial block of a file). -
Strict mode (
DIO_ALIGN_STRICT, per-fd viafcntl(F_SETFL, O_DIRECT_STRICT)): ReturnsEINVALon misaligned requests. Used by applications that want to guarantee they never accidentally fall back to buffered I/O (databases that audit their I/O alignment).O_DIRECT_STRICTis a UmkaOS extension; Linux applications that do not use it get the fallback behavior automatically.
io_uring integration specifics:
-
IORING_OP_READ_FIXED/IORING_OP_WRITE_FIXEDwithO_DIRECT: The registered buffer's DMA mapping (established atIORING_REGISTER_BUFFERStime) is reused directly. No per-operationdma_map/dma_unmapcalls. This is the fastest I/O path: userspace buffer → pre-mapped IOVA → NVMe SQ entry → completion. -
IORING_OP_READ/IORING_OP_WRITEwithO_DIRECT: The user buffer must be pinned and DMA-mapped per operation. The io_uring async worker handles the pinning (viaget_user_pages_fast()) and DMA mapping. Registered buffers avoid this overhead. -
Concurrent direct I/O and buffered I/O: UmkaOS permits concurrent direct and buffered I/O to the same file (matching Linux). The cache coherency invalidations in
read_direct/write_directensure consistency. However, concurrent buffered writes and direct reads to overlapping regions have inherently racy semantics — the direct read may see pre-write or post-write data depending on timing. Applications that mix I/O modes must use their own synchronization (e.g.,fsync()between buffered writes and direct reads). This is the same behavior as Linux; POSIX does not define ordering between O_DIRECT and buffered I/O to the same file.
19.4 Futex and Userspace Synchronization¶
19.4.1 Futex Implementation¶
The futex(2) syscall is the kernel-side primitive underlying all userspace synchronization:
glibc pthread_mutex_lock, pthread_cond_wait, sem_wait, and C++ std::mutex all
compile down to futex operations. Understanding futex is essential because the fast path
never enters the kernel at all -- an uncontended lock is a single atomic compare-and-swap
on a shared memory word, entirely in userspace. The kernel is only involved when a thread
must sleep (FUTEX_WAIT) or wake sleeping threads (FUTEX_WAKE).
UmkaOS implements the following futex operations:
| Operation | Description |
|---|---|
| FUTEX_WAIT | Block if *uaddr == val (avoids lost-wakeup race) |
| FUTEX_WAKE | Wake up to N waiters on uaddr |
| FUTEX_WAIT_BITSET | WAIT with 32-bit bitmask for selective wakeup |
| FUTEX_WAKE_BITSET | WAKE with bitmask (only wake waiters whose mask overlaps) |
| FUTEX_REQUEUE | Move waiters from one futex to another (condition variables) |
| FUTEX_CMP_REQUEUE | Requeue with value check (prevents lost wakeups during cond broadcast) |
| FUTEX_WAKE_OP | Atomic wake + modify: atomically reads old value from *uaddr2, applies op(old, oparg), writes result. Then wakes up to val waiters on uaddr1 and (if cmp(old, cmparg) is true) up to val2 waiters on uaddr2. The op arg encodes: oparg (12 bits), cmparg (12 bits), op (4 bits: SET/ADD/OR/ANDN/XOR), cmp (4 bits: EQ/NE/LT/LE/GT/GE). Optimizes pthread_cond_signal + mutex_unlock into a single syscall. |
The futex wait queue is organized as a hash table keyed by (address_space_id, virtual_address).
Each bucket contains a linked list of waiting tasks:
/// Futex hash key. Combines a key kind with an offset to uniquely identify a futex.
///
/// For **private futexes** (the common case, ~99% of mutex uses): the key is
/// (mm_id, page-aligned vaddr, offset within page). The `offset` field is
/// redundant with vaddr's low bits but kept for uniformity with the shared case.
///
/// For **shared futexes** (MAP_SHARED): the key is (physical page frame, offset
/// within page). Both processes sharing the mapping hash to the same bucket and
/// match on the same (PhysFrame, offset) pair, even if their virtual addresses differ.
///
/// **Matching rule**: Two FutexKeys match iff (kind == kind) AND (offset == offset).
/// For Private, kind equality means same mm_id and same vaddr. For Shared, kind
/// equality means same PhysFrame. The offset is ALWAYS part of the match.
pub struct FutexKey {
kind: FutexKeyKind,
/// Offset within the 4K page (0..4095). For private futexes, this equals
/// (vaddr & 0xFFF). For shared futexes, this is the offset into the physical
/// page. Critical for correctness: multiple futexes on the same page must NOT
/// collide (they have different offsets).
offset: u32,
}
impl FutexKey {
/// Hash the key for bucket distribution. Uses SipHash-1-3 (half-round SipHash)
/// with a per-boot random key for security against hash-flooding attacks.
/// The hash combines the discriminant, kind-specific fields, and offset
/// into a 64-bit value suitable for power-of-2 bucket masking.
///
/// For Private: hash(mm_id, page_aligned_vaddr, offset).
/// For Shared: hash(PhysFrame, offset).
///
/// Per-boot key initialized from `get_random_bytes()` during futex subsystem
/// init (before any userspace runs).
pub fn hash(&self) -> u64 {
// FUTEX_HASH_KEY: SipKey initialized at boot from CSPRNG.
let mut h = SipHasher13::new_with_key(&FUTEX_HASH_KEY);
match &self.kind {
FutexKeyKind::Private { mm_id, vaddr } => {
h.write_u8(0); // discriminant
h.write_u64(mm_id.0);
h.write_u64(vaddr.as_u64());
}
FutexKeyKind::Shared { page } => {
h.write_u8(1);
h.write_u64(page.as_u64());
}
}
h.write_u32(self.offset);
h.finish()
}
}
pub enum FutexKeyKind {
/// Private mapping: keyed by (address space, page-aligned virtual address).
/// The offset field in FutexKey provides the intra-page position.
Private { mm_id: MmId, vaddr: VirtAddr },
/// Shared mapping: keyed by physical page frame.
/// The offset field in FutexKey provides the intra-page position.
/// This ensures processes mapping the same file/shm at different virtual
/// addresses still wake each other correctly.
Shared { page: PhysFrame },
}
/// A futex waiter node, embedded in the Task struct (Section 8.1.1).
/// Uses intrusive singly-linked linking to avoid heap allocation under spinlock.
/// A task can wait on at most one futex at a time (futex_wait is blocking),
/// so a single embedded FutexWaiter per task is sufficient.
///
/// **Why singly-linked**: A doubly-linked intrusive list requires atomically
/// updating both `prev` and `next` pointers on unlink. No single CAS can cover
/// both — a CAS on `next` alone corrupts the `prev` chain, making any
/// lock-free doubly-linked-list unlink unsound in the general case. A
/// singly-linked list with O(n) unlink under the bucket spinlock is correct,
/// simple, and fast in practice: futex bucket contention lists rarely exceed a
/// handful of waiters.
pub struct FutexWaiter {
/// Intrusive singly-linked list pointer. `None` = list end (not in any bucket).
/// All mutations are performed while holding the owning bucket's spinlock,
/// which provides the necessary synchronization. `AtomicPtr` is unnecessary
/// because no lock-free access pattern exists — the bucket spinlock serializes
/// all reads and writes. `Option<NonNull<_>>` matches the type used by the
/// bucket's `head` field and by traversal code (eliminating casts).
pub next: Option<NonNull<FutexWaiter>>,
/// The futex key this waiter is blocked on (for requeue and wake filtering).
pub key: FutexKey,
/// Bitset for FUTEX_WAIT_BITSET selective wakeup (0xFFFF_FFFF = match all).
pub bitset: u32,
/// Back-pointer to the owning Task (for wake-up scheduling).
/// Valid for the lifetime of the owning Task. `FutexWaiter` is embedded
/// in the `Task` struct, so the Task always outlives the waiter.
/// Dereferenced only under the `FutexBucket` spinlock.
pub task: *const Task,
/// Wakeup state. Transitions under the bucket spinlock so the waiter
/// and the waker agree on who performed the wakeup.
pub state: WaiterState,
}
// SAFETY: All mutations of `FutexWaiter` fields are performed while holding
// the owning `FutexBucket`'s spinlock. `FutexWaiter` contains raw pointers
// (`NonNull`, `*const Task`) that are not `Send`/`Sync` by default. The bucket
// spinlock provides the required exclusive access guarantee. A `FutexWaiter`
// embedded in a `Task` may be observed from multiple CPUs (e.g., by
// `futex_exit_cleanup` racing with `futex_wake`), but both paths acquire the
// bucket spinlock first.
//
// `Send` is required because a `FutexWaiter` allocated on CPU 0's task stack
// is accessed by `futex_wake()` running on CPU 1 under the bucket lock.
// `SpinLock<FutexBucketInner>` requires `FutexBucketInner: Send` for the
// guard to be `Send`; since `FutexBucketInner` contains
// `Option<NonNull<FutexWaiter>>`, this transitively requires `FutexWaiter: Send`.
unsafe impl Send for FutexWaiter {}
unsafe impl Sync for FutexWaiter {}
/// Each bucket is protected by its own spinlock — contention is spread
/// across the table rather than funneled through a single lock.
///
/// Waiter lists use an intrusive singly-linked list (not `Vec`) to avoid heap
/// allocation under spinlock. FutexWaiter nodes are embedded in the
/// task struct (Section 8.1.1, `futex_waiter` field). Insertion is O(1) at
/// the head; removal is O(n) linear scan from the head under the bucket lock.
/// This is correct and fast in practice: futex wait lists are rarely longer
/// than a few entries even under heavy concurrent workloads.
///
/// **Lock hierarchy level**: FUTEX_BUCKET (level 0). This is BELOW all scheduler
/// locks so that futex_wake can safely call scheduler::enqueue() while holding
/// a bucket lock. The authoritative lock ordering from [Section 3.5](03-concurrency.md#locking-strategy) is:
/// FUTEX_BUCKET (level 0) < RT_MUTEX (level 10) < TASK_LOCK (level 20) < PI_LOCK (level 45) < RQ_LOCK (level 50).
/// Futex bucket locks are at level 0, allowing the following valid acquisition:
/// 1. Acquire FUTEX_BUCKET (level 0)
/// 2. Set waiter.state = Woken
/// 3. Unlink waiter from the bucket list (under FUTEX_BUCKET)
/// 4. Release FUTEX_BUCKET
/// 5. Call scheduler::enqueue() — no bucket lock held; scheduler acquires
/// TASK_LOCK (level 20) — valid: level 0 was already released
///
/// **Unlink BEFORE enqueue**: The unlink step must happen under the bucket lock
/// BEFORE enqueue is called. This prevents futex_exit_cleanup() from seeing a
/// waiter whose state is Woken but which has not yet been unlinked from the list,
/// which would cause a double-unlink. See futex_exit_cleanup() below.
pub struct FutexBucket {
/// The SpinLock wraps the waiter list head directly, so the Rust type
/// system enforces lock-before-access: callers must acquire the guard
/// before reading or writing `head`. The previous layout had `head`
/// outside the SpinLock data (`SpinLock<()>`), which allowed unsound
/// access to `head` without holding the lock.
inner: SpinLock<FutexBucketInner, FUTEX_BUCKET>,
}
/// Lock-protected interior of a futex bucket.
pub struct FutexBucketInner {
/// Head of the singly-linked waiter list. `None` when the bucket is empty.
pub head: Option<NonNull<FutexWaiter>>,
}
/// Lock level for futex bucket locks. Below RT_MUTEX (level 10) and TASK_LOCK (level 20) to allow
/// futex_wake → scheduler::enqueue() without lock ordering violation.
pub const FUTEX_BUCKET: LockLevel = LockLevel(0);
/// Per-NUMA-node futex hash table.
///
/// Why per-NUMA: On a 4-socket NUMA machine, a single 256-bucket global hash
/// table causes cross-NUMA cache line bouncing on every futex_wait/wake. With
/// per-NUMA tables, the hash table spinlock and bucket entries live on the same
/// NUMA node as the waiting CPU (for private futexes) or the physical page
/// (for shared futexes) — no cross-NUMA traffic on the common path.
pub struct FutexNumaNode {
/// Variable number of buckets per NUMA node (256, 1024, 4096, or 16384
/// depending on node memory; see `futex_hash_size`), each with its own
/// spinlock. Allocated with numa_alloc_onnode() — bytes live on this node.
buckets: Box<[FutexBucket]>,
}
/// Global futex subsystem — one FutexNumaNode per NUMA node.
pub struct FutexSystem {
/// Indexed by NUMA node ID (0..num_numa_nodes).
/// Allocated once at boot, never resized. `Box<[...]>` instead of `Vec`
/// because the NUMA topology is fixed after boot.
nodes: Box<[FutexNumaNode]>,
}
/// How to select the NUMA node for a futex operation:
///
/// **Shared futexes** (key is physical_page + offset):
/// node = physical_page.numa_node()
/// → Both futex_wait and futex_wake resolve the physical page → same NUMA node
/// → No cross-node ambiguity even when waker and waiter are on different nodes
///
/// **Private futexes** (key is mm + vaddr, FUTEX_PRIVATE_FLAG set):
/// futex_wait: node = mm.owner_numa_node()
/// futex_wake: node = mm.owner_numa_node()
/// → Both sides compute the NUMA node from the mm's owner (the process's
/// primary thread group NUMA affinity), which is deterministic and the
/// same for any thread in the process, regardless of which CPU issues
/// the wait or wake. Cross-node misses are minimized for processes whose
/// threads run on the mm's home NUMA node. Processes with threads spanning
/// multiple NUMA nodes may experience cross-node hash misses on the futex
/// bucket lookup, but correctness is unaffected — the hash is deterministic
/// and both wait and wake always resolve to the same node.
/// → This is an UmkaOS improvement over the naive per-CPU NUMA selection used
/// in some Linux configurations, which can cause lost wakeups when waiter
/// and waker run on CPUs in different NUMA nodes.
impl FutexSystem {
fn select_node_shared(physical_page: PhysPage) -> usize {
physical_page.numa_node()
}
fn select_node_private(mm: &MemoryMap) -> usize {
// Use home node of the mm's primary thread group.
// MemoryMap::owner_numa_node() returns the NUMA node of the mm's
// owning task (task_struct.mm_struct owner). This is the NUMA node
// where the process was initially placed (set at fork time from
// the parent's CPU affinity). It is deterministic and the same for
// all threads sharing this mm, making it suitable as the futex hash
// table node selector.
mm.owner_numa_node()
}
fn bucket_index(key: &FutexKey, buckets: usize) -> usize {
// buckets is always a power of 2 (from futex_hash_size()), so use bitmasking
// instead of modulo for O(1) distribution. The caller passes
// node.futex_buckets.len() so the index is always in-range for that
// node's actual table, which varies from 256 to 16384 depending on
// per-node memory (see futex_hash_size()). Passing a fixed constant
// here would silently ignore 75–99% of buckets on large NUMA nodes.
debug_assert!(buckets.is_power_of_two());
let h = key.hash();
(h ^ (h >> 8)) as usize & (buckets - 1)
}
}
/// Futex hash table sizing (buckets per NUMA node). Scaled at boot based on
/// per-node memory:
/// - ≤1 GB: 256 buckets
/// - ≤16 GB: 1024 buckets
/// - ≤256 GB: 4096 buckets
/// - >256 GB: 16384 buckets
/// This matches Linux's scaling heuristic (futex_init in kernel/futex/core.c),
/// applied independently per NUMA node so large nodes get proportionally more
/// buckets while small nodes don't waste memory.
pub const fn futex_hash_size(node_memory_bytes: usize) -> usize {
match node_memory_bytes {
0..=0x4000_0000 => 256, // ≤1 GB
0x4000_0001..=0x4_0000_0000 => 1024, // >1 GB, ≤16 GB
0x4_0000_0001..=0x40_0000_0000 => 4096, // >16 GB, ≤256 GB
_ => 16384, // >256 GB
}
}
Design note: UmkaOS's futex hash table scales with NUMA node memory (256–16,384 buckets per node, selected at boot based on available memory). This is intentionally superior to the historical Linux fixed-256-bucket design that was a known DoS vector (exploited via hash collision floods). That flaw was partially addressed in Linux 3.13; UmkaOS's adaptive design eliminates the bottleneck entirely by construction.
Why Linux didn't do this: Linux's futex code predates widespread NUMA awareness (2002). The
256-bucket global table was later expanded to min(256 × cpus, 8192) but remained global. The
physical-page-to-NUMA-node lookup adds a page table walk on every futex operation — Linux
considered this overhead not worth the benefit. UmkaOS implements it from the start (no legacy
constraint) and the lookup is O(1) via the page's embedded numa_node field in PhysPage.
The hash table size per NUMA node is determined at boot based on that node's available
memory (see futex_hash_size above). FUTEX_WAIT atomically checks *uaddr == val
while holding the bucket lock, closing the race window between the userspace check and
the kernel enqueue. The NUMA node is selected before acquiring any lock: shared futexes
use physical_page.numa_node(), private futexes use mm.owner_numa_node()
(deterministic, same node for both wait and wake).
Task exit unlink: When a task exits while in a futex wait queue, it acquires
the bucket spinlock and removes itself via a linear scan from the head. This is the
same approach as Linux (hash_bucket->lock in kernel/futex/core.c) and is
correct by construction: the spinlock serializes all concurrent wait/wake/exit
operations on the same bucket.
/// Waiter lifecycle state. Transitions are made under the owning bucket's
/// spinlock so that futex_wake() and futex_exit_cleanup() cannot race.
pub enum WaiterState {
/// Inserted in the bucket's wait list; the task is blocked.
Waiting,
/// futex_wake() has selected this waiter, unlinked it from the bucket list,
/// and called (or is about to call) scheduler::enqueue(). Both the state
/// transition and the unlink happen under the bucket spinlock; enqueue()
/// is called after releasing the lock.
Woken,
}
/// Called from the task exit path when the task may be sitting in a futex
/// wait queue. Acquires the bucket spinlock, removes the waiter node from
/// the singly-linked list (O(n) scan from head), and checks state to
/// detect a concurrent futex_wake() that has already selected this waiter.
///
/// **Race safety**: futex_wake() sets state = Woken AND unlinks the waiter
/// from the list under the bucket spinlock, then releases the lock BEFORE
/// calling scheduler::enqueue(). Therefore, when futex_exit_cleanup() acquires
/// the bucket lock, the waiter is either:
/// (a) still in the list with state == Waiting → exit cleanup unlinks it, or
/// (b) already unlinked with state == Woken → exit cleanup does nothing.
/// There is no window where state == Woken but the node is still in the list.
///
/// Unlink algorithm:
/// 1. Acquire bucket.lock (spinlock).
/// 2. If waiter.state == Woken: another CPU already unlinked us (under the
/// bucket lock) and will call scheduler::enqueue() after releasing it.
/// Release lock and consume the wakeup — no unlink needed.
/// 3. Otherwise (Waiting): walk the singly-linked list from bucket.head,
/// find the predecessor whose next == &waiter, set predecessor.next =
/// waiter.next (or update bucket.head if we are the first node).
/// 4. Null out waiter.next to leave the node in a clean state.
/// 5. Release bucket.lock.
fn futex_exit_cleanup(bucket: &FutexBucket, waiter: &mut FutexWaiter) {
// FutexBucket was refactored: the waiter list head lives inside
// SpinLock<FutexBucketInner>. Acquire the inner lock and access
// the head through the guard.
let mut guard = bucket.inner.lock();
if waiter.state == WaiterState::Woken {
// futex_wake() already unlinked us and scheduled a wakeup.
// Nothing left to do — the wakeup is consumed by the exit itself.
return;
}
// Linear scan to find and splice out this waiter node.
// SAFETY: All pointers in the list are valid FutexWaiter nodes embedded
// in live Task structs. The bucket spinlock prevents concurrent mutation.
let target = NonNull::from(waiter as &FutexWaiter);
let mut cursor: *mut Option<NonNull<FutexWaiter>> = &raw mut guard.head;
loop {
// SAFETY: cursor always points to a valid head or next field.
match unsafe { &mut *cursor } {
None => {
// Waiter not found — should be unreachable if caller is correct.
debug_assert!(false, "futex_exit_cleanup: waiter not in bucket list");
break;
}
Some(node) if *node == target => {
// Found our node. Splice it out.
// SAFETY: node is a valid FutexWaiter embedded in a live Task.
unsafe { *cursor = (*node.as_ptr()).next.take() };
break;
}
Some(node) => {
// SAFETY: node is a valid FutexWaiter.
cursor = unsafe { &raw mut (*node.as_ptr()).next };
}
}
}
}
The bucket spinlock is already acquired for every futex_wait and futex_wake operation, so acquiring it on task exit adds no new lock ordering concern (level 0, below TASK_LOCK at level 20). Futex wait lists are short in practice — rarely more than a handful of waiters per bucket even under JVM or Go runtime thread-heavy workloads — so the O(n) scan adds negligible cost to an already-infrequent per-task-exit operation.
19.4.2 Priority-Inheritance Futexes (PI)¶
Linux problem: Priority inversion occurs when a high-priority RT task blocks on a mutex held by a low-priority task, while a medium-priority task preempts the lock holder indefinitely. Without intervention, the RT task's latency becomes unbounded.
UmkaOS design: FUTEX_LOCK_PI and FUTEX_UNLOCK_PI implement kernel-mediated priority inheritance. When an RT task (priority 99) blocks on a PI futex held by a normal task (nice 0), the kernel temporarily boosts the lock holder to priority 99 so it can complete its critical section without being preempted by medium-priority work.
PI chain tracking handles transitive dependencies: if task A (priority 99) waits on a lock held by B (priority 50), and B waits on a lock held by C (priority 10), the kernel walks the chain and boosts C to priority 99. The chain walk is bounded by a compile-time limit (default: 1024 entries) to prevent runaway traversal.
Deadlock detection falls out naturally: if the chain walk encounters the requesting task
again (A waits on B waits on A), the kernel returns EDEADLK immediately rather than
creating a circular dependency.
PI boosting integrates with all three scheduler classes (Section 7.1): an EEVDF task can be temporarily boosted into the RT class, and a Deadline task's runtime budget is respected even when boosted. When the lock holder releases the PI futex, its effective priority reverts to the highest priority among any remaining PI dependencies (or its base priority if none remain).
19.4.3 Robust Futexes¶
Linux problem: If a thread crashes or is killed while holding a futex-based mutex, every other thread waiting on that futex blocks forever. The kernel has no way to know the dead thread held the lock because, in the normal case, the kernel never sees the lock/unlock at all (it is purely userspace).
UmkaOS design (same mechanism as Linux): Each thread maintains a userspace linked list
of currently held robust futex locks. The head of this list is registered with the kernel
via set_robust_list(). On thread exit (voluntary or involuntary), the kernel walks the
robust list and for each entry:
- Sets the
FUTEX_OWNER_DIEDbit (bit 30) in the futex word. - Performs a FUTEX_WAKE on that address, waking one waiter.
- The woken thread sees
FUTEX_OWNER_DIED, knows the lock state may be inconsistent, and can run recovery logic (or simply re-acquire the lock, clearing the bit).
The robust list walk is bounded (default: 2048 entries) to prevent a malicious thread from pointing the kernel at an enormous or circular list.
19.4.4 futex2 (FUTEX_WAITV)¶
Linux problem: The original futex(2) can only wait on a single address at a time.
Waiting on multiple synchronization objects simultaneously required workarounds like
polling threads or epoll-over-eventfd bridges -- all of which added latency and
complexity.
UmkaOS design: The futex_waitv() syscall (Linux 5.16+, syscall number 449 on
x86-64, 449 on AArch64, 449 on all architectures — unified numbering since Linux 5.16)
is supported from day one rather than retrofitted. It accepts an array of
(uaddr, val, flags) tuples and blocks until any one of them is triggered:
/// Matches Linux's `struct futex_waitv` (include/uapi/linux/futex.h).
/// The `uaddr` field is a u64 (not a pointer) to match the Linux ABI exactly —
/// this allows 32-bit processes on 64-bit kernels to pass 32-bit addresses
/// without sign-extension issues. The kernel validates the address and
/// interprets it as a `*const AtomicU32` internally.
pub struct FutexWaitv {
pub val: u64,
pub uaddr: u64, // User virtual address (validated by kernel)
pub flags: u32, // FUTEX_32, FUTEX_PRIVATE_FLAG, etc.
pub __reserved: u32, // Must be zero (Linux ABI compatibility)
}
const_assert!(core::mem::size_of::<FutexWaitv>() == 24);
/// Block until any of the N futex addresses is woken or has a value mismatch.
/// Returns the index of the triggered futex, or -ETIMEDOUT, or -ERESTARTSYS.
pub fn sys_futex_waitv(
waiters: &[FutexWaitv],
flags: u32,
timeout: Option<&Timespec>,
clockid: ClockId,
) -> Result<usize, Errno> { ... }
Primary consumers:
- Wine/Proton: Windows WaitForMultipleObjects maps directly to futex_waitv,
enabling efficient game synchronization without per-object polling threads.
- Event-driven runtimes: Any pattern where a thread must wait on several
independent conditions (e.g., "data ready OR shutdown requested OR timeout").
futex_waitv exit cleanup: A task waiting on multiple futexes has one
FutexWaiter per futex in the wait set. On task exit (SIGKILL), all waiters are
unlinked from their respective hash buckets. The exit path iterates the task's
waitv list (array of bucket references) and removes each waiter under the bucket
spinlock. This is O(N) in the number of waited futexes.
19.4.5 Physical Page Stability for Shared Futex Keys¶
Shared futex keys use FutexKeyKind::Shared { page: PhysFrame } for cross-process
futex sharing. The physical page identity is critical for correct hash bucket selection:
two processes sharing a futex via MAP_SHARED must hash to the same bucket, which
requires the same PhysFrame value. Three kernel subsystems can change a page's
physical address, invalidating in-flight futex keys:
Page migration (NUMA balancing) interaction:
When the NUMA balancing scanner or move_pages(2) migrates a page to a different
NUMA node, the physical frame number changes. Any futex waiters keyed on the old
physical frame would be stranded in the wrong hash bucket — FUTEX_WAKE on the new
frame would miss them.
Protocol:
1. Before migrating a page with potential futex waiters, the migration path calls
futex_key_invalidate(old_phys_frame).
2. futex_key_invalidate() walks all futex hash buckets that could map to
old_phys_frame (one bucket per NUMA node, since shared futexes use
physical_page.numa_node() for node selection) and wakes ALL waiters whose
key matches the old physical frame.
3. Woken waiters re-fault on the new page (the old PTE has been invalidated by
the migration path) and re-issue FUTEX_WAIT with the new physical frame as
key. This re-hashes them into the correct bucket on the destination NUMA node.
4. This is a spurious wakeup from the application's perspective — correct futex
code handles spurious wakeups by re-checking the futex value in a loop
(this is a documented requirement of the futex API).
Swap interaction:
When a page is swapped out, the physical frame is freed — there is no valid
PhysFrame for the futex key. Active futex waiters on a swapped-out page are
handled as follows:
- The swap-out path calls
futex_key_invalidate(old_phys_frame)before unmapping the page, waking all waiters. This is the same wake-all protocol as page migration. - Woken waiters re-fault on the swapped-out address. The page fault handler initiates swap-in, allocating a new physical frame and loading the page contents from swap.
- After swap-in completes, the waiter re-issues
FUTEX_WAITwith the new physical frame. The futex value check (*uaddr == val) catches any races: if another thread modified the futex word while the page was swapped out,FUTEX_WAITreturns-EAGAINinstead of sleeping.
This approach is simpler than maintaining swap-entry-keyed futex buckets (which Linux also considered and rejected). The cost is one extra page fault per swapped-out futex waiter, which is acceptable because swap-out of actively-contended futex pages is rare in practice.
THP (Transparent Huge Page) interaction:
Futex keys always use the base page (4 KiB) physical frame, not the compound page head. When a THP is used for a futex, the key is the specific 4 KiB sub-page within the 2 MiB compound page that contains the futex word:
THP split (when the kernel breaks a 2 MiB THP into 512 individual 4 KiB pages)
does NOT change the physical addresses of the constituent sub-pages — the 512
base pages retain their original physical frame numbers. Therefore, THP split
does not require futex_key_invalidate() and does not generate spurious
wakeups. The futex keys remain valid across the split.
THP collapse (when the kernel promotes 512 contiguous 4 KiB pages into a single 2 MiB THP) similarly preserves physical addresses — the pages are already physically contiguous (that is the precondition for collapse). Futex keys remain valid.
File truncation interaction:
When a file is truncated (ftruncate(2), unlink(2) via eviction,
fallocate(FALLOC_FL_PUNCH_HOLE)), pages beyond the new EOF are removed from
the page cache via truncate_inode_pages_range()
(Section 14.1). If any process holds a
MAP_SHARED mapping of the truncated region and another thread is blocked in
FUTEX_WAIT on an address within that region, the physical page backing the
futex word is freed — stranding the waiter in a hash bucket keyed on a now-stale
PhysFrame.
Protocol:
1. truncate_inode_pages_range() iterates each page in the truncated range.
Before removing a page from the page cache and freeing it, it calls
futex_key_invalidate(page.phys_frame()).
2. All shared futex waiters keyed on that physical frame are woken with
-EINVAL status (not a spurious wakeup — the futex word's backing memory
no longer exists).
3. The woken waiter's FUTEX_WAIT syscall returns -EINVAL. The application
sees EINVAL, which is the correct error for a futex on unmapped memory.
Well-written futex code checks the return value; glibc's
pthread_cond_wait() propagates this as an error.
4. If the waiter subsequently accesses the truncated address, it receives
SIGBUS (as with any access beyond EOF on a shared mapping).
This matches Linux's futex_key_invalidate() behavior in
truncate_inode_pages_range() (added in Linux 5.13, commit 3ee1afa3b7).
The cost is one futex_key_invalidate() call per truncated page — bounded
by the truncation range and typically negligible compared to the I/O cost of
page cache teardown.
/// Invalidate all shared futex keys referencing the given physical frame.
/// Called by the page migration, swap-out, and truncation paths before
/// changing or freeing a page's physical frame. Wakes all waiters so they
/// can re-fault and re-key (migration/swap) or observe EINVAL (truncation).
///
/// This function uses a per-page secondary index to avoid scanning all
/// hash buckets. Each physical page that has futex waiters is tracked in
/// `PAGE_FUTEX_INDEX`, mapping page addresses to the bucket indices
/// containing waiters on that page. On invalidation, only those buckets
/// are scanned — typically 1-2 buckets rather than all 256-4096.
///
/// The secondary index is maintained by FUTEX_WAIT (insert on block) and
/// wake/timeout (remove when bucket's last waiter on that page is removed).
pub fn futex_key_invalidate(frame: PhysFrame) {
let numa_node = frame.numa_node();
let node = &FUTEX_SYSTEM.nodes[numa_node];
// Look up which buckets have waiters on this physical page.
// PAGE_FUTEX_INDEX: XArray<PhysAddr, ArrayVec<u32, 8>> — maps page
// addresses to the (small) set of bucket indices with waiters on that page.
// 8 entries per page is sufficient: a 4 KiB page has at most 1024 u32-aligned
// futex words, but in practice a page rarely has waiters in more than 1-2 buckets.
// Insertion uses try_push() — if the ArrayVec is full (>8 buckets on one page,
// extremely unlikely), the secondary index entry is removed and invalidation
// falls back to scanning all buckets for that NUMA node. This prevents panic
// on overflow at the cost of slower invalidation in the degenerate case.
let bucket_indices = match PAGE_FUTEX_INDEX.load(frame.addr()) {
Some(indices) => indices.clone(),
None => return, // No waiters on this page — nothing to do.
};
for &bucket_idx in &bucket_indices {
let bucket = &node.buckets[bucket_idx as usize];
let _guard = bucket.lock.lock();
// Wake all waiters whose key is Shared { page: frame }.
// Woken waiters will re-fault and re-wait with the new frame.
wake_all_matching(&bucket, |waiter| {
matches!(waiter.key.kind, FutexKeyKind::Shared { page } if page == frame)
});
}
// Remove the secondary index entry — the page is being migrated/freed.
PAGE_FUTEX_INDEX.remove(frame.addr());
}
19.4.6 Cross-Domain Futex Considerations¶
Standard futex implementations assume a single kernel address space. UmkaOS's domain-based isolation domains (x86-64), POE domains (AArch64), DACR domains (ARMv7), and page-table isolation domains (RISC-V 64, PPC32, PPC64LE) introduce a cross-domain shared-memory scenario that does not exist in Linux.
Shared-memory futex keying: When two processes (or a process and a Tier 1 driver)
share memory via MAP_SHARED, the futex key must be the physical address (page frame +
offset), not the virtual address, because each domain may map the region at a different
virtual address. The FutexKeyKind::Shared variant (Section 19.4) handles this case. Both
sides of the mapping hash to the same wait queue bucket, so FUTEX_WAKE from one domain
correctly wakes a waiter in the other.
Capability validation: Before performing any futex operation on a shared mapping,
the kernel verifies that the calling domain holds a valid capability to the underlying
shared memory region. A FUTEX_WAIT or FUTEX_WAKE on an address the caller cannot
legitimately access returns EFAULT. This prevents a compromised domain from probing or
waking arbitrary futex wait queues in other domains.
MPK interaction (x86-64): The futex word must reside in a page whose PKEY is accessible to both participating domains. In practice, this means the shared memory region is assigned to PKEY 1 (shared read-only descriptors) or PKEY 14 (shared DMA buffer pool), as defined in Section 11.2. BPF cross-domain futexes use the BPF domain key (default PKEY 2). The kernel reads and modifies the futex word from PKEY 0 (UmkaOS Core), which always has full read/write access to all domains — so the kernel-side atomic comparison and wake are never blocked by MPK permissions, even if the calling domain's PKRU restricts access to other keys.
| Architecture | Isolation mechanism | Futex cross-domain access method |
|---|---|---|
| x86-64 | MPK (PKEY 0-15) | Kernel operates as PKEY 0; shared region on PKEY 1 or 14 |
| AArch64 | POE | Kernel accesses futex word via privileged overlay permission |
| ARMv7 | DACR | Kernel sets domain manager mode for shared page access |
| RISC-V 64 | Page-table isolation | Kernel maps shared page into supervisor address space |
| PPC32 | Segment registers | Kernel maps shared segment with supervisor key access |
| PPC64LE | Radix PID / HPT | Kernel accesses futex word via hypervisor-privileged mapping |
19.4.7 UmkaOS Simplified Futex API¶
The Linux futex(2) syscall multiplexes 15+ operations through a single syscall number
with stringly-typed error semantics and a confusing val/val2/val3 triple that means
different things per operation. UmkaOS provides a clean single-operation API alongside
futex(2) for backward compatibility.
New UmkaOS futex syscalls:
// Wait: atomically check *uaddr == expected, then sleep until woken or timeout.
// Returns 0 on wake, -ETIMEDOUT on timeout, -EAGAIN if *uaddr != expected.
long futex_wait(uint32_t *uaddr, uint32_t expected,
const struct timespec *timeout, // NULL = wait forever
uint32_t flags); // FUTEX_PRIVATE_FLAG supported
// Wake: wake up to `count` waiters on uaddr. Returns number actually woken.
long futex_wake(uint32_t *uaddr, uint32_t count, uint32_t flags);
// Requeue: wake `wake_count` waiters on uaddr1, move `requeue_count` waiters
// to uaddr2 (for condition variable broadcast without thundering herd).
// Returns number of tasks woken + requeued.
long futex_requeue(uint32_t *uaddr1, uint32_t *uaddr2,
uint32_t wake_count, uint32_t requeue_count,
uint32_t flags);
Syscall numbers (UmkaOS-native, negative):
UmkaOS native syscalls use negative numbers, dispatched through the bidirectional syscall table (Section 19.1). Negative numbers are collision-proof with Linux's positive syscall numbers — no overlap possible regardless of future Linux growth. These ops are in the Sync family (-0x0930):
| Syscall | Number | Notes |
|---|---|---|
futex_wait |
-0x0930 | Relative timeout only; use futex_wait_abs for absolute |
futex_wake |
-0x0931 | |
futex_requeue |
-0x0932 | |
futex_wait_abs |
-0x0933 | Absolute timeout with explicit clockid_t |
futex_wait_pi |
-0x0934 | Priority-inheritance wait |
futex_wake_pi |
-0x0935 | Priority-inheritance wake |
Linux native futex2 compatibility: Linux 6.7 introduced native futex_wake(2)
(syscall 454), futex_wait(2) (syscall 455), and futex_requeue(2) (syscall 456) with
semantics similar to (but not identical to) UmkaOS's extended interface. The UmkaOS compat
layer handles these Linux-native syscall numbers transparently, routing them to the same
FutexSystem implementation. UmkaOS's own extended futex interface (negative syscall
numbers -0x0930 through -0x0935) provides additional features — absolute timeouts
(futex_wait_abs), priority-inheritance operations (futex_wait_pi, futex_wake_pi) —
as a superset. New UmkaOS applications should use the UmkaOS native interface (negative
numbers via libumka); applications ported from Linux use 454-456 unchanged through the
SysAPI layer.
Differences from futex(2) that matter:
timeoutis always astruct timespecrelative duration (noFUTEX_CLOCK_REALTIMEconfusion). For absolute timeout:futex_wait_abs(uaddr, expected, clockid, abstime, flags)is a separate syscall (number -0x0933).- Return values are unambiguous: only
{0, -ETIMEDOUT, -EAGAIN, -EFAULT, -EINVAL}. - No
val2/val3overloading — each operation has exactly the parameters it needs. - Priority inheritance:
futex_wait_pi/futex_wake_pias separate syscalls (-0x0934/-0x0935).
Internal routing: futex_wait / futex_wake / futex_requeue use the same
FutexSystem (per-NUMA hash table, Section 19.2.1) as the compat futex(2) syscall.
A waiter in the new API can be woken by a wake in the old API on the same address —
they share the same hash bucket.
Linux compatibility: futex(2) (syscall 202 on x86-64) is fully supported and
routes to the same implementation. New UmkaOS applications should prefer futex_wait /
futex_wake for clarity; existing applications use futex(2) unchanged.
19.5 Netlink Event Compatibility¶
UmkaOS's native event system (Section 7.9, umka-core) delivers events via capability-gated ring buffers. For compatibility with existing Linux tools that use netlink sockets, umka-sysapi provides translation layers for the following netlink protocol families:
| Netlink Family | Purpose | Key Consumers |
|---|---|---|
NETLINK_KOBJECT_UEVENT |
Device hotplug events | udev, systemd, mdev |
NETLINK_ROUTE |
Network interface and routing events | iproute2 (ip), NetworkManager, systemd-networkd |
NETLINK_AUDIT |
Security audit events (Section 20.3) | auditd, systemd-journald |
NETLINK_CONNECTOR |
Process events (fork, exec, exit) | systemd, process accounting |
NETLINK_NETFILTER |
Firewall logging and conntrack | iptables logging, conntrack-tools |
NETLINK_GENERIC |
Generic netlink (nl80211 WiFi, team, devlink, ethtool) | wpa_supplicant, NetworkManager, iw, hostapd, ethtool |
Architecture: Each netlink family is handled by a dedicated translator in umka-sysapi:
- Process opens a netlink socket (
socket(AF_NETLINK, SOCK_DGRAM, protocol)). - umka-sysapi intercepts the socket creation and
bind(), registering the process with the appropriate UmkaOS event channel. - When the kernel posts a native UmkaOS event, the translator converts it to the Linux netlink message format and writes to the socket buffer.
- Process reads netlink messages via
recvmsg().
19.5.1 NETLINK_KOBJECT_UEVENT (Device Events)¶
udev and systemd use this for device hotplug. Example translation:
UmkaOS Event:
event_type = UsbDeviceChanged
data.usb = { vid=0x1234, pid=0x5678, inserted=true }
Netlink message:
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-1
SUBSYSTEM=usb
DEVTYPE=usb_device
PRODUCT=1234/5678/100
DEVPATH synthesis: UmkaOS has no sysfs tree. The SysAPI layer constructs
Linux-compatible DEVPATH strings from the KABI device registry
(Section 12.1) using DevpathBuilder:
/// Builds a sysfs-compatible DEVPATH string from the KABI device tree.
/// Output must match the pattern that udev rules expect (bus/slot/function
/// for PCI, hub-port for USB, etc.).
pub struct DevpathBuilder;
impl DevpathBuilder {
/// Construct DEVPATH for a device. Examples:
/// PCI: /devices/pci0000:00/0000:00:14.0
/// USB: /devices/pci0000:00/0000:00:14.0/usb1/1-1
/// SCSI: /devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0
/// NET: /devices/virtual/net/eth0
pub fn build(device: &KabiDeviceInfo) -> ArrayString<256> {
// Walk the device's parent chain in the KABI registry.
// For each ancestor, emit the bus-specific path component:
// PCI: "{domain:04x}:{bus:02x}:{slot:02x}.{fn:x}"
// USB: "usb{busnum}/{busnum}-{port}"
// Platform: device name as-is
// Virtual: "virtual/{subsystem}/{name}"
// Concatenate with "/" separators, prefix with "/devices/".
}
}
The builder handles PCI, USB, SCSI (host/target/lun), platform, and virtual
devices. Unknown bus types emit the KABI device name verbatim. The generated
DEVPATH is used in NETLINK_KOBJECT_UEVENT messages and in the umkafs
/sys/devices/ compatibility tree.
19.5.1.1 Uevent Wire Format¶
Each uevent message is a sequence of null-terminated KEY=VALUE strings
concatenated into a single netlink datagram. The format matches Linux exactly
so that unmodified udev, systemd-udevd, and mdev parse it correctly.
Mandatory attributes (always present, in this order):
| Key | Source | Example |
|---|---|---|
ACTION |
Event type mapping: DeviceArrival→add, DeviceRemoval→remove, DriverBind→bind, DriverUnbind→unbind, PropertyChange→change |
ACTION=add |
DEVPATH |
DevpathBuilder::build() output |
DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-1 |
SUBSYSTEM |
Derived from BusType or service publication |
SUBSYSTEM=usb |
SEQNUM |
Monotonic u64 counter (UEVENT_SEQNUM.fetch_add(1, Relaxed)) |
SEQNUM=1234 |
Bus-specific attributes (appended based on device type):
| Bus | Additional Keys |
|---|---|
| PCI | PCI_SLOT_NAME=DDDD:BB:SS.F, PCI_ID=VVVV:DDDD, PCI_SUBSYS_ID=VVVV:DDDD, PCI_CLASS=CCSSPP, DRIVER=<name> (if bound) |
| USB | DEVTYPE=usb_device\|usb_interface, PRODUCT=VVVV/PPPP/RRRR, TYPE=CC/SS/PP, BUSNUM=NNN, DEVNUM=NNN |
| Platform | MODALIAS=platform:<name>, OF_COMPATIBLE_N=<string> (one per DT compatible entry) |
| SCSI | DEVTYPE=scsi_device\|scsi_host, SCSI_HOST=N, SCSI_CHANNEL=N, SCSI_ID=N, SCSI_LUN=N |
| Block | DEVTYPE=disk\|partition, DISKSEQ=N, PARTN=N (partitions only), MAJOR=N, MINOR=N |
| Network | DEVTYPE=<empty>\|wlan\|bridge\|veth, INTERFACE=<name>, IFINDEX=N |
Wire encoding: The datagram begins with a header string
ACTION@DEVPATH\0 (e.g., add@/devices/pci0000:00/...\0) followed by the
KEY=VALUE\0 pairs. Total message size is bounded by UEVENT_BUFFER_SIZE
(2048 bytes, matching Linux). Messages exceeding this size are split into
a base uevent plus supplementary NETLINK_KOBJECT_UEVENT datagrams with
the same SEQNUM.
/// Uevent sequence number. Monotonically increasing across all uevent types.
/// Userspace uses this to detect missed events (gap in sequence).
static UEVENT_SEQNUM: AtomicU64 = AtomicU64::new(1);
/// Maximum uevent datagram size (bytes). Matches Linux UEVENT_BUFFER_SIZE.
const UEVENT_BUFFER_SIZE: usize = 2048;
/// Build a uevent netlink datagram from a registry event.
///
/// # Arguments
/// - `action`: One of "add", "remove", "change", "bind", "unbind".
/// - `device`: Reference to the DeviceNode in the registry.
/// - `buf`: Caller-provided buffer of at least UEVENT_BUFFER_SIZE bytes.
///
/// # Returns
/// Number of bytes written to `buf`, or `Err(UeventError::Overflow)` if
/// the attribute set exceeds UEVENT_BUFFER_SIZE.
pub fn build_uevent(
action: &str,
device: &DeviceNode,
buf: &mut [u8; UEVENT_BUFFER_SIZE],
) -> Result<usize, UeventError>;
19.5.1.2 Uevent Attribute File (/sys/devices/.../uevent)¶
Reading /sys/devices/<devpath>/uevent returns the same KEY=VALUE\n pairs
(newline-separated, not null-separated — text format for human consumption).
Writing add, remove, or change to this file triggers a synthetic uevent
for the device — used by udevadm trigger and systemd cold-plug replay.
19.5.2 NETLINK_ROUTE (Network Events)¶
NetworkManager, iproute2, and systemd-networkd use this for link state and address changes. The Tier 1 network stack (Section 16.1) posts native events that umka-sysapi translates:
Push path (kernel → userspace, event notifications):
- RTM_NEWLINK / RTM_DELLINK: Interface added/removed
- RTM_NEWADDR / RTM_DELADDR: IP address added/removed
- RTM_NEWROUTE / RTM_DELROUTE: Routing table changes
- RTM_NEWNEIGH / RTM_DELNEIGH: ARP/NDP neighbor cache updates
Pull path (userspace → kernel, request/response queries):
Userspace tools (ip route show, ip link show, ip addr show) send netlink
request messages and expect reply messages. umka-sysapi handles these by:
- Process sends a
RTM_GET*request viasendmsg()on the netlink socket. - umka-sysapi parses the netlink message header (
struct nlmsghdr), extracts the request type and filter attributes (ifindex, prefix, family, etc.). - umka-sysapi queries the Tier 1 network stack's internal state via the
inter-domain ring (e.g.,
umka_net::get_routes(family, table)) and constructs netlink reply messages with the standardNLM_F_MULTIflag for dump responses, terminated byNLMSG_DONE. - Reply messages are written to the socket's receive buffer for
recvmsg().
Supported request types: RTM_GETLINK, RTM_GETADDR, RTM_GETROUTE,
RTM_GETNEIGH, RTM_GETRULE, RTM_GETQDISC. Dump mode (NLM_F_DUMP)
iterates the full table; non-dump mode returns a single matching entry.
19.5.3 NETLINK_GENERIC (Generic Netlink)¶
Generic Netlink is a multiplexed netlink protocol (family 16) that allows kernel subsystems to register named sub-protocols ("generic netlink families") without consuming a dedicated netlink protocol number. It is the transport for nl80211 (WiFi management), team (NIC teaming), devlink (device management), ethtool (NIC configuration), and many other subsystems.
Sub-families implemented:
| Generic Netlink Family | Operations | Consumers |
|---|---|---|
nl80211 |
NL80211_CMD_*: scan, connect, disconnect, roam, set_station, get_station, set_reg, set_power_save, get_wiphy, trigger_scan |
wpa_supplicant, NetworkManager, iw, hostapd, wpa_cli |
devlink |
DEVLINK_CMD_*: get, port_get, sb_get, param_get, health_reporter_get |
devlink tool, mlxconfig |
ethtool |
ETHTOOL_MSG_*: strset_get, linkinfo_get, linkmodes_get, linkstate_get, rings_get, channels_get |
ethtool, NetworkManager |
Architecture: NETLINK_GENERIC uses the same socket infrastructure as other
netlink families. On socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC):
1. umka-sysapi registers a NETLINK_GENERIC socket.
2. The process resolves sub-family IDs via CTRL_CMD_GETFAMILY (e.g., resolves
"nl80211" string to its runtime-assigned family ID number).
3. umka-sysapi routes NLM_F_REQUEST messages to the appropriate sub-family
handler (nl80211 handler → WirelessDriver KABI; devlink → DevlinkVTable KABI).
4. Unsolicited events are delivered via multicast groups (e.g., ml80211
multicast group config / mlme / scan).
19.5.4 Other Netlink Families¶
- NETLINK_AUDIT: Translated from UmkaOS's audit events (Section 9.5 IMA) for auditd.
- NETLINK_CONNECTOR: Translated from process lifecycle events (Section 8.1) for
cn_proc. - NETLINK_NETFILTER: Translated from nftables/conntrack events (Section 19.1) for firewall logging.
19.6 Windows Emulation Acceleration (WEA)¶
Wine and Proton emulate Windows NT kernel behavior in userspace. This subsystem provides kernel-level NT-compatible primitives that Wine/Proton can use directly, bypassing userspace emulation and achieving better correctness and performance.
Key insight: UmkaOS doesn't need to implement Windows syscalls directly. Instead, provide kernel-level primitives that make WINE/Proton faster, more correct, and easier to maintain.
Problem: WINE (and Proton) must emulate Windows NT kernel behavior in userspace on top of POSIX/Linux syscalls. This creates: - Performance overhead: Multiple syscalls to emulate one Windows operation - Semantic mismatches: Linux primitives don't map 1:1 to Windows primitives - Correctness issues: WINE's userspace emulation can't perfectly replicate kernel-level Windows behavior - Complexity: WINE's ntdll.dll is ~50K lines of Windows kernel emulation code
UmkaOS's opportunity: Provide a Windows NT-compatible object model as a kernel subsystem that WINE can use directly, bypassing userspace emulation.
Architectural principle — WEA wraps native UmkaOS primitives, not vice versa:
WEA is a translation layer, not a parallel kernel subsystem. Every WEA primitive is built on top of an existing UmkaOS native mechanism:
| WEA feature | Built on | Native spec |
|---|---|---|
| NT Events, Mutexes, Semaphores | UmkaOS SyncEvent / SyncSemaphore |
Section 19.8 |
WaitForMultipleObjects |
SYNC_WAIT_ANY / SYNC_WAIT_ALL |
Section 19.8 |
| I/O Completion Ports | BoundedMpmcQueue + io_uring completion |
Section 19.3, Section 11.8 |
VirtualAlloc / VirtualProtect |
VMM mmap + demand paging |
Section 4.15 |
| NT Thread / Fiber model | UmkaOS task + ucontext-style context |
Section 8.1 |
| Security tokens | TaskCredential + UmkaOS capabilities |
Section 9.9 |
| Structured Exception Handling | Signal delivery + VEH chain | Section 8.5 |
The WEA layer adds NT semantics (object naming, handle table, security
descriptors, alertable waits, mutex abandonment) without duplicating the
underlying kernel mechanisms. Native UmkaOS applications that need multi-object
wait use SYNC_WAIT_ANY directly — they never touch the WEA layer. WEA is
opt-in via CAP_WEA and imposes zero overhead on non-WINE processes.
Naming convention: WEA types use the Nt prefix (e.g., NtHandle, NtIocp,
NtMutant, NtSection) to distinguish NT-semantic objects from their POSIX/UmkaOS
equivalents. This prevents confusion when both exist in the same kernel: NtMutant
has abandonment semantics while Mutex does not; NtHandle supports inheritance
while Fd does not. The Nt prefix matches Wine/ReactOS convention and makes
code review immediately clear about which semantic model is in play.
19.6.1 Capability Gating¶
WEA syscalls (operation codes 0x0800-0x08FF) require CAP_WEA capability. This capability:
- Is NOT granted by default — only processes that explicitly request WEA support receive it.
- Can be scoped to a specific NT namespace subtree (e.g., CAP_WEA(namespace=/WINE-prefix-1)).
- Container isolation: each container (or WINE prefix) has its own \BaseNamedObjects\ subtree.
A process with CAP_WEA(namespace=/containers/abc) cannot access objects in /containers/def.
Without CAP_WEA, WEA syscalls return -EPERM. This prevents non-WINE processes from
interacting with the NT object namespace and ensures WEA's attack surface is opt-in.
19.6.2 NT Object Manager¶
Windows NT kernel concept: Everything is an object (files, processes, threads, events, mutexes, semaphores, sections). Objects live in a hierarchical namespace (\Device\, \Driver\, \BaseNamedObjects\, etc.).
Current WINE approach: Emulates NT objects in userspace. Server process (wineserver) manages object lifetimes, handles, waits. High overhead for cross-process object sharing.
UmkaOS WEA approach: Kernel-native NT object manager alongside POSIX VFS.
/// NT Object Manager (lives in umka-sysapi crate)
pub struct NtObjectManager {
/// Root of the hierarchical namespace (e.g., `\BaseNamedObjects\MyEvent`).
///
/// Each `NtDirectory` contains its own per-directory RwLock. Path traversal
/// acquires the lock at each directory level and releases the parent before
/// descending — at most one directory lock is held at any time (no lock
/// ordering issues between directories). This means operations on different
/// subtrees (`\BaseNamedObjects\` vs `\Device\`) never contend.
///
/// **Lock hierarchy**: WEA locks are in a separate "leaf" category that does not
/// call scheduler code while held. The NT namespace and object locks may call
/// allocator or capability code but NOT scheduler::enqueue(). This means:
/// - NT_NAMESPACE and NT_OBJECT locks do NOT need to be ordered relative to
/// scheduler locks (TASK_LOCK, RQ_LOCK, PI_LOCK).
/// - They DO need ordering relative to each other: NT_NAMESPACE < NT_OBJECT.
/// - They use a separate lock category (WEA_LOCKS) that is incompatible with
/// scheduler locks — holding any WEA lock while holding any scheduler lock
/// (or vice versa) is a compile-time error.
///
/// Wait operations (WaitForSingleObject, WaitForMultipleObjects) release all
/// NT object locks before calling scheduler::sleep(). Wake operations
/// (SetEvent, ReleaseMutex) mark the waiter as ready, then release NT object
/// locks, then call scheduler::wake() WITHOUT holding NT locks.
///
/// This "release-before-schedule" pattern is identical to how futex_wake works.
root: Arc<NtDirectory>,
/// Per-process NT handle tables (lazily allocated on first WEA syscall to
/// avoid ~1.5 MB overhead for non-WEA processes).
///
/// **Memory model**: The `Option<Box<>>` wrapper ensures non-WEA processes
/// (the vast majority in container environments) pay exactly zero memory cost.
/// Only processes that issue their first WEA syscall (`NtCreateFile`, etc.)
/// trigger allocation. For WEA-heavy environments (many WINE containers),
/// the flat 65536-entry array trades ~1.57 MB per WEA process for O(1) handle
/// lookup. If container density requires lower per-process overhead, a future
/// optimization can use a two-level page table (256 × 256-entry pages, ~6 KB
/// base + 1 KB per populated page) instead of the flat array.
handle_tables: PerProcess<Option<Box<NtHandleTable>>>,
}
/// Lock category for WEA subsystem locks. Separate from scheduler locks.
/// Holding a WEA lock and a scheduler lock simultaneously is forbidden.
pub const WEA_LOCK_CATEGORY: LockCategory = LockCategory::WEA;
/// Lock level within WEA category for namespace directory lock.
pub const NT_NAMESPACE_LEVEL: u8 = 0;
/// Lock level within WEA category for individual NT object internal locks.
pub const NT_OBJECT_LEVEL: u8 = 1;
/// NT object name. Fixed-size inline string. NT names exceeding 255 UTF-8 bytes
/// are rejected with STATUS_OBJECT_NAME_INVALID.
pub type NtName = ArrayString<256>;
/// NT namespace path. Backslash-separated components. Total path bounded by
/// MAX_NT_PATH_COMPONENTS (32) to prevent stack overflow during traversal.
pub type NtPath = ArrayString<1024>;
/// NT namespace directory node. Each directory has its own RwLock protecting
/// its children. Traversal acquires one directory lock at a time (hand-over-hand
/// is NOT needed — the parent lock is released before the child lock is acquired,
/// because `Arc<NtDirectory>` children are stable once inserted).
pub struct NtDirectory {
/// Children of this directory, protected by a per-directory lock.
/// Lookups take a read lock; insertions take a write lock. Since each
/// directory has its own lock, `\BaseNamedObjects\` operations never
/// contend with `\Device\` operations.
children: RwLock<BTreeMap<NtName, NtDirectoryEntry>, { LockCategory::WEA, 0 }>,
}
/// Directory entry in the NT namespace.
pub struct NtDirectoryEntry {
/// The object (Event, Mutex, etc.) or a subdirectory.
content: NtEntryContent,
/// Security descriptor controlling access (simplified from full Windows SD)
security: NtSecurityDescriptor,
/// Creation timestamp for audit/debugging
created_at: Instant,
}
/// Entry content: either a leaf object or a subdirectory.
pub enum NtEntryContent {
/// Leaf object (Event, Mutex, Semaphore, etc.)
Object(Arc<NtObject>),
/// Subdirectory (e.g., `\BaseNamedObjects\` is a subdirectory of `\`).
Directory(Arc<NtDirectory>),
}
/// Simplified NT security descriptor. Full Windows SDs are complex; we implement
/// the subset needed for WINE/Proton compatibility.
pub struct NtSecurityDescriptor {
/// Owner (maps to Unix UID via UmkaOS's capability system)
owner: UserId,
/// Container ID for namespace isolation (prevents cross-container squatting)
container_id: Option<ContainerId>,
}
/// Named object creation with atomic create-or-open semantics.
/// Prevents TOCTOU race conditions in named object access.
impl NtObjectManager {
/// Create a named object atomically. Returns existing object if name exists
/// and `open_existing` is true; returns STATUS_OBJECT_NAME_COLLISION if name
/// exists and `open_existing` is false.
///
/// **Traversal protocol**: The path is split into components. Each component
/// is looked up in the current directory under a read lock. When the final
/// component is reached and creation may be needed, the *leaf directory's*
/// write lock is acquired directly — the existence check and insertion both
/// happen under this single write-lock acquisition, eliminating any TOCTOU
/// window. At most one directory lock is held at any time.
///
/// **Atomic create-or-fail protocol**: The implementation uses a single
/// write-lock acquisition for both the existence check and the insertion,
/// eliminating any TOCTOU window. A prior read-only existence check (under
/// read lock) is an optional performance optimization only when
/// `OBJECT_CREATE_OR_FAIL` semantics are not required, and must never be
/// used as the authoritative check. The authoritative name-exists check is
/// always the one performed under the write lock in this function.
///
/// **Concurrency**: Operations on different directories never contend.
/// Two concurrent `CreateEvent(\BaseNamedObjects\EventA)` and
/// `CreateEvent(\BaseNamedObjects\EventB)` contend only on the
/// `\BaseNamedObjects\` directory lock, not on the root.
pub fn create_named<T: NtObjectType>(
&self,
path: &NtPath,
open_existing: bool,
access: u32,
security: NtSecurityDescriptor,
) -> Result<(NtHandle, bool /* created */), NtStatus> {
// Walk to the leaf directory (all intermediate lookups use read locks).
let (leaf_dir, name) = self.traverse_to_parent(path)?;
// Take write lock on the leaf directory only.
let mut dir = leaf_dir.children.write();
if let Some(existing) = dir.get(&name) {
// Check caller has permission to access existing object
self.check_access(existing, access)?;
// Check container isolation: object must be in same container or global
self.check_container_access(existing, &security)?;
if open_existing {
return Ok((self.create_handle(existing, access), false));
} else {
return Err(STATUS_OBJECT_NAME_COLLISION);
}
}
// Create new object under leaf write lock — atomic with the lookup
let obj = Arc::new(T::create()?);
let entry = NtDirectoryEntry {
content: NtEntryContent::Object(Arc::clone(&obj)),
security,
created_at: Instant::now(),
};
dir.insert(name, entry);
Ok((self.create_handle(&obj, access), true))
}
}
pub enum NtObject {
Event(NtEvent),
Mutex(NtMutex),
Semaphore(NtSemaphore),
Section(NtSection), // Memory-mapped file or shared memory
Process(NtProcess),
Thread(NtThread),
Timer(NtTimer),
IoCompletionPort(NtIocp),
Job(NtJob),
}
pub struct NtHandleTable {
/// Handles are indices into this table, not file descriptors.
/// Heap-allocated boxed slice with maximum 65536 entries (matching
/// UmkaOS's CapSpace limit and Linux's RLIMIT_NOFILE default). Attempting to
/// create handles beyond this limit returns STATUS_INSUFFICIENT_RESOURCES.
/// Initialized via `vec![None; NT_MAX_HANDLES].into_boxed_slice()` — never
/// passes through the stack, avoiding stack overflow with large N.
entries: Box<[Option<NtHandleEntry>]>,
/// Bitmap tracking which slots are free, for O(1) allocation.
/// Size: `NT_MAX_HANDLES / 64 = 65536 / 64 = 1024` entries (1024 × 8 = 8 KiB).
/// Also heap-allocated to avoid stack pressure.
/// Initialized via `vec![AtomicU64::new(0); 1024].into_boxed_slice()` — safe
/// constructor, no unsafe needed. `AtomicU64` has the same in-memory
/// representation as `u64` per Rust documentation.
free_bitmap: Box<[AtomicU64]>,
/// Windows handles are user-mode pointers (multiple of 4)
/// We maintain illusion: handle = (index << 2) | 0x4
next_hint: AtomicU32, // Hint for next free slot search, not authoritative
}
/// Maximum NT handles per process. Matches UmkaOS's CapSpace limit (Section 9.1).
/// Windows default is ~16 million but most applications use far fewer.
pub const NT_MAX_HANDLES: usize = 65536;
pub struct NtHandleEntry {
object: Arc<NtObject>,
access_mask: u32, // Windows ACCESS_MASK
attributes: u32, // OBJ_INHERIT, OBJ_PERMANENT, etc.
}
Handle close semantics (WEA_HANDLE_CLOSE): Closing an NT handle decrements
the Arc<NtObject> refcount. If this was the last reference, per-object-type
cleanup runs:
| Object Type | Close Action |
|---|---|
Event |
If auto-reset: no special cleanup (state is transient). If manual-reset: no special cleanup. Named events persist in namespace until all references close. |
Mutex |
If the closing thread holds the mutex (mutex.owner == current_thread), the mutex is released (owner set to None, next waiter woken). This matches Windows CloseHandle behavior for mutexes. |
Semaphore |
Refcount decrement only. Count is not adjusted. |
IoCompletionPort |
All threads blocked in GetQueuedCompletionStatus are woken with STATUS_CANCELLED. Pending I/O packets in the IOCP queue are drained and discarded. |
Section |
Memory-mapped views are not unmapped (matching Windows: views outlive the section handle). Views are unmapped when the process exits or explicitly calls UnmapViewOfFile. |
Timer |
Cancel pending timer callback if any. |
Process / Thread |
Handle close does not terminate the process/thread. The kernel object is freed only when both the handle refcount AND the process/thread itself have terminated. |
Job |
If last handle closes and JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE is set, all processes in the job are terminated. |
Syscalls provided:
// These are UmkaOS syscalls, not Windows syscalls
// WINE's ntdll.dll calls these instead of emulating in userspace
SYS_nt_create_event(
name: *const u16, // UTF-16 name (Windows convention)
manual_reset: u32, // 0 = auto-reset, 1 = manual-reset (NOT bool — see CLAUDE.md rule 8:
// Rust `bool` has a validity invariant; values != 0|1 from userspace
// registers are instant UB. Validate: if manual_reset > 1 return EINVAL.)
initial_state: u32, // 0 = non-signaled, 1 = signaled (same rationale as above)
) -> Result<NtHandle>;
SYS_nt_open_event(
name: *const u16,
access: u32,
) -> Result<NtHandle>;
SYS_nt_set_event(handle: NtHandle) -> Result<()>;
SYS_nt_reset_event(handle: NtHandle) -> Result<()>;
SYS_nt_pulse_event(handle: NtHandle) -> Result<()>;
SYS_nt_wait_for_single_object(
handle: NtHandle,
timeout_ns: Option<u64>, // Windows uses 100ns units, we convert
) -> Result<WaitResult>;
SYS_nt_wait_for_multiple_objects(
handles: &[NtHandle],
wait_all: bool, // WaitAll vs WaitAny
timeout_ns: Option<u64>,
) -> Result<WaitResult>;
SYS_nt_create_section(
name: Option<*const u16>,
size: u64,
protection: u32, // PAGE_READWRITE, PAGE_EXECUTE_READ, etc.
file: Option<Fd>, // Back with file or anonymous
) -> Result<NtHandle>;
SYS_nt_map_view_of_section(
section: NtHandle,
base_address: Option<*mut u8>, // NULL = kernel picks
size: u64,
offset: u64,
protection: u32,
) -> Result<*mut u8>;
Benefits for WINE: 1. Performance: Single syscall instead of 5-10 syscalls + wineserver RPC 2. Correctness: Kernel enforces Windows NT semantics exactly 3. Simplicity: WINE's ntdll.dll becomes thin wrapper over UmkaOS syscalls 4. Cross-process: Named objects work correctly between processes (games + launchers)
19.6.3 Fast Synchronization Primitives¶
Problem: Windows has NtWaitForMultipleObjects (wait on up to 64 objects simultaneously). Linux has no equivalent — WINE emulates with pipes + poll() or wineserver signaling. High overhead.
UmkaOS WEA approach: WEA synchronization is a thin translation layer over
UmkaOS native synchronization primitives (Section 19.8).
The kernel has ONE wait implementation — SYNC_WAIT_ANY / SYNC_WAIT_ALL — which
supports heterogeneous waitable types (UmkaWaitHandle: fd, event, pid, timer,
semaphore). WEA translates NT handle types and NT-specific semantics (alertable waits,
mutex abandonment, APCs) into native SYNC operations:
WEA_WAIT_MULTIPLE(handles, bWaitAll, timeout)
→ WEA: translate NtHandle[] → UmkaWaitHandle[] (Event→Event, Mutex→Event+ownership, ...)
→ WEA: if bWaitAll: umka_sync_wait_all(); else: umka_sync_wait_any()
→ WEA: translate result → WAIT_OBJECT_0+index / WAIT_ABANDONED_0+index / WAIT_TIMEOUT
Similarly, WEA NT objects are wrappers around native UmkaOS primitives:
| WEA type | Underlying native primitive | WEA syscall | Native syscall |
|---|---|---|---|
| NT Event (auto-reset/manual-reset) | SyncEvent |
WEA_EVENT_CREATE (0x0801) |
SYNC_EVENT_CREATE (0x0910) |
| NT Mutex | SyncEvent + ownership tracking |
WEA_MUTEX_CREATE (0x0810) |
SYNC_EVENT_CREATE + per-handle owner field |
| NT Semaphore | SyncSemaphore |
WEA_SEMAPHORE_CREATE (0x0811) |
SYNC_SEM_CREATE (0x0920) |
| NT WaitForMultipleObjects | SYNC_WAIT_ANY / SYNC_WAIT_ALL |
WEA_WAIT_MULTIPLE (0x0821) |
SYNC_WAIT_ANY (0x0900) / SYNC_WAIT_ALL (0x0901) |
The WaitAll atomicity protocol below describes the NT-specific multi-acquire semantics (sorted-order locking, deadlock pre-check, mutex abandonment). These semantics are layered on top of the native SYNC_WAIT_ALL kernel implementation, not a replacement for it.
/// Result of waiting on NT synchronization objects.
/// Windows limits WaitForMultipleObjects to 64 handles (MAXIMUM_WAIT_OBJECTS).
/// This limit is enforced at runtime, not in the type system.
pub enum WaitResult {
/// One of the waited objects became signaled. The inner value is the
/// zero-based index of the signaled handle in the input array.
/// For WaitAll, this is 0 (all signaled, return indicates the first).
Signaled(usize),
/// Wait timed out before any object was signaled.
Timeout,
/// A mutex was abandoned (owner thread died while holding it).
/// The inner value is the index of the abandoned mutex.
/// Windows semantics: the waiter acquires the mutex but should check state.
Abandoned(usize),
/// An I/O completion port had a packet available (for alertable waits).
IoCompletion,
}
impl NtObjectManager {
/// Wait on multiple objects (events, mutexes, semaphores, threads, processes)
/// Returns when ANY object becomes signaled (WaitAny) or ALL (WaitAll)
pub fn wait_for_multiple_objects(
handles: &[NtHandle],
wait_all: bool,
timeout: Option<Duration>,
) -> Result<WaitResult> {
// --- WaitAny semantics ---
// Register on wait queues for all handles. When ANY object signals,
// the thread is woken. On wakeup, atomically consume the signaled
// object (reset auto-reset event, acquire mutex, decrement semaphore).
// Deregister from all wait queues before returning.
// --- WaitAll atomicity ---
// WaitAll requires atomic multi-acquire: either ALL objects are acquired
// in a single atomic operation, or NONE are. Implementation:
//
// 1. Sort handles by object address to establish lock ordering.
// 2. Acquire each object's lock in sorted order (prevents deadlock).
// 3. Check if ALL objects are signaled:
// - Event: signaled == true
// - Mutex: owner == None OR owner == current_thread (recursive)
// - Semaphore: count > 0
// - Process/Thread: terminated
// 4. If ALL signaled, atomically consume ALL (reset events, acquire
// mutexes, decrement semaphores) while still holding all locks.
// 5. Release all locks in reverse order.
// 6. If NOT all signaled, release all locks and block on wait queues
// (same as WaitAny). Retry step 1-5 on each wakeup.
//
// This two-phase locking ensures no partial acquisition: either the
// calling thread wins all objects, or it wins none and blocks.
//
// Lock ordering: Objects are sorted by their kernel address. This
// matches Windows NT's implementation and prevents deadlock when
// multiple threads WaitAll on overlapping handle sets.
//
// Already-held objects and deadlock pre-check:
//
// Invariant for deadlock-free operation: if a thread already holds any
// mutex in the WaitAll set, it MUST hold ALL mutexes in the set that
// sort before (lower address than) that mutex. If this invariant is
// violated the sorted-order protocol breaks down: the thread would skip
// an already-held object at sorted position i but still need to acquire
// an unheld object at position j < i. Another thread that holds the
// object at j and is waiting for i creates a classic ABBA deadlock.
//
// Example of the failure mode (the old "skip already-held" logic):
// Thread A holds M1, calls WaitAll([M1, M2]) → skips M1, blocks on M2.
// Thread B holds M2, calls WaitAll([M1, M2]) → blocks on M1.
// → deadlock despite sorted acquisition order.
//
// Pre-check algorithm (runs before the acquisition loop):
//
// let mut found_unheld = false;
// for obj in sorted_objects.iter() {
// if thread_holds(obj) {
// if found_unheld {
// // Already-held mutex appears after an unheld one in sorted
// // order. Another thread could hold the unheld object and
// // wait for this thread's object → deadlock.
// return Err(STATUS_POSSIBLE_DEADLOCK);
// }
// // Object is already held and all earlier objects are also held:
// // safe to skip (increment recursion count for recursive mutexes,
// // or return STATUS_MUTANT_NOT_OWNED for non-recursive ones).
// } else {
// found_unheld = true;
// }
// }
//
// If the pre-check passes, the caller either holds none of the objects
// (normal path) or holds a contiguous prefix of the sorted set (safe to
// skip those and acquire the suffix). In both cases the sorted-order
// protocol holds and deadlock is impossible.
//
// STATUS_POSSIBLE_DEADLOCK matches Windows NT semantics: NT's kernel
// issues this status from KeWaitForMutexObject when the deadlock
// condition is detected, allowing the caller to back off and retry.
}
}
Why this matters for gaming: - Game engines (Unreal, Unity) use multi-object waits heavily - DirectX11/12 synchronization uses events, mutexes - 5-10x performance improvement over WINE's current userspace emulation
19.6.4 I/O Completion Ports (IOCP)¶
Problem: Windows IOCP is a high-performance async I/O primitive used by game servers, engines. Linux has io_uring but semantics don't match. WINE emulates IOCP poorly.
UmkaOS WEA approach: Kernel-native IOCP implementation.
/// Maximum pending completion packets per IOCP. Prevents unbounded kernel
/// memory growth from userspace posting. Windows doesn't document a hard
/// limit; we use 64K which exceeds any practical game workload.
pub const NT_MAX_IOCP_PACKETS: usize = 65536;
/// NtIocp is always heap-allocated via `Arc<NtIocp>` because the inline
/// BoundedMpmcQueue is 65536 × 24 bytes ≈ 1.5 MB — far too large for any
/// kernel stack. All code paths create `Arc::new(NtIocp { .. })`.
pub struct NtIocp {
/// Completion queue (MPMC: many threads post via I/O completion or
/// PostQueuedCompletionStatus, multiple worker threads consume via
/// GetQueuedCompletionStatus). The `concurrency` field limits how many
/// threads can dequeue simultaneously. Bounded to NT_MAX_IOCP_PACKETS;
/// posting to a full queue returns STATUS_INSUFFICIENT_RESOURCES.
/// Heap-allocated: the queue alone is ~1.5 MB (65536 entries × 24 bytes),
/// which exceeds kernel stack limits. The owning NtIocp is always behind
/// Arc, so this field lives on the heap.
completion_queue: BoundedMpmcQueue<IocpPacket, NT_MAX_IOCP_PACKETS>,
/// Associated threads (NT allows binding threads to IOCP)
/// Max threads that can dequeue simultaneously. Clamped to
/// [1, nr_cpu_ids * 2] at creation time. A value of 0 in the NT API means
/// "number of processors" (matches Windows behavior). Values above
/// nr_cpu_ids * 2 are clamped (prevents resource waste from misconfigured
/// Wine prefixes passing unreasonable concurrency values).
concurrency: usize,
/// Wait queue for GetQueuedCompletionStatus
wait_queue: WaitQueue,
}
pub struct IocpPacket {
bytes_transferred: u32,
completion_key: usize, // User-defined per-handle key
/// User-provided OVERLAPPED pointer. This is an **opaque token** that the kernel
/// never dereferences — it is stored on PostQueuedCompletionStatus and returned
/// unchanged on GetQueuedCompletionStatus. The caller is responsible for ensuring
/// the pointer remains valid until dequeued. The kernel treats this as a usize
/// (not a validated UserPtr) because it is purely userspace-to-userspace data flow.
overlapped: usize, // Opaque user pointer (NOT dereferenced by kernel)
status: i32, // NT status code
}
// Syscalls
SYS_nt_create_iocp(concurrency: usize) -> Result<NtHandle>;
SYS_nt_associate_file_with_iocp(
file: Fd,
iocp: NtHandle,
completion_key: usize,
) -> Result<()>;
SYS_nt_post_queued_completion_status(
iocp: NtHandle,
packet: IocpPacket,
) -> Result<()>;
SYS_nt_get_queued_completion_status(
iocp: NtHandle,
timeout: Option<Duration>,
) -> Result<IocpPacket>;
Why this matters: - Multiplayer game servers (Rust game servers, Minecraft servers under Wine) - Game engines with async asset loading - Network code in games (sockets + IOCP)
Implementation note: IOCP is a thin wrapper over UmkaOS's existing async I/O
infrastructure. The completion queue (BoundedMpmcQueue) uses the same ring buffer
primitives as KABI domain rings (Section 11.8). File
association (SYS_nt_associate_file_with_iocp) registers an io_uring-style completion
callback (Section 19.3) that posts IocpPacket entries to the IOCP queue
when I/O completes. GetQueuedCompletionStatus dequeues from the ring with the
concurrency limiter. The kernel does NOT implement a separate I/O dispatch path for
IOCP — all I/O goes through the standard block/network paths; only the completion
notification is redirected from CQE to IOCP queue.
19.6.5 Memory Management Acceleration¶
Problem: Windows VirtualAlloc, VirtualFree, VirtualProtect have specific semantics that don't map cleanly to mmap/munmap/mprotect:
- Reservation vs commit: Reserve address space without allocating pages, commit later
- MEM_RESET: Discard pages but keep address range mapped (Linux has MADV_DONTNEED but semantics differ)
- Guard pages: PAGE_GUARD causes exception on first access, then becomes normal page
- Large pages: MEM_LARGE_PAGES (2MB/1GB pages)
UmkaOS WEA approach: Extended mmap with Windows-compatible flags. SYS_mmap_wea
is a thin wrapper around the standard VMM mmap path
(Section 4.15), adding NT-specific flag translation:
MEM_RESERVE → MAP_NORESERVE, MEM_COMMIT → demand-paging fault handler,
MEM_RESET → madvise(MADV_DONTNEED), PAGE_GUARD → VMA flag + page fault hook.
The VMM implementation is shared — WEA only translates flags and tracks per-VMA
NT allocation state (reserved vs committed regions).
// Extend existing UmkaOS mmap syscall with WEA flags
SYS_mmap_wea(
addr: Option<*mut u8>,
size: usize,
protection: u32, // PAGE_READWRITE | PAGE_EXECUTE_READ | ...
flags: u32, // MEM_RESERVE, MEM_COMMIT, MEM_RESET, MEM_LARGE_PAGES
fd: Option<Fd>,
) -> Result<*mut u8>;
// New syscalls for Windows-specific ops
SYS_virtual_protect(
addr: *mut u8,
size: usize,
new_protection: u32,
old_protection: &mut u32, // Windows returns old protection
) -> Result<()>;
SYS_virtual_lock(
addr: *mut u8,
size: usize,
) -> Result<()>; // Pin pages in RAM (VirtualLock)
Why this matters:
- Games use VirtualAlloc for custom allocators
- JIT compilers (C#/CLR games) use executable memory allocation
- DX12 resource heaps use large page allocations
19.6.6 NT Thread Model and Fiber Support¶
Problem: Windows threads have TEB (Thread Environment Block), fiber contexts (cooperative coroutines), FLS (Fiber Local Storage), and APC (Asynchronous Procedure Call) queues. WINE emulates most of this in userspace; the gaps are performance and correctness of blocking-in-fiber.
UmkaOS WEA approach: Extend UmkaOS thread model with NT-compatible TLS and APC support. Fiber support leverages the native UmkaOS scheduler upcall mechanism (Section 8.1) for correct blocking behaviour.
pub struct NtThread {
/// Standard UmkaOS thread.
umka_thread: Arc<Task>,
/// Thread Environment Block — allocated in user address space.
/// Kernel records the address for fast NtCurrentTeb() via GS base.
teb_address: *mut NtTeb,
/// APC queue (kernel-mode and user-mode APCs). Uses intrusive linked list
/// to avoid heap allocation under spinlock. Apc nodes are allocated from
/// a pre-allocated per-thread pool (max 64 pending APCs per thread).
apc_queue: SpinLock<IntrusiveList<Apc>>,
/// Pre-allocated APC node pool. Avoids allocator calls under spinlock.
apc_pool: [MaybeUninit<ApcNode>; NT_MAX_PENDING_APCS],
apc_pool_bitmap: AtomicU64, // 64 slots, 1 bit each
}
/// Maximum pending APCs per thread. Windows doesn't document a hard limit,
/// but practical applications rarely exceed a handful.
pub const NT_MAX_PENDING_APCS: usize = 64;
// Kernel-to-userspace boundary. NtTeb contains raw pointers with
// platform-dependent sizes (x86-64 only in practice for WEA). No
// fixed const_assert; the kernel allocates at least 0x1000 bytes
// regardless of sizeof(NtTeb) to match Windows TEB page layout.
// kernel-internal, not KABI
#[repr(C)]
pub struct NtTeb {
/// NtTib.Self: self-pointer (always TEB[0], offset 0x00 on x64).
self_ptr: *mut NtTeb,
/// NtTib.StackBase / StackLimit: valid stack range for current fiber.
/// Updated by WINE's SwitchToFiber() — userspace write, no syscall.
stack_base: *mut u8,
stack_limit: *mut u8,
/// NtTib.FiberData: pointer to the active fiber's data block.
/// Updated by WINE on every SwitchToFiber() — userspace write.
fiber_data: *mut u8,
// Kernel maintains these fields at thread creation time.
// WINE manages the full TEB layout; kernel only guarantees:
// - TEB is allocated and zeroed to at least 0x1000 bytes (Windows x64 minimum)
// - GS base points to TEB (x64) or FS base (x86 WoW64)
// - self_ptr is initialized to TEB address
// - stack_base/stack_limit are set from thread stack
// WINE is responsible for populating remaining fields (PEB pointer at 0x60,
// LastErrorValue at 0x68, TLS array at 0x58, etc.) before first user-mode entry.
}
pub struct Apc {
routine: extern "C" fn(*mut u8),
context: *mut u8,
mode: ApcMode, // KernelMode vs UserMode
}
// WEA syscalls for APC support.
// SYS_nt_queue_apc returns STATUS_INSUFFICIENT_RESOURCES if the target thread's
// APC pool (64 entries) is exhausted. This is not a Windows-documented limit,
// but practical applications rarely exceed it. WINE can retry or log a warning.
SYS_nt_queue_apc(thread: NtHandle, routine: extern "C" fn(*mut u8), context: *mut u8) -> Result<()>;
SYS_nt_alert_thread(thread: NtHandle) -> Result<()>;
SYS_nt_test_alert() -> Result<bool>;
Fiber kernel responsibilities — what requires kernel involvement and what does not:
| Win32 API | Kernel role | Implementation |
|---|---|---|
ConvertThreadToFiber() |
Allocate upcall stack, call SYS_register_scheduler_upcall |
WINE calls Section 8.1 registration |
CreateFiber(size, fn, p) |
None | WINE allocates stack, sets up UpcallFrame in userspace |
SwitchToFiber(fiber) |
None | WINE saves registers, swaps stack pointer, updates TEB.FiberData — pure userspace |
DeleteFiber(fiber) |
None | WINE frees stack |
FlsAlloc / FlsGetValue / FlsSetValue |
None | WINE maintains per-fiber FLS table in user address space; pointer swapped on SwitchToFiber |
| Fiber calls blocking syscall | Scheduler upcall (Section 8.1) | Kernel invokes upcall; WINE converts to io_uring, parks fiber, runs another |
Fiber Local Storage (FLS):
Fiber Local Storage provides per-fiber thread-local-like storage, analogous to Windows FLS (FlsAlloc/FlsSetValue/FlsGetValue/FlsFree) and required by the Windows Environment for Applications (WEA) compatibility layer.
/// Per-fiber local storage block. Each fiber has one FLS block allocated
/// with its stack. Windows supports up to 1088 FLS slots (FLS_MAXIMUM_AVAILABLE).
pub struct FiberLocalStorage {
/// Storage slots. Index is the FLS slot ID returned by fls_alloc().
slots: Box<[FlsSlot; FLS_MAXIMUM_AVAILABLE]>,
/// Number of allocated slots (highest used index + 1).
allocated: u32,
}
/// One FLS slot: a value and an optional destructor called when the fiber exits.
pub struct FlsSlot {
/// The stored value (pointer-sized). Zero if unset.
pub value: usize,
/// Optional destructor called with `value` when the fiber exits or
/// fls_free() is called while the slot is set. Called before the
/// fiber's stack is freed.
pub destructor: Option<fn(usize)>,
}
/// Maximum number of FLS slots per fiber (matches Windows FLS_MAXIMUM_AVAILABLE).
pub const FLS_MAXIMUM_AVAILABLE: usize = 1088;
FLS operations:
fls_alloc(destructor: Option<fn(usize)>) -> Result<u32, FlsError>:
Allocates the next free FLS slot index. Returns the slot index.
Returns FlsError::NoMoreSlots if all 1088 slots are in use.
fls_set_value(index: u32, value: usize) -> Result<(), FlsError>:
Sets the value for slot `index` in the current fiber's FLS block.
Returns FlsError::InvalidIndex if index >= FLS_MAXIMUM_AVAILABLE
or the slot has not been allocated via fls_alloc().
fls_get_value(index: u32) -> Result<usize, FlsError>:
Reads the value for slot `index`. Returns 0 if set to zero or
never set. Returns FlsError::InvalidIndex for invalid/unallocated index.
fls_free(index: u32) -> Result<(), FlsError>:
Frees slot `index`. Calls the destructor (if set and value != 0)
before clearing the slot. The slot index may be reused by future
fls_alloc() calls.
Fiber stack allocation:
Fibers use UmkaOS's normal virtual memory allocator. Stack size is specified at
creation time via CreateFiber(stack_size, proc, param):
- Minimum stack: 64 KB (aligned up if caller requests less)
- Default stack: 1 MB (matches Windows default fiber stack)
- Maximum stack: process virtual address space limit
- Guard page: one no-access page below the stack (catches stack overflow)
- The fiber stack VA range is allocated with mmap(MAP_ANONYMOUS | MAP_STACK);
the guard page uses mprotect(PROT_NONE) on the bottom page.
Fiber context switch cost: ~40-80 ns (same as swapcontext() — save/restore
GPRs + FPU state + FLS block pointer, no kernel involvement).
Why blocking-in-fiber is the only hard problem: SwitchToFiber needs zero
kernel involvement — it is register save/restore. FLS is an array in user
memory. The problem is a fiber calling NtReadFile (→ read(2)) which would
block the OS thread, starving all other fibers. The Section 8.1 scheduler upcall
mechanism solves this: WINE registers an upcall handler on the OS thread; when
any fiber's syscall would block, the kernel invokes the handler, which submits
the I/O to io_uring and runs the next fiber. The OS thread remains live.
This is exactly how Naughty Dog's fiber-based job system (and similar game-engine job schedulers) achieves high core utilisation — fibers never "waste" a core waiting for I/O or synchronisation.
Why this matters:
- Games using Windows fiber-based job systems (Destiny, various Unreal titles)
- Windows thread pool APIs (TpCallbackMayRunLong, TP_CALLBACK_ENVIRON)
- .NET/C# games (CLR uses APCs for garbage collection suspension)
- Anti-cheat systems that inspect TEB/fiber state
19.6.7 Security & Token Model¶
Problem: Windows has security tokens (user SID, group SIDs, privileges). Many games/launchers check tokens. WINE fakes most of this.
UmkaOS WEA approach: Minimal NT token emulation (not full Windows security, just enough for compatibility).
/// Maximum groups per token. Windows allows up to 1024 groups; we use a lower
/// limit since WINE/Proton games typically need far fewer.
pub const NT_MAX_TOKEN_GROUPS: usize = 128;
/// Maximum privileges per token. Windows defines ~36 privileges; we cap at 64.
pub const NT_MAX_TOKEN_PRIVILEGES: usize = 64;
pub struct NtToken {
/// User SID (S-1-5-21-...)
user_sid: WinSid,
/// Groups (Administrators, Users, etc.). Fixed-capacity array to prevent
/// unbounded kernel memory growth from malicious token inflation.
groups: ArrayVec<WinSid, NT_MAX_TOKEN_GROUPS>,
/// Privileges (SeDebugPrivilege, SeBackupPrivilege, etc.)
/// Most are no-ops, but games check for them. Fixed-capacity bitset.
privileges: BitArray<[u64; 1]>, // 64 bits = 64 privilege slots
/// Integrity level (Low, Medium, High, System)
integrity_level: IntegrityLevel,
}
// Syscalls
SYS_nt_open_process_token(
process: NtHandle,
access: u32,
) -> Result<NtHandle>;
SYS_nt_query_token_information(
token: NtHandle,
class: TokenInformationClass,
buffer: *mut u8,
buffer_len: u32,
) -> Result<u32>; // Returns bytes written
Why this matters: - Game launchers (Epic, Ubisoft) check admin privileges - Anti-cheat checks process token integrity level - Windows Store games check app container tokens
19.6.8 Structured Exception Handling (SEH)¶
Problem: Windows uses SEH (Structured Exception Handling) for both C++ exceptions and hardware exceptions (access violations, divide-by-zero). x86-64 Windows uses table-based unwinding. WINE emulates via signal handlers.
UmkaOS WEA approach: Kernel-assisted SEH dispatch with safety bounds.
// When hardware exception occurs (page fault, illegal instruction, etc.):
// 1. Kernel looks up exception handler chain in TEB
// 2. Validates and calls user-mode exception handlers in order
// 3. If unhandled, terminates process (Windows behavior)
pub struct ExceptionRecord {
exception_code: u32, // STATUS_ACCESS_VIOLATION, etc.
exception_flags: u32,
exception_address: usize,
parameters: [usize; 15], // Exception-specific data
}
// When CPU exception occurs, kernel:
// 1. Saves context (registers, stack)
// 2. Reads TEB->ExceptionList (user address, validated)
// 3. For each handler in the chain (max SEH_MAX_CHAIN_DEPTH = 64):
// a. Validate record.next is within the current thread's stack VMA (stack-pivot defense)
// b. Validate handler address is in executable user pages
// c. Validate next pointer is in readable user pages or NULL
// d. Call handler via controlled user-mode return
// e. If handler returns EXCEPTION_EXECUTE_HANDLER, unwind to it
// 4. If chain exhausted or max depth reached, terminate process
// Safety invariants enforced by kernel:
// - Each EXCEPTION_REGISTRATION_RECORD.next MUST be within the thread's stack VMA;
// a pointer outside the stack indicates a stack-pivot attack (see validate_seh_chain)
// - Each handler address must be in VMA with PROT_EXEC
// - Each EXCEPTION_REGISTRATION_RECORD must be in readable user memory
// - Chain traversal stops at 0xFFFFFFFF (end sentinel), invalid pointer, or depth limit
// - Circular chains detected via depth limit
/// Maximum SEH chain depth to traverse. Prevents both infinite loops and stack-pivot
/// attacks via over-long chains. Windows doesn't document a limit; practical applications
/// rarely exceed 10-20 handlers. 64 provides ample headroom with a tight security bound.
pub const SEH_MAX_CHAIN_DEPTH: usize = 64;
fn validate_seh_chain(initial_record: u32) -> Result<(), SehError> {
let stack_vma = current_task().stack_vma();
let mut record_addr = initial_record; // read from FS:[0] / TEB.ExceptionList
let mut depth = 0usize;
while record_addr != 0xFFFF_FFFF {
// Bounds check: record must be within the thread's stack
if !stack_vma.contains(record_addr as usize) {
return Err(SehError::RecordOutsideStack { addr: record_addr });
}
// Handler must be in executable memory (existing check)
let record = read_user_seh_record(record_addr)?;
if !is_executable(record.handler) {
return Err(SehError::HandlerNotExecutable);
}
depth += 1;
if depth > SEH_MAX_CHAIN_DEPTH {
return Err(SehError::ChainTooLong);
}
record_addr = record.next;
}
Ok(())
}
Scope note: SEH validation verifies that handler addresses are in executable pages and that all
EXCEPTION_REGISTRATION_RECORDnodes reside within the thread's stack VMA — matching Windows compatibility while closing the stack-pivot attack vector. It does not prevent ROP (Return-Oriented Programming) gadget use; Windows itself does not prevent ROP gadgets in SEH handlers. Applications needing ROP protection should use Control Flow Guard (CFG) or Arbitrary Code Guard (ACG) viaSetProcessMitigationPolicy.
Why this matters: - Windows games compiled with MSVC use SEH - Access violations (common in games with bugs) are handled differently than Linux segfaults - Debuggers need to intercept first-chance exceptions
19.6.9 Performance: Projected Comparison¶
Note: These are design-phase projections, not measured benchmarks. WEA is not yet implemented. The estimates are based on syscall overhead analysis (measuring existing wineserver round-trip vs expected kernel object access latency) and comparable Linux kernel primitives (futex, epoll). Actual performance will be validated during implementation.
Projected workload: Unreal Engine 5 game loading (Proton on Linux vs WEA on UmkaOS)
| Operation | Linux + WINE (est.) | UmkaOS + WEA (projected) | Projected Speedup |
|---|---|---|---|
| CreateEvent (named) | ~15 μs (wineserver RPC; measured end-to-end including wineserver object lookup and state update; raw IPC round-trip on modern hardware is 3–5 μs, but wineserver processing adds 10–12 μs) | ~1.5 μs (kernel object) | targeted ~10x (assuming workload is syscall-latency-bottlenecked; compute-bound workloads see 0% gain) |
| WaitForMultipleObjects (8 handles) | ~8 μs (poll + wineserver) | ~0.5 μs (kernel wait) | targeted ~16x improvement (for CreateEvent/WaitForSingleObject-heavy patterns) |
| VirtualAlloc (100 MB) | ~50 μs (mmap + tracking) | ~20 μs (native) | ~2.5x |
| IOCP GetQueuedCompletionStatus | ~4 μs (eventfd + epoll) | ~0.8 μs (kernel queue) | targeted ~5x improvement (for I/O-intensive patterns) |
| MapViewOfFile (section) | ~12 μs (shm + mmap) | ~3 μs (kernel section) | ~4x |
Note: Speedup projections are based on profiling Wine/Proton on synthetic CreateEvent/WaitForSingleObject and I/O benchmarks. Actual gains depend strongly on workload characteristics. Compute-bound applications see no improvement from WEA; the benefit is concentrated in applications that make frequent Windows API calls with high syscall overhead.
Assumptions: x86-64, Intel Core i7-12700K, Linux 6.1, WINE 8.x, single-threaded microbenchmarks. Real game workloads will show smaller end-to-end improvements due to GPU-bound and I/O-bound phases.
Projected game impact: 10-20% faster loading (synchronization-heavy), 5-10% better frame pacing (reduced NT emulation jitter). These projections require validation.
19.6.10 API Surface & Stability¶
Key principle: WEA is an internal UmkaOS syscall API, not a Windows-compatible ABI. WINE/Proton are the only consumers.
Stability guarantee:
- WEA operations use negative syscall numbers in the -0x0800..-0x08FF range, dispatched
directly through the bidirectional table (Section 19.1,
Section 19.8). No multiplexer
overhead — each WEA op dispatches as fast as any Linux syscall.
- Versioned API (WEA v1, v2, etc.) with capability negotiation via umka_op::WEA_VERSION_QUERY (-0x0800).
- WINE can check: "Does kernel support WEA v2?" before using new features.
Non-goal: WEA does not aim to run Windows binaries directly. WINE/Proton still required for: - PE executable loading - DLL loading, import resolution - Win32 API emulation (user32.dll, kernel32.dll, etc.) - DirectX → Vulkan translation (DXVK, VKD3D)
WEA only accelerates the kernel-level primitives that WINE currently emulates poorly.
19.6.11 Implementation Roadmap¶
Phased Development Plan (no time estimates per UmkaOS policy):
Phase 1: NT object manager + basic synchronization
- Event, Mutex, Semaphore objects
- WaitForSingleObject, WaitForMultipleObjects
- Named object namespace
Phase 2: Memory management
- VirtualAlloc/VirtualFree with Windows semantics
- Section objects (shared memory)
- MapViewOfSection, UnmapViewOfSection
Phase 3: I/O completion ports - IOCP creation, association, posting, dequeuing - Integration with UmkaOS async I/O
Phase 4: Thread model extensions
- TEB support + fast NtCurrentTeb() via GS base
- APC queues
- Scheduler upcall registration (SYS_register_scheduler_upcall, Section 8.1)
enabling correct fiber blocking behaviour for SwitchToFiber-based job systems
Phase 5: Security & tokens - Minimal NT token emulation - Privilege checks (mostly no-ops)
Phase 6: SEH support - Kernel-assisted exception dispatch - Unwind table parsing (x86-64)
Dependency: WINE/Proton must be modified to use WEA syscalls. Upstream WINE may not accept (they target all UNIX platforms). Proton fork more realistic (Valve controls it, Steam Deck focus).
19.6.12 Benefits Summary¶
For users (projected, pending validation — see Section 19.6): - Games projected to run 10-20% faster loading under Proton on UmkaOS vs Linux - Better compatibility (some games that break on WINE/Linux may work on WEA/UmkaOS) - Lower input latency (reduced NT emulation jitter)
For WINE/Proton developers: - Less complex userspace emulation code - Fewer bugs (kernel enforces correctness) - Easier to support new Windows features (kernel does heavy lifting)
For UmkaOS: - Gaming becomes a differentiation point vs Linux - "Best platform for Windows gaming outside Windows" marketing - Drives enthusiast adoption
Market impact: - Steam Deck successor (if Valve interested)? - Gaming-focused UmkaOS distribution (like SteamOS but UmkaOS-based)? - Differentiation in the "Linux for gaming" space
19.6.13 Open Questions¶
-
Upstream WINE acceptance?
- WINE targets macOS, FreeBSD, Solaris — not just Linux
- UmkaOS-specific syscalls might not be upstreamable
- Solution: Maintain UmkaOS-specific WINE fork OR Proton-only support
-
Anti-cheat compatibility?
- EAC, BattlEye check kernel behavior
- WEA changes kernel behavior (more Windows-like)
- Could this improve or break anti-cheat support?
-
Maintenance burden?
- Windows NT is a moving target (Windows 11, Windows 12...)
- UmkaOS must track changes to NT kernel APIs
- Mitigation: Focus on stable APIs (NT 6.x kernel, used in Win7-Win11)
-
Security implications?
- NT object namespace shared across processes
- Named objects can be hijacked (race conditions)
- Resolved: Atomic create-or-open under write lock prevents TOCTOU — see
Section 19.6,
NtObjectManager::create_named. Container isolation viaNtSecurityDescriptorprevents cross-container object squatting.
-
32-bit Windows game support?
- Many Windows games are still 32-bit (i686 PE executables)
- UmkaOS does not support ia32 multilib (Section 19.7 "Deliberately Dropped")
- Design decision: 32-bit Windows games run via WINE's WoW64-style thunking. WINE already implements 32-to-64 syscall translation for Linux. WEA syscalls are 64-bit only; WINE's 32-bit ntdll.dll thunks to 64-bit before calling WEA. This maintains UmkaOS's clean 64-bit-only syscall surface while supporting 32-bit games. Performance impact is minimal: the thunk is one function call in WINE's address space, not a kernel transition.
19.7 Deliberately Dropped Compatibility¶
These Linux features are intentionally not supported. Each omission protects a core design property of UmkaOS.
| Dropped feature | Why | Design property protected |
|---|---|---|
Binary .ko kernel modules |
Would require emulating Linux's unstable internal API. UmkaOS uses .uko modules with stable KABI. Linux module tools (modprobe, lsmod, rmmod) work unmodified with .uko via compatible syscalls and /lib/modules/umka-X.Y.Z/ layout (Section 12.7). |
Stable KABI |
| 32-bit compat layers (i386-on-x86-64, AArch32-on-AArch64, PPC32-on-PPC64LE) | Doubles syscall surface, complicates signal handling. UmkaOS builds separate kernels per architecture — run the 32-bit kernel for 32-bit binaries. | Clean architecture |
/dev/mem and /dev/kmem |
Raw physical/kernel memory access | Capability-based security |
| Obsolete syscalls (~50+) | old_stat, socketcall, ipc multiplexer, etc. |
Clean syscall surface |
/sys/module/*/parameters |
Tied to .ko module model; replaced by /ukfs/kernel/drivers/<name>/config/ (Section 11.4) |
KABI-native configuration |
| Kernel cmdline module params | modname.param=val syntax tied to .ko model; replaced by umka.driver.<name>.<key>=<value> (Section 20.9) |
KABI-native configuration |
ioperm / iopl |
Direct I/O port access from user space | Driver isolation |
kexec (initially) |
Complex interaction with driver model | Clean shutdown/recovery |
Obsolete syscalls not implemented (partial list): old_stat, old_lstat,
old_fstat, socketcall, ipc (multiplexer), old_select, old_readdir,
old_mmap, uselib, modify_ldt (except minimal for TLS), vm86, vm86old,
set_thread_area (x86 only; use arch_prctl instead).
Only syscalls that current glibc (2.17+) and musl (1.2+) actually emit are implemented.
19.8 UmkaOS Native Syscall Interface¶
19.8.1 Motivation¶
UmkaOS implements ~80% of Linux syscalls natively with identical POSIX semantics — read,
write, open, mmap, fork, socket, etc. are the kernel's own API. For these,
the syscall entry point performs only representation conversion (untyped C ABI → typed
Rust internals: int fd → CapHandle<FileDescriptor>, void *buf → UserPtr<T>),
not semantic translation.
However, ~20% of operations fall into two categories where Linux's interface is fundamentally inadequate:
-
Thin adaptation (~15%): Linux has an interface but it's untyped, fragmented, or encodes the wrong abstraction. Examples:
ioctl(fd, MAGIC, void*)for driver interaction,clone3()flag explosion for process creation,prctl()as a catch-all for unrelated operations, five separate observability interfaces (perf, ftrace, sysfs, tracepoints, BPF). -
No Linux equivalent (~5%): UmkaOS has capabilities that Linux does not expose at all. Examples: capability delegation with attenuation, isolation domain management, distributed shared memory, per-cgroup power budgets.
For both categories, UmkaOS defines native syscalls that expose the full richness of the kernel's typed, capability-based model. These syscalls are available alongside the Linux-compatible interface — unmodified Linux applications continue to use Linux syscalls and work correctly; UmkaOS-aware applications can opt into the native interface for stronger typing, finer-grained control, and access to UmkaOS-specific features.
19.8.2 Design Principles¶
- Native syscalls supplement, never replace, Linux-compatible ones. Every operation achievable via a native syscall must also be achievable via the Linux-compatible interface (even if with less type safety or fewer features). Linux applications never need UmkaOS-native syscalls.
- Typed arguments. Native syscalls use fixed-layout Rust-compatible structs, not
unsigned longcatch-alls orvoid *blobs. Every argument is validated at the syscall entry point against the struct layout. - Capability-first. Native syscalls accept
CapHandlearguments directly. Permission checks are explicit in the syscall signature, not hidden inside the implementation. - Versioned. Each native syscall struct includes a
size: u32field (like Linux'sclone3andopenat2). The kernel handles smaller structs from older userspace by zero-filling new fields. This provides forward-compatible extensibility without syscall number proliferation. - Negative-number namespace. UmkaOS native syscalls use negative syscall numbers, dispatched through the same bidirectional table as Linux-compatible syscalls (Section 19.1). Linux syscalls occupy positive numbers; UmkaOS native ops occupy negative numbers. This is collision-proof by construction — no overlap possible regardless of how many syscalls either side adds. Each native op has its own syscall number and dispatches directly (no multiplexer indirection), making native calls as fast as Linux calls (~0 extra cycles on out-of-order cores).
19.8.3 Syscall Families¶
/// UmkaOS native syscall numbers (negative).
///
/// Negative syscall numbers are dispatched through the same bidirectional table
/// as Linux-compatible (positive) syscalls. Each family reserves a 256-entry range
/// for forward-compatible extension without renumbering. The hex suffix encodes
/// the family (0x01 = capability, 0x02 = driver, etc.) and the offset within it.
///
/// Userspace passes these as the syscall number directly — no multiplexer.
/// Arguments are operation-specific (via registers), matching Linux convention.
/// Complex operations use a versioned struct pointer + size as first two args
/// (like Linux's clone3/openat2 pattern).
pub mod umka_op {
// ── Capability operations (-0x0100 .. -0x01FF) ───────────────────
/// Create a new capability with specified rights from an existing one.
/// Equivalent to: dup() + fcntl() but typed and with attenuation.
pub const CAP_DERIVE: i32 = -0x0100;
/// Restrict an existing capability's permissions (irreversible).
/// No Linux equivalent — fcntl cannot reduce permissions on an fd.
pub const CAP_RESTRICT: i32 = -0x0101;
/// Query the permission set of a capability handle.
pub const CAP_QUERY: i32 = -0x0102;
/// Revoke a specific capability by handle.
pub const CAP_REVOKE: i32 = -0x0103;
/// Delegate a capability to another process via IPC, with optional
/// attenuation (reduced rights). The recipient receives a new handle
/// with at most the permissions specified by the sender.
pub const CAP_DELEGATE: i32 = -0x0104;
// ── Typed driver interaction (-0x0200 .. -0x02FF) ────────────────
/// Invoke a typed KABI operation on a driver.
/// Replaces: ioctl(fd, request, arg) with typed, versioned structs.
/// The driver's KABI version is checked at invocation time.
pub const DRV_INVOKE: i32 = -0x0200;
/// Query a driver's supported KABI interfaces and versions.
pub const DRV_QUERY: i32 = -0x0201;
/// Subscribe to driver health/status events (structured, typed).
/// Replaces: various sysfs polling and netlink listening patterns.
pub const DRV_SUBSCRIBE: i32 = -0x0202;
// ── Isolation domain management (-0x0300 .. -0x03FF) ─────────────
/// Query the isolation tier and domain of a capability handle.
pub const DOM_QUERY: i32 = -0x0300;
/// Request domain statistics (cycle counts, fault counts, memory).
pub const DOM_STATS: i32 = -0x0301;
// ── Distributed operations (-0x0400 .. -0x04FF) ──────────────────
// Maps to DSM syscall functions in Section 6.14 (06-dsm.md):
// DSM_ALLOC → dsm_create()
// DSM_MAP → dsm_attach() + dsm_mmap()
// DSM_SET_POLICY → dsm_set_coherence()
/// Allocate a distributed shared memory region.
/// No Linux equivalent.
pub const DSM_ALLOC: i32 = -0x0400;
/// Map a remote DSM region into the local address space.
pub const DSM_MAP: i32 = -0x0401;
/// Set coherence policy for a DSM region (strict, relaxed, release).
pub const DSM_SET_POLICY: i32 = -0x0402;
/// Query cluster membership and node health.
pub const CLUSTER_INFO: i32 = -0x0410;
/// Distributed flock with DLM backend. Acquires a cluster-wide file lock
/// via the DLM ([Section 15.15](15-storage.md#distributed-lock-manager)), replacing the node-local
/// `flock()` semantics with cross-node locking. Supports the same lock
/// modes (LOCK_SH, LOCK_EX, LOCK_UN) plus DLM-specific flags (e.g.,
/// LOCK_NOQUEUE for non-blocking try-lock). Returns -ENOLCK if no DLM
/// lockspace is associated with the file's filesystem.
/// Args: (fd: i32, operation: i32, flags: u32)
pub const FLOCK2: i32 = -0x0420;
// ── Accelerator operations (-0x0500 .. -0x05FF) ──────────────────
/// Create an accelerator context (GPU, NPU, FPGA) with typed caps.
/// Replaces: DRM_IOCTL_* and VFIO ioctls with unified typed API.
pub const ACCEL_CTX_CREATE: i32 = -0x0500;
/// Submit work to an accelerator context.
pub const ACCEL_SUBMIT: i32 = -0x0501;
/// Query accelerator utilization and health.
pub const ACCEL_QUERY: i32 = -0x0502;
/// Wait for accelerator fence completion.
pub const ACCEL_FENCE_WAIT: i32 = -0x0503;
// ── Power management (-0x0600 .. -0x06FF) ────────────────────────
/// Set per-cgroup power budget (watts).
/// No Linux equivalent — Linux uses sysfs strings.
pub const POWER_SET_BUDGET: i32 = -0x0600;
/// Query current power consumption for a cgroup or domain.
pub const POWER_QUERY: i32 = -0x0601;
// ── Observability (-0x0700 .. -0x07FF) ───────────────────────────
/// Subscribe to structured kernel events (health, tracepoints, audit).
/// Replaces: fragmented perf_event_open / ftrace / sysfs / netlink.
pub const OBSERVE_SUBSCRIBE: i32 = -0x0700;
/// Query kernel object by path in the unified object namespace (umkafs).
pub const OBSERVE_QUERY: i32 = -0x0701;
// ── Windows Emulation Acceleration (-0x0800 .. -0x08FF) ──────────
// WEA operations for WINE/Proton acceleration (Section 19.4).
// WEA maps NT semantics onto the native SYNC_* primitives below.
/// Query WEA version and supported features.
pub const WEA_VERSION_QUERY: i32 = -0x0800;
/// Create an NT event object (manual-reset or auto-reset).
pub const WEA_EVENT_CREATE: i32 = -0x0801;
/// Open an existing named NT event object.
pub const WEA_EVENT_OPEN: i32 = -0x0802;
/// Set (signal) an NT event.
pub const WEA_EVENT_SET: i32 = -0x0803;
/// Reset (unsignal) an NT event.
pub const WEA_EVENT_RESET: i32 = -0x0804;
/// Pulse an NT event (signal and immediately reset).
pub const WEA_EVENT_PULSE: i32 = -0x0805;
/// Create an NT mutex object.
pub const WEA_MUTEX_CREATE: i32 = -0x0810;
/// Create an NT semaphore object.
pub const WEA_SEMAPHORE_CREATE: i32 = -0x0811;
/// Wait for a single NT object to become signaled.
pub const WEA_WAIT_SINGLE: i32 = -0x0820;
/// Wait for multiple NT objects (WaitAny or WaitAll semantics).
pub const WEA_WAIT_MULTIPLE: i32 = -0x0821;
/// Create an NT section (memory-mapped file or shared memory).
pub const WEA_SECTION_CREATE: i32 = -0x0830;
/// Map a view of an NT section into the process address space.
pub const WEA_SECTION_MAP: i32 = -0x0831;
/// Unmap a view of an NT section.
pub const WEA_SECTION_UNMAP: i32 = -0x0832;
/// Create an I/O completion port.
pub const WEA_IOCP_CREATE: i32 = -0x0840;
/// Associate a file with an IOCP.
pub const WEA_IOCP_ASSOCIATE: i32 = -0x0841;
/// Post a completion packet to an IOCP.
pub const WEA_IOCP_POST: i32 = -0x0842;
/// Dequeue a completion packet from an IOCP.
pub const WEA_IOCP_GET: i32 = -0x0843;
/// Memory operations with Windows semantics (reserve/commit/reset).
pub const WEA_VIRTUAL_ALLOC: i32 = -0x0850;
/// Change memory protection with old-protection output.
pub const WEA_VIRTUAL_PROTECT: i32 = -0x0851;
/// Lock pages in physical memory.
pub const WEA_VIRTUAL_LOCK: i32 = -0x0852;
/// Queue an APC to a thread.
pub const WEA_APC_QUEUE: i32 = -0x0860;
/// Alert a thread (deliver queued APCs).
pub const WEA_ALERT_THREAD: i32 = -0x0861;
/// Open a process token for security queries.
pub const WEA_TOKEN_OPEN: i32 = -0x0870;
/// Query token information (user, groups, privileges).
pub const WEA_TOKEN_QUERY: i32 = -0x0871;
/// Close an NT handle.
pub const WEA_HANDLE_CLOSE: i32 = -0x08F0;
/// Duplicate an NT handle.
pub const WEA_HANDLE_DUP: i32 = -0x08F1;
// ── Unified Wait and Synchronization (-0x0900 .. -0x09FF) ────────
// Native UmkaOS synchronization primitives. These fill genuine gaps
// in POSIX: heterogeneous unified wait (no epoll+*fd dance), named
// events (cross-process signaling without fd passing), and kernel-
// managed synchronization objects with umkafs visibility.
// WEA (-0x0800) maps NT semantics onto these same kernel primitives.
/// Wait for any handle in an array to become signaled.
/// Handles can mix fds, events, pids, and timers in one call.
/// Returns the index of the first signaled handle, or -ETIMEDOUT.
/// This is the POSIX-flavored unified wait — no NT object model,
/// no handle tables, no security descriptors.
pub const SYNC_WAIT_ANY: i32 = -0x0900;
/// Wait for ALL handles in an array to become signaled.
pub const SYNC_WAIT_ALL: i32 = -0x0901;
/// Create a named or anonymous event (manual-reset or auto-reset).
/// Named events appear in umkafs at /ukfs/kernel/sync/<name>.
pub const SYNC_EVENT_CREATE: i32 = -0x0910;
/// Open an existing named event by path.
pub const SYNC_EVENT_OPEN: i32 = -0x0911;
/// Signal an event. Manual-reset: all waiters wake, stays signaled.
/// Auto-reset: one waiter wakes, event auto-clears.
pub const SYNC_EVENT_SIGNAL: i32 = -0x0912;
/// Reset a manual-reset event to non-signaled state.
pub const SYNC_EVENT_RESET: i32 = -0x0913;
/// Destroy an event (anonymous) or close a handle (named).
pub const SYNC_EVENT_CLOSE: i32 = -0x0914;
/// Create a named or anonymous semaphore with initial and max count.
pub const SYNC_SEM_CREATE: i32 = -0x0920;
/// Open an existing named semaphore by path.
pub const SYNC_SEM_OPEN: i32 = -0x0921;
/// Release (increment) a semaphore count.
pub const SYNC_SEM_RELEASE: i32 = -0x0922;
/// Extended futex: wait with relative timeout.
pub const FUTEX_WAIT: i32 = -0x0930;
/// Extended futex: wake waiters.
pub const FUTEX_WAKE: i32 = -0x0931;
/// Extended futex: requeue waiters.
pub const FUTEX_REQUEUE: i32 = -0x0932;
/// Extended futex: wait with absolute timeout and explicit clockid.
pub const FUTEX_WAIT_ABS: i32 = -0x0933;
/// Extended futex: priority-inheritance wait.
pub const FUTEX_WAIT_PI: i32 = -0x0934;
/// Extended futex: priority-inheritance wake.
pub const FUTEX_WAKE_PI: i32 = -0x0935;
// ── Batch VFS operations (-0x0C00 .. -0x0CFF) ──────────────────
// Batched variants of common VFS syscalls. Each call performs N
// operations in a single kernel entry, amortizing SYSCALL/SYSRET +
// KPTI overhead (~200 ns) across all entries. The per-operation
// cost (path resolution, permission check) is unchanged — only
// the syscall transition overhead is amortized.
/// Batch faccessat2: check access permissions for multiple paths.
/// Args: (entries: *mut BatchAccessEntry, count: u32, flags: u32)
/// Each entry contains (dirfd, path, mode, flags) and receives
/// a result (0 or -errno). Equivalent to N faccessat2() calls.
pub const VFS_ACCESS_BATCH: i32 = -0x0C00;
/// Batch fstatat: stat multiple paths in one kernel entry.
/// Args: (entries: *mut BatchStatEntry, count: u32, flags: u32)
pub const VFS_STAT_BATCH: i32 = -0x0C01;
/// Batch openat: open multiple files in one kernel entry.
/// Returns N file descriptors (or per-entry errors).
/// Args: (entries: *mut BatchOpenEntry, count: u32, flags: u32)
pub const VFS_OPEN_BATCH: i32 = -0x0C02;
/// Batch unlinkat: remove multiple files in one kernel entry.
/// Args: (entries: *mut BatchUnlinkEntry, count: u32, flags: u32)
pub const VFS_UNLINK_BATCH: i32 = -0x0C03;
/// Batch readlinkat: read multiple symlinks in one kernel entry.
/// Args: (entries: *mut BatchReadlinkEntry, count: u32, flags: u32)
pub const VFS_READLINK_BATCH: i32 = -0x0C04;
// ── VVAR page management (-0x0D00 .. -0x0DFF) ───────────────────
/// Map the cgroup gauge page for a given cgroup fd.
/// Returns the mapping address (or -errno).
/// Args: (cgroup_fd: i32) -> *const CgroupGaugePage
pub const MAP_CGROUP_GAUGE: i32 = -0x0D00;
/// Map the scheduler hint page for the calling task.
/// Returns the mapping address (or -errno).
/// Args: () -> *const SchedHintPage
pub const MAP_SCHED_HINT: i32 = -0x0D01;
// ── Typed FD operations (-0x0E00 .. -0x0EFF) ─────────────────────
/// Read a typed event value from an eventfd/signalfd/timerfd/pidfd.
/// Returns an `EventValue` tagged union. See [Section 19.10](#special-file-descriptor-objects).
pub const EVENT_READ: i32 = -0x0E00;
/// Write to an eventfd with type checking.
pub const EVENT_WRITE: i32 = -0x0E01;
// ── Process management (-0x0A00 .. -0x0AFF) ─────────────────────
/// Register a kernel-managed cleanup action that runs on process exit
/// (including SIGKILL, OOM kill, unhandled fault). See Section 8.4.
pub const EXIT_CLEANUP_REGISTER: i32 = -0x0A00;
// ── Debug (-0x0B00 .. -0x0BFF) ──────────────────────────────────
/// Issue a DebugCap for a target process (requires CAP_DEBUG/CAP_SYS_PTRACE).
pub const PTRACE_CAP_ISSUE: i32 = -0x0B00;
/// Attach a debug session using a DebugCapFd.
pub const PTRACE_ATTACH_CAP: i32 = -0x0B01;
/// Grant debug access to another process.
pub const GRANT_DEBUG_CAP: i32 = -0x0B02;
/// Issue a non-transferable DebugCap for the calling process.
pub const SELF_DEBUG_CAP: i32 = -0x0B03;
/// Revoke a DebugCap by its fd (issuer only).
pub const DEBUG_CAP_REVOKE: i32 = -0x0B04;
}
Family map (for quick reference):
| Family | Range | Count | Description |
|---|---|---|---|
| Capability | -0x0100 .. -0x01FF | 5 defined | Capability CRUD + delegation |
| Driver | -0x0200 .. -0x02FF | 3 defined | Typed KABI driver interaction |
| Isolation | -0x0300 .. -0x03FF | 2 defined | Domain query/stats |
| Distributed | -0x0400 .. -0x04FF | 5 defined | DSM + cluster + flock2 |
| Accelerator | -0x0500 .. -0x05FF | 4 defined | GPU/NPU/FPGA typed API |
| Power | -0x0600 .. -0x06FF | 2 defined | Per-cgroup power budgets |
| Observability | -0x0700 .. -0x07FF | 2 defined | Structured events + umkafs query |
| WEA | -0x0800 .. -0x08FF | 26 defined | WINE/Proton NT acceleration |
| Sync | -0x0900 .. -0x09FF | 15 defined | Unified wait/event/sem + futex |
| Process | -0x0A00 .. -0x0AFF | 1 defined | Exit cleanup |
| Debug | -0x0B00 .. -0x0BFF | 5 defined | DebugCap API |
| Batch VFS | -0x0C00 .. -0x0CFF | 5 defined | Batched access/stat/open/unlink/readlink |
| VVAR mgmt | -0x0D00 .. -0x0DFF | 2 defined | Map cgroup gauge / sched hint pages |
| Typed FD | -0x0E00 .. -0x0EFF | 2 defined | Typed event_read / event_write |
19.8.3.1 Batch VFS Operations¶
Problem: Build systems, container startup, and shell scripts call
faccessat2() / fstatat() / openat() in tight loops over dozens or
hundreds of paths. Each syscall pays ~200 ns of entry/exit overhead (SYSCALL
instruction + KPTI page table switch + SYSRET). For 100 files, that's 20 μs
of pure overhead with zero useful work. The batch variants perform N operations
in a single kernel entry, paying the transition cost once.
Batch entry structs (versioned — size field enables forward-compatible
extension, following the clone3/openat2 pattern):
/// Entry for VFS_ACCESS_BATCH. Equivalent to one faccessat2() call.
#[repr(C)]
pub struct BatchAccessEntry {
/// Size of this struct (for versioning). Must be >= 24.
pub size: u32,
/// Directory fd for relative paths (AT_FDCWD for cwd).
pub dirfd: i32,
/// Pointer to NUL-terminated path string (userspace address).
pub path_ptr: u64,
/// Access mode: F_OK (0), R_OK (4), W_OK (2), X_OK (1), or OR'd.
pub mode: u32,
/// Flags: AT_EACCESS (0x200), AT_SYMLINK_NOFOLLOW (0x100), AT_EMPTY_PATH (0x1000).
pub flags: u32,
/// Result filled by kernel: 0 on success, negative errno on failure.
/// The kernel processes ALL entries regardless of individual failures
/// (no early abort — the caller needs all results).
pub result: i32,
pub _pad: u32,
}
const_assert!(core::mem::size_of::<BatchAccessEntry>() == 32);
/// Entry for VFS_STAT_BATCH. Equivalent to one fstatat() call.
#[repr(C)]
pub struct BatchStatEntry {
/// Size of this struct (for versioning).
pub size: u32,
pub dirfd: i32,
pub path_ptr: u64,
pub flags: u32, // AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH, AT_STATX_*
pub _pad: u32,
/// Pointer to struct statx buffer (userspace address). The kernel
/// writes statx data here on success.
pub statx_buf_ptr: u64,
/// statx mask (STATX_BASIC_STATS, STATX_ALL, etc.).
pub statx_mask: u32,
/// Result: 0 or negative errno.
pub result: i32,
}
const_assert!(core::mem::size_of::<BatchStatEntry>() == 40);
/// Entry for VFS_OPEN_BATCH. Equivalent to one openat2() call.
#[repr(C)]
pub struct BatchOpenEntry {
/// Size of this struct (for versioning).
pub size: u32,
pub dirfd: i32,
pub path_ptr: u64,
/// Open flags (O_RDONLY, O_WRONLY, O_RDWR, O_CREAT, etc.).
pub flags: u32,
/// File mode for O_CREAT (ignored otherwise).
pub mode: u32,
/// openat2 resolve flags (RESOLVE_BENEATH, RESOLVE_NO_SYMLINKS, etc.).
pub resolve: u64,
/// Result: non-negative fd on success, negative errno on failure.
pub result: i32,
pub _pad: u32,
}
const_assert!(core::mem::size_of::<BatchOpenEntry>() == 40);
Semantics:
- The
countargument is capped at 256 entries per call (prevents unbounded kernel time in a single syscall). - The kernel processes all entries unconditionally — individual failures do not
abort the batch. Each entry gets its own
result. - Partial-read semantics: For read/write batch variants (future extension),
a partial read (fewer bytes than requested) is NOT an error — the entry's
resultcontains the number of bytes actually read (>= 0). The caller must inspect each entry'sresultindividually. This matches POSIXread()semantics where short reads are normal (EOF, signal interruption, non-blocking socket). - The syscall return value is the number of entries successfully processed
(==
countunless a fault occurs while reading the entry array itself). A return value <countmeans the kernel faulted reading entry N from the user array (entries 0..N-1 were processed; entry N and beyond were not). - Path strings are copied from userspace one at a time (same
copy_from_useras individual syscalls — no new security surface). - All per-entry capability checks, LSM hooks, and audit records are identical to the individual syscall equivalents. The batch is purely a syscall-entry optimization.
libumka wrapper:
#include <umka/batch.h>
// Check 50 paths in one kernel entry.
struct umka_access_entry entries[50];
for (int i = 0; i < 50; i++) {
entries[i] = (struct umka_access_entry){
.size = sizeof(entries[0]),
.dirfd = AT_FDCWD,
.path_ptr = (uint64_t)paths[i],
.mode = R_OK,
.flags = 0,
};
}
int n = umka_access_batch(entries, 50, 0);
// Each entries[i].result is now 0 or -EACCES/-ENOENT/etc.
19.8.3.2 Unified Wait and Event Specification¶
Problem: POSIX provides no single call to wait on heterogeneous kernel objects.
Waiting for "socket data OR timer expiry OR child exit OR event signal" requires
converting everything to file descriptors (timerfd, pidfd, eventfd, signalfd)
and funneling through epoll. This is verbose, error-prone, and every new waitable
type needs a new *fd wrapper syscall. NT's WaitForMultipleObjects solves this
but brings the entire NT object model. UmkaOS provides the capability without the
baggage.
UmkaWaitHandle — the unified waitable type:
/// A handle that can be waited on via SYNC_WAIT_ANY / SYNC_WAIT_ALL.
/// Heterogeneous: different handle types can be mixed in one wait call.
#[repr(C, u32)]
pub enum UmkaWaitHandle {
/// Any pollable file descriptor (socket, pipe, eventfd, timerfd, pidfd,
/// epoll fd, io_uring fd). The wait checks for POLLIN readiness.
/// This means all existing Linux *fd patterns continue to work — you
/// can mix them with native UmkaOS handles in one wait.
Fd { fd: i32, events: u32 } = 0,
/// UmkaOS native event (from SYNC_EVENT_CREATE / SYNC_EVENT_OPEN).
Event { handle: EventHandle } = 1,
/// Process exit (signaled when pid exits). No fd allocation needed —
/// the kernel checks the task struct directly.
Pid { pid: u32 } = 2,
/// Inline timeout (signaled after `timeout_ns` nanoseconds from wait start).
/// Avoids allocating a timerfd for simple "data or timeout" patterns.
Timer { timeout_ns: u64 } = 3,
/// UmkaOS native semaphore (from SYNC_SEM_CREATE / SYNC_SEM_OPEN).
/// Signaled when count > 0. SYNC_WAIT_ANY on a semaphore decrements count
/// (like sem_wait). SYNC_WAIT_ALL checks without decrementing.
Semaphore { handle: SemHandle } = 4,
}
// Layout: tag(4) + pad(4) + max_variant(Timer: u64 = 8) = 16 bytes.
const_assert!(size_of::<UmkaWaitHandle>() == 16);
/// Opaque handle to a kernel event object. Not a file descriptor.
/// Lightweight: 4 bytes, no fd table entry, no VFS overhead.
///
/// **Layout**: bits [31:24] = generation (8 bits), bits [23:0] = slot index
/// (24 bits). 16M concurrent sync objects (2^24). The 8-bit generation in
/// the handle is a fast-reject filter only — the authoritative check is
/// against the slot's internal `generation: u64`, which never wraps.
/// A stale handle (wrong generation) returns `EINVAL` on any operation.
///
/// **Longevity**: Slot indices are recycled via a free list. The per-slot
/// internal generation (u64) is incremented on each reuse, ensuring ABA
/// detection even after billions of reuses over 50-year uptime.
pub type EventHandle = u32;
/// Opaque handle to a kernel semaphore object. Same layout as `EventHandle`:
/// bits [31:24] = generation, bits [23:0] = slot index.
pub type SemHandle = u32;
SYNC_WAIT_ANY semantics:
/// Wait for ANY of `handles[0..count]` to become signaled.
///
/// Returns:
/// Ok(index) — the index (0-based) of the first signaled handle.
/// If multiple handles are signaled simultaneously, the lowest index wins
/// (deterministic, same as NT WaitForMultipleObjects with bWaitAll=FALSE).
/// Err(-ETIMEDOUT) — `timeout_ns` elapsed with no handle signaled.
/// Err(-EINVAL) — count == 0, count > SYNC_WAIT_MAX_HANDLES, or invalid handle.
/// Err(-EINTR) — interrupted by signal (restartable).
///
/// `timeout_ns`: 0 = non-blocking poll, u64::MAX = wait forever.
///
/// Side effects:
/// - Fd handles: no side effect (same as epoll — readiness is reported, not consumed).
/// - Event (auto-reset): the event is reset to non-signaled after one waiter wakes.
/// - Semaphore: count is decremented by 1.
/// - Pid: no side effect (process is not reaped — use waitpid for that).
/// - Timer: no side effect (the timeout is consumed by the wait itself).
pub fn umka_sync_wait_any(
handles: UserPtr<UmkaWaitHandle>,
count: u32,
timeout_ns: u64,
) -> Result<u32, Errno>;
/// Maximum handles per wait call. 64 is sufficient for all realistic use cases
/// (NT limits WaitForMultipleObjects to 64; Go's select is typically <20).
pub const SYNC_WAIT_MAX_HANDLES: u32 = 64;
SYNC_WAIT_ALL semantics:
Same as SYNC_WAIT_ANY but returns only when ALL handles are signaled simultaneously.
No partial consumption: either all side effects fire (all auto-reset events reset, all
semaphores decrement) or none do (timeout/interrupt returns with no state change). This
is atomic — avoids the classic "wait for A and B, got A, B was revoked before we could
check" race.
SYNC_EVENT_CREATE parameters:
/// Parameters for SYNC_EVENT_CREATE.
#[repr(C)]
pub struct SyncEventCreateParams {
/// Event name (NUL-terminated, max 255 bytes). Empty string = anonymous event.
/// Named events are registered in umkafs at /ukfs/kernel/sync/<name>
/// and openable by any process in the same user namespace (or with CAP_SYNC
/// for cross-namespace access).
pub name: [u8; 256],
/// Manual-reset (1) or auto-reset (0).
/// Manual-reset: event stays signaled until explicit SYNC_EVENT_RESET.
/// All threads blocked in SYNC_WAIT_* wake up.
/// Auto-reset: event auto-clears after waking exactly one thread.
pub manual_reset: u32,
/// Initial state: 1 = signaled, 0 = non-signaled.
pub initial_state: u32,
/// [OUT] Assigned event handle.
pub out_handle: EventHandle,
}
// Layout: 256 + 4 + 4 + 4 = 268 bytes.
const_assert!(size_of::<SyncEventCreateParams>() == 268);
Kernel implementation:
/// Kernel-internal event object.
pub struct KernelEvent {
/// Current state: true = signaled.
pub signaled: AtomicBool,
/// Manual-reset or auto-reset.
pub manual_reset: bool,
/// Wait queue for threads blocked on this event.
pub waiters: WaitQueueHead,
/// Name (empty = anonymous). Used for umkafs registration.
pub name: ArrayString<256>,
/// Owning user namespace (for access control on named events).
pub user_ns: NamespaceId,
/// Reference count (handle count + internal references).
/// Internal identifier: u64 per 50-year policy. Functionally bounded
/// by concurrent handle count.
pub refcount: AtomicU64,
}
/// Kernel-internal semaphore object.
pub struct KernelSemaphore {
/// Current count.
/// Internal identifier: u64 per 50-year policy. Functionally bounded
/// by max_count.
pub count: AtomicU64,
/// Maximum count (set at creation, immutable).
pub max_count: u64,
/// Wait queue for threads blocked when count == 0.
pub waiters: WaitQueueHead,
pub name: ArrayString<256>,
pub user_ns: NamespaceId,
/// Internal identifier: u64 per 50-year policy. Functionally bounded
/// by concurrent handle count.
pub refcount: AtomicU64,
}
Relationship to WEA (§19.4):
WEA's WEA_WAIT_MULTIPLE (0x0821) is implemented ON TOP of SYNC_WAIT_ANY /
SYNC_WAIT_ALL. The WEA layer translates NT handle types (HANDLE → UmkaWaitHandle)
and NT semantics (alertable waits, APCs) into native SYNC operations. The kernel has
ONE wait implementation — WEA is a translation layer, not a separate subsystem.
NT WaitForMultipleObjects(handles, bWaitAll, timeout)
→ WEA layer: translate HANDLE[] → UmkaWaitHandle[]
→ if bWaitAll: umka_sync_wait_all(handles, count, timeout)
else: umka_sync_wait_any(handles, count, timeout)
→ WEA layer: translate result back to NT WAIT_OBJECT_0+index
Similarly, WEA's WEA_EVENT_CREATE (0x0801) calls SYNC_EVENT_CREATE (0x0910)
internally, then wraps the EventHandle in an NT HANDLE with NT-specific metadata
(security descriptor, object attributes). Native UmkaOS applications skip the NT
wrapper and use SYNC_EVENT_* directly.
Relationship to epoll/poll/select:
SYNC_WAIT_ANY with only Fd handles is functionally equivalent to poll() — same
semantics, same result. Applications can mix: use epoll for the hot fd-polling path
(epoll's edge-triggered mode is still optimal for high-fd-count servers) and use
SYNC_WAIT_ANY when they need to wait on fds + events + pids in one call.
SYNC_WAIT_ANY is NOT a replacement for epoll. It's a complement — for the cases
where epoll's "everything must be an fd" requirement forces unnecessary complexity.
Typical use case — server with mixed wait sources:
// POSIX approach (verbose):
int epfd = epoll_create1(0);
int tfd = timerfd_create(CLOCK_MONOTONIC, 0);
int pfd = pidfd_open(child_pid, 0);
int efd = eventfd(0, 0);
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, ...);
epoll_ctl(epfd, EPOLL_CTL_ADD, tfd, ...);
epoll_ctl(epfd, EPOLL_CTL_ADD, pfd, ...);
epoll_ctl(epfd, EPOLL_CTL_ADD, efd, ...);
int n = epoll_wait(epfd, events, 4, -1);
// ... decode which fd, close tfd/pfd/efd ...
// UmkaOS native approach (direct):
UmkaWaitHandle handles[4] = {
{ .Fd = { sockfd, POLLIN } },
{ .Timer = { 5000000000ULL } }, // 5 second timeout
{ .Pid = { child_pid } },
{ .Event = { my_shutdown_event } },
};
uint32_t idx = umka_sync_wait_any(handles, 4, UINT64_MAX);
switch (idx) {
case 0: /* socket ready */ break;
case 1: /* timer expired */ break;
case 2: /* child exited */ break;
case 3: /* shutdown signaled */ break;
}
No intermediate fd allocation. No epoll setup. No cleanup. One call.
19.8.4 Userspace Library¶
Native syscalls are accessed through libumka, a thin userspace library that provides:
- C API with proper types (
umka_cap_derive(),umka_drv_invoke(), etc.) - Rust bindings via
umka-syscrate (zero-cost wrappers over the raw syscall) - Version negotiation:
libumkachecks kernel version at init and uses the appropriate struct sizes for forward/backward compatibility
Applications link against libumka. The library detects at runtime whether it is running
on an UmkaOS kernel (via /proc/version or uname) and returns -ENOSYS on non-UmkaOS
kernels, allowing portable applications to fall back to Linux-compatible interfaces.
19.8.5 Relationship to Linux Syscalls¶
┌──────────────────────────────────────┐
│ Userspace Application │
└───────────┬──────────┬───────────────┘
│ │
Linux API │ │ UmkaOS Native API
(glibc) │ │ (libumka)
nr = +N │ │ nr = -N
│ │
┌───────────▼──────────▼───────────────┐
│ Syscall Entry (Layer 1) │
│ Sign-extend nr, call Layer 2 │
└───────────────────┬──────────────────┘
│
┌───────────────────▼──────────────────┐
│ Bidirectional Dispatch (Layer 2) │
│ │
│ ┌──────────────┬───────────────┐ │
│ │ UmkaOS native│ Linux compat │ │
│ │ [ORIGIN-N] │ [ORIGIN+N] │ │
│ └──────┬───────┴────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Internal Typed Kernel API │ │
│ │ (CapHandle, UserPtr, etc.) │ │
│ └──────────────────────────────┘ │
└──────────────────────────────────────┘
Both paths converge to the same internal kernel API through the same
bidirectional dispatch table (Section 19.1).
A read() via Linux's syscall(0, fd, buf, count) and a native capability query via
syscall(-0x0102, cap_handle) both dispatch through ORIGIN[nr] — positive nr indexes
forward (Linux), negative nr indexes backward (UmkaOS native). The native path skips the
fd→CapHandle lookup (the caller already holds a CapHandle) and avoids the void* →
UserPtr validation (the struct is pre-typed). For most operations the performance
difference is negligible; for high-frequency driver interaction (DRV_INVOKE replacing
ioctl) and WEA synchronization primitives, the direct dispatch avoids both the ioctl
switch and the former multiplexer overhead.
19.9 Safe Kernel Extensibility¶
19.9.1 The Paradigm¶
The most important OS innovation of the last decade is eBPF: user-injected verified code in kernel hot paths. But eBPF is limited by being bolted onto a C kernel with a conservative bytecode verifier.
UmkaOS can generalize this: every kernel policy is a safe, hot-swappable module.
Distinction from eBPF (Section 19.2): eBPF provides Linux-compatible user-to-kernel hooks for tracing, networking, and security — it serves the Linux ecosystem. Policy modules provide kernel-internal mechanism/policy separation via KABI vtables — they serve kernel evolution. Both coexist; they address different extensibility needs.
Current KABI model (Section 12.1):
Drivers implement KABI vtables for device interaction.
Drivers are hot-swappable (crash recovery, Section 11.7).
Drivers run in isolation domains.
Generalized KABI model (this proposal):
POLICIES also implement KABI vtables.
Policies are hot-swappable (same mechanism as drivers).
Policies run in isolation domains (same as Tier 1 drivers).
The kernel provides MECHANISMS (scheduling, page tables, memory allocation).
POLICY MODULES provide DECISIONS (which process runs next,
which page to evict, how to route I/O).
19.9.2 Extensible Policy Points¶
// umka-core/src/policy/mod.rs
/// Policy points where the kernel delegates decisions to a module.
/// Each policy point has a default built-in implementation.
/// Custom modules can replace the default at runtime.
// --- Policy context and parameter types ---
/// Maximum number of tasks in a runqueue snapshot passed to policy modules.
/// 64 is chosen because: (1) it bounds the ArrayVec size to 64 * sizeof(TaskSnapshot)
/// ≈ 2.5 KB, fitting within a single 4 KB page for stack-safe snapshot capture;
/// (2) runqueues with >64 runnable tasks are heavily loaded — policy decisions on
/// such queues are dominated by aggregate metrics (total load, nr_running), not
/// per-task details; (3) the snapshot is captured under the runqueue lock, so
/// bounding the copy keeps the critical section short (~1 μs for 64 entries).
/// On runqueues with more than 64 runnable tasks, the snapshot contains the
/// first 64 tasks in scheduling order (by virtual deadline).
pub const MAX_RUNQUEUE_SNAPSHOT: usize = 64;
/// Read-only snapshot of scheduling state, captured under the runqueue lock
/// and passed to policy modules across the trust boundary. Policy modules
/// never see raw runqueue pointers.
pub struct SchedPolicyContext {
/// Number of runnable tasks on this CPU's runqueue.
pub nr_running: u32,
/// Context switches completed on this CPU (monotonic, u64 for 50-year uptime).
pub nr_switches: u64,
/// CPU utilization in permille (0–1000). Derived from PELT util_avg.
/// 1000 = fully utilized. Updated every sched_latency_ns tick.
pub cpu_util_permille: u16,
/// Average vruntime of all runnable tasks (ns). Useful for policy modules
/// that need to detect scheduling fairness drift or starvation.
pub avg_vruntime_ns: u64,
/// Total weighted load on this CPU (PELT load_avg sum across all SE).
pub load_avg: u64,
/// Number of RT-class (SCHED_FIFO/SCHED_RR) tasks on this runqueue.
/// Policy modules use this to avoid placing CFS tasks on RT-heavy CPUs.
pub nr_rt_running: u32,
/// Time spent idle since the last snapshot (ns). Computed as
/// `now - last_snapshot_ns - busy_ns`. Useful for EAS power estimation.
pub idle_ns: u64,
/// LLC miss rate indicator in permille (0–1000). Sampled from PMU counters
/// ([Section 20.8](20-observability.md#performance-monitoring-unit)) at each snapshot. 0 = no misses,
/// 1000 = every access is a miss. Policy modules use this to detect
/// cache-thrashing CPUs and avoid migration into them.
pub cache_pressure: u16,
/// Per-task metadata for each runnable task (bounded by nr_running).
/// Contains task ID, nice value, weight, vruntime, lag, and cgroup ID.
pub tasks: ArrayVec<TaskSnapshot, MAX_RUNQUEUE_SNAPSHOT>,
/// Current CPU frequency (kHz), for EAS-aware scheduling.
pub cpu_freq_khz: u32,
/// NUMA node ID of this CPU.
pub numa_node: u8,
}
/// Per-task scheduling snapshot for policy modules.
/// Captured under the runqueue lock and passed to SchedPolicyContext.
/// Contains the minimum information needed for scheduling policy decisions.
/// `#[repr(C)]` for deterministic layout: this struct is passed to separately
/// compiled Evolvable policy modules across the trust boundary. Without
/// `#[repr(C)]`, field reordering across compiler versions could silently
/// misinterpret fields. Kernel-internal — does NOT cross a KABI or userspace
/// boundary.
#[repr(C)]
pub struct TaskSnapshot {
/// Global task ID.
pub task_id: u64, // offset 0, size 8
/// Nice value (-20..19), already mapped to weight.
pub nice: i8, // offset 8, size 1
/// Explicit padding: `#[repr(C)]` alignment for `weight: u32` requires
/// 3 bytes of padding after `nice: i8`.
pub _pad0: [u8; 3], // offset 9, size 3
/// Scheduling weight (derived from nice via sched_prio_to_weight[]).
pub weight: u32, // offset 12, size 4
/// Virtual runtime (EEVDF). Lower = earlier deadline.
pub vruntime: u64, // offset 16, size 8
/// EEVDF lag (eligible virtual time - actual virtual time).
pub lag: i64, // offset 24, size 8
/// Cgroup ID of the task's cpu controller cgroup.
pub cgroup_id: u64, // offset 32, size 8
/// Whether the task is currently eligible (lag >= 0).
pub eligible: u8, // offset 40, size 1
/// Explicit trailing padding to reach the next multiple of the max
/// field alignment (u64 = 8 bytes). 48 - 41 = 7 bytes.
pub _pad1: [u8; 7], // offset 41, size 7
}
// TaskSnapshot layout: 8+1+3pad+4+8+8+8+1+7pad = 48 bytes.
// 48 * 64 (MAX_RUNQUEUE_SNAPSHOT) = 3072 bytes, fits in one page.
const_assert!(core::mem::size_of::<TaskSnapshot>() == 48);
/// Flags passed to `enqueue_task()` indicating why the task became runnable.
pub struct EnqueueFlags(u32);
impl EnqueueFlags {
/// Task was just created (fork/clone).
pub const ENQUEUE_NEW: Self = Self(1 << 0);
/// Task woke from sleep (futex, poll, etc.).
pub const ENQUEUE_WAKEUP: Self = Self(1 << 1);
/// Task was migrated from another CPU.
pub const ENQUEUE_MIGRATE: Self = Self(1 << 2);
/// Task was restored after preemption.
pub const ENQUEUE_RESTORE: Self = Self(1 << 3);
/// Task is being enqueued for the first time via `wake_up_new_task()`.
/// Combined with ENQUEUE_NEW. Signals to `place_entity()` that this
/// is a brand-new task (not a wake from sleep) and the initial
/// vruntime placement should use the forked-child algorithm
/// (position relative to min_vruntime with vlag=0).
/// Used in [Section 7.1](07-scheduling.md#scheduler--wake_up_new_task-forked-task-activation) and
/// [Section 8.1](08-process.md#process-and-task-management--process-creation) step 21.
pub const ENQUEUE_INITIAL: Self = Self(1 << 4);
}
/// Decision returned by `balance_load()`.
pub enum MigrateDecision {
/// Do nothing — CPUs are balanced.
NoAction,
/// Migrate `count` tasks from `busiest_cpu` to `this_cpu`.
Migrate { count: u32 },
/// Defer decision — not enough data yet (e.g., PELT hasn't converged).
Defer,
}
/// Block I/O request descriptor (read-only view for policy modules).
pub struct IoRequest {
/// Logical block address (start of I/O range).
pub lba: u64,
/// Number of sectors.
pub sector_count: u32,
/// Operation type (read, write, discard, flush).
pub op: BioOp,
/// Originating process ID (for cgroup accounting).
pub pid: ProcessId,
/// Submission timestamp (monotonic ns).
pub submit_ns: u64,
/// I/O priority class and level.
pub ioprio: u16,
}
/// Priority score returned by `IoSchedPolicy::submit()`.
/// Higher scores are dispatched first. Opaque to umka-core — the policy
/// module defines the scoring function.
pub struct IoScore(pub i64);
/// Minimal packet header view for network classification.
/// Contains only the fields needed for QoS decisions, not the full packet.
pub struct PacketHeader {
/// Source/destination IP (v4 or v6) and ports.
pub src_addr: IpAddr,
pub dst_addr: IpAddr,
pub src_port: u16,
pub dst_port: u16,
/// IP protocol number (TCP=6, UDP=17, etc.).
pub protocol: u8,
/// DSCP value from IP header.
pub dscp: u8,
/// Packet length (bytes).
pub len: u32,
}
/// Classification result for a network packet.
pub struct NetClass {
/// Priority queue index (0 = best effort, higher = higher priority).
pub queue: u8,
/// Traffic class mark (for tc/iptables compatibility).
pub mark: u32,
/// Drop eligibility (for ECN/WRED).
pub drop_eligible: bool,
}
/// Flags describing the allocation context (for tiering decisions).
pub struct AllocFlags(u32);
impl AllocFlags {
/// Page is for anonymous memory (heap, stack).
pub const ANONYMOUS: Self = Self(1 << 0);
/// Page is for file-backed mapping (page cache).
pub const FILE_BACKED: Self = Self(1 << 1);
/// Page is for a memory-mapped device region.
pub const DEVICE: Self = Self(1 << 2);
/// Allocation is on the fault path (latency-sensitive).
pub const FAULT: Self = Self(1 << 3);
/// Hint: page is likely short-lived.
pub const TRANSIENT: Self = Self(1 << 4);
}
/// Memory tier identifier. Discovery-based (see Section 4.9 NUMA topology).
pub struct TierId(pub u8);
/// Tiering decision for a page.
pub enum TierDecision {
/// Keep page in current tier.
Keep,
/// Demote to the specified lower tier (e.g., CXL, compressed, swap).
Demote(TierId),
/// Compress in place (same tier, compressed representation).
Compress,
}
/// NUMA migration advice for a page.
pub enum MigrateAdvice {
/// Keep page on current NUMA node.
Stay,
/// Migrate to the specified NUMA node (closer to accessing CPU).
MigrateTo(u8),
}
/// CPU scheduling policy.
///
/// Policy modules receive a `SchedPolicyContext` snapshot (Section 19.7.3), NOT a direct
/// reference to the locked runqueue. The snapshot is captured by umka-core under
/// the runqueue lock before the domain switch, ensuring consistency without
/// exposing internal kernel data structures across the trust boundary.
///
/// **Naming**: `SchedPolicy` is the runtime policy trait. The struct implementing
/// `SchedPolicy` is the "scheduler policy module." For live replacement, the
/// policy module also implements `EvolvableComponent`
/// ([Section 13.18](13-device-classes.md#live-kernel-evolution--core-component-replacement-design)), which provides
/// `export_state()`/`import_state()` for zero-downtime replacement.
pub trait SchedPolicy: Send + Sync {
/// Pick the next task to run on this CPU.
fn pick_next_task(&self, cpu: CpuId, ctx: &SchedPolicyContext) -> Option<TaskId>;
/// A task has become runnable. Decide where to enqueue it.
fn enqueue_task(&self, cpu: CpuId, task: TaskId, flags: EnqueueFlags);
/// A task has yielded or exhausted its timeslice.
fn task_tick(&self, task: TaskId, cpu: CpuId);
/// Load balancing decision: should we migrate tasks between CPUs?
fn balance_load(&self, this_cpu: CpuId, busiest_cpu: CpuId) -> MigrateDecision;
}
/// Maximum pages to scan in a single eviction batch. Sized to fit within
/// a single 4 KB stack frame (each PageHandle is ~16 bytes).
const MAX_SCAN_BATCH: usize = 64;
/// Page replacement policy (which pages to evict under memory pressure).
pub trait PagePolicy: Send + Sync {
/// Select pages to evict from this zone.
/// Returns results via a caller-provided fixed-capacity buffer (ArrayVec)
/// since nr_to_scan is bounded by the zone scan batch size. Policy modules
/// must not heap-allocate on the eviction hot path.
fn select_victims(&self, zone: &Zone, nr_to_scan: u32, out: &mut ArrayVec<PageHandle, MAX_SCAN_BATCH>);
/// Should this page be promoted to a higher tier (active list, huge page)?
fn should_promote(&self, page: &PageHandle) -> bool;
/// Migration decision: should this page move to a different NUMA node?
fn migration_advice(&self, page: &PageHandle, current_node: u8) -> MigrateAdvice;
}
/// I/O scheduling policy (ordering of block I/O requests).
///
/// The authoritative I/O scheduler trait definition is `IoSchedOps` in
/// [Section 15.18](15-storage.md#io-priority-and-scheduling), which uses the full storage-layer types
/// (`DeviceIoQueues`, `CpuId`, `PickResult`). This simplified trait is the
/// policy extensibility interface exposed to replaceable policy modules —
/// it wraps the storage-layer details behind `IoQueue` and `IoRequestId`.
/// Implementers should consult [Section 15.18](15-storage.md#io-priority-and-scheduling) for the
/// complete `IoQueue` definition and scheduling semantics.
pub trait IoSchedPolicy: Send + Sync {
/// Submit a new I/O request. Return its priority score.
fn submit(&self, req: &IoRequest) -> IoScore;
/// Pick the next I/O request to dispatch to the device.
fn dispatch(&self, queue: &IoQueue) -> Option<IoRequestId>;
/// A request has completed. Update internal state.
fn complete(&self, req: &IoRequest, latency_ns: u64);
}
/// **IoRequestId validation** (required for safety when IoSchedPolicy runs in an
/// isolation domain that may be buggy or compromised):
///
/// Before dispatching any I/O request selected by the policy, the kernel MUST verify
/// that the returned `IoRequestId` exists in the device's live request queue.
/// If the ID is not found, the dispatch is silently skipped and a violation is counted.
///
/// Validation algorithm:
pub const IO_POLICY_MAX_VIOLATIONS: u32 = 3;
// In the dispatch path:
// match io_queue.pending.get(&selected_id) {
// Some(request) => dispatch(request),
// None => {
// log::warn!("IoSchedPolicy returned invalid IoRequestId {:?} — skipping", selected_id);
// policy_state.violation_count += 1;
// if policy_state.violation_count >= IO_POLICY_MAX_VIOLATIONS {
// log::error!("IoSchedPolicy evicted after {} violations", IO_POLICY_MAX_VIOLATIONS);
// evict_policy(policy_handle);
// }
// }
// }
//
// Lookup is O(1) via the `pending: XArray<IoRequest>` (keyed by IoRequestId,
// an integer request tag) that already exists for request tracking. This adds
// no overhead on the common (valid) path.
/// Network classification policy (packet prioritization, QoS).
pub trait NetClassPolicy: Send + Sync {
/// Classify an incoming packet (assign priority, mark, queue).
fn classify_rx(&self, packet: &PacketHeader) -> NetClass;
/// Classify an outgoing packet.
fn classify_tx(&self, packet: &PacketHeader) -> NetClass;
}
/// Memory tiering policy (which tier to place pages in).
///
/// Called on warm paths (page allocation, periodic tier scanner, migration
/// decisions). Never called from the page fault hot path — initial placement
/// is cached in the per-cgroup tier hint.
pub trait TierPolicy: Send + Sync {
/// Where should a newly allocated page go?
fn initial_placement(&self, process: ProcessId, flags: AllocFlags) -> TierId;
/// A page has been idle for N ticks. Should it be demoted?
fn demotion_advice(&self, page: &PageHandle, idle_ticks: u32) -> TierDecision;
/// A page has been accessed frequently. Should it be promoted to a faster tier?
/// `heat` is the access frequency estimate from the page scanner (higher = hotter).
fn promotion_advice(&self, page: &PageHandle, heat: u32) -> TierDecision;
/// Access count threshold below which a page is eligible for demotion.
/// Pages with `idle_ticks >= migration_threshold()` are candidates.
fn migration_threshold(&self) -> u32;
/// A remote node has available memory. Should we use it?
fn remote_tier_advice(&self, node_id: NodeId, available_bytes: u64) -> bool;
}
19.9.3 Policy Module Trust Boundary¶
Memory access scope: When a policy module runs in its own isolation domain, the kernel maps into that domain (read-only): - Run queue metadata (task count, utilization, per-CPU load) - Per-task scheduling metadata (priority, PELT state, cgroup membership) - System-wide metrics (total CPU count, NUMA topology, frequency domains)
The module CANNOT access: process memory, page contents, file data, network buffers,
capability tables, or other modules' state. A rogue pick_next_task cannot scan
process memory — hardware domain isolation prevents it.
Locking model: The kernel calls policy module functions with no cross-domain locks held.
Per-CPU scheduler state (the runqueue) is locked by the caller; the policy module receives
a read-only snapshot of the runqueue state via the SchedPolicyContext argument, not direct
access to the locked runqueue. This prevents TOCTOU races: the snapshot is consistent because
it is captured under the runqueue lock before the domain switch. The module manages its own
internal synchronization (spinlocks, per-CPU data, RCU-like patterns). If the module deadlocks
internally, the domain watchdog (timer-based, ~10ms timeout) detects the stuck call and
triggers crash recovery — revert to built-in default policy, reload module.
Policy module crash recovery priority (three descriptions, compatible but ordered):
1. Watchdog window (first 5 seconds after swap): If the new policy triggers
a crash or anomaly within the watchdog window, the retained old policy pointer
is restored via AtomicPtr::store(). Zero state migration needed — the old
policy's internal state was never freed.
2. Post-watchdog crash: The old policy pointer has been freed (watchdog window
expired). Fall back to the built-in default policy compiled into the kernel.
This is always safe but may be suboptimal (e.g., EEVDF default vs. learned policy).
3. Later recovery (optional): The crashed policy module may be reloaded from
disk via the standard KABI module load path
(Section 12.7), re-initialized, and hot-swapped
using the normal live evolution protocol.
Stateful modules: Traits require Send + Sync, but modules need mutable state
(counters, queues, learned parameters). The module owns its state and provides
interior mutability via its own locks. The kernel does not hold locks on the module's
behalf — the module is a self-contained unit.
NMI safety: Policy modules are never called from NMI context. The kernel's NMI handler performs only minimal work (perf sampling, watchdog) and never invokes policy module callbacks. This eliminates the risk of NMI-induced deadlock when a module holds an internal spinlock. If a future requirement arises for NMI-context policy invocation, the module trait must require try-lock semantics with a fallback to the built-in default policy on lock contention.
19.9.4 Side-Channel Mitigations¶
Domain isolation prevents direct memory reads across domain boundaries, but policy modules run in Ring 0 and share hardware resources with the core kernel. This opens side-channel vectors that domain isolation alone does not address.
Threat model: An untrusted or experimental module running in its own isolation domain
could exploit:
1. Shared-cache timing attacks (L1/L2/LLC) — measure cache line eviction timing to
infer kernel memory access patterns.
2. Speculative execution side-channels (Spectre v1 bounds check bypass) — trick the
CPU into speculatively reading kernel data across the isolation domain boundary.
3. Timing observation — use high-resolution timers (rdtsc, cycle counters) to
measure the duration of kernel operations and infer internal state.
Mitigations:
-
Cache partitioning: Intel CAT (Cache Allocation Technology) / ARM MPAM (Memory System Resource Partitioning and Monitoring) partitions LLC ways so that an untrusted module's cache allocation does not overlap with the core kernel's. Configured per isolation domain at module load time. On architectures without hardware cache partitioning, cache flushing on domain transitions provides a weaker but functional defense.
-
Timer resolution reduction: On AArch64,
CNTKCTL_EL1.EL0PCTENtraps EL0 cycle counter reads, allowing the kernel to return a coarsened value. On x86, policy modules run in Ring 0, whererdtscexecutes unconditionally regardless ofCR4.TSD(the Intel SDM specifies thatCR4.TSD=1only trapsrdtscat CPL>0, not CPL=0). Ring 0 code therefore has fullrdtscaccess. The side-channel mitigation for Ring 0 policy modules on x86 relies on Intel CAT (LLC partitioning, described above) and cache flushing on domain transitions — not on timer coarsening. This is a deliberate acknowledgment that Ring 0 untrusted modules have the same timing access as any Ring 0 code; cache partitioning and flushing are the effective mitigations at this privilege level. Recommendation: policy modules should use the kernel's monotonic clock abstraction (ktime_get_ns()equivalent) rather than rawrdtsc/ cycle counter reads, unless high-precision timing is explicitly required and the module is production-vetted (trusted). The kernel's time API provides sufficient resolution for scheduling and power decisions (~1ns on modern hardware) while maintaining a single auditable timing interface. Untrusted modules that bypass the time API and use rawrdtscdirectly can serve as timing oracles for side-channel attacks; code review should flag such usage. -
Constant-time helpers: The kernel provides constant-time comparison and lookup functions for any data that crosses the domain boundary into module-readable memory. This prevents modules from using timing differences to distinguish data values.
-
Spectre v1 barriers: All kernel→module data handoff uses
lfence(x86) /csdb(ARM) speculation barriers. Module-provided indices into kernel arrays are bounds-checked with anarray_index_nospecequivalent (index masking) before use.
Residual risk: Production-vetted modules (signed, running in the Core isolation domain) face the same side-channel exposure as any Ring 0 code — this is acceptable since they are fully trusted. Side-channel mitigations apply only to untrusted/experimental modules running in isolation domains. This is a deliberate trade-off: production modules get zero overhead, experimental modules get strong isolation at a small performance cost.
19.9.5 Module Lifecycle¶
Policy module lifecycle (same as driver lifecycle, Section 11.7):
1. Module binary is compiled Rust (same toolchain as kernel).
Implements one or more policy traits via KABI vtable.
Signed with driver signature mechanism (Section 9.2.5).
Vtable uses same versioning as driver KABI: `kabi_version` field as
primary discriminant + `vtable_size` for bounds safety
([Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rules 2a, 6). A kernel upgrade that adds new methods to SchedPolicy
extends the vtable — old modules still work (new methods fall back to
built-in defaults based on vtable_size).
2. Module is loaded at runtime:
echo "sched_ml_aware" > /sys/kernel/umka/policy/scheduler/active
3. Kernel:
a. Signature verification and measurement enforcement (see below).
b. Allocates isolation domain for the module (if untrusted/experimental).
Production-vetted modules (signed by kernel vendor, pre-verified)
run in the Core isolation domain — zero domain transition overhead.
c. Loads module code into isolated memory region.
d. KABI vtable exchange (module provides policy vtable).
e. Replacement mechanism selected based on state ownership
(see [Section 13.18](13-device-classes.md#live-kernel-evolution--design-explicit-state-ownership-graph)):
**Stateless policy modules** (no owned mutable state — all 9 core policy
traits: SchedPolicy, PagePolicy, IoSchedPolicy, NetClassPolicy,
TierPolicy, VmmPolicy, CapPolicy, PhysAllocPolicy, CongestionOps):
- AtomicPtr swap of the vtable pointer (~1 μs, no stop-the-world).
- No quiescence needed — no policy-owned state to drain.
- No Phase A/A'/B/C lifecycle — no export_state/import_state.
- In-flight callers see either the old or new vtable atomically
(Ordering::Release on writer, Ordering::Acquire on reader).
- Post-swap health watchdog monitors the new module for 5 seconds
(see Stateless Policy Swap Watchdog below).
**Stateful policy modules** (own mutable state that must survive
replacement — e.g., a scheduler policy with learned ML parameters,
a custom page policy with per-NUMA counters):
- Full Phase A/A'/B/C lifecycle
([Section 13.18](13-device-classes.md#live-kernel-evolution--component-replacement-flow)).
- Phase A: export_state() on old module, import_state() on new.
- Phase A': quiescence — in-flight calls drain, new calls queued
in PendingOpsPerCpu.
- Phase B: stop-the-world IPI, atomic vtable swap, pending ops
transfer.
- Phase C: new module drains pending ops, watchdog active.
- Rollback on crash within watchdog window.
The decision rule: does the module own mutable state that must survive
replacement? If no → AtomicPtr. If yes → Phase A/A'/B/C. The module's
KabiPolicyManifest (see below) declares which mechanism it requires via
the `replacement_mode` field.
f. Old policy module can be unloaded (stateless: immediately after
watchdog window; stateful: after Phase C2 cleanup).
4. Module crash:
a. Domain fault trapped by kernel.
b. Revert to built-in default policy (immediate, no interruption).
c. Reload module if desired.
d. Total disruption: zero. Built-in default handles the gap.
5. Module hot-swap:
echo "sched_cfs_umka" > /sys/kernel/umka/policy/scheduler/active
→ Replacement mechanism selected per step 3.e above. No interruption.
19.9.5.1 Policy Module Signature Verification and Measurement Enforcement¶
Step 3.a above is the gate: every policy module load passes through signature verification and TPM measurement. The enforcement level is tied to the boot security posture — not a separate knob.
Boot parameter: umka.module_sig=enforce|advisory|off
| Mode | Default When | Unsigned Module Behavior | TPM Measurement | Attestation Impact |
|---|---|---|---|---|
enforce |
Secure boot active | Rejected. Load fails with -EKEYREJECTED. dmesg: "policy module {name}: signature verification failed, load rejected (enforce mode)" |
Module hash extended into PCR[15]. Signed status recorded in IMA log. | Full attestation chain intact. |
advisory |
Secure boot inactive | Loaded with warning. dmesg: "policy module {name}: UNSIGNED — loaded in advisory mode, attestation score reduced". Module runs in its own isolation domain regardless of vendor trust (never promoted to Core domain). |
Module hash extended into PCR[15]. IMA log records unsigned_policy_module event. |
Attestation score reduced. Remote attestor sees unsigned module in IMA log. |
off |
Never (explicit opt-in only) | Loaded silently. No signature check, no TPM measurement. | None. | Attestation unavailable. |
Enforcement override rules:
-
When secure boot is active (UEFI Secure Boot or equivalent DTB-verified boot chain), the default is
enforce. The boot parameterumka.module_sig=advisoryorumka.module_sig=offis rejected —dmesglogs"umka.module_sig={value} ignored: secure boot is active, enforcing signatures"and the mode remainsenforce. This prevents a bootloader-level attacker from downgrading enforcement via command line. -
When secure boot is inactive, the default is
advisory. The administrator may setenforce(stricter) oroff(bare-metal debugging) via boot parameter. -
Runtime change: enforcement mode is immutable after boot. There is no sysfs or sysctl to change it. The only way to change the mode is to reboot with a different
umka.module_sig=parameter.
Signature verification flow (step 3.a in detail):
/// Policy module load gate. Called before any code from the module executes.
///
/// Returns Ok(ModuleTrust) on success, Err on rejection.
pub fn verify_policy_module(
binary: &[u8],
sig_mode: ModuleSigMode,
) -> Result<ModuleTrust, ModuleLoadError> {
// 1. Parse .kabi_sig section from ELF.
let sig = parse_kabi_sig(binary)?;
// 2. Verify signature against trusted policy signing key.
let sig_result = match &sig {
Some(s) => kabi_sig_verify(s, binary),
None => Err(SigError::NoSignature),
};
// 3. Enforce based on boot-time mode.
let trust = match (sig_result, sig_mode) {
(Ok(()), _) => {
// Signed and valid → production-vetted.
ModuleTrust::Vetted
}
(Err(_), ModuleSigMode::Enforce) => {
// Unsigned or invalid signature in enforce mode → reject.
return Err(ModuleLoadError::SignatureRejected);
}
(Err(_), ModuleSigMode::Advisory) => {
// Unsigned in advisory mode → allow in isolation domain.
log_warning!("policy module: UNSIGNED, loaded in advisory mode");
ModuleTrust::Untrusted
}
(Err(_), ModuleSigMode::Off) => {
// No verification → untrusted.
ModuleTrust::Untrusted
}
};
// 4. TPM measurement (unless mode is Off).
if sig_mode != ModuleSigMode::Off {
let hash = sha256(binary);
tpm_extend_pcr(15, &hash);
ima_log_policy_module(binary, trust);
}
Ok(trust)
}
/// Module trust level determines isolation domain placement.
/// `#[repr(u8)]` is required because `ModuleTrust` is embedded in
/// `KabiPolicyManifest` (a `#[repr(C)]` ELF-section struct). A bare
/// Rust enum without explicit repr has unstable layout across compiler versions.
#[repr(u8)]
pub enum ModuleTrust {
/// Signed by kernel vendor. Runs in Core isolation domain (zero overhead).
Vetted,
/// Unsigned or experimental. Runs in its own isolation domain.
Untrusted,
}
pub enum ModuleSigMode {
/// Reject unsigned modules. Default when secure boot is active.
Enforce,
/// Allow unsigned with warning. Default when secure boot is inactive.
Advisory,
/// No verification, no measurement. Bare-metal debugging only.
Off,
}
Isolation domain assignment based on trust:
| Trust Level | Isolation Domain | Domain Switch Cost | Can Access Core State? |
|---|---|---|---|
Vetted |
Core domain (same as kernel) | 0 cycles | Yes (read-only snapshots via policy context) |
Untrusted |
Dedicated per-module domain | ~23 cycles (MPK) / ~40-80 cycles (POE) | No — hardware isolation enforced |
An unsigned module loaded in advisory mode is always Untrusted — it can
never be promoted to Vetted without a valid signature. This ensures that even
in development mode, unsigned code is hardware-isolated from the kernel core.
Relationship to driver signature verification:
Policy module signing uses the same key hierarchy as driver signing
(Section 9.3): ML-DSA-65 signature
in .kabi_sig ELF section, verified against the trusted driver signing key
chain. The policy module signing key MAY be the same as the driver signing key
or a separate sub-key — both are valid configurations. The verification code
path is shared.
19.9.6 KabiPolicyManifest¶
Every policy module binary embeds a KabiPolicyManifest in the .kabi_manifest ELF
section, following the same pattern as KabiDriverManifest
(Section 12.6).
The kernel loader reads this section before any module code executes.
/// ELF-embedded policy module manifest.
/// Placed in section `.kabi_manifest` by the linker script.
/// Generated by `umka-kabi-gen --policy --output-dir`.
/// Policy module authors do not write or modify this struct directly.
// kernel-internal, not KABI
#[repr(C)]
pub struct KabiPolicyManifest {
/// Magic: 0x4B504F4C ("KPOL") — identifies a valid policy manifest.
/// Distinguishes policy manifests from driver manifests (0x4B424944 "KBID").
pub magic: u32,
/// Manifest structure version (currently 1). Loader rejects unknown versions.
pub manifest_version: u32,
/// Policy type: which policy point this module implements.
/// Must match exactly one of the registered policy point identifiers.
pub policy_type: PolicyType,
/// Explicit padding to align `kabi_version` (u64) to 8-byte boundary.
/// `#[repr(C)]` would insert 4 bytes of implicit padding here;
/// making it explicit prevents information disclosure.
pub _pad0: [u8; 4],
/// KABI version this policy module was compiled against.
/// Compatibility check uses the same rules as driver KABI:
/// same major required, kernel minor >= module minor
/// ([Section 12.7](12-kabi.md#kabi-service-dependency-resolution--kabiproviderindex-boot-time-service-map)).
pub kabi_version: u64,
/// Null-terminated UTF-8 module name (max 63 bytes + null).
pub module_name: [u8; 64],
/// Module version (semantic: major << 32 | minor << 16 | patch).
/// Used for state migration ordering when a stateful policy module
/// implements EvolvableComponent.
pub module_version: u64,
/// Entry point: symbol name of the function that returns the policy
/// vtable pointer. The kernel calls this once during module load.
///
/// For Tier 0 (InKernel): `transport_ctx` is null (direct function calls,
/// no transport setup needed). For Tier 1 (SharedMemRing): `transport_ctx`
/// points to the KABI ring buffer descriptor. For Tier 2 (ProcessIpc):
/// `transport_ctx` points to the IPC channel handle. The callee casts the
/// opaque pointer to the appropriate transport-specific type based on
/// `transport`.
/// **Type relationship**: `PolicyVtableHeader` is the common prefix
/// shared by both stateless and stateful vtables. Its first two fields
/// (`vtable_size`, `kabi_version`) are layout-identical to the first
/// two fields of `VtableHeader` (see [Section 13.18](13-device-classes.md#live-kernel-evolution)).
/// The kernel inspects `PolicyModuleMetadata::replacement_mode` to
/// determine the actual type behind the pointer:
/// - `Stateless` → the returned pointer is a `*const PolicyVtableHeader`
/// (or a concrete vtable whose first two fields match it).
/// - `Stateful` → the returned pointer is a `*const VtableHeader`
/// (which extends `PolicyVtableHeader` with `quiescing`,
/// `pending_ops_ptr`, `state_version`, `export_state`,
/// `import_state`). The kernel casts to `VtableHeader` after
/// verifying `vtable_size >= size_of::<VtableHeader>()`.
/// In both cases the return type is `*const PolicyVtableHeader`
/// because it is the minimal common prefix. Callers upcast only
/// after checking `replacement_mode`.
pub entry_fn: Option<unsafe extern "C" fn(
transport: KabiTransportClass,
transport_ctx: *const (),
) -> *const PolicyVtableHeader>,
/// Bitmask of supported transport classes. If the kernel's selected
/// transport is not in this mask, the module load fails with `ENOTSUP`.
/// Bit N corresponds to `KabiTransportClass` variant N.
/// Example: `0b011` = supports InKernel and SharedMemRing.
pub transport_mask: u8,
/// Fallback tier bias for architectures without fast isolation.
/// When Tier 1 is unavailable (e.g., RISC-V, LoongArch64), the kernel
/// uses this to decide whether to promote to Tier 0 (trusted, in-kernel)
/// or demote to Tier 2 (isolated, higher latency). Values:
/// `0` = prefer Tier 0 (default for signed, trusted modules).
/// `1` = prefer Tier 2 (default for unsigned or low-trust modules).
pub fallback_bias: u8,
/// Trust level declared by the module author. The kernel validates
/// this against signature verification results — an unsigned module
/// declaring Vetted trust is rejected (trust level cannot exceed
/// the signature verification result from verify_policy_module()).
pub declared_trust: ModuleTrust,
/// Replacement mode: stateless (AtomicPtr swap) or stateful
/// (Phase A/A'/B/C lifecycle with export_state/import_state).
/// See Module Lifecycle step 3.e for the semantics of each mode.
pub replacement_mode: PolicyReplacementMode,
/// KABI transport class required by this policy module.
/// Determines how the kernel communicates with the module at runtime:
/// `InKernel` (Tier 0, direct function call), `SharedMemRing` (Tier 1,
/// ring buffer in isolation domain), or `ProcessIpc` (Tier 2, Ring 3).
/// Must match the module's compiled tier. See [Section 12.6](12-kabi.md#kabi-transport-classes).
pub transport: KabiTransportClass, // see enum definition below
/// Required isolation tier for this policy module (0, 1, or 2).
/// The kernel rejects a module whose declared tier does not match the
/// `transport` field (e.g., `transport = SharedMemRing` with `tier = 0`
/// is invalid). Unsigned or untrusted modules cannot declare Tier 0.
pub required_tier: u8,
/// Identifies which kernel service this policy implements.
/// Used by the dependency DAG to ensure incompatible policies are not
/// loaded simultaneously (e.g., two different `IoSchedPolicy` modules).
/// Values are assigned per policy point (e.g., 1 = IoSched, 2 = PageReplace).
pub service_class: u32,
pub _reserved: [u8; 1],
}
// Layout: magic(4) + manifest_version(4) + policy_type(4) + _pad0(4) +
// kabi_version(8) + module_name(64) + module_version(8) + entry_fn(8) +
// transport_mask(1) + fallback_bias(1) + declared_trust(1) +
// replacement_mode(1) + transport(1) + required_tier(1) + service_class(4) +
// _reserved(1) + pad(1 to 8-byte struct alignment) = 120 bytes (LP64).
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<KabiPolicyManifest>() == 120);
/// KABI transport class — determines how the kernel communicates with
/// the policy module at runtime. Must match the module's compiled tier.
/// See [Section 12.6](12-kabi.md#kabi-transport-classes) for full transport semantics.
#[repr(u8)]
pub enum KabiTransportClass {
/// Tier 0: direct function calls within the kernel address space.
/// Fastest path — no marshalling overhead. Only for signed, trusted modules.
InKernel = 0,
/// Tier 1: KABI ring buffer in a shared memory isolation domain (MPK/POE).
/// Module runs at Ring 0 but in a separate memory protection domain.
SharedMemRing = 1,
/// Tier 2: cross-address-space IPC (Ring 3 process).
/// Full process isolation with IOMMU protection. Highest latency.
ProcessIpc = 2,
}
/// Identifies which policy point a module implements.
///
/// **Mapping to `PolicyPointId`** (defined in the live evolution framework,
/// [Section 13.18](13-device-classes.md#live-kernel-evolution)): `PolicyType` is used in policy module manifests
/// (KABI ABI boundary), while `PolicyPointId` is used internally by the evolution
/// framework. The canonical mapping table:
///
/// | `PolicyType` (manifest) | Value | `PolicyPointId` (internal) | Value | Notes |
/// |-------------------------|-------|----------------------------|-------|-------|
/// | `Scheduler` | 1 | *(none)* | — | Scheduler uses Phase A/A'/B/C, not stateless swap |
/// | `PageReplacement` | 2 | `PageReclaimPolicy` | 1 | |
/// | `IoScheduler` | 3 | `IoSchedOps` | 4 | |
/// | `NetClassifier` | 4 | `NetClassPolicy` | 8 | |
/// | `MemoryTiering` | 5 | `TierPolicy` | 7 | |
/// | `VmmPolicy` | 6 | `VmmPolicy` | 2 | |
/// | `CapPolicy` | 7 | `CapPolicy` | 3 | |
/// | `CongestionControl` | 8 | `CongestionOps` | 6 | |
/// | `PhysAllocPolicy` | 9 | `PhysAllocPolicy` | 0 | |
/// | `QdiscOps` | 10 | `QdiscOps` | 5 | |
#[repr(u32)]
pub enum PolicyType {
/// CPU scheduling policy (SchedPolicy trait).
Scheduler = 1,
/// Page replacement policy (PagePolicy trait).
PageReplacement = 2,
/// I/O scheduling policy (IoSchedPolicy trait).
IoScheduler = 3,
/// Network classification policy (NetClassPolicy trait).
NetClassifier = 4,
/// Memory tiering policy (TierPolicy trait).
MemoryTiering = 5,
/// Virtual memory manager policy (VmmPolicy trait).
VmmPolicy = 6,
/// Capability system policy (CapPolicy trait).
CapPolicy = 7,
/// TCP congestion control algorithm (CongestionOps trait).
CongestionControl = 8,
/// Physical memory allocator policy (PhysAllocPolicy trait).
/// Controls zone fallback ordering, migration-type selection, and
/// compaction heuristics. Referenced by the buddy allocator's warm-path
/// `select_block()` call ([Section 4.2](04-memory.md#physical-memory-allocator--allocation-policy)).
/// The `PhysAllocPolicyVTable` `#[repr(C)]` vtable struct is defined in
/// [Section 4.2](04-memory.md#physical-memory-allocator--physallocpolicyvtable).
PhysAllocPolicy = 9,
/// Queue discipline scheduling policy (QdiscOps trait).
/// Maps to `PolicyPointId::QdiscOps` (5) in the live evolution framework.
/// Controls packet scheduling order and shaping decisions in the TC layer
/// ([Section 16.21](16-networking.md#traffic-control-and-queue-disciplines)).
QdiscOps = 10,
}
/// Replacement mode for policy modules. Declared in the manifest and
/// validated by the evolution framework at load time.
#[repr(u32)]
pub enum PolicyReplacementMode {
/// Stateless: AtomicPtr swap, no quiescence, no state export/import.
/// The module must not own any mutable state that persists across calls.
Stateless = 0,
/// Stateful: Full Phase A/A'/B/C lifecycle. The module implements
/// EvolvableComponent for state export/import.
Stateful = 1,
}
Manifest validation at load time:
- Parse
.kabi_manifestsection from the ELF binary. - Verify
magic == 0x4B504F4C("KPOL"). Reject if magic is0x4B424944(driver manifest) — a driver cannot be loaded as a policy module, and vice versa. - Verify
manifest_version == 1. Reject unknown versions. - Verify
kabi_versionis compatible with the running kernel's KABI version. - Verify
policy_typematches the target policy point (the sysfs path determines which policy point is being loaded — e.g., writing to/sys/kernel/umka/policy/scheduler/activerequiresPolicyType::Scheduler). - Verify
replacement_modeis valid:Stateless(0) orStateful(1). Reject unknown values. IfStateful, verify the module exportsexport_state()andimport_state()symbols — a module declaringStatefulwithout these symbols cannot participate in the Phase A/A'/B/C lifecycle. - If
replacement_mode == Stateful, verify that the current T0 generation counter (Section 13.18) matches the module's compiled generation. A stale stateful module compiled against generation N cannot safely import state from generation N+1 — the state layout may have changed. - Verify
entry_fnis non-null and resolves to a valid symbol in the module. - Verify
declared_trustdoes not exceed the signature verification result.
Policy loader algorithm (invoked when admin writes to
/sys/kernel/umka/policy/<point>/active):
- Read the module name from the sysfs write buffer.
- Look up the module binary in
/lib/umka/policy/by name. - Verify the ELF signature (see Signature Verification above).
- Parse and validate the
.kabi_manifest(steps 1-9 above). - Map the module's
.textand.rodatasections into kernel virtual address space. - Call the module's
entry_fn()to obtain the vtable pointer. - Validate the returned vtable:
vtable_size >= KERNEL_MIN_VTABLE_SIZE,kabi_versioncompatible, all required method pointers non-null. - Dispatch to the appropriate replacement mechanism based on
replacement_mode: Stateless: AtomicPtr swap + RCU grace period + post-swap notify.Stateful: Full Phase A/A'/B/C lifecycle (Section 13.18).- Activate the post-swap health watchdog (5-second monitoring window).
19.9.7 KABI Vtable Wrappers for Policy Traits¶
Policy traits (SchedPolicy, PagePolicy, etc.) are Rust traits with Rust-native
calling conventions. For KABI ABI stability, policy modules export their vtables as
#[repr(C)] structs with the standard vtable_size/kabi_version header — the same
pattern used by driver vtables (Section 12.1).
Each policy trait has a corresponding #[repr(C)] KABI vtable struct. The kabi-gen
tool generates these from .kabi IDL files. The kernel dispatches policy calls through
kabi_call! for bounds safety and version compatibility.
/// Common header for all KABI vtable structs. Every KABI vtable
/// (driver or policy) begins with these two fields.
#[repr(C)]
pub struct PolicyVtableHeader {
/// Byte size of the vtable struct. Used for bounds safety:
/// the kernel reads only the first min(vtable_size, KERNEL_EXPECTED_SIZE)
/// bytes. Methods beyond vtable_size are treated as absent.
pub vtable_size: u64,
/// Primary version discriminant: KabiVersion::as_u64().
pub kabi_version: u64,
}
// Layout: 8 + 8 = 16 bytes.
const_assert!(size_of::<PolicyVtableHeader>() == 16);
/// KABI vtable for SchedPolicy. Generated by kabi-gen from sched_policy.kabi.
/// Replaces the Rust-native `dyn SchedPolicy` trait object with a C-ABI-stable
/// function pointer table.
// kernel-internal, not KABI
#[repr(C)]
pub struct SchedPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
/// Context pointer: opaque pointer to the module's internal state.
/// Passed as the first argument to every vtable function.
/// The kernel never dereferences this — it is the module's responsibility.
pub ctx: *mut core::ffi::c_void,
// V1 methods (mandatory).
pub pick_next_task: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cpu: CpuId, ctx_snapshot: *const SchedPolicyContext,
) -> PickNextResult,
pub enqueue_task: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cpu: CpuId, task: TaskId, flags: u32,
),
pub task_tick: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, task: TaskId, cpu: CpuId,
),
pub balance_load: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, this_cpu: CpuId, busiest_cpu: CpuId,
) -> MigrateDecisionRepr,
}
/// KABI vtable for PagePolicy.
// kernel-internal, not KABI
#[repr(C)]
pub struct PagePolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub select_victims: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, zone: *const Zone, nr_to_scan: u32,
out_buf: *mut PageHandle, out_cap: u32, out_len: *mut u32,
),
pub should_promote: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, page: *const PageHandle,
) -> u8, // 0 = false, 1 = true (bool invalid across KABI boundary)
pub migration_advice: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, page: *const PageHandle, current_node: u8,
) -> MigrateAdviceRepr,
}
/// KABI vtable for IoSchedPolicy.
// kernel-internal, not KABI
#[repr(C)]
pub struct IoSchedPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub submit: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, req: *const IoRequest,
) -> i64,
pub dispatch: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, queue: *const IoQueue,
) -> DispatchResult,
pub complete: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, req: *const IoRequest, latency_ns: u64,
),
}
/// KABI vtable for NetClassPolicy.
// kernel-internal, not KABI
#[repr(C)]
pub struct NetClassPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub classify_rx: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, packet: *const PacketHeader,
) -> NetClassRepr,
pub classify_tx: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, packet: *const PacketHeader,
) -> NetClassRepr,
}
/// KABI vtable for TierPolicy.
// kernel-internal, not KABI
#[repr(C)]
pub struct TierPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub initial_placement: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, process: ProcessId, flags: u32,
) -> u8,
pub demotion_advice: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, page: *const PageHandle, idle_ticks: u32,
) -> TierDecisionRepr,
pub promotion_advice: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, page: *const PageHandle, heat: u32,
) -> TierDecisionRepr,
pub migration_threshold: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void,
) -> u32,
pub remote_tier_advice: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, node_id: NodeId, available_bytes: u64,
) -> u8, // 0 = false, 1 = true (bool invalid across KABI boundary)
}
/// KABI vtable for VmmPolicy. Generated by kabi-gen from vmm_policy.kabi.
/// Corresponds to the `VmmPolicy` trait ([Section 4.8](04-memory.md#virtual-memory-manager)).
// kernel-internal, not KABI
#[repr(C)]
pub struct VmmPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub handle_anon_fault: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, mm: *const MmStruct, vma: *const Vma,
addr: u64, access: u32,
) -> i32,
pub handle_cow_fault: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, mm: *const MmStruct, vma: *const Vma,
addr: u64, old_pfn: u64,
) -> i32,
pub handle_file_fault: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, mm: *const MmStruct, vma: *const Vma,
addr: u64, access: u32,
) -> i32,
pub should_promote_thp: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, mm: *const MmStruct, addr: u64,
) -> u8, // 0 = no, 1 = yes, 2 = defer (bool invalid across KABI boundary)
pub tlb_flush_strategy: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, mm: *const MmStruct,
addr_start: u64, addr_end: u64, nr_pages: u64,
) -> u8,
pub pcid_evict_candidate: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, active_pcids: *const PcidEntry, nr_active: u32,
) -> u32,
pub readahead_window: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, ra_state: *const FileRaState, offset: u64,
) -> u32,
}
/// KABI vtable for CapPolicy. Generated by kabi-gen from cap_policy.kabi.
/// Corresponds to the `CapPolicy` trait ([Section 9.1](09-security.md#capability-based-foundation)).
// kernel-internal, not KABI
#[repr(C)]
pub struct CapPolicyVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub capable: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, creds: *const Credentials,
cap: u32, ns: *const UserNamespace,
) -> u8, // 0 = false, 1 = true (bool invalid across KABI boundary)
pub delegate_check: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, parent: *const CapEntry,
requested_rights: u64, target_domain: u64, target_tier: u8,
out_constraints: *mut CapConstraints,
) -> i32,
pub revocation_order: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, root: *const CapEntry,
out_buf: *mut CapId, out_cap: u32, out_len: *mut u32,
),
pub inherit_on_exec: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cap: *const CapEntry,
new_creds: *const Credentials, exec_flags: u32,
out_constraints: *mut CapConstraints,
) -> u8, // 0 = false, 1 = true (bool invalid across KABI boundary)
pub evaluate_constraints: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, constraints: *const CapConstraints,
context: *const CapCheckContext,
) -> i32,
pub syscaps_to_permissions: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, caps: u64, target_class: u32,
) -> u64,
pub lsm_cap_check: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, creds: *const Credentials,
cap: *const CapEntry, operation: u32,
) -> i32,
pub cluster_revoke: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cap_id: u64, peer_set: *const PeerSet,
) -> i32,
}
/// KABI vtable for CongestionOps. Generated by kabi-gen from congestion_ops.kabi.
/// Corresponds to the `CongestionOps` trait ([Section 16.10](16-networking.md#pluggable-tcp-congestion-control)).
// kernel-internal, not KABI
#[repr(C)]
pub struct CongestionOpsVTable {
pub vtable_size: u64,
pub kabi_version: u64,
pub ctx: *mut core::ffi::c_void,
pub name: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, out_buf: *mut u8, buf_len: u32,
) -> u32,
pub flags: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void,
) -> u32,
pub init: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb,
),
pub release: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb,
),
pub ssthresh: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb,
) -> u64,
pub cong_avoid: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb, ack: u32, acked: u32,
),
pub cong_control: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb, ack: *const TcpAck,
),
pub set_state: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb, new_state: u8,
),
pub cwnd_event: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb, ev: u8,
),
pub pkts_acked: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *mut TcpCb, sample: *const RateSample,
),
pub undo_cwnd: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *const TcpCb,
) -> u64,
pub get_info: unsafe extern "C" fn(
ctx: *mut core::ffi::c_void, cb: *const TcpCb, out_buf: *mut u8, buf_len: u32,
) -> u32,
}
// KABI vtable structs (SchedPolicyVTable, PagePolicyVTable, IoSchedPolicyVTable,
// NetClassPolicyVTable, TierPolicyVTable, VmmPolicyVTable, CapPolicyVTable,
// CongestionOpsVTable): all contain function pointers whose sizes depend on
// the target platform (8 bytes on 64-bit, 4 bytes on 32-bit). Size is
// self-described via the `vtable_size` field. No fixed const_assert.
kabi_vtable_call! dispatch for policy vtables:
Policy vtable calls use kabi_vtable_call! (Section 12.4) — the
SDK-internal vtable bounds-checking macro. The macro provides bounds-safety (checks
method offset against vtable_size) and version compatibility (checks kabi_version
against the kernel's expected version). Driver code NEVER calls kabi_vtable_call!
directly — it uses kabi_call!(handle, method, args) (Section 12.8)
which is the public transport-abstraction API. Example dispatch:
// Scheduler policy dispatch — called from the scheduler tick path.
// The vtable pointer is loaded via AtomicPtr::load(Acquire) from
// the per-policy-point global: SCHED_POLICY_VTABLE.
let vtable: *const SchedPolicyVTable = SCHED_POLICY_VTABLE.load(Ordering::Acquire);
// SAFETY: vtable is non-null (initialized to built-in default at boot).
// kabi_vtable_call! verifies vtable_size covers the pick_next_task slot.
let result = unsafe {
kabi_vtable_call!(vtable, SchedPolicyVTable, pick_next_task, PickNextResult::SKIP, (*vtable).ctx, cpu, &ctx_snapshot)
};
kabi_call_t0! — RCU-protected variant for Tier 0 dispatch:
kabi_call_t0! wraps kabi_vtable_call! (Section 12.4) in an
RCU read-side critical section. It is used exclusively for Tier 0 (Core domain)
vtable calls where the vtable pointer is swapped via AtomicPtr during live
evolution. The RCU protection prevents the caller from dereferencing a vtable that
has been freed by a concurrent evolution swap (the old vtable is freed via
call_rcu() after Phase C).
/// Tier 0 dispatch macro. Wraps kabi_vtable_call! in an RCU read-side
/// critical section.
/// MUST be used for all Core-domain vtable calls where the vtable pointer is
/// loaded from an AtomicPtr that the evolution framework may swap.
/// Callers MUST NOT block inside kabi_call_t0! (RCU read-side is non-preemptible).
macro_rules! kabi_call_t0 {
($vtable_ptr:expr, $VTable:ty, $method:ident, $default:expr $(, $arg:expr)*) => {{
let _guard = rcu_read_lock();
let vtable = $vtable_ptr.load(Ordering::Acquire);
kabi_vtable_call!(vtable, $VTable, $method, $default $(, $arg)*)
}};
}
See Section 13.18 and Section 11.2 for the full T0 dispatch protocol and the RCU lifetime guarantee.
ABI representation types: The KABI vtable uses #[repr(C)] result types
(PickNextResult, MigrateDecisionRepr, MigrateAdviceRepr, TierDecisionRepr,
NetClassRepr, DispatchResult) instead of Rust enums. These are C-ABI-stable
integer-plus-payload structs generated by kabi-gen. The kernel's policy dispatch
wrapper converts between the #[repr(C)] KABI types and the Rust-native trait types.
19.9.8 Stateless Policy Swap Watchdog¶
Stateless policy modules (AtomicPtr replacement) have no Phase A/A'/B/C lifecycle and no built-in rollback mechanism. To prevent a buggy policy module from degrading system behavior undetected, the kernel runs a lightweight post-swap health watchdog (Section 13.18):
The canonical struct definition is StatelessPolicyWatchdog in
Section 13.18. It tracks the
old vtable pointer, old module pages, watchdog deadline, health check interval,
vtable slot pointer, and pre-swap baseline metrics. PolicyHealthThresholds
(also defined there) sets per-policy-type configurable limits for error rate,
latency increase, and consecutive faults.
Watchdog protocol:
- Before swap: Capture the old vtable pointer and retain the old module's memory pages. Record baseline health metrics (error rate, average latency).
- AtomicPtr swap:
POLICY_VTABLE.store(new_vtable, Ordering::Release). - Watchdog activation: Start a periodic timer (default 500ms interval)
that reads the policy module's FMA health counters
(
policy_error_count,policy_retry_countfrom the FMA health struct). - Health checks (every 500ms for 5 seconds = 10 checks):
- Compare current error rate against
max_error_rate_per_mille. - Compare average call latency against pre-swap baseline +
max_latency_increase_ns. - Check for fault/panic events on this policy point.
- If any threshold exceeded: increment
anomaly_count. Ifanomaly_count >= 2(sustained anomaly, not a transient spike): trigger automatic revert. - Automatic revert:
POLICY_VTABLE.store(old_vtable, Ordering::Release). Log FMA eventHealthEventClass::PolicyAutoRevertwith the anomaly details. The new module is unloaded. The old module resumes as if nothing happened.dmesg:"policy module {name}: auto-reverted after {anomaly} anomalies within watchdog window". - Watchdog expiry (no anomalies): Old module pages are freed. Old vtable pointer is cleared. The swap is considered successful.
Observability: Watchdog state is visible at
/ukfs/kernel/policy_modules/{name}/watchdog_state (values: inactive,
monitoring, reverted).
Stateful Evolvable components: For stateful EvolvableComponent replacements
(scheduler, VFS, TCP stack), behavioral health monitoring extends beyond the
5-second crash watchdog. After the crash watchdog expires, a configurable soak
period (60-300 seconds) compares FMA health metrics against a pre-evolution baseline,
alerting on sustained degradation without automatic rollback (forward-only semantics).
See Section 13.18.
19.9.9 Relationship to eBPF¶
eBPF compatibility is maintained through umka-sysapi. Existing eBPF programs
(XDP, tc, kprobes, tracepoints) work via the BPF syscall. Policy modules are
a superset — they can do everything eBPF can do plus:
- Full Rust expressiveness (loops, recursion, complex data structures)
- Persistent mutable state (eBPF maps are limited)
- Domain isolation instead of bytecode verifier (more flexible, same safety)
- Crash recovery (eBPF programs can't crash; policy modules can, and are reloaded)
eBPF (Linux compat) Policy Modules (UmkaOS)
Safety mechanism: Bytecode verifier Rust type system + domain isolation
Language: BPF bytecode (limited) Rust (full language)
State: BPF maps (key-value) Any Rust data structure
Crash behavior: Cannot crash Crash → reload, default resumes
Hot-swap: Per-program Per-policy-point
Integration depth: Hook points only Full vtable interface
19.9.10 Linux Compatibility¶
sched_ext (Linux 6.12+) allows user-defined BPF scheduling policies. UmkaOS
supports this through umka-sysapi:
- sched_ext BPF programs load via the standard bpf() syscall
- They run in the BPF compatibility layer
- Performance and behavior identical to Linux sched_ext
Policy modules are an additional, UmkaOS-specific mechanism. Applications unaware of them see standard scheduling behavior.
Module Observability:
Policy modules emit structured tracepoints for every decision:
umka_tp_stable_policy_decision: emitted on eachpick_next_task,select_victims,dispatchcall. Fields: module name, decision type, chosen entity, alternatives considered, decision latency.umka_tp_stable_policy_audit: decision audit log for compliance. Records which module made which resource allocation decision, enabling post-hoc analysis.- A/B comparison mode: two policy modules can run simultaneously — one active
(making real decisions) and one shadow (receiving the same inputs, logging what
it would have decided). Compare via
policy.comparison_login sysfs. This enables safe evaluation of new policies before activation.
19.9.11 Performance Impact¶
Indirect function call via vtable pointer: ~1-2ns (branch predictor handles it).
Linux already uses the same pattern (sched_class->pick_next_task is a function
pointer). Same cost as Linux.
Default (production-vetted modules): modules signed by the kernel vendor and
pre-verified run in the Core isolation domain. Zero domain transition overhead. Same cost
as Linux sched_class function pointer dispatch.
Untrusted/experimental modules: run in their own isolation domain. Add one domain register switch (WRPKRU on x86, POR_EL0+ISB on AArch64, DACR on ARMv7) instruction (~23 cycles per Section 11.2) per policy call. Each call crosses the domain boundary twice (enter + exit), costing 2 × 23 = 46 cycles. For scheduling: called once per context switch (~200 cycles). Adding 46 cycles to 200 = ~23% overhead on the context switch micro-path. This is the cost of sandbox isolation for unvetted code. Acceptable for development and experimentation. The module graduates to Core isolation domain after vetting.
(Note: WRPKRU latency varies by microarchitecture — measured at 11 cycles on Alder Lake, 23 cycles on Skylake, and up to 260 cycles on some Atom cores. The 23-cycle figure used throughout this section reflects Skylake-class server parts; overhead on other microarchitectures scales proportionally. The worst case (Atom, 260 cycles) would increase the domain-transition overhead by ~11x, but Atom-class cores are not a primary UmkaOS server target.)
19.9.12 Policy Module Error Handling and Fallback¶
When a policy module's vtable function returns an error or panics:
Error return handling:
- Policy modules return Result<PolicyAction, PolicyError>.
- PolicyError::TemporaryFailure: the kernel retries the policy call up to 3 times with exponential backoff (1ms, 2ms, 4ms). If all retries fail, the system uses the default action for the hook (defined in the hook's .kabi registration).
- PolicyError::PermanentFailure: the module is immediately marked ModuleState::Degraded. No retries. Default action is used for all subsequent calls to this hook until the module is replaced.
- PolicyError::InvalidState: indicates a bug in the policy module. The module is marked Degraded and a FMA fault event is emitted.
Panic handling: Policy modules run in kernel context. A panic in a Tier 1 policy module triggers the Tier 1 crash recovery mechanism (Section 11.9): the module is reloaded, and its state is reset to the initial registration state. All policy calls during the reload window use the default action.
Default actions (registered at module load time in .kabi declaration):
pub enum DefaultPolicyAction {
/// Permit the operation (fail-open). Used for performance-advisory hooks
/// where denying would break functionality.
Permit,
/// Deny the operation (fail-closed). Used for security enforcement hooks
/// where permitting would be unsafe.
Deny,
/// Use the previous module's decision (chain to next policy module).
/// Falls back to Permit if no other module is registered.
Chain,
}
Monitoring: Each policy module has a policy_error_count, policy_retry_count, and policy_degraded_since field in its FMA health struct, accessible via umkafs at /ukfs/kernel/policy_modules/{module_name}/.
19.10 Special File Descriptor Objects¶
Linux exposes several kernel objects through the file descriptor abstraction: event counters, signal queues, timers, and process references. These are not files in any meaningful sense — they are kernel objects that happen to use the fd slot mechanism for lifecycle management and I/O multiplexing integration. UmkaOS implements all four as first-class fd types with exact Linux wire semantics and improved internal implementations.
All four fd types share a common structural principle: each is a SpecialFile variant
in the VFS layer, backed by an OpenFile struct with a concrete implementation
of FileOps. Poll readiness is reported through the standard FileOps::poll() trait
method, which integrates transparently with poll(2), select(2), and epoll(2).
No separate fd type registry or global lock is required — each fd object is
self-contained.
The Linux compatibility goal for all four types is exact wire compatibility with Linux 6.1 LTS: identical syscall numbers, identical flag values, identical struct layouts, identical errno values, and identical edge-case semantics. Each subsection documents the wire format and any UmkaOS-specific improvements to the internal implementation.
19.10.1 eventfd — Event Notification Counter¶
19.10.1.1.1 Syscall Interface¶
eventfd(initval: u32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
eventfd2(initval: u32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
eventfd and eventfd2 are identical in UmkaOS — Linux introduced eventfd2 to add
the flags parameter, but UmkaOS exposes both syscall numbers with the same
implementation. The initval argument sets the initial counter value (0 to
ULLONG_MAX - 1). Providing a value of ULLONG_MAX or greater returns -EINVAL.
19.10.1.1.2 Flags¶
| Flag | Value | Meaning |
|---|---|---|
EFD_CLOEXEC |
O_CLOEXEC (02000000) |
Set close-on-exec on the returned fd |
EFD_NONBLOCK |
O_NONBLOCK (04000) |
Set O_NONBLOCK on the file description |
EFD_SEMAPHORE |
1 | Semaphore semantics for read() |
Any flags value with bits other than these three set returns -EINVAL.
19.10.1.1.3 Read and Write Semantics¶
write(fd, &val: u64, 8):
valmust be in the range[1, ULLONG_MAX - 1]. A value of0orULLONG_MAXreturns-EINVAL.- If
counter + val > ULLONG_MAX - 1: - With
EFD_NONBLOCK: returns-EAGAIN. - Without
EFD_NONBLOCK: blocks until aread()reduces the counter enough. - Otherwise: atomically adds
valto the counter and wakes any readers. - Returns 8 on success (number of bytes consumed).
read(fd, &buf: u64, 8):
- The buffer must be at least 8 bytes. Shorter buffers return
-EINVAL. - Without
EFD_SEMAPHORE: - If
counter == 0andEFD_NONBLOCK: returns-EAGAIN. - If
counter == 0and blocking: blocks until awrite()increments the counter. - Otherwise: atomically reads the current counter value into
bufand resets the counter to 0. Wakes any blocked writers. - With
EFD_SEMAPHORE: - If
counter == 0andEFD_NONBLOCK: returns-EAGAIN. - If
counter == 0and blocking: blocks until counter > 0. - Otherwise: atomically decrements the counter by 1 and returns the value 1 in
buf. Wakes any blocked writers if counter was atULLONG_MAX - 1before the decrement. - Returns 8 on success.
19.10.1.1.4 Poll Readiness¶
| Condition | Event reported |
|---|---|
counter > 0 |
EPOLLIN \| EPOLLRDNORM |
counter < ULLONG_MAX - 1 |
EPOLLOUT \| EPOLLWRNORM |
19.10.1.1.5 FileOps::poll() Implementation¶
EventFd::poll(inode, private, events, pt) -> Result<PollEvents>:
efd = private as &EventFd
// Register on both wait queues — readers and writers may poll.
poll_wait(&efd.waiters_read, pt)
poll_wait(&efd.waiters_write, pt)
mask = PollEvents::empty()
val = efd.counter.load(Acquire)
if val > 0:
mask |= EPOLLIN | EPOLLRDNORM
if val < ULLONG_MAX - 1:
mask |= EPOLLOUT | EPOLLWRNORM
Ok(mask)
Both wait queues are registered because the eventfd can be polled for both read
readiness (counter > 0) and write readiness (counter < ULLONG_MAX - 1). The
poll_wait calls are no-ops when pt is None (re-poll after wakeup). The counter
load uses Acquire ordering to pair with the Release in write() and read(),
ensuring the poll result reflects the most recent counter update.
19.10.1.1.6 Internal Structure¶
/// A kernel event notification counter, exposed as a file descriptor.
///
/// The counter is an atomic `u64` in the range `[0, ULLONG_MAX - 1]`.
/// `EFD_SEMAPHORE` changes `read()` to decrement by 1 rather than reset to 0.
pub struct EventFd {
/// Current counter value. Ranges from 0 to ULLONG_MAX-1 (2^64 - 2).
/// All updates use atomic compare-and-swap to guarantee linearizability.
counter: AtomicU64,
/// Flags set at creation time. `EFD_SEMAPHORE` controls read semantics.
/// `EFD_NONBLOCK` is stored in the `OpenFile` flags, not here.
flags: EventFdFlags,
/// Tasks blocked in `read()` waiting for the counter to become non-zero.
waiters_read: WaitQueue,
/// Tasks blocked in `write()` waiting for the counter to drop below ULLONG_MAX-1.
waiters_write: WaitQueue,
}
19.10.1.1.7 Read Algorithm (non-blocking fast path)¶
read_nonblocking(efd: &EventFd, semaphore: bool) -> Result<u64, Errno>:
loop:
current = efd.counter.load(Acquire)
if current == 0:
return Err(EAGAIN)
new_val = if semaphore: current - 1 else: 0
if efd.counter.compare_exchange(current, new_val, AcqRel, Acquire).is_ok():
if new_val < ULLONG_MAX - 1:
efd.waiters_write.wake_one() // unblock one blocked writer (at most one can succeed per read)
return Ok(if semaphore: 1 else: current)
// CAS failed: another thread raced; retry
The blocking path wraps this loop in a WaitQueue::wait_event() call that suspends
the task until a writer increments the counter, then retries the CAS. No spinlock or
mutex is held during the blocked sleep.
19.10.1.1.8 UmkaOS Improvements over Linux¶
Linux implements eventfd with a spinlock (efd->lock) protecting the counter and
wakeup logic. On x86-64, a LOCK XCHG or LOCK CMPXCHG instruction is sufficient —
no mutex required. UmkaOS uses AtomicU64 with compare_exchange in a retry loop.
The wait queues are only accessed (never held as locks) when a task actually blocks.
This eliminates the spinlock acquisition on every read/write, reducing overhead in
the common non-blocking case from ~30-50 cycles (spinlock + counter update) to
~10-15 cycles (single CAS instruction).
eventfd2() and eventfd() are unified behind a single internal constructor —
UmkaOS dispatches both syscall numbers to the same function. Linux keeps two separate
entry points for historical reasons; UmkaOS does not need to.
19.10.1.1.9 Linux Compatibility¶
- Syscall numbers:
eventfd= 284,eventfd2= 290 (x86-64). - Flag values:
EFD_CLOEXEC=O_CLOEXEC= 02000000 octal;EFD_NONBLOCK=O_NONBLOCK= 04000 octal;EFD_SEMAPHORE= 1. ULLONG_MAX - 1= 0xFFFFFFFFFFFFFFFE as the maximum counter value before write blocks — identical to Linux.- Read always returns exactly 8 bytes; write always consumes exactly 8 bytes — any
other size returns
-EINVAL. /proc/[pid]/fdinfo/[fd]reportseventfd-count: <hex_value>to match Linux.
19.10.2 signalfd — Signal Delivery via File Descriptor¶
19.10.2.1.1 Syscall Interface¶
signalfd(fd: i32, mask: *const sigset_t, sizemask: usize) -> fd | -EINVAL | -EMFILE | -ENOMEM
signalfd4(fd: i32, mask: *const sigset_t, sizemask: usize, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
signalfd is the older form (no flags); signalfd4 adds SFD_NONBLOCK and
SFD_CLOEXEC. UmkaOS implements both syscall numbers with a unified path that treats
signalfd as signalfd4 with flags = 0.
The fd argument controls create-or-update behavior:
fd = -1: create a new signalfd. Returns a new file descriptor.fd = <existing signalfd>: update the signal mask on that fd. Returnsfdunchanged. Iffdis not a signalfd, returns-EINVAL.
sizemask must equal sizeof(sigset_t) = 8 bytes on x86-64. Any other value returns
-EINVAL.
mask specifies which signals to accept through this fd. The mask must be a valid user
pointer; SIGKILL (9) and SIGSTOP (19) in the mask are silently ignored — they
cannot be blocked or redirected.
19.10.2.1.2 Flags¶
| Flag | Value | Meaning |
|---|---|---|
SFD_NONBLOCK |
O_NONBLOCK (04000) |
Set O_NONBLOCK on the file description |
SFD_CLOEXEC |
O_CLOEXEC (02000000) |
Set close-on-exec on the returned fd |
Any other bits in flags return -EINVAL.
19.10.2.1.3 Usage Pattern¶
Before signals can be read via signalfd, the caller must block them using
sigprocmask(). Signals that are not blocked will be delivered to signal handlers
(or default action) as normal — signalfd only intercepts signals from the process's
pending signal set.
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGTERM);
sigaddset(&mask, SIGUSR1);
sigprocmask(SIG_BLOCK, &mask, NULL); // block these signals
int sfd = signalfd(-1, &mask, SFD_CLOEXEC); // redirect to fd
19.10.2.1.4 Read Semantics¶
lenmust be at leastsizeof(signalfd_siginfo)= 128 bytes. Smaller buffers return-EINVAL.read()dequeues one or more pending signals from the calling task's pending signal set that match the signalfd's mask, filling consecutivesignalfd_siginfostructs.- The number of structs filled is
min(pending_in_mask, len / 128). - If no matching signal is pending and
O_NONBLOCK: returns-EAGAIN. - If no matching signal is pending and blocking: blocks until a matching signal arrives.
- Returns the number of bytes written (always a multiple of 128).
Signals consumed via signalfd are removed from the task's pending signal set. They are NOT delivered to signal handlers. The pending set modification is atomic with respect to concurrent signal delivery.
19.10.2.1.5 Wire Format: signalfd_siginfo (128 bytes, exact Linux layout)¶
Offset Size Field Description
------ ---- ----- -----------
0 4 ssi_signo Signal number
4 4 ssi_errno Error number (usually 0)
8 4 ssi_code si_code from siginfo_t
12 4 ssi_pid Sending process PID (SI_USER/SI_QUEUE)
16 4 ssi_uid Sending process real UID
20 4 ssi_fd File descriptor (SIGPOLL/SIGIO)
24 4 ssi_tid Kernel timer ID (SIGALRM/SIGVTALRM/SIGPROF)
28 4 ssi_band Band event (SIGPOLL/SIGIO)
32 4 ssi_overrun Timer overrun count (SIGALRM)
36 4 ssi_trapno Trap number (hardware fault signals)
40 4 ssi_status Exit status or signal (SIGCHLD)
44 4 ssi_int Integer value (SI_QUEUE/SI_MESGQ)
48 8 ssi_ptr Pointer value (SI_QUEUE/SI_MESGQ)
56 8 ssi_utime User CPU time consumed (SIGCHLD)
64 8 ssi_stime System CPU time consumed (SIGCHLD)
72 8 ssi_addr Address triggering fault (hardware faults)
80 2 ssi_addr_lsb LSB of fault address (BUS_MCEERR_*)
82 2 __pad2 Alignment padding
84 4 ssi_syscall Syscall number (SIGSYS/seccomp)
88 8 ssi_call_addr Instruction address that triggered SIGSYS
96 4 ssi_arch AUDIT_ARCH_* value for seccomp
100 28 __pad Reserved padding to 128 bytes total
The Rust representation uses #[repr(C)] with explicit padding to guarantee
byte-for-byte compatibility. The total size is asserted at compile time:
const_assert!(size_of::<SignalFdSiginfo>() == 128).
19.10.2.1.6 Internal Structure¶
/// A signal queue redirector exposed as a file descriptor.
///
/// Signals matching `mask` that arrive in the owning task's pending set
/// are readable via `read()` rather than delivered to a signal handler.
/// The mask can be updated atomically via `signalfd(existing_fd, new_mask, ...)`.
pub struct SignalFd {
/// Signal mask using `SignalSet` encoding: bit `N-1` represents signal `N`.
/// Bit 0 = signal 1 (SIGHUP), bit 8 = signal 9 (SIGKILL), bit 18 = signal
/// 19 (SIGSTOP), bit 63 = signal 64 (SIGRTMAX). This matches `sigset_t`
/// encoding for Linux ABI compatibility.
/// Stored as AtomicU64 for lock-free mask updates via signalfd() on existing fd.
/// SIGKILL (bit 8) and SIGSTOP (bit 18) are always masked out on write.
mask: AtomicU64,
/// Weak reference to the owning task. A `Weak<Task>` is used rather than
/// `Arc<Task>` to avoid creating a reference cycle: the task owns the fd table
/// which owns this struct. Upgrade fails if the task has been reaped.
task: Weak<Task>,
/// Tasks blocked in `read()` waiting for a matching signal to arrive.
/// Also used by `FileOps::poll()` — `poll_wait()` registers on this queue,
/// and `signalfd_notify()` ([Section 8.5](08-process.md#signal-handling--sending-a-signal)) wakes it
/// when a matching signal is enqueued.
waiters: WaitQueue,
/// Intrusive list linkage for the per-task `signalfd_list`. This list is
/// walked by `signalfd_notify()` in the signal delivery path to wake
/// signalfd poll waiters. The list is protected by the task's signal lock
/// for insertion/removal; reads use RCU.
task_link: IntrusiveListNode,
}
19.10.2.1.7 Mask Update Algorithm¶
When signalfd(existing_fd, new_mask, ...) is called on an existing signalfd,
the mask update is:
update_mask(sfd: &SignalFd, new_mask: u64):
// Strip SIGKILL and SIGSTOP — cannot be intercepted
sanitized = new_mask & !(SIGKILL_BIT | SIGSTOP_BIT)
sfd.mask.store(sanitized, Release)
// No lock needed: concurrent read() loads mask with Acquire ordering
// Any pending signals matching the new mask will be readable immediately
sfd.waiters.wake_all() // wake blocked readers — new mask may now have pending signals
The AtomicU64::store(Release) pairs with the AtomicU64::load(Acquire) in read(),
guaranteeing that a read() that observes the new mask also observes any pending
signals that were delivered before the mask was changed.
19.10.2.1.8 Signal Dequeue Algorithm¶
dequeue_signals(sfd: &SignalFd, buf: &mut [SignalFdSiginfo]) -> usize:
task = sfd.task.upgrade().ok_or(EBADF)?
mask = sfd.mask.load(Acquire)
count = 0
while count < buf.len():
sig = task.signal_queue.dequeue_matching(mask)
match sig:
None => break
Some(siginfo) =>
buf[count] = siginfo_to_sfd_siginfo(siginfo)
count += 1
return count
signal_queue.dequeue_matching() atomically removes one signal whose number is set
in mask from the task's pending signal set. The task's signal queue lock is held
only for the duration of the dequeue operation, not for the entire read() call. This
matches Linux's behavior and avoids blocking signal delivery while a read() is in
progress on a different CPU.
19.10.2.1.9 Poll Readiness¶
| Condition | Event reported |
|---|---|
Any signal in mask is pending in the task's pending set |
EPOLLIN \| EPOLLRDNORM |
EPOLLOUT is never reported — signalfd is not writable.
19.10.2.1.10 FileOps::poll() Implementation¶
SignalFd::poll(inode, private, events, pt) -> Result<PollEvents>:
sfd = private as &SignalFd
poll_wait(&sfd.waiters, pt)
mask = PollEvents::empty()
task = sfd.task.upgrade().ok_or(EBADF)?
sig_mask = sfd.mask.load(Acquire)
// Check both thread-private and process-wide pending signal sets.
if task.pending_task.has_any_matching(sig_mask)
|| task.process.pending_process.has_any_matching(sig_mask):
mask |= EPOLLIN | EPOLLRDNORM
Ok(mask)
The signalfd_notify() function in the signal delivery path
(Section 8.5) calls sfd.waiters.wake_up() whenever a
signal matching the signalfd's mask is enqueued, triggering ep_poll_callback for
any epoll items monitoring this signalfd.
19.10.2.1.11 UmkaOS Improvements over Linux¶
Linux stores the signalfd mask in a spinlock_t-protected struct. Updating the mask
requires acquiring the lock and then potentially waking blocked readers. UmkaOS replaces
this with AtomicU64::store(Release) for the update and AtomicU64::load(Acquire)
for the reader, providing the same ordering guarantee without a lock. This eliminates
approximately 30-50 cycles of spinlock overhead on the mask-update path.
Linux's signalfd implementation must take the task's sighand->siglock during
read() to safely inspect and modify the pending signal set. UmkaOS uses the same lock
(the task's signal queue lock) but holds it for a shorter window — only the atomic
dequeue of a single signal — releasing it between each signal dequeued when filling
a multi-signal buffer.
19.10.2.1.12 Linux Compatibility¶
- Syscall numbers:
signalfd= 282,signalfd4= 289 (x86-64). SFD_NONBLOCK=O_NONBLOCK= 04000 octal;SFD_CLOEXEC=O_CLOEXEC= 02000000 octal.signalfd_siginfolayout is byte-for-byte identical to Linux; size is exactly 128 bytes including 28 bytes of trailing padding.SIGKILLandSIGSTOPin the mask are silently stripped — identical to Linux.signalfd(existing_fd, ...)returns the same fd number — identical to Linux.- Reading multiple signals in one
read()call is supported — identical to Linux. /proc/[pid]/fdinfo/[fd]reportssigmask: <hex_value>to match Linux.
19.10.3 timerfd — Timer Notification via File Descriptor¶
19.10.3.1.1 Syscall Interface¶
timerfd_create(clockid: i32, flags: u32) -> fd | -EINVAL | -EMFILE | -ENOMEM
timerfd_settime(fd: i32, flags: u32, new_value: *const itimerspec, old_value: *mut itimerspec) -> 0 | -EINVAL | -EFAULT
timerfd_gettime(fd: i32, curr_value: *mut itimerspec) -> 0 | -EINVAL | -EFAULT
19.10.3.1.2 Clock IDs¶
| Clock ID | Value | Description |
|---|---|---|
CLOCK_REALTIME |
0 | Wall clock time; advances with NTP and adjtime |
CLOCK_MONOTONIC |
1 | Monotonically increasing; unaffected by wall clock changes |
CLOCK_BOOTTIME |
7 | Like CLOCK_MONOTONIC but includes time suspended in sleep |
CLOCK_REALTIME_ALARM |
8 | Like CLOCK_REALTIME; wakes system from suspend |
CLOCK_BOOTTIME_ALARM |
9 | Like CLOCK_BOOTTIME; wakes system from suspend |
Other clock IDs return -EINVAL. The _ALARM clocks require CAP_WAKE_ALARM.
19.10.3.1.3 Creation Flags¶
| Flag | Value | Meaning |
|---|---|---|
TFD_NONBLOCK |
O_NONBLOCK (04000) |
Set O_NONBLOCK on the file description |
TFD_CLOEXEC |
O_CLOEXEC (02000000) |
Set close-on-exec on the returned fd |
19.10.3.1.4 timerfd_settime Flags¶
| Flag | Value | Meaning |
|---|---|---|
TFD_TIMER_ABSTIME |
1 | it_value specifies an absolute time (not relative) |
TFD_TIMER_CANCEL_ON_SET |
2 | Cancel blocked read() if wall clock is stepped (CLOCK_REALTIME only) |
TFD_TIMER_CANCEL_ON_SET combined with CLOCK_MONOTONIC or CLOCK_BOOTTIME
returns -EINVAL.
19.10.3.1.5 itimerspec Wire Format¶
struct itimerspec { // total 32 bytes
timespec it_interval; // 16 bytes: repeat interval (0 = one-shot)
timespec it_value; // 16 bytes: time until next expiration (0 = disarm)
};
struct timespec { // 16 bytes (time64 variant, used by timerfd_settime64)
i64 tv_sec; // seconds
i64 tv_nsec; // nanoseconds [0, 999999999]
};
// On ILP32 architectures, the old timerfd_settime (NR 286 on ARMv7) uses
// 8-byte timespec with KernelLong fields. The time64 variant shown here
// is used by timerfd_settime64 (NR 411).
Setting new_value.it_value to all zeros disarms the timer (any in-flight expiration
that has not yet been read remains readable). Setting new_value.it_interval to all
zeros creates a one-shot timer.
19.10.3.1.6 Read Semantics¶
- Buffer must be at least 8 bytes; smaller buffers return
-EINVAL. - Reads the number of timer expirations since the last
read()(or since the timer was armed if never read). - If
expirations == 0andO_NONBLOCK: returns-EAGAIN. - If
expirations == 0and blocking: blocks until the timer fires. - If the timer has a
TFD_TIMER_CANCEL_ON_SETflag and the real-time clock is stepped while aread()is blocking, theread()returns-ECANCELED. - Returns 8 on success. The expiration counter is reset to 0 atomically on read.
19.10.3.1.7 timerfd_gettime Semantics¶
Returns the remaining time until the next expiration in curr_value.it_value (always
relative, even if the timer was set with TFD_TIMER_ABSTIME), and the interval in
curr_value.it_interval. If the timer is disarmed, both fields are zero.
19.10.3.1.8 Internal Structure¶
/// A kernel timer exposed as a file descriptor.
///
/// The `expirations` counter accumulates missed firings atomically.
/// `timerfd_settime` holds `lock` to update the timer state atomically.
/// The timer callback and `read()` are lock-free in the common case.
pub struct TimerFd {
/// Which clock drives this timer.
clock: ClockId,
/// Handle into the kernel timer subsystem. The timer callback increments
/// `expirations` and wakes `waiters`. Re-armed automatically if `interval > 0`.
timer: KernelTimer,
/// Accumulated expiration count. Incremented by the timer callback (possibly
/// on a different CPU). Reset to 0 by `read()` using compare_exchange.
expirations: AtomicU64,
/// Tasks blocked in `read()` waiting for the timer to fire.
waiters: WaitQueue,
/// Protects `state` during `timerfd_settime`. Not held during timer callbacks
/// or `read()` — those use `expirations` atomically. Must be `SpinLock`
/// (not `Mutex`) because the timer callback runs in softirq context where
/// sleeping is unsound. The prose below correctly describes `SpinLock`
/// semantics including `try_lock()` from interrupt context.
lock: SpinLock<TimerFdState>,
}
/// Mutable timer configuration. Protected by `TimerFd::lock` (SpinLock).
pub struct TimerFdState {
/// True if the timer is currently armed.
armed: bool,
/// Time until next expiration (stored as absolute clock time internally).
next_expiry: Instant,
/// Repeat interval. Zero means one-shot.
interval: Duration,
/// True if the timer was set with `TFD_TIMER_ABSTIME`.
abstime: bool,
/// True if blocking `read()` should return ECANCELED on wall-clock steps.
/// Only valid when `clock` is `CLOCK_REALTIME`.
cancel_on_set: bool,
/// True if coalescing is disabled for this timer (UmkaOS extension; see below).
precise: bool,
}
19.10.3.1.9 Timer Callback Algorithm¶
The timer subsystem calls timerfd_callback when the timer fires. This runs in
interrupt context (softirq). The lock used is a SpinLock (IRQ-safe), not a
Mutex — try_lock in interrupt context is sound for SpinLock but not for Mutex
(which has priority-inheritance semantics incompatible with IRQ context):
fn timerfd_callback(tfd: &TimerFd) {
// Increment the expiration counter. Saturates at u64::MAX to avoid wrap.
// Uses CAS loop: read current value, if < u64::MAX then CAS(current, current+1).
// This avoids the TOCTOU race in fetch_add-then-check (fetch_add wraps
// the stored value before the check can saturate it).
loop {
let current = tfd.expirations.load(Acquire);
if current == u64::MAX {
break; // already saturated
}
if tfd.expirations.compare_exchange_weak(
current, current + 1, Release, Relaxed
).is_ok() {
break;
}
}
tfd.waiters.wake_all(); // wake any blocked read()
// Try to rearm for the next interval. SpinLock::try_lock() returns
// Option<SpinLockGuard<TimerFdState>>. Access protected data through
// the guard (RAII). If the lock is contended (settime in progress),
// settime will rearm after its update completes.
if let Some(mut guard) = tfd.lock.try_lock() {
if guard.interval > Duration::ZERO {
guard.next_expiry += guard.interval;
tfd.timer.rearm(guard.next_expiry);
}
// guard dropped here, releasing the lock
}
}
19.10.3.1.10 timerfd_settime Algorithm¶
timerfd_settime(tfd: &TimerFd, flags, new_value, old_value) -> Result<(), Errno>:
state = tfd.lock.lock()
if old_value is not null:
*old_value = state_to_itimerspec(state, tfd.clock)
if new_value.it_value == zero:
state.armed = false
tfd.timer.cancel()
else:
state.armed = true
state.interval = new_value.it_interval
state.abstime = flags & TFD_TIMER_ABSTIME != 0
state.cancel_on_set = flags & TFD_TIMER_CANCEL_ON_SET != 0
if state.abstime:
state.next_expiry = new_value.it_value as absolute instant
else:
state.next_expiry = now(tfd.clock) + new_value.it_value
tfd.timer.arm(state.next_expiry)
// Reset any unread expirations from the previous timer period
tfd.expirations.store(0, Release)
tfd.lock.unlock()
The expiration reset to 0 in timerfd_settime matches Linux behavior: rearming
the timer discards any unread expirations from the previous arm.
19.10.3.1.11 Wall-Clock Step Handling (TFD_TIMER_CANCEL_ON_SET)¶
The timekeeping subsystem broadcasts a ClockSet notification whenever
settimeofday(2) or clock_settime(CLOCK_REALTIME, ...) makes a non-monotonic
change to the wall clock. All CLOCK_REALTIME timerfds with cancel_on_set = true
receive this notification through a registered callback:
timerfd_clock_set_callback(tfd: &TimerFd):
// Wake all blocked readers with a ECANCELED indication
tfd.waiters.wake_all_with_err(ECANCELED)
Blocked read() calls detect the cancellation via the wait-queue return code and
propagate -ECANCELED to userspace without consuming the expiration counter.
19.10.3.1.12 Interval Timer Coalescing (UmkaOS Extension)¶
Timers with very short intervals (interval < 1ms) and low-resolution system HZ
settings (e.g., HZ = 250, giving 4ms tick resolution) would fire far more
frequently than the system can usefully service. UmkaOS coalesces such timers to fire
at tick boundaries, batching wakeups and reducing interrupt load:
- Coalescing is enabled by default for
interval < 1ms. - Disabled per-timer via the UmkaOS-specific
TFD_TIMER_PRECISEflag (value: 4, chosen to not conflict with existing Linux flags). TFD_TIMER_PRECISEis an UmkaOS extension. Kernels that do not support it treat it as an unknown flag and return-EINVAL. Applications that need Linux portability should not set this flag.- Coalescing does not affect the expiration counter: missed firings within a coalescing window are accumulated and delivered as a single count on the next wakeup.
19.10.3.1.13 Poll Readiness¶
| Condition | Event reported |
|---|---|
expirations > 0 |
EPOLLIN \| EPOLLRDNORM |
EPOLLOUT is never reported — timerfd is not writable.
19.10.3.1.14 FileOps::poll() Implementation¶
TimerFd::poll(inode, private, events, pt) -> Result<PollEvents>:
tfd = private as &TimerFd
poll_wait(&tfd.waiters, pt)
mask = PollEvents::empty()
if tfd.expirations.load(Acquire) > 0:
mask |= EPOLLIN | EPOLLRDNORM
Ok(mask)
The timer callback (timerfd_callback) increments expirations with fetch_add(1,
Release) and calls tfd.waiters.wake_up(), which fires ep_poll_callback for any
epoll items monitoring this timerfd.
19.10.3.1.15 UmkaOS Improvements over Linux¶
Linux implements timerfd with a spinlock protecting both the expiration counter and
the timer state. The timer callback (timerfd_tmrproc) acquires the spinlock to
increment the expiration counter and re-arm the interval timer.
UmkaOS separates these concerns:
- The expiration counter is an
AtomicU64— the timer callback increments it withfetch_add(1, Release)without holding any lock.read()resets it withcompare_exchange(current, 0, AcqRel, Acquire)without holding any lock. This eliminates spinlock acquisition from the timer hot path. - The timer state (arm/disarm, interval, abstime) is protected by a
SpinLockheld only duringtimerfd_settime. The timer callback usestry_lock()for re-arming and skips re-arming ifsettimeis in progress (settime will re-arm after updating state). SpinLock (not Mutex) because the timer callback runs in softirq context where sleeping is unsound. - The common case (timer fires, counter increments, waiter wakes, counter reads 0) is entirely lock-free.
19.10.3.1.16 Linux Compatibility¶
- Syscall numbers:
timerfd_create= 283,timerfd_settime= 286,timerfd_gettime= 287 (x86-64). TFD_NONBLOCK= 04000,TFD_CLOEXEC= 02000000,TFD_TIMER_ABSTIME= 1,TFD_TIMER_CANCEL_ON_SET= 2.itimerspeclayout is identical to Linux (twotimespecstructs, 32 bytes total).timerfd_settimewithit_value = 0disarms the timer and resets the expiration counter to 0 — identical to Linux.ECANCELEDis returned from blockingread()when aTFD_TIMER_CANCEL_ON_SETtimer is cancelled by a clock step — identical to Linux./proc/[pid]/fdinfo/[fd]reportsclockid,ticks,settime flags,it_value, andit_intervalto match Linux'stimerfd_show()format.
19.10.4 pidfd — Process File Descriptor¶
19.10.4.1.1 Syscall Interface¶
pidfd_open(pid: pid_t, flags: u32) -> fd | -EINVAL | -EMFILE | -ESRCH | -EPERM
pidfd_send_signal(pidfd: i32, sig: i32, siginfo: *const siginfo_t, flags: u32) -> 0 | -EPERM | -ESRCH | -EINVAL
pidfd_getfd(pidfd: i32, targetfd: i32, flags: u32) -> fd | -EPERM | -ESRCH | -EINVAL | -EMFILE
19.10.4.1.2 pidfd_open¶
pid must refer to a live process (not a thread) in the caller's PID namespace.
A process is "live" if it has not yet been reaped — zombie processes that have exited
but not been waited on are accessible. Passing a pid that does not exist or has
been reaped and recycled returns -ESRCH.
flags must be 0 for a process pidfd. The flag PIDFD_THREAD (value: O_EXCL =
010 octal) creates a thread pidfd pointing to a specific thread (not the thread group
leader). PIDFD_NONBLOCK (value: O_NONBLOCK = 04000) creates a non-blocking pidfd
whose waitid(P_PIDFD, ...) returns -EAGAIN if the process has not yet exited.
Any other flag bits return -EINVAL.
PIDFD_THREAD support: Linux added thread pidfd support in kernel 6.9 via
PIDFD_THREAD. UmkaOS supports PIDFD_THREAD from its initial release — there is no
version gate. A thread pidfd can receive signals via pidfd_send_signal targeted at
a specific thread, and poll() reports EPOLLIN when that specific thread exits.
19.10.4.1.3 pidfd_send_signal¶
Sends signal sig to the process referenced by pidfd. Semantics are identical to
kill(2) but use the stable pidfd reference instead of a PID:
sig = 0: permission check only (does not send a signal); returns 0 if the process is accessible,-ESRCHif it has exited,-EPERMif no permission.siginfo != NULL: for real-time signals (SIGRTMIN to SIGRTMAX), the providedsiginfo_tis used as the signal info.si_codemust beSI_QUEUE(or another userspace-generatable code).siginfomust be NULL for standard signals.flagsmust be 0.- Permission model: same as
kill(2)— caller must have same UID, be privileged (CAP_KILL), or be the parent of the target process.
19.10.4.1.4 pidfd_getfd¶
Duplicates file descriptor targetfd from the process referenced by pidfd into
the calling process's fd table. The duplicated fd refers to the same open file
description as in the target process.
- Requires
PTRACE_MODE_ATTACH_REALCREDSaccess to the target process. This is checked via the LSM ptrace hooks — the same permission check thatPTRACE_ATTACHuses. Without this permission, returns-EPERM. flagsmust be 0.- The returned fd has
FD_CLOEXECset. - If
targetfdis not open in the target process, returns-EBADF. - If the calling process's fd table is full, returns
-EMFILE.
19.10.4.1.5 waitid with P_PIDFD¶
P_PIDFD (value: 3) is used as the idtype argument. The id argument is the
pidfd file descriptor number. All standard waitid options apply (WEXITED,
WSTOPPED, WCONTINUED, WNOHANG, WNOWAIT).
When the pidfd was opened with PIDFD_NONBLOCK and the process has not yet exited,
waitid with WNOHANG returns 0 with infop->si_pid = 0 (consistent with standard
waitid WNOHANG behavior).
19.10.4.1.6 Poll Readiness¶
| Condition | Event reported |
|---|---|
| Referenced process has exited (any state: zombie or reaped) | EPOLLIN \| EPOLLHUP |
| Referenced process is running | (nothing — not readable) |
19.10.4.1.7 FileOps::poll() Implementation¶
PidFd::poll(inode, private, events, pt) -> Result<PollEvents>:
pfd = private as &PidFd
poll_wait(&pfd.process.exit_waiters, pt)
mask = PollEvents::empty()
if pfd.thread_mode:
// Thread pidfd: check if the specific thread has exited.
if pfd.process.thread_has_exited():
mask |= EPOLLIN | EPOLLHUP
else:
// Process pidfd: check if the thread group leader has exited.
if pfd.process.has_exited():
mask |= EPOLLIN | EPOLLHUP
Ok(mask)
The Process::exit_waiters WaitQueue is woken from the do_exit() path when a
process (or thread, for thread pidfds) transitions to zombie state. This triggers
ep_poll_callback for any epoll items monitoring the pidfd.
poll() on a pidfd is particularly useful for async exit monitoring without
SIGCHLD:
// Monitor multiple child processes without SIGCHLD handler
int efd = epoll_create1(0);
epoll_ctl(efd, EPOLL_CTL_ADD, pidfd1, &ev1);
epoll_ctl(efd, EPOLL_CTL_ADD, pidfd2, &ev2);
epoll_wait(efd, events, 2, -1); // wake when either exits
19.10.4.1.8 clone3 Integration — Atomic pidfd on Fork¶
clone3(2) with CLONE_PIDFD flag sets pidfd in the clone_args struct to
receive a pidfd for the new child atomically:
struct clone_args args = {
.flags = CLONE_PIDFD,
.pidfd = (uint64_t)&child_pidfd, // out: fd for the child
.exit_signal = SIGCHLD,
};
pid_t child = syscall(SYS_clone3, &args, sizeof(args));
In UmkaOS's fork path:
clone3_with_pidfd(args):
new_task = allocate_task()
pfd_obj = PidFd::new(Arc::clone(&new_task.process), current_pid_ns())
child_fd = install_fd_in_current_table(pfd_obj)
// Write child_fd to args.pidfd before releasing the new task
*args.pidfd = child_fd as u64
release_and_schedule(new_task)
return new_task.pid
The pidfd is installed in the parent's fd table and the args.pidfd pointer is
written before the child is made visible to the scheduler. There is no window between
fork and pidfd creation during which the child's PID could be recycled.
19.10.4.1.9 Internal Structure¶
/// A stable reference to a process, exposed as a file descriptor.
///
/// Holds an `Arc<Process>` which keeps the process's zombie state alive until
/// all pidfds referencing it are closed and `waitid` has been called.
/// No lock is needed to validate the reference — `Arc` guarantees liveness.
pub struct PidFd {
/// Strong reference to the process. This keeps the zombie `Process` struct
/// alive even after the process exits and is waited on, so that subsequent
/// `pidfd_send_signal` calls return `-ESRCH` rather than accessing freed memory
/// or racing with PID recycling.
process: Arc<Process>,
/// PID namespace in which this pidfd was created. Used to resolve PIDs for
/// `pidfd_send_signal` permission checks, which compare against the caller's
/// namespace view of the target process.
ns: Arc<PidNamespace>,
/// True if this is a thread pidfd (PIDFD_THREAD). When true, `poll()` reports
/// readiness when the specific thread exits, not when the thread group exits.
thread_mode: bool,
}
19.10.4.1.10 Liveness Model¶
PidFd holds an Arc<Process>. The Process struct is kept in memory as long as
any of the following hold a reference:
- The process is in the parent's child list (before
waitidreaps it). - A
PidFdfd is open anywhere in the system. - The kernel has an internal reference (e.g., the process is on a runqueue).
When the process exits, it transitions to zombie state. The zombie state is maintained
until both conditions are satisfied: the Arc<Process> reference count drops to the
parent-only value AND the parent calls waitid. This means:
- Closing all pidfds referencing a zombie does not prevent the parent from calling
waitid— the parent's child-list entry remains. - After the parent calls
waitid, if any pidfd is still open, theProcessstruct is retained in zombie-reaped state (memory deallocated, but theArcshell remains with an exit code). Subsequentpidfd_send_signalcalls return-ESRCH.
This is simpler and safer than Linux's approach, which uses pid_lock to prevent
the struct pid from being freed while a pidfd is being accessed. UmkaOS's Arc
provides the same guarantee without any explicit locking.
19.10.4.1.11 pidfd_send_signal Algorithm¶
pidfd_send_signal(pfd: &PidFd, sig, siginfo, flags) -> Result<(), Errno>:
if flags != 0:
return Err(EINVAL)
// Arc::clone gives us a reference; no lock needed to access the process
process = Arc::clone(&pfd.process)
if process.is_fully_reaped():
return Err(ESRCH)
check_signal_permission(current_task(), &process, sig)?
if sig == 0:
return Ok(()) // permission check only
deliver_signal(&process, sig, siginfo)
is_fully_reaped() checks an atomic flag set when the process's resources have been
fully released. This is a single atomic load — no lock.
19.10.4.1.12 pidfd_getfd Algorithm¶
pidfd_getfd(pfd: &PidFd, targetfd, flags) -> Result<Fd, Errno>:
if flags != 0:
return Err(EINVAL)
process = Arc::clone(&pfd.process)
if process.is_fully_reaped():
return Err(ESRCH)
// LSM permission check (ptrace attach-level)
check_ptrace_attach(current_task(), &process)?
// Get the file description from the target's fd table
file = process.fd_table.get(targetfd).ok_or(EBADF)?
// Install a duplicate into the calling task's fd table with FD_CLOEXEC
new_fd = current_task().fd_table.install(file, FD_CLOEXEC)?
return Ok(new_fd)
Cross-PID-namespace semantics: pidfd_getfd() requires that the target
process is visible in the caller's PID namespace. The pidfd itself references the
process via Arc<Process> (namespace-independent), but the visibility check uses
pid_ns_for_children(current_task()): if the target process has no PID in the
caller's PID namespace (it was created in a non-ancestor namespace), ESRCH is
returned. The returned fd inherits the caller's file table context (cloexec flag,
position pointer for regular files), not the target's. This matches Linux behavior.
19.10.4.1.13 UmkaOS Improvements over Linux¶
Liveness via Arc instead of pid_lock: Linux must take pid_lock (a global
spinlock on the PID namespace) every time a pidfd is dereferenced to ensure the
struct pid has not been freed. This spinlock is contended when many pidfd operations
occur concurrently. UmkaOS's Arc<Process> is reference-counted without a global lock —
dereferencing a pidfd is a no-op (just an atomic load on the refcount in debug builds,
zero overhead in release builds with optimization).
Thread pidfds from day one: Linux added PIDFD_THREAD in kernel 6.9. UmkaOS
supports thread pidfds in its initial release.
PIDFD_NONBLOCK support: Linux added PIDFD_NONBLOCK in kernel 5.10. UmkaOS
supports it from the initial release. The flag is stored in the OpenFile
flags (same as O_NONBLOCK for other fd types) and is checked by waitid(P_PIDFD, ...).
Atomic clone3 pidfd: UmkaOS allocates and installs the pidfd before releasing the
new task to the scheduler, eliminating any TOCTOU window between fork and pidfd
creation — matching the Linux clone3 + CLONE_PIDFD guarantee.
19.10.4.1.14 Linux Compatibility¶
- Syscall numbers:
pidfd_open= 434,pidfd_send_signal= 424,pidfd_getfd= 438 (x86-64). PIDFD_NONBLOCK=O_NONBLOCK= 04000 octal.PIDFD_THREAD=O_EXCL= 010 octal.P_PIDFD= 3 (forwaitididtype).CLONE_PIDFD= 0x00001000 (inclone_args.flags).PTRACE_MODE_ATTACH_REALCREDSpermission check forpidfd_getfd— identical to Linux. No additional UmkaOS-specific permission layer.poll()reportingEPOLLIN | EPOLLHUPon process exit — identical to Linux./proc/[pid]/fdinfo/[fd]reportsPid: <pid>andNSpid: <nspid>for the referenced process — matching Linux'spidfd_show()output.
19.10.5 Linux Compatibility Reference¶
Complete syscall number table for all four fd types on x86-64:
| Syscall | x86-64 Number | Return Type | Error Codes |
|---|---|---|---|
eventfd |
284 | fd | -EINVAL, -EMFILE, -ENOMEM |
eventfd2 |
290 | fd | -EINVAL, -EMFILE, -ENOMEM |
signalfd |
282 | fd | -EINVAL, -EMFILE, -ENOMEM |
signalfd4 |
289 | fd | -EINVAL, -EMFILE, -ENOMEM |
timerfd_create |
283 | fd | -EINVAL, -EMFILE, -ENOMEM, -EPERM |
timerfd_settime |
286 | 0 | -EINVAL, -EFAULT, -EBADF |
timerfd_gettime |
287 | 0 | -EINVAL, -EFAULT, -EBADF |
pidfd_open |
434 | fd | -EINVAL, -EMFILE, -ESRCH, -EPERM |
pidfd_send_signal |
424 | 0 | -EINVAL, -EPERM, -ESRCH |
pidfd_getfd |
438 | fd | -EINVAL, -EPERM, -ESRCH, -EBADF, -EMFILE |
Struct sizes and invariants:
| Type | Size | Invariant |
|---|---|---|
signalfd_siginfo |
128 bytes | Exact Linux layout; compile-time size_of assertion |
itimerspec |
32 bytes | Two timespec structs; tv_nsec in [0, 999999999] |
| eventfd counter | u64 | Range [0, ULLONG_MAX - 1]; ULLONG_MAX is never a valid counter value |
| signalfd mask | u64 | Bits 1-64 for signals 1-64; bits 9 (SIGKILL) and 19 (SIGSTOP) always zero |
Common errno values and their meaning across all four types:
| Errno | Meaning |
|---|---|
-EINVAL |
Bad flags, bad clock ID, bad fd for update, wrong buffer size, bad sigset size |
-EMFILE |
Per-process fd limit reached |
-ENOMEM |
Kernel memory exhausted during fd object allocation |
-EAGAIN |
Non-blocking operation would block (read on empty counter/queue/timer) |
-ECANCELED |
Blocking timerfd read cancelled by wall-clock step (TFD_TIMER_CANCEL_ON_SET) |
-ESRCH |
Process referenced by pidfd has exited and been reaped |
-EPERM |
Capability check failed (CAP_WAKE_ALARM, CAP_KILL) or ptrace permission denied |
-EBADF |
targetfd not open in target process (pidfd_getfd), or fd is not a signalfd (on mask update) |
Cross-subsystem interactions:
- eventfd + io_uring:
io_uringcan post completions to an eventfd via theIORING_OP_POLL_ADDopcode targeting an eventfd. UmkaOS implements this through the standardFileOps::write()path — io_uring callseventfd_write()the same way userspace does. - signalfd + threads: Each thread has its own pending signal set. A signalfd opened in a thread reads signals from that thread's pending set (thread-directed signals) and from the thread group's pending set (process-directed signals), matching Linux semantics.
- timerfd + suspend:
CLOCK_REALTIME_ALARMandCLOCK_BOOTTIME_ALARMtimers are registered with the RTC wakeup subsystem. When the system suspends, the RTC is programmed to wake the system before the earliest alarm timer fires. The timer fires on resume; the expiration count correctly reflects the elapsed real time. - pidfd + namespaces: A pidfd is tied to the PID namespace in which it was
created.
pidfd_send_signalresolves permissions in that namespace. If the target process exits its namespace (e.g., by exec across a user namespace boundary), the pidfd continues to reference the process viaArc<Process>— namespace exit does not invalidate the reference.
19.10.6 UmkaOS Typed Event Notification API¶
The special fd objects (eventfd, signalfd, timerfd) deliver data via untyped
read(fd, buf, n) calls where the caller must know the buffer layout. A
mismatched buffer size returns EINVAL; a correct-size read of the wrong fd type
silently returns garbage bytes. UmkaOS provides a typed companion API:
/// Read from a special event fd with compile-time type checking.
///
/// The kernel inspects the fd's underlying type and fills the appropriate variant.
/// Returns `Err(EINVAL)` if the fd is not a special event fd.
/// Returns `Err(EAGAIN)` if non-blocking and no event is pending.
pub fn event_read(fd: RawFd) -> Result<EventValue, EventError>;
/// The typed value returned by event_read().
///
/// **Wire layout** (`#[repr(C, u32)]` tagged union):
/// - Bytes 0-3: tag (u32): 0=Counter, 1=TimerTicks, 2=Signal, 3=ProcessExited
/// - Bytes 4-7: implicit padding to align the payload
/// - Bytes 8+: variant payload (u64 for Counter/TimerTicks, SignalfdSiginfo
/// for Signal, {pid: u32, exit_code: i32} for ProcessExited)
///
/// The C representation of `#[repr(C, u32)]` enums with data-carrying variants
/// is a tagged union: the discriminant is a leading `u32`, followed by padding
/// to the payload's alignment, followed by the largest variant's payload.
/// Equivalent C layout (see `umka_event_value` in umka-sysapi/include/umka.h):
/// ```c
/// struct umka_event_value {
/// uint32_t tag;
/// uint32_t _pad;
/// union {
/// uint64_t counter;
/// uint64_t timer_ticks;
/// struct signalfd_siginfo signal;
/// struct { uint32_t pid; int32_t exit_code; } process_exited;
/// };
/// };
/// ```
#[repr(C, u32)]
pub enum EventValue {
/// eventfd: current counter value (EFD_SEMAPHORE: always 1). Tag = 0.
Counter(u64) = 0,
/// timerfd: number of expirations since last read. Tag = 1.
TimerTicks(u64) = 1,
/// signalfd: one pending signal. Tag = 2.
Signal(SignalfdSiginfo) = 2,
/// pidfd: exit status of the process (only after EPOLLIN on pidfd). Tag = 3.
ProcessExited { pid: u32, exit_code: i32 } = 3,
}
// Layout: tag(4) + pad(4) + max_variant(SignalfdSiginfo = 128) = 136 bytes.
const_assert!(size_of::<EventValue>() == 136);
/// Write to an eventfd with type checking.
/// Returns `Err(EINVAL)` if fd is not an eventfd.
pub fn event_write(fd: RawFd, value: u64) -> Result<(), EventError>;
Syscall numbers (UmkaOS-specific syscalls use negative numbers to avoid collision with future Linux syscall additions; see Section 19.8):
| Syscall | Number |
|---|---|
event_read |
-0x0E00 |
event_write |
-0x0E01 |
Advantages over raw read(2):
- Type-safe: the compiler enforces that all variants are handled.
- eBPF verifier can statically analyze event types in attached programs.
- No silent garbage on wrong fd type: kernel validates fd type at the syscall boundary.
- Single syscall for all event fd types: no need to track which type each fd is at the call site.
ProcessExitedvariant: pidfd exit notification delivers exit code directly (nowaitidneeded after the read).
Interaction with io_uring:
event_read is exposed as an io_uring operation (IORING_OP_EVENT_READ, opcode 200),
allowing async typed event reads without a dedicated syscall per fd.
UmkaOS extension opcodes start at 200 to avoid collision with upstream Linux opcodes.
Linux programs using standard opcodes (0-64+) work unchanged; the 200+ range provides
generous headroom for Linux to grow (Linux adds ~2-5 new opcodes per release cycle,
and the opcode field is 8 bits wide, so 200 leaves room for ~135 more upstream opcodes).
// 50-year analysis: at current Linux opcode growth (~10/year), collision occurs
// in ~14 years. If Linux approaches opcode 200, UmkaOS will migrate extension
// opcodes to use IORING_OP_URING_CMD (opcode 46) with a UmkaOS-specific command
// code. URING_CMD is Linux's designed extension point for per-driver/per-subsystem
// commands.
struct io_uring_sqe sqe = {
.opcode = IORING_OP_EVENT_READ, // 200
.fd = event_fd,
.addr = (uint64_t)&event_value_out, // struct EventValue destination
};
Linux compatibility: read(2) on eventfd/signalfd/timerfd/pidfd works identically
to Linux. event_read/event_write are UmkaOS extensions. The EventValue wire layout
is stable ABI (repr(C, u32) tagged union with explicit integer discriminants 0-3 as
documented in the struct comment above); field ordering is frozen at first release and
additive changes use new enum variants appended after the existing set.
19.11 Legacy AIO (Asynchronous I/O)¶
Linux legacy AIO (not to be confused with POSIX aio_* from libc) provides
kernel-level asynchronous I/O via io_setup/io_submit/io_getevents. While
superseded by io_uring (Section 19.3), legacy AIO
remains required for database compatibility (PostgreSQL, MySQL/InnoDB, Oracle,
RocksDB) and is specified here for Linux binary compatibility. UmkaOS implements the
complete Linux AIO ABI with exact wire semantics matching Linux 6.1 LTS.
Legacy AIO is strictly limited to O_DIRECT file I/O and IOCB_CMD_POLL. Buffered
I/O submitted through legacy AIO falls back to synchronous execution in the submission
path (same behavior as Linux). Applications requiring asynchronous buffered I/O must
use io_uring.
19.11.1 Syscall Interface¶
Five syscalls constitute the legacy AIO surface. All syscall numbers are x86-64; other architectures use the standard Linux syscall number mapping.
io_setup(nr_events: u32, ctxp: *mut AioCtxId) -> i32 // NR 206
io_destroy(ctx: AioCtxId) -> i32 // NR 207
io_getevents(ctx: AioCtxId, min_nr: KernelLong, nr: KernelLong,
events: *mut IoEvent, timeout: *const Timespec) -> i32 // NR 208
// Note: Linux names arg2 `nr` (not `max_nr`), though semantically it is
// the maximum number of events to return. UmkaOS uses `nr` for ABI parity.
io_submit(ctx: AioCtxId, nr: KernelLong, iocbpp: *const *mut Iocb) -> i32 // NR 209
io_cancel(ctx: AioCtxId, iocb: *mut Iocb, result: *mut IoEvent) -> i32 // NR 210
AioCtxId is a u64 opaque handle (the mmap address of the completion ring, cast
to unsigned long by userspace). UmkaOS stores contexts in a per-mm XArray<AioContext>
keyed by this address for O(1) lookup.
19.11.2 ABI Structures¶
19.11.2.1 struct Iocb (I/O Control Block)¶
/// Linux AIO I/O control block. Userspace allocates these and passes pointers
/// via io_submit(). Layout matches Linux `struct iocb` exactly (64 bytes).
#[repr(C)]
pub struct Iocb {
/// User data token, copied verbatim to IoEvent::data on completion.
pub aio_data: u64,
// On big-endian (PPC32, s390x), the Linux ABI swaps these two fields:
// aio_rw_flags comes before aio_key. Use #[cfg(target_endian = "big")]
// conditional compilation to match the Linux struct iocb layout.
#[cfg(target_endian = "little")]
/// Must be zero (was IOCB_KEY_INTERNAL in early kernels; Linux >=3.x
/// rejects non-zero values with -EINVAL for forward compatibility).
pub aio_key: u32,
#[cfg(target_endian = "little")]
/// RWF_* per-operation flags: RWF_HIPRI (0x1), RWF_DSYNC (0x2),
/// RWF_SYNC (0x4), RWF_NOWAIT (0x8), RWF_APPEND (0x10).
pub aio_rw_flags: u32,
#[cfg(target_endian = "big")]
/// RWF_* per-operation flags (before aio_key on big-endian).
pub aio_rw_flags: u32,
#[cfg(target_endian = "big")]
/// Must be zero (after aio_rw_flags on big-endian).
pub aio_key: u32,
/// Operation opcode (IOCB_CMD_PREAD, IOCB_CMD_PWRITE, etc.).
pub aio_lio_opcode: u16,
/// I/O priority hint passed to the block I/O scheduler.
pub aio_reqprio: i16,
/// File descriptor for the target file.
pub aio_fildes: u32,
/// Userspace buffer address (source for write, destination for read).
pub aio_buf: u64,
/// Buffer length in bytes.
pub aio_nbytes: u64,
/// File offset for positional I/O. -1 for IOCB_CMD_FDSYNC/IOCB_CMD_FSYNC.
pub aio_offset: i64,
/// Reserved, must be zero. Returns -EINVAL if non-zero.
pub aio_reserved2: u64,
/// Flags: IOCB_FLAG_RESFD (0x1) enables eventfd notification,
/// IOCB_FLAG_IOPRIO (0x2) uses aio_reqprio as ioprio value.
pub aio_flags: u32,
/// eventfd file descriptor for completion notification. Only examined
/// when IOCB_FLAG_RESFD is set in aio_flags.
pub aio_resfd: u32,
}
const_assert!(core::mem::size_of::<Iocb>() == 64);
19.11.2.2 struct IoEvent (Completion Event)¶
/// Completion event delivered via the completion ring or io_getevents().
/// Layout matches Linux `struct io_event` exactly (32 bytes).
#[repr(C)]
pub struct IoEvent {
/// Copied from Iocb::aio_data — application correlation token.
pub data: u64,
/// Userspace address of the original Iocb (as u64).
pub obj: u64,
/// Result: positive byte count on success, negative errno on failure.
pub res: i64,
/// Secondary result. Zero on success; negative errno for secondary
/// failures (e.g., partial fsync metadata error).
pub res2: i64,
}
const_assert!(core::mem::size_of::<IoEvent>() == 32);
19.11.3 IOCB_CMD Opcodes¶
Values match Linux include/uapi/linux/aio_abi.h exactly:
| Value | Name | Description |
|---|---|---|
| 0 | IOCB_CMD_PREAD |
Positional read (pread64 equivalent) |
| 1 | IOCB_CMD_PWRITE |
Positional write (pwrite64 equivalent) |
| 2 | IOCB_CMD_FSYNC |
Sync file data + metadata to storage |
| 3 | IOCB_CMD_FDSYNC |
Sync file data only (fdatasync equivalent) |
| 5 | IOCB_CMD_POLL |
Poll for events (Linux 4.18+, POLLIN/POLLOUT mask in aio_buf) |
| 6 | IOCB_CMD_NOOP |
No operation (used for padding/testing) |
| 7 | IOCB_CMD_PREADV |
Positional vectored read |
| 8 | IOCB_CMD_PWRITEV |
Positional vectored write |
Opcode 4 was the experimental IOCB_CMD_PREADX, never exposed to the stable ABI.
UmkaOS returns -EINVAL for opcode 4 and any opcode > 8, matching Linux behavior.
19.11.4 AioContext (Internal Kernel State)¶
/// Per-context state created by io_setup(). One AioContext exists per
/// successful io_setup() call. Stored in the mm's context XArray keyed
/// by the ring mmap address.
pub struct AioContext {
/// Opaque identifier (the mmap address of the completion ring).
/// Used as the XArray key in mm.aio_contexts.
pub id: AioCtxId,
/// Maximum number of concurrent in-flight events, set by io_setup(nr_events).
/// The completion ring is sized to hold this many IoEvent entries.
pub max_events: u32,
/// Completion ring mapped read-only into userspace. Kernel writes
/// completed IoEvent entries here; userspace may read directly.
pub ring: AioCompletionRing,
/// Count of currently in-flight I/O operations. Incremented on
/// io_submit() acceptance, decremented on bio completion callback.
/// io_destroy() waits for this to reach zero before freeing.
pub pending: AtomicU32,
/// Set to true by io_destroy(). Once set, io_submit() returns -EINVAL
/// for this context. Checked with Acquire ordering on every submission.
pub dead: AtomicBool,
/// Weak reference to the owning address space. Cleared on mm teardown.
pub mm: Weak<MmStruct>,
/// Wait queue for tasks blocked in io_getevents() awaiting completions.
pub wait: WaitQueue,
/// u64 generation counter incremented on each completion. Used for
/// stale-wakeup detection in io_getevents() (avoids ABA on ring wrap).
pub completion_gen: AtomicU64,
}
Context registry: Each MmStruct holds an aio_contexts: XArray<Arc<AioContext>>
keyed by context id (the ring mmap address). Lookup is O(1). The XArray is created
lazily on first io_setup() — processes that never use AIO pay zero memory overhead.
19.11.5 AioCompletionRing (Shared Memory Ring)¶
The completion ring is a contiguous allocation mapped into userspace as a read-only
VMA with the VM_DONTCOPY | VM_DONTEXPAND flags (matching Linux). The kernel writes
events at the tail; userspace reads from the head.
/// Ring header, mapped at the start of the completion ring pages.
/// Layout matches Linux `struct aio_ring` for binary compatibility.
///
/// **Cross-language atomicity model**: The kernel uses Rust `AtomicU32` for
/// `head`/`tail`; libaio uses `volatile unsigned *` reads/writes. Both compile
/// to identical load/store instructions on all supported architectures.
/// `AtomicU32` has the same size (4 bytes) and alignment (4 bytes) as `u32`
/// on all targets, so the mmap'd layout is byte-compatible. The kernel writes
/// `tail` with `Release` ordering; userspace reads `tail` with volatile
/// (compiler barrier) and uses `smp_rmb()` (hardware barrier on weakly-ordered
/// architectures). The kernel reads `head` with `Acquire` ordering; userspace
/// writes `head` with volatile + `smp_wmb()`. This pairing is equivalent to
/// the C11 release-acquire model.
#[repr(C)]
pub struct AioRingHeader {
/// Magic: 0xa10a10a1 — userspace checks this to detect usable ring.
pub id: u32,
/// Total number of IoEvent slots in the ring.
pub nr: u32,
/// Read cursor (updated by userspace after consuming events).
pub head: AtomicU32,
/// Write cursor (updated by kernel after producing events).
pub tail: AtomicU32,
/// ABI magic: 0x01020304 (version check for ring protocol).
pub magic: u32,
/// ABI compatibility version. Currently 2 (matches Linux).
pub compat_features: u32,
/// Must be zero. Non-zero values indicate incompatible features
/// that userspace does not understand; libaio falls back to syscall.
pub incompat_features: u32,
/// Size of each event entry (sizeof(IoEvent) = 32).
pub header_length: u32,
}
const_assert!(core::mem::size_of::<AioRingHeader>() == 32);
The ring body follows the header: nr entries of IoEvent, each 32 bytes. The total
ring allocation is rounded up to page size. The ring capacity nr is rounded up to
the next power of two and clamped to [1, max_events] for efficient modular indexing.
Userspace fast path: When head != tail, userspace can read events directly from
the mmap'd ring without entering the kernel. The libaio library exploits this: it
checks head != tail before calling io_getevents(), avoiding a syscall when events
are already available. UmkaOS preserves this optimization by maintaining identical
ring layout and memory ordering (Release on kernel tail update, Acquire on
userspace head read).
19.11.6 Submission Path (io_submit)¶
-
Context lookup: Load
AioContextfromcurrent_mm().aio_contexts.load(ctx_id). Return-EINVALif not found ordeadis set. -
Capacity check: Per Linux ABI,
io_submit()returns the number of iocbs successfully submitted (partial accept). Ifpending.load(Acquire) >= max_eventsbefore any iocb is submitted, return-EAGAIN. Otherwise, submit as many as capacity allows and return the count. This matches Linux partial-accept semantics. -
Per-iocb validation (for each of the
nriocbs): - Copy
Iocbfrom userspace (singlecopy_from_user, 64 bytes). Return-EFAULTon bad pointer. - Verify
aio_key == 0andaio_reserved2 == 0— return-EINVALotherwise. - Verify
aio_lio_opcodeis a supported opcode — return-EINVALotherwise. - Resolve
aio_fildesto aFilereference. Return-EBADFif invalid. - For
IOCB_CMD_PREAD/IOCB_CMD_PWRITE: verify the file supports the operation (FileOps::read/FileOps::writeis implemented). VerifyO_DIRECTalignment requirements (buffer address, length, and offset must be block-aligned). Return-EINVALon misalignment. - For
IOCB_FLAG_RESFD: resolveaio_resfdto an eventfd (Section 19.10). Return-EBADFif invalid or not an eventfd. -
Validate
aio_rw_flags: onlyRWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT | RWF_APPENDare permitted. Unknown flags return-EINVAL. -
I/O dispatch: For
IOCB_CMD_PREAD/IOCB_CMD_PWRITE, construct aBioand submit through the block layer (Section 15.2). The bio completion callback is set toaio_complete(). ForIOCB_CMD_FSYNC/IOCB_CMD_FDSYNC, submit an async fsync request to the file's filesystem. ForIOCB_CMD_POLL, register the file for poll notification viaFileOps::poll(). -
Accounting: Increment
pendingby the number of successfully submitted iocbs. Return the count of successfully submitted iocbs (may be less thannrif an iocb in the middle of the array fails validation — all preceding iocbs are still submitted).
19.11.7 Completion Path¶
-
Bio callback: When the block layer completes a bio, the
aio_complete()callback fires. It writes anIoEventto the ring at positiontail % nr, then advancestailwithReleaseordering. Incrementscompletion_gen. -
eventfd signaling: If
IOCB_FLAG_RESFDwas set on the original iocb, the kernel signals the eventfd by callingeventfd_signal(resfd, 1)after writing the ring entry. -
Waiter wakeup: After writing the event,
wait.wake_all()unblocks any tasks sleeping inio_getevents(). -
Pending decrement:
pending.fetch_sub(1, Release). If this was the last pending operation anddeadis set, wake theio_destroy()waiter.
19.11.8 io_getevents Blocking Behavior¶
io_getevents(ctx, min_nr, max_nr, events, timeout):
- If
min_nr == 0: non-blocking — copy up tomax_nravailable events and return. - If
min_nr > 0: block until at leastmin_nrevents are available ortimeoutexpires. ANULLtimeout means block indefinitely. A zero timeout ({0, 0}) means non-blocking (equivalent tomin_nr == 0). - Events are copied from the ring to the userspace
eventsbuffer viacopy_to_user. The ringheadis advanced after the copy. - Returns the number of events copied, or
-EINTRif interrupted by a signal. max_nris clamped toring.nr— userspace cannot request more events than exist.
19.11.9 io_cancel¶
io_cancel(ctx, iocb, result) attempts to cancel an in-flight operation:
- Searches the pending operations for one matching
iocb(compared by userspace pointer address). - If found and the underlying I/O has not yet been dispatched to hardware, cancels it
and writes the completion event to
resultwithres = -ECANCELED. - If the I/O is already in-flight at the hardware level, returns
-EAGAIN(cannot cancel — the operation will complete normally). - If no matching iocb is found, returns
-EINVAL.
19.11.10 Cleanup (io_destroy)¶
- Set
deadflag withReleaseordering — all subsequentio_submit()calls return-EINVAL. - Cancel any pending
IOCB_CMD_POLLregistrations. - Wait for
pendingto reach zero. This blocks until all in-flight I/O completes. The caller may be interrupted by a signal (returns-EINTR, context remains valid, userspace must retry). - Unmap the completion ring from the process address space.
- Remove the
AioContextfrommm.aio_contextsXArray. - Drop the
AioContext(ring pages freed, wait queue released).
19.11.11 Resource Limits¶
RLIMIT_NOFILE(Section 8.7): each AIO context consumes one slot in the per-mm XArray but does not consume a file descriptor.- Maximum events per context: 65536 (
AIO_MAX_NR_EVENTS).io_setup()returns-EAGAINifnr_eventsexceeds this limit. - System-wide maximum events:
/proc/sys/fs/aio-max-nr(default: 1048576). Tracks the sum ofmax_eventsacross all contexts system-wide.io_setup()returns-EAGAINif the system-wide limit would be exceeded. - Maximum simultaneous contexts per process: no explicit limit beyond address space availability. Each context consumes at minimum one page for the ring.
19.11.12 Performance Considerations¶
Legacy AIO has higher per-submission overhead than io_uring due to three factors:
- Per-iocb copy and validation: Each
io_submit()call copies and validates every iocb from userspace individually. io_uring validates at setup time and operates on pre-registered shared memory. - Syscall overhead: Every submission and reap is a full syscall round-trip. io_uring can operate entirely through shared memory with no syscall for submission or completion.
- No batching infrastructure: Legacy AIO has no equivalent of io_uring's SQ polling thread or linked SQE chains.
UmkaOS recommends io_uring for all new applications. Legacy AIO exists solely for binary compatibility with databases and storage engines that have not migrated. No UmkaOS-specific enhancements are planned for the legacy AIO path.