Chapter 10: Driver Architecture and Isolation
Three-tier protection model, isolation mechanisms, KABI, driver model, device registry, zero-copy I/O, IPC, crash recovery, driver subsystems
10.1 Three-Tier Protection Model
UmkaOS organizes code into three driver tiers — the UmkaOS Core microkernel (Tier 0), Tier 1 kernel-adjacent drivers, and Tier 2 user-space drivers — plus standard user space. The "three tiers" refer to the three levels at which kernel/driver code executes (Core, Tier 1, Tier 2); user space is not counted as a tier because it uses the standard Linux process model unchanged.
A fourth class, Tier M (Multikernel Peer), emerges when attached hardware runs its own UmkaOS kernel instance or an UmkaOS-compatible shim. Tier M is not a tier within a single UmkaOS instance — it is a physically separate execution environment with isolation stronger than Tier 2 and near-zero host-side driver complexity. See Section 10.1.2.
+======================================================================+
| UmkaOS CORE (Ring 0) |
| Microkernel: Rust + C/asm for arch boot |
| |
| - Capability manager - Physical memory allocator |
| - Thread/process management - Scheduler (CFS/EEVDF + RT + DL) |
| - IPC primitives - MMU / IOMMU programming |
| - Interrupt routing - vDSO maintenance |
| - Virtual memory manager - Page cache |
| - Timer management - Linux syscall interface |
+======================================================================+
| MPK switch (~23 cycles) | Shared memory (0 copies)
v v
+======================================================================+
| TIER 1: Kernel-Adjacent Drivers |
| Ring 0, MPK-isolated (Intel Memory Protection Keys) |
| |
| - NVMe, AHCI/SATA - High-perf NICs (Intel, Mellanox) |
| - TCP/IP + UDP stack - GPU compute drivers |
| - Block I/O layer - Filesystem impls (ext4, XFS, btrfs) |
| - VirtIO drivers - Crypto subsystem |
| - KVM hypervisor (*) - Netfilter/nftables engine |
+======================================================================+
(*) KVM runs as a Tier 1 driver with extended hardware privileges (KvmHardwareCapability),
which authorizes umka-core to execute VMX/VHE/H-extension operations on KVM's behalf
via a validated VMX/VHE trampoline. KVM retains full Tier 1 crash-recovery semantics.
See Section 18.1.4.5 for the full classification rationale.
| Address-space switch | IOMMU-isolated
| (~200-500 cycles, PCID/ASID) | DMA fencing
v v
+======================================================================+
| TIER 2: User-Space Drivers |
| Ring 3, separate address space, IOMMU-protected DMA |
| |
| - USB drivers - Audio (HDA, USB Audio) |
| - Input devices - Bluetooth, WiFi control plane |
| - Printers, scanners - Third-party / vendor drivers |
| - Display server drivers - Non-performance-critical devices |
+======================================================================+
| Standard Linux syscall interface (100% compatible)
v
+======================================================================+
| USER SPACE (Ring 3) |
| Unmodified Linux binaries: glibc, musl, systemd, Docker, K8s, etc. |
+======================================================================+
════════════ Hardware Fabric Boundary (PCIe / CXL / coherent on-chip) ════════════
+======================================================================+
| TIER M: Multikernel Peer (separate kernel instance) |
| Own CPU complex · own memory · UmkaOS kernel or UmkaOS-compatible shim |
| |
| - SmartNIC / DPU (BlueField, Pensando, Marvell OCTEON) |
| - Computational storage SoC (Arm Cortex-R, Zynq UltraScale+) |
| - On-chip hardware partition (ARM CCA Realm, RISC-V WorldGuard) |
| - GPU / NIC with UmkaOS-compatible firmware shim |
+======================================================================+
(*) Tier M: deployment-time property — 0 to N peers per host.
Host representation: umka-peer-transport (~2,000 lines, device-agnostic).
Complexity management — The core-of-core (scheduler + memory + caps + IPC) should be as small as feasible. For reference: seL4's verified microkernel is ~10K SLOC (but provides far fewer services), QNX's microkernel is ~100K, and the Zircon kernel (Fuchsia) is ~200K. Any subsystem that grows beyond the minimum necessary for its function should be re-evaluated for extraction to Tier 1.
10.1.1 How the Tiers Interact
UmkaOS Core to Tier 1: The core switches the MPK protection domain via WRPKRU (a
single unprivileged instruction, approximately 23 cycles). Both run in Ring 0 and share
the same address space, but MPK keys prevent a Tier 1 driver from reading or writing
memory belonging to the core or to other Tier 1 domains. Communication uses shared-memory
ring buffers -- zero copies, zero transitions for data.
UmkaOS Core to Tier 2: Standard process-based isolation. Tier 2 drivers run in Ring 3 with their own address space. Communication uses mapped shared-memory rings for data (zero copy) and lightweight syscall-based notifications. IOMMU restricts DMA to driver-allocated regions.
How Tier 2 zero-copy works — dual physical page mapping: The shared ring buffers between UmkaOS Core and a Tier 2 driver are backed by a single set of physical pages mapped into two virtual address spaces simultaneously. The kernel side holds a
VmAreacovering these pages (withVM_SHARED | VM_IOflags) and accesses them through its own virtual address. The Tier 2 driver side callsmmap(UMKA_RING_FD, ...)on a special file descriptor issued at driver registration; the kernel maps the same physical frames into the Tier 2 process address space as a read-write shared mapping. No copy occurs on either side: the kernel writes to its virtual address and the Tier 2 driver reads from its virtual address, both resolving to the same physical frames. The shared region is bounded to the ring buffer size; the Tier 2 driver cannot access kernel memory outside the mapped ring (enforced by VMA boundaries and IOMMU). Cache coherency is guaranteed by the CPU coherency protocol on x86 and ARM; non-coherent platforms add a memory fence before and after ring accesses.
Tier 1/2 to User Space: No direct interaction for control paths — all user-space requests go through UmkaOS Core's syscall layer, which dispatches to the appropriate tier. However, the data path does allow direct shared memory: UmkaOS Core sets up shared ring buffers (Section 10.6) that are mapped into both the driver and user-space address spaces. Once established, data flows through these rings without UmkaOS Core mediation (zero-copy). UmkaOS Core mediates only the ring setup, teardown, and error paths.
UmkaOS Core to Tier M (Peer Kernel): Communication uses typed capability channels
over the hardware fabric — PCIe P2P ring buffers, CXL shared memory, or coherent
on-chip SRAM. No UmkaOS Core data paths cross the hardware boundary. The host-side
umka-peer-transport module (~2,000 lines) manages cluster membership, capability
negotiation, and crash recovery. The peer kernel runs its own scheduler, memory
manager, and capability space independently; the host CPU is not in the device's
data path.
10.1.2 Tier M: Multikernel Peer Isolation
Tier M describes the isolation class of devices running their own UmkaOS kernel instance (or UmkaOS-compatible shim) as cluster peers. The three-tier model (Tiers 0–2) describes isolation within a single UmkaOS instance. Tier M is a between-kernel isolation class.
Isolation properties:
- No shared kernel address space. Tiers 0–2 all execute within the host UmkaOS kernel (Ring 0 or Ring 3) and share kernel data structures at various depths. A Tier M peer has an entirely separate address space, CPU state, and capability namespace. The host UmkaOS Core never maps peer memory.
- Hardware boundary, not software policy. Tier 1 isolation (MPK/POE/DACR) and Tier 2 isolation (IOMMU) are policies enforced by software-programmable hardware registers — a sufficiently privileged exploit can alter them. The Tier M boundary is a hardware fabric (PCIe, CXL, on-chip partition fence); crossing it requires physical access or firmware compromise of the device, a categorically different threat model.
- Isolation stronger than Tier 2. Tier 2 is Ring 3 + IOMMU on the host — the driver still shares the host kernel for syscall dispatch, signal delivery, and page table management. A Tier M peer shares none of these. The only communication surface is the typed capability channel, a significantly smaller attack surface.
- Ordered crash recovery. On peer kernel crash: IOMMU lockout and PCIe bus master disable within 2ms, then distributed state cleanup, then optional FLR and reboot. The host kernel never panics; applications see a brief I/O stall. See Section 5.1.3.
Performance properties:
- Host-side complexity: ~2,000 lines of device-agnostic
umka-peer-transportregardless of device class, vs. 100K–700K lines of device-specific Ring 0 driver code for equivalent traditional devices. - Host CPU out of data path: the peer kernel manages its own scheduler and I/O independently. Host CPU overhead is proportional to control-path events (peer joins, leaves, crashes, capability renegotiation), not to data throughput.
- Communication latency by fabric:
| Fabric | Example hardware | Round-trip latency |
|---|---|---|
| PCIe P2P | Discrete SmartNIC, DPU | ~1–2 μs |
| CXL coherent | Attached memory/compute expander | ~100–300 ns |
| On-chip hardware partition | ARM CCA Realm, RISC-V WorldGuard | ~10–50 ns |
Hardware forms:
| Form | Examples | Notes |
|---|---|---|
| Discrete device | BlueField-3, Pensando Elba, Marvell OCTEON 10 | Full UmkaOS kernel port |
| Computational storage | Arm Cortex-R NVMe SoC, Zynq UltraScale+ | Full UmkaOS kernel port |
| On-chip partition | ARM CCA Realm (Neoverse V3+), RISC-V WorldGuard | Same physical package; coherent fabric |
| UmkaOS shim | GPU or NIC firmware implementing the UmkaOS peer protocol | Device need not run the full kernel |
The on-chip partition form — ARM CCA Realms (shipping in Neoverse V3+, Cortex-X4+) and RISC-V WorldGuard (specification in progress) — is conceptually identical to the discrete device form. The architectural pattern is the same: a separate UmkaOS instance, a hardware-enforced boundary, typed capability channels as the sole communication surface. The difference is physical proximity, which determines communication latency but not the isolation model.
Availability: Tier M is a deployment-time property. A host may have zero, one, or
many Tier M peers depending on attached hardware. The host kernel supports the
traditional driver model (Tiers 0–2) and the peer model simultaneously with no
configuration distinction — umka-peer-transport loads on demand when a peer device
is detected.
10.2 Isolation Mechanisms and Performance Modes
Hardware-assisted memory isolation enables UmkaOS's three-tier driver model (Section 11.1) at near-zero overhead on platforms with hardware isolation support (x86 MPK ~23 cycles, ARMv7 DACR ~10-20 cycles, AArch64 page-table ~150-300 cycles). On RISC-V, Tier 1 isolation is not available — Tier 1 drivers run as Tier 0 (in-kernel, fully trusted) until RISC-V hardware provides suitable isolation primitives. This section covers the mechanisms, their costs, threat model, and the adaptive policy that allows UmkaOS to run on hardware ranging from x86_64 with MPK (~23 cycles) to RISC-V where Tier 1 is absent entirely. Isolation is one of eight core capabilities — see Section 1.1 for the full list.
10.2.1 Isolation Philosophy: Best Effort Within Performance Budget
Key principle: Driver isolation in UmkaOS is not a single fixed design point. It is a spectrum that varies across hardware architectures, and the approach is deliberately "best effort within the performance budget" rather than "maximum isolation everywhere."
Why this matters:
-
Hardware capability varies widely: x86_64 has MPK (16 domains, ~23 cycles). AArch64 uses page-table + ASID isolation (~150-300 cycles) as the standard mechanism on all current deployed hardware (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920). POE (ARMv8.9+/ARMv9.4+, ~40-80 cycles) is an optional hardware acceleration available on newer silicon (Neoverse V3+, Cortex-X4+) that provides 2-4x speedup when present. ARMv7 has DACR (16 domains, ~10-20 cycles). RISC-V has no suitable isolation mechanism — Tier 1 is not available on RISC-V. A design that mandates uniform isolation would either (a) impose unacceptable overhead on some architectures, or (b) fail to leverage better isolation on architectures that support it.
-
Performance is a requirement, not a nice-to-have: The 5% overhead target is non-negotiable. UmkaOS must be a drop-in replacement for Linux — if I/O latency increases by 20%, users will not adopt it regardless of how strong the isolation is.
-
The escape hatch always exists: Any Tier 1 driver can be demoted to Tier 2 (full process isolation) at any time — via per-driver manifest, sysfs knob, or automatic crash-count policy. If an administrator values isolation over performance for a specific workload or hardware configuration, that choice is always available. The tradeoff is explicit and user-controlled.
-
This is not a bug, it's a feature: Some reviewers may see varying isolation strength across architectures as a "flaw" or "inconsistency." It is neither. It is an honest acknowledgment of hardware reality. The alternative — pretending all architectures have identical isolation capabilities, or mandating full process isolation everywhere (and accepting 20-50% overhead) — would make UmkaOS impractical for its intended use case as a Linux replacement.
The design contract:
| Hardware | Tier 1 Isolation | Overhead | Alternative |
|---|---|---|---|
| x86_64 with MPK | Strong (MPK domains) | ~1-2% | Demote to Tier 2 for stronger isolation |
| AArch64 (mainstream: page-table + ASID) | Moderate (page-table domains) | ~6-12% | Demote to Tier 2, or promote to Tier 0 for performance |
| AArch64 with POE (ARMv8.9+/ARMv9.4+) | Strong (POE indices) | ~2-4% | Demote to Tier 2 for stronger isolation |
| ARMv7 with DACR | Strong (DACR domains) | ~0.5-1% | Demote to Tier 2 for stronger isolation |
| RISC-V | Tier 1 unavailable — Tier 1 drivers run as Tier 0 | 0% overhead (no isolation boundary) | Demote to Tier 2 for isolation |
| PPC32/PPC64LE | Strong-Moderate | ~1-5% | Demote to Tier 2 for stronger isolation |
Summary: UmkaOS provides the best isolation the hardware can deliver within the performance budget, with a user-controlled escape hatch to stronger isolation (Tier 2) when security requirements exceed what the hardware can efficiently provide. This is a pragmatic engineering tradeoff, not a design flaw.
AArch64 deployment note: The global UmkaOS performance budget (≤5% overhead vs Linux) requires POE (ARMv8.9+/ARMv9.4-A,
FEAT_S1POE) to be met with Tier 1 on AArch64. Without POE, page-table + ASID isolation costs 6-12% per domain switch, which exceeds the budget for high-throughput workloads (NVMe, network). On current mainstream AArch64 servers (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra) that lack POE, operators have two options: 1. Use Tier 1 and accept the higher overhead (appropriate when crash containment is the priority and workloads have low I/O frequency — e.g., compute-heavy, GPU inference). 2. Prefer Tier 2 for I/O-intensive drivers (USB, SATA, fast storage) and promote only low-frequency drivers to Tier 1. This keeps per-request overhead within budget.POE support detection is automatic at boot (
ID_AA64MMFR3_EL1.S1PIE). Operators can also force Tier 2 globally on AArch64 without POE viaumka.tier1_aarch64=0.
10.2.2 How MPK Works
Each page table entry contains a 4-bit protection key (PKEY), assigning the page to one
of 16 domains (0-15). The PKRU register holds per-domain read/write permission bits.
The WRPKRU instruction updates these permissions in approximately 23 cycles (measured:
~23 cycles on Skylake [libmpk, USENIX ATC '19], ~28 cycles on Skylake-SP [EPK, USENIX
ATC '22]) -- no TLB flush, no privilege transition, no system call.
10.2.3 Cost Comparison
| Mechanism | Cost per transition | Isolation strength | Used for |
|---|---|---|---|
| Function call | ~1-5 cycles | None | Linux monolithic |
Intel MPK WRPKRU |
~23 cycles | Memory domain | Tier 1 drivers |
| Full IPC (seL4-style) | ~600-1000 cycles | Full address space | Too expensive |
| Address-space switch | ~200-600 cycles | Full process | Tier 2 drivers |
MPK gives meaningful isolation -- a Tier 1 driver cannot read or write kernel private data, other driver data, or memory in other MPK domains -- at only approximately 23 cycles per boundary crossing. Combined with IOMMU for DMA fencing, this is the foundation of our performance story.
10.2.4 MPK Domain Allocation
With 16 available domains (PKEY 0-15), the allocation strategy is:
| PKEY | Assignment |
|---|---|
| 0 | UmkaOS Core (kernel private data) |
| 1 | Shared read-only (ring buffer descriptors) |
| 2-13 | Tier 1 driver domains (12 available) |
| 14 | Shared DMA buffer pool |
| 15 | Guard / unmapped |
When more than 12 Tier 1 domains are needed, related drivers are grouped into the same domain (for example, all block drivers share one domain, all network drivers share another). This grouping is configurable via policy.
10.2.5 WRPKRU Threat Model: Crash Containment, Not Exploitation Prevention
Critical design constraint: WRPKRU is an unprivileged instruction. Any code
running in Ring 0 — including Tier 1 driver code — can execute WRPKRU to modify its
own MPK permission register, granting access to any MPK domain including UmkaOS Core
(PKEY 0). This means MPK isolation provides crash containment (preventing buggy
drivers from corrupting kernel memory) but does not provide exploitation
prevention (compromised Ring 0 code can execute WRPKRU to escape).
Security model — UmkaOS's Tier 1 isolation is designed to survive driver bugs, not driver exploitation. The rationale: the vast majority of kernel crashes are caused by bugs (null dereference, use-after-free, buffer overrun), not by attackers with arbitrary code execution inside a specific driver. For environments requiring defense against compromised Ring 0 code, Tier 2 (full process isolation) provides the strong boundary — at higher latency cost.
What MPK actually protects against:
- Accidental memory corruption: Null pointer dereferences, buffer overruns, and similar bugs that write to wrong addresses are contained — the hardware fault triggers before the driver can corrupt kernel memory.
- Crash recovery: When a driver faults, UmkaOS Core can safely restart it without system panic because driver memory is isolated from core state.
- Fault propagation containment: A bug in one Tier 1 driver cannot corrupt data belonging to other drivers or to UmkaOS Core.
What MPK does NOT protect against:
- Deliberate exploitation: An attacker who achieves arbitrary code execution within
a Tier 1 driver can execute
WRPKRUto escape isolation. The instruction is unprivileged by design and the sanctionedswitch_domain()trampoline uses it legitimately — it cannot be detected or blocked. - Runtime code injection: JIT code or ROP gadgets that contain
WRPKRUcan execute the instruction directly.
Driver signing — All Tier 1 drivers must be signed (Section 8.2). An attacker cannot load a malicious driver binary without a valid signature. The attack surface is limited to exploiting bugs in legitimately signed driver code. Combined with Rust's memory safety guarantees and standard Linux hardening (CFI, CET), this raises the bar for achieving arbitrary code execution, but does not eliminate the WRPKRU escape vector.
Tier 2 for exploitation-sensitive workloads — For environments where defense against compromised Ring 0 code is required, drivers should run at Tier 2 (full process isolation). The auto-demotion mechanism (Section 10.5.10.2) allows administrators to pin specific drivers to Tier 2 via policy, trading higher I/O latency for stronger isolation.
10.2.5.1 PKRU Write Elision (Mandatory)
The ~23-cycle WRPKRU cost is per instruction, not per domain crossing. When an I/O
path traverses multiple domains in sequence (e.g., NIC driver → TCP stack → socket
layer), a naive implementation issues a WRPKRU at every boundary — 6 writes for a
3-boundary round-trip. UnderBridge (Gu et al., USENIX ATC '20) demonstrated that many
of these writes are redundant and must be elided.
WRPKRU elision is a mandatory core design decision, not a deferred optimization.
Every WRPKRU instruction in UmkaOS goes through the switch_domain() trampoline
(defined below), which enforces shadow comparison before any hardware write. There
is no code path in the kernel that issues a raw WRPKRU without shadow checking —
this invariant is enforced at the API level (the x86::wrpkru() function is unsafe
and only called from switch_domain()).
The three elision techniques (all implemented from day one):
-
Same-permission transition: if domain A and domain B both need read access to a shared buffer, and the only permission change is adding write access to B's private region, the
WRPKRUwrite may be unnecessary if A's private region is already read-disabled. The key insight:WRPKRUsets all 16 domain permissions simultaneously — if the new permission bitmap happens to be identical to the current one, the write is redundant. -
Batched transitions: when crossing A → B → C in rapid succession (e.g., NIC driver → TCP → socket), instead of writing PKRU three times (disable A/enable B, disable B/enable C), compute the final PKRU state and write once. The intermediate states are unnecessary if no untrusted code executes between transitions.
-
Cached PKRU shadow: a per-CPU shadow of the current PKRU value (stored in
CpuLocalBlock, see Section 3.1.2). Before issuingWRPKRU,switch_domain()compares the desired value against the shadow. If identical, the instruction is skipped entirely. This is a single register comparison (~1 cycle) versus the ~23-cycleWRPKRU.
UmkaOS implementation — every domain switch goes through this trampoline. The
pkru_shadow is stored in CpuLocalBlock for single-instruction access.
No code path in the kernel issues WRPKRU outside this function. The
switch_domain() inline function:
#[inline(always)]
fn switch_domain(target_pkru: u32) {
let shadow = per_cpu::pkru_shadow();
if shadow != target_pkru {
// SAFETY: WRPKRU updates permission bits for all 16 MPK domains.
// target_pkru is computed from the domain allocation table and
// validated at driver load time — only valid permission sets are
// reachable. Preemption is disabled (see above).
unsafe { x86::wrpkru(target_pkru) };
per_cpu::set_pkru_shadow(target_pkru);
}
}
Context switch coherence: On every context switch, the scheduler calls
arch::x86_64::isolation::save_pkru(prev_task) to save the outgoing task's PKRU
register value into prev_task.saved_pkru, then calls
arch::x86_64::isolation::restore_pkru(next_task) to load next_task.saved_pkru via
WRPKRU. The per-CPU CpuLocalBlock.isolation_shadow field
(Section 3.1.2) is
updated to next_task.saved_pkru atomically with the WRPKRU execution — the shadow
always reflects the actual hardware PKRU register value on this CPU. This invariant is
required for the validate_current_domain() fast path which reads the shadow without
executing RDPKRU. Any code path that issues WRPKRU outside switch_domain() or the
context switch save/restore functions is a bug: it would desync the shadow from the
hardware register, causing switch_domain() to skip necessary WRPKRU writes on
subsequent domain transitions.
Guaranteed savings — on a typical TCP receive path (4 WRPKRU instructions in the
naive case: 2 boundary crossings × 2 switches each), shadow comparison eliminates 1-2
redundant writes (the intermediate transitions where permissions don't actually change).
At ~23 cycles per elided write, this saves ~23-46 cycles per packet — reducing TCP
path overhead from ~2% to ~1-1.5%. On NVMe paths, back-to-back domain transitions
(submit→complete with no intervening domain change) hit the shadow cache and skip the
second WRPKRU pair entirely, saving ~46 cycles.
Generalization to other architectures: The shadow-comparison pattern applies to every architecture's isolation register, not just x86 PKRU:
| Architecture | Register | Shadow location | Skip cost | Hardware write cost |
|---|---|---|---|---|
| x86-64 | PKRU (WRPKRU) |
CpuLocalBlock.pkru_shadow |
~1 cycle (compare) | ~23 cycles |
| AArch64 | POR_EL0 (MSR) |
CpuLocalBlock.por_shadow |
~1 cycle | ~40-80 cycles |
| ARMv7 | DACR (MCR p15) |
CpuLocalBlock.dacr_shadow |
~1 cycle | ~10-20 cycles |
| PPC64 | Radix PID (mtspr) |
CpuLocalBlock.rpid_shadow |
~1 cycle | ~30-60 cycles |
| PPC32 | Segment regs (mtsr) |
CpuLocalBlock.sr_shadow[16] |
~1 cycle | ~10-30 cycles |
RISC-V has no isolation register (Tier 1 is not available on RISC-V) — shadow elision is not applicable. AArch64 uses POR_EL0 when POE hardware is present; on mainstream AArch64 without POE, the shadow tracks ASID/TTBR0 to elide redundant page-table switches. The shadow pattern provides the largest benefit on x86-64 and AArch64 POE, where the hardware write cost is highest relative to the comparison cost.
10.2.6 Isolation on Other Architectures
Each supported architecture uses its best available fast isolation mechanism:
| Architecture | Mechanism | Switch Cost | Domains |
|---|---|---|---|
| x86_64 | MPK (WRPKRU) |
~23 cycles | 12 for drivers |
| AArch64 (mainstream) | Page-table + ASID | ~150-300 cycles | Unlimited |
| AArch64 + POE (ARMv8.9+/ARMv9.4+) | POE (MSR POR_EL0) |
~40-80 cycles | 7 usable (3 for drivers after infra deductions; see Section 23.4.3) |
| ARMv7 | DACR (MCR p15) |
~10-20 cycles | 15 usable |
| PPC32 | Segment registers (mtsr) |
~10-30 cycles | 15 usable |
| PPC64LE | Radix PID (mtspr PIDR) |
~30-60 cycles | Per-process |
| RISC-V 64 | None — Tier 1 unavailable | N/A | N/A |
Page-table + ASID isolation is the standard AArch64 mechanism and runs on all current ARM datacenter deployments: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. POE (ARM FEAT_S1POE) is a hardware acceleration available on ARMv8.9+/ARMv9.4+ silicon (Neoverse V3+, Cortex-X4+) that reduces switch cost to ~40-80 cycles; it is an optional optimization, not the primary mechanism. When domain counts are exhausted, architectures with register-based isolation fall back to page-table switches. ARMv7 DACR is universally available on all Cortex-A cores and matches MPK in both cost and domain count.
10.2.6.1 Per-Architecture Mechanism Details
- aarch64: ARM Memory Domains (up to 16 domains via DACR on ARMv7) are not available
on ARMv8/AArch64 in the same form. The standard AArch64 isolation mechanism is
page-table-based domain isolation with ASID-preserving switches (~150-300 cycles per
TTBR0_EL1write +ISB+TLBI ASIDE1IS). This is what runs on all current deployed ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. On hardware with ARM FEAT_S1POE (optional from ARMv8.9/ARMv9.4, available on Neoverse V3+ and Cortex-X4+), UmkaOS activates the Permission Overlay Extension as an acceleration: POE provides 8 overlay indices (3 bits from PTE bits [62:60]), with index 0 reserved, giving 7 usable domains — fewer than x86 MPK's 12 driver domains. After infrastructure deductions (index 1: shared read-only, index 2: shared DMA, index 6: userspace, index 7: temporary/debug), only 3 indices remain for Tier 1 driver domains (indices 3-5); see Section 23.4.3 for the full AArch64 grouping scheme. Domain grouping is therefore much more aggressive on AArch64 when POE is active. POE is an optimization that reduces switch cost to ~40-80 cycles (~2-4x improvement); the system operates correctly without it using the page-table path. - armv7: ARMv7 provides hardware Domain Access Control via the DACR register,
supporting 16 memory domains (15 usable — domain 0 reserved for kernel). Each
domain can be set to No Access, Client (checked against page permissions), or
Manager (unchecked access) via a single
MCRinstruction to update DACR. This is the closest hardware analogue to x86 MPK on 32-bit ARM — a single privileged (MCR p15) register write switches domain permissions without TLB flushes. Unlike x86 WRPKRU (which is unprivileged and executable from Ring 3), DACR writes require PL1 — this is a security advantage: user-space code cannot forge domain switches. - riscv64: RISC-V currently has no hardware isolation primitive suitable for Tier 1.
SPMP (S-mode Physical Memory Protection) is only active when paging is disabled
(
satp.mode == Bare) and cannot be used in a kernel with virtual memory enabled. Smmtt (Supervisor Domain Access Protection) targets confidential computing, not MPK-style fast domain switching. Pointer Masking (Smnpm/Ssnpm, ratified Oct 2024) is not a domain isolation mechanism. Tier 1 isolation is not available on RISC-V. Tier 1 drivers on RISC-V platforms are promoted to Tier 0 (in-kernel, statically linked, fully trusted) — the same model as Linux. This is an accepted hardware constraint, not a design flaw. Tier 2 (Ring 3 + IOMMU) remains available on RISC-V for untrusted drivers where isolation is required. When RISC-V ISA extensions provide suitable isolation primitives (e.g., future Smpmp or custom domain extensions), UmkaOS will support Tier 1 on RISC-V without requiring architectural changes — the driver model is designed for this upgrade path. - ppc32: PPC32 uses segment registers for memory domain isolation. The 32-bit
PowerPC architecture provides 16 segment registers (SR0–SR15), each controlling
access to a 256 MB virtual address region. Updating a segment register via
mtsris a single supervisor-mode instruction with low overhead (~10-30 cycles). When segments are insufficient, UmkaOS falls back to page-table-based isolation. - ppc64le: PPC64LE on POWER9+ uses the Radix MMU with partition table entries
(process table / PID) for isolation. On POWER8, the Hashed Page Table (HPT) with
LPAR (Logical Partitioning) provides hardware-assisted isolation. The Radix MMU's
PID-based isolation switches via
mtspr PIDR(~30-60 cycles). HPT fallback uses full page table switches (~200-400 cycles).
10.2.6.2 Per-Architecture Isolation Cost Analysis
The x86_64 MPK WRPKRU instruction provides ~23-cycle domain switches (measured on
Skylake-class server cores; varies by microarchitecture — see Section 18.7.8
for full range: 11 cycles on Alder Lake, up to 260 cycles on Atom). Other architectures
use different mechanisms with different cost profiles:
| Architecture | Mechanism | Domain Switch Cost | Domains | Notes |
|---|---|---|---|---|
| x86_64 | MPK (WRPKRU) |
~23 cycles | 12 for drivers | 16 total keys. PKEY 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure. |
| x86_64 (no MPK) | Page table switch + ASID | ~200-400 cycles | Unlimited | Used when MPK unavailable (pre-Skylake). Full CR3 write + TLB management. |
| aarch64 (mainstream) | Page table switch + ASID | ~150-300 cycles | Unlimited | Standard mechanism on all current ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. TTBR0_EL1 write + ISB + TLBI ASIDE1IS. |
| aarch64 + POE (ARMv8.9+/ARMv9.4+) | MSR POR_EL0 + ISB |
~40-80 cycles | 7 usable | Optional acceleration: ARM FEAT_S1POE available on Neoverse V3+, Cortex-X4+. ISB barrier required (~20-40 cycles). Provides ~2-4x improvement over page-table path. |
| aarch64 + MTE | (not viable for domain isolation) | N/A | N/A | MTE assigns 4-bit tags per 16-byte granule, but tags are compared per-pointer — no single-register switch exists. Valuable for memory safety, not domain isolation. |
| armv7 | DACR (MCR p15) |
~10-20 cycles | 15 usable | Single MCR p15, 0, Rd, c3, c0, 0 writes all 16 domain permissions. No barrier required. Comparable to MPK cost. |
| armv7 (fallback) | Page table switch + CONTEXTIDR | ~150-300 cycles | Unlimited | MCR to TTBR0 + ISB + TLBI. Similar cost profile to aarch64 page-table path. |
| riscv64 | Tier 1 not available | N/A — Tier 1 drivers run as Tier 0 | N/A | No suitable hardware isolation exists with paging enabled. SPMP requires paging disabled; Smmtt targets confidential computing. Tier 1 drivers are promoted to Tier 0 (no isolation overhead). |
| ppc32 | Segment registers (mtsr) |
~10-30 cycles | 15 usable | Single mtsr instruction per 256 MB segment. No barrier required. Comparable to armv7 DACR cost. |
| ppc32 (fallback) | Page table switch | ~200-400 cycles | Unlimited | Full TLB invalidation + page table base update. |
| ppc64le (Radix) | PID switch (mtspr PIDR) |
~30-60 cycles | Process-table scoped | POWER9+ Radix MMU. mtspr PIDR + isync. ~2-3x MPK cost. |
| ppc64le (HPT) | HPT + LPAR switch | ~200-400 cycles | Unlimited | POWER8 Hashed Page Table. tlbie + table update. |
Impact on performance budget — The Section 1.2 overhead analysis uses x86_64 MPK (~23 cycles per switch, ~92 cycles per I/O round-trip):
| Architecture | Overhead per NVMe 4KB read | Overhead per TCP RX |
|---|---|---|
| x86_64 MPK | +1% (92 cycles / 10μs) | +2% (~92 cycles / 5μs, with NAPI batching; naive per-packet is ~17-26%, see Section 15.1.7) |
| aarch64 page-table (mainstream) | +6-12% (600-1200 cycles / 10μs) | +12-24% (600-1200 cycles / 5μs) |
| aarch64 + POE (ARMv8.9+/ARMv9.4+) | +2-3% (160-320 cycles / 10μs) | +3-6% (160-320 cycles / 5μs) |
| armv7 DACR | +0.5-1% (40-80 cycles / 10μs) | +1-2% (40-80 cycles / 5μs) |
| riscv64 | N/A — Tier 1 not available; Tier 1 drivers run as Tier 0 (zero isolation overhead, same as Linux) | N/A |
| ppc32 segments | +0.5-1% (40-120 cycles / 10μs) | +1-2% (40-120 cycles / 5μs) |
| ppc64le Radix | +1-2% (120-240 cycles / 10μs) | +2-5% (120-240 cycles / 5μs) |
For armv7 with DACR and ppc32 with segment registers, the overhead is comparable to or better than x86 MPK. For aarch64 with POE and ppc64le with Radix PID, the overhead remains within the 5% budget for storage workloads. On mainstream AArch64 (page-table path), Tier 1 overhead reaches 6-12%, which exceeds the 5% budget for I/O-heavy workloads; administrators can promote performance-critical drivers to Tier 0 or demote to Tier 2 as appropriate.
ARM server reality — Page-table + ASID isolation (~150-300 cycles) is the mechanism that runs on nearly all currently deployed ARM servers. FEAT_S1POE is optional from ARMv8.9/ARMv9.4. Current mainstream datacenter cores — Neoverse V2 (ARMv9.0, AWS Graviton 4, Google Axion), Neoverse V3 (ARMv9.2, AWS Graviton 5, Azure Cobalt 200), Ampere Altra (ARMv8.2), and Kunpeng 920 (ARMv8.2) — do not implement POE. The page-table path is not a fallback; it is the standard operating mode for AArch64. POE is a hardware acceleration that becomes available on ARMv8.9+/ARMv9.4+ silicon (Neoverse V3+, Cortex-X4+) and reduces per-switch cost by ~2-4x when present.
RISC-V reality — RISC-V currently has no hardware isolation mechanism suitable for Tier 1. Tier 1 isolation is not available on RISC-V; all Tier 1 drivers run as Tier 0 (in-kernel, fully trusted, zero isolation overhead). The 5% overhead budget applies to operations that do run — without the Tier 1 isolation layer, there is no overhead to measure on that path. Tier 2 remains available for drivers where isolation is required. When RISC-V ISA extensions provide suitable isolation primitives, UmkaOS will support Tier 1 on RISC-V without architectural changes to the driver model.
10.2.7 Adaptive Isolation Policy (Graceful Degradation)
UmkaOS targets six architectures with fundamentally different isolation capabilities. The design philosophy: use the best isolation the hardware provides; when the hardware provides nothing, degrade gracefully — don't refuse to run. This mirrors Linux's approach to every hardware feature.
Three boot-time modes, selectable via umka.isolation= kernel parameter or runtime
sysfs:
strict(default when fast isolation available): All Tier 1 drivers run in hardware-isolated domains. Full isolation at ~23-80 cycle cost per switch (register-based) or ~150-300 cycles (page-table, AArch64 mainstream).degraded(default on AArch64 mainstream): Page-table isolation operates the three-tier model with ~150-300 cycle overhead per crossing. This is the normal operating mode for current ARM server deployments, not a degraded state.performance: Tier 1 drivers promoted to Tier 0 — zero boundary-crossing overhead, matching Linux exactly. IOMMU DMA fencing and capability checks remain active. Appropriate for I/O-heavy workloads where the page-table path overhead is unacceptable.
On RISC-V, the adaptive policy always selects Tier 0 for all Tier 1 drivers — Tier 1 isolation is not available on RISC-V due to hardware capability limitations. This is not a performance mode selection; it is a platform capability constraint that will be resolved when RISC-V hardware provides suitable isolation primitives.
Per-driver overrides are available via driver manifests (no_fast_isolation policy):
drivers can individually choose promote_tier0, page_table (default), or
demote_tier2 regardless of global mode.
10.2.7.1 Performance Mode Details
On hardware without fast isolation, Tier 1 drivers are promoted to Tier 0 — they run in the same protection domain as umka-core with zero boundary-crossing overhead. Performance matches Linux exactly. The system logs a prominent warning:
umka: isolation=performance: Tier 1 drivers running WITHOUT memory isolation
umka: Driver crashes may cause kernel panic (same as Linux monolithic behavior)
umka: IOMMU DMA fencing is still active — DMA isolation preserved
Key properties of performance mode: - IOMMU DMA fencing remains active — even without MPK memory isolation, DMA operations are still restricted to driver-allocated regions. - Crash recovery is best-effort — without memory isolation, a crashing driver may corrupt umka-core state, making recovery impossible. - Capability system still enforced — the software-level capability model remains active. Only the memory enforcement is relaxed. - Security model partially degraded — a malicious driver could exploit the shared address space. This mode is appropriate for trusted environments with known drivers.
Per-driver tier pinning via driver manifest:
# umka-nvme driver manifest
[driver]
name = "umka-nvme"
preferred_tier = 1
minimum_tier = 1
# Override: on hardware without MPK, run this driver as Tier 0
# instead of using the slow page-table fallback
[driver.isolation_fallback]
no_fast_isolation = "promote_tier0" # "promote_tier0" | "page_table" | "demote_tier2"
Options per driver:
- promote_tier0: run in Tier 0 (fast, no isolation) — for performance-critical drivers
- page_table: use page-table fallback (slow, but isolated) — default
- demote_tier2: move to Tier 2 userspace (full process isolation) — for untrusted or
crash-prone drivers
Historical context — Apple's transition from kexts (Ring 0, no isolation) to DriverKit (userspace, full isolation) took 5 years. UmkaOS's approach is more nuanced: rather than a binary choice between "fast and dangerous" and "safe and slow," hardware-assisted isolation (MPK, POE, DACR, segments, Radix PID) provides a third option — "fast and safe" — on modern hardware. The adaptive isolation policy ensures UmkaOS remains viable on older hardware by honestly trading off isolation for performance when the hardware cannot support both simultaneously.
Isolation is one of eight core capabilities, not the only one. Even on hardware without fast isolation (RISC-V where Tier 1 is unavailable, older x86 without MPK), UmkaOS still provides: driver crash recovery (best-effort in Tier 0, full in Tier 2), distributed kernel primitives, heterogeneous compute management, structured observability, power budgeting, post-quantum security, live kernel evolution, and a stable driver ABI. A RISC-V server operating without Tier 1 isolation retains all seven other capabilities. This is a hardware-imposed constraint, not a design failure — and it is resolved when suitable RISC-V hardware becomes available.
10.4 Driver Isolation Tiers
10.4.1 Tier Classification
Tier 0 has two sub-forms: static (compiled into the kernel binary) and loadable (dynamically loaded but running in the Core domain with no isolation). Both are Tier 0 in the trust and crash-consequence sense; they differ in deployment. See Section 10.4.2 for details.
| Property | Tier 0 Static | Tier 0 Loadable | Tier 1 | Tier 2 |
|---|---|---|---|---|
| Location | Compiled into kernel binary | Ring 0, Core domain, dynamically loaded | Ring 0, dynamically loaded, domain-isolated | Ring 3, separate process |
| KABI transport | Direct vtable call (T0) | Direct vtable call (T0) | Ring buffer (T1) | Ring buffer (T2) |
| Isolation | None | None (same address space) | Hardware memory domains + IOMMU | Full address space + IOMMU |
| Crash behavior | Kernel panic | Kernel panic | Reload module (~50-150ms, design target) | Restart process (~10ms) |
| DMA access | Unrestricted | Unrestricted | IOMMU-fenced | IOMMU-fenced |
| Performance | Zero overhead | ~2–5 cycles (vtable dispatch) | ~23 cycles domain switch + marshaling (x86 MPK) | ~200-500 cycles per crossing |
| Trust level | Maximum (core kernel) | Maximum (signed, sealed index) | High (verified, signed) | Low (untrusted acceptable) |
| Unloadable | No (static) | No (load_once: true) |
Yes (domain revocation) | Yes (process exit) |
| Examples | APIC, timer, early console, Core allocator | SCSI mid-layer, MDIO bus, SPI bus core, cfg80211 framework, V4L2 core | NVMe, NIC, TCP/IP, FS, GPU, KVM, audio (default), WiFi driver | USB, input, BT, audio (optional demotion), HID |
Tier 1 isolation mechanism per architecture:
The "hardware memory domains" used for Tier 1 isolation are architecture-specific. Not all architectures have a fast isolation mechanism; RISC-V has none at all and runs Tier 1 drivers as Tier 0. See Section 10.2.6.2 for per-architecture cycle costs and Section 10.2.7 for the adaptive policy.
| Architecture | Tier 1 Mechanism | Switch Cost | Domains | Availability |
|---|---|---|---|---|
| x86-64 | MPK (WRPKRU) |
~23 cycles | 12 usable | Intel Skylake+ / AMD Zen 3+ |
| x86-64 (no MPK) | Page table + ASID | ~200-400 cycles | Unlimited | All x86-64 |
| AArch64 (mainstream) | Page table + ASID | ~150-300 cycles | Unlimited | All AArch64 — standard mechanism on Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920 |
| AArch64 + POE (ARMv8.9+/ARMv9.4+) | POE (MSR POR_EL0 + ISB) |
~40-80 cycles | 7 usable (3 for drivers after infra deductions; see Section 23.4.3) | Optional acceleration: FEAT_S1POE on Neoverse V3+, Cortex-X4+ |
| ARMv7 | DACR (MCR p15) |
~10-20 cycles | 15 usable | All ARMv7 (universal) |
| RISC-V 64 | Tier 1 not available — Tier 1 drivers run as Tier 0 | N/A (no isolation boundary) | N/A | Hardware capability not yet available on any RISC-V silicon |
| PPC32 | Segment registers (mtsr) |
~10-30 cycles | 15 usable | All PPC32 |
| PPC64LE (POWER9+) | Radix PID (mtspr PIDR) |
~30-60 cycles | Process-scoped | POWER9+ with Radix MMU |
| PPC64LE (POWER8) | HPT + LPAR | ~200-400 cycles | Unlimited | POWER8 |
On RISC-V 64, Tier 1 isolation is not available. As of early 2026, no ratified RISC-V
extension provides a suitable intra-address-space isolation mechanism with paging
enabled (SPMP requires paging disabled; Smmtt targets confidential computing; Pointer
Masking Smnpm/Ssnpm, ratified Oct 2024, is not a domain isolation mechanism).
All Tier 1 drivers on RISC-V are promoted to Tier 0 — they run in-kernel with no
hardware isolation boundary, identical to the Linux monolithic driver model. Tier 2
(Ring 3 + IOMMU) remains available for RISC-V drivers where isolation is required.
When RISC-V hardware provides suitable isolation primitives, UmkaOS will activate Tier 1
on RISC-V without requiring changes to the driver model or driver manifests.
10.4.2 Tier 0: Boot-Critical and Core Framework Code
Tier 0 encompasses all kernel code that runs in Ring 0 inside the Core memory domain, with no hardware isolation boundary between it and the static kernel binary. A crash in any Tier 0 code — static or loadable — causes a kernel panic. Tier 0 is split into two deployment forms.
10.4.2.1 Tier 0 Static
Compiled directly into the kernel binary. Required before any dynamic loading infrastructure is available:
- Local APIC and I/O APIC
- PIT/HPET/TSC timer
- Early serial/VGA console
- ACPI table parsing (early boot only). Security trade-off: ACPI tables are firmware-provided data that the kernel must trust at boot. A malicious or buggy BIOS can supply corrupt ACPI tables (malformed AML, overlapping MMIO regions, impossible NUMA topologies). UmkaOS's Tier 0 ACPI parser performs defensive parsing: all table lengths are bounds-checked, AML interpretation uses a sandboxed evaluator with a cycle limit (no infinite loops), and MMIO regions claimed by ACPI are validated against the e820/UEFI memory map before being mapped. Despite these defenses, ACPI parsing remains the largest attack surface in Tier 0. The firmware quirk framework (Section 10.5.11.6) provides per-platform overrides for known-buggy tables.
Tier 0 static code is held to the highest review standard and kept minimal. Only code that is genuinely required before the module loader and isolation infrastructure are operational belongs here.
10.4.2.2 Tier 0 Loadable Modules
Dynamically loaded into the Core domain after the module loader initialises, but before or during device enumeration. Tier 0 loadable modules:
- Run in Ring 0 in the same memory domain as static Core
- Communicate with static Core and other Tier 0 modules via direct vtable calls (Transport T0, Section 11.1.8) — not ring buffers
- Are loaded by the kernel-internal module loader (Section 11.1.9.6) without requiring userspace
- Are never unloaded (
load_once: true) — safe Tier 0 unloading would require auditing every execution context in the kernel for stale function pointers, which is not tractable - Declare their
requiresandprovidesdependencies in their.kabifile (Section 11.1.9.2) - Are transparent to Tier 1 callers — from Tier 1's perspective, calling a Tier 0 loadable service is identical to calling static Core
When to use Tier 0 loadable vs Tier 1: Use Tier 0 loadable for kernel framework modules that provide services to many other drivers and whose correctness can be established by inspection — bus frameworks (SCSI mid-layer, MDIO, SPI bus core), protocol framework layers (cfg80211, V4L2 core, media controller), and subsystem registries (backlight, NVMEM, rfkill). These are trusted, signed, reviewed code that is too large to statically link but too fundamental to pay Tier 1 ring buffer overhead on every call. Use Tier 1 for hardware drivers and any code that benefits from crash containment — if a bug is plausible, it should be Tier 1.
10.4.3 Tier 1: Kernel-Adjacent Drivers (Hardware Memory Domain Isolated)
Tier 1 is crash containment, not a security boundary. Hardware memory domain isolation (MPK, POE, DACR) prevents a buggy driver from corrupting kernel memory by accident. It does not prevent a compromised driver from escaping deliberately — on x86-64,
WRPKRUis unprivileged and any Ring 0 code can execute it. This is a known architectural property of Intel MPK, not an UmkaOS design flaw, and is documented as an intentional tradeoff. Tier 2 (Ring 3 + IOMMU) is the security boundary for untrusted or third-party drivers. Do not flag Tier 1 MPK escape as a vulnerability — the threat model is stated and the mitigation is Tier 2. See Section 10.4.5 for Tier 2.
Performance-critical drivers run in Ring 0 but are isolated via hardware memory domains (MPK on x86-64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE, page-table + ASID on AArch64 mainstream, POE on AArch64 ARMv8.9+/ARMv9.4+ when available — see "Tier 1 isolation mechanism per architecture" table above). Note: Tier 1 isolation is not available on RISC-V; Tier 1 drivers on RISC-V run as Tier 0. Each driver (or driver group) that does have Tier 1 isolation is assigned a protection domain. The driver can only access:
- Its own private memory (tagged with its domain key)
- Shared ring buffers (tagged with the shared domain, read-write)
- Shared DMA buffers (tagged with DMA domain, read-write)
- Its MMIO regions (mapped with its domain key)
It cannot access: - UmkaOS Core private memory - Other Tier 1 drivers' private memory - Page tables, capability tables, or scheduler state - Arbitrary physical memory
Security limitation: Tier 1 isolation protects against bugs, not exploitation.
On x86-64, MPK isolation uses the WRPKRU instruction, which is unprivileged --
any Ring 0 code (including Tier 1 driver code) can execute it to modify its own
domain permissions and access any MPK-protected memory, including UmkaOS Core (PKEY 0).
This means a compromised Tier 1 driver with arbitrary code execution can trivially
bypass MPK isolation. On ARMv7, MCR to DACR is privileged (PL1), which is stronger
-- user-space cannot forge domain switches, but kernel-mode drivers still can. On
PPC32 and PPC64LE, segment register and AMR updates are similarly supervisor-mode.
Tier 1 threat model: MPK (and its architectural equivalents) provides defense against accidental corruption -- buffer overflows, use-after-free, null dereferences that happen to write to the wrong address. It does not defend against deliberate exploitation where an attacker achieves arbitrary code execution within a Tier 1 driver and intentionally escapes the domain. For the exploitation case, Tier 2 (full process isolation in Ring 3) is the appropriate boundary.
Tier 1 trust requirement: Tier 1 drivers run in Ring 0 with only domain isolation (not address space isolation). They must be treated as trusted code: cryptographically signed, manifest-verified (Section 1.2), and subject to the same security review standard as Core kernel code. Tier 1 is not appropriate for third-party, untrusted, or unaudited drivers. Untrusted drivers must use Tier 2 (Ring 3 process isolation) where a compromised driver cannot escalate to kernel privilege regardless of the exploit technique. See Section 10.4.8 (Signal Delivery Across Isolation Boundaries) for the complete domain crossing specification during signal handling.
Mitigations that raise the bar for exploitation are detailed in Section 10.2
("WRPKRU Threat Model: Unprivileged Domain Escape"): binary scanning for unauthorized
WRPKRU/XRSTOR instructions at load time, W^X enforcement on driver code pages,
forward-edge CFI (Clang -fsanitize=cfi-icall), and the NMI watchdog for detecting
PKRU state mismatches.
Tier 0 fast path: On RISC-V (where Tier 1 is not available), POWER8, or when
isolation=performancepromotes all drivers to Tier 0, MPK-specific mitigations are automatically skipped: - Binary scanning forWRPKRU/XRSTOR: skipped (no MPK → no WRPKRU exploit). - NMI PKRU watchdog: disabled (no PKRU state to verify). - W^X enforcement and forward-edge CFI remain active — these defend against code injection and control-flow hijacking regardless of isolation tier and are standard hardening measures, not isolation-specific overhead.
Future: PKS (Protection Keys for Supervisor) -- Intel's PKS extension provides
supervisor-mode protection keys that are controlled via MSR writes (privileged
operations that require Ring 0 + CPL 0 MSR access). Unlike WRPKRU (which any Ring 0
code can execute), PKS key modifications go through WRMSR to IA32_PKS, which can
be trapped by a hypervisor or controlled by umka-core. When PKS-capable hardware is
available, UmkaOS will use PKS for Tier 1 isolation, closing the unprivileged-WRPKRU
escape path. PKS is available on Intel Sapphire Rapids and later server CPUs.
10.4.3.1 VirtIO Device Hosting
VirtIO devices in UmkaOS run as Tier 1 drivers (Ring 0, hardware memory domain isolated). Rationale: VirtIO devices are almost always used in virtualized environments where high-throughput I/O is required; Tier 1 gives them direct access to the network and block stacks without ring-crossing overhead, while the MPK/POE/DACR isolation boundary still contains crashes.
- The VirtIO transport layer (PCI or MMIO config space, virtqueue management) is implemented inside the Tier 1 driver domain.
- Virtqueues (split or packed ring format) are backed by
RingBuffer<VirtqDesc>— the same infrastructure used for other UmkaOS driver rings, providing unified memory accounting across all device types. - The Linux VirtIO userspace API (vhost-user, vDPA) is surfaced through UmkaOS's compat layer unchanged — guest VMs and containers see standard VirtIO PCI/MMIO devices.
- Tier 2 option: a VirtIO device MAY be hosted as Tier 2 (full userspace process) via vhost-user if the operator prioritizes fault isolation over latency; this adds approximately 5–15 μs of ring-crossing overhead per batch.
10.4.4 Protection Key Exhaustion (Hardware Domain Limit)
Intel MPK provides only 16 protection keys (PKEY 0-15). With PKEY 0 reserved for UmkaOS Core, PKEY 1 for shared read-only descriptors, PKEY 14 for shared DMA, and PKEY 15 as guard, only 12 keys (PKEY 2-13) are available for Tier 1 driver domains (see Section 10.2, "MPK Domain Allocation"). This limits the number of independently isolated Tier 1 drivers to 12 on x86-64 with MPK. Architectures with equivalent mechanisms (AArch64 POE: 7 usable domains, ARMv7 DACR: 15 usable domains, PPC32 segments: 15 usable) face the same constraint. This is a hard hardware limit that cannot be worked around without changing the isolation granularity. PPC64LE (Radix PID) use process-scoped isolation without a fixed small domain budget, so domain exhaustion does not apply to those architectures — but they pay higher per-switch costs (see Section 10.2.6.2 cost table). RISC-V has no Tier 1 isolation at all; domain exhaustion does not apply.
When domains are exhausted (more concurrent Tier 1 drivers than available hardware domains — 12 on x86 MPK, 7 on AArch64 POE, 15 on ARMv7 DACR, 15 on PPC32 segments), UmkaOS applies three strategies in priority order:
-
Domain grouping (default): Related drivers share a protection key. For example, all block storage drivers (NVMe, AHCI, virtio-blk) share one key, all network drivers (NIC, TCP/IP stack) share another. Grouping reduces isolation granularity -- a bug in one block driver can corrupt another block driver's memory within the same group -- but preserves isolation between groups (network cannot corrupt storage). Grouping policy is configurable via the driver manifest:
toml [driver.isolation] isolation_group = "block" # Share isolation domain with other "block" group drivers -
Automatic Tier 2 demotion: Drivers below a configurable priority threshold are demoted to Tier 2 (process isolation) when all hardware isolation domains are consumed. Only the most performance-critical drivers retain Tier 1 placement. The priority is determined by
match_priorityin the driver manifest -- higher priority retains Tier 1. -
Domain virtualization (future): On context switch, the scheduler can save and restore the isolation domain register (PKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, segment registers on PPC32) along with a remapped domain assignment table, allowing more logical domains than hardware provides by time-multiplexing physical domains. Domain virtualization adds overhead to context switches (~50-100 cycles for the register save/restore and domain table lookup (warm-cache fast path: WRPKRU ~20 cycles + L1-resident domain table lookup; cold-cache misses add ~100-200 cycles to the domain table access)) and is used only when strategies 1 and 2 are insufficient. This is a future optimization -- domain grouping and Tier 2 demotion handle all current deployment scenarios.
Strategy 3: POE + ASID Domain (ARMv8.9+ systems with POE support)
On AArch64 systems with Permission Overlay Extensions (ARMv8.9+ / FEAT_S1PIE):
- Each Tier 1 driver domain is assigned a POE domain (POR_EL0 register field, up to 8 domains).
- Domain switch:
MSR POR_EL0, x0(single instruction, ~40-80 cycles, no TLB flush). - Domain assignment: PKEY 0 = UmkaOS Core private, PKEYs 1-6 = Tier 1 driver domains, PKEY 7 = shared DMA pool. (POE supports 8 domains = one less than x86 MPK's 16.)
- Fallback: If the hardware supports POE but a driver requires exclusive ASID isolation
(e.g., cryptographic device handling key material), the ASID-table strategy (Strategy 2)
is used for that driver even on POE-capable hardware. The driver registers
require_asid_isolation: truein its.kabimanifest. - Combined POE+ASID: For the highest isolation guarantee on ARMv8.9+, use both: POE for fast memory-domain switching + a dedicated ASID for the driver domain. This prevents both memory domain escapes (POE) and TLB side-channel attacks (ASID). Cost: ~80-150 cycles per domain switch (ASID flush + POE switch); used for Tier 1 drivers handling sensitive key material.
- Detection: POE availability is checked at boot via
ID_AA64MMFR3_EL1.S1PIE[8:11] != 0. Exposed viaIsolationCapabilities::poe_available: boolto the driver subsystem.
When domain grouping is applied, the kernel logs a warning (umka: isolation domain exhausted, grouping {driver_a} with {driver_b}) and exposes the current domain allocation in /sys/kernel/umka/isolation/domains for admin visibility.
Practical impact: A typical server has 5-8 performance-critical driver types (NVMe, NIC, TCP/IP, filesystem, GPU, KVM, virtio, crypto). With grouping, these fit within the hardware domain budget on x86 (12 domains), ARMv7 (15), and PPC32 (15) with room to spare. On AArch64 with POE (7 total usable indices, of which only 3 are available for Tier 1 drivers after infrastructure reservations — see Section 23.4.3 in 23-roadmap.md for the full index allocation), a typical 5-8 driver configuration requires at least one grouping (e.g., NVMe + filesystem share a domain). Systems with unusually many distinct Tier 1 drivers (e.g., multi-vendor NIC + storage + GPU + FPGA configurations) trigger Tier 2 demotion for the lowest-priority drivers.
Long-term trajectory: the domain budget pressure diminishes as devices become
peers — but this is a multi-year ecosystem shift, not a near-term fix. The
devices that consume the most Tier 1 domain slots today — GPU (~700K lines of
handwritten driver code, excluding auto-generated headers), high-end NIC/DPU (~150K lines), and high-throughput storage
controllers — are exactly the devices most suited to become UmkaOS multikernel
peers (Section 5.1.2.2). When a device runs its own UmkaOS kernel and participates as a
cluster peer, it is handled entirely by umka-peer-transport (~2K lines) and
consumes zero MPK domains; it exits the Tier 1 population entirely and is
contained by the IOMMU hard boundary instead.
However, UmkaOS cannot assume vendor adoption. Rewriting device firmware to implement UmkaOS message passing requires vendor investment, ecosystem tooling, and standardization effort that will take years to mature. For the foreseeable future, most devices will continue to use traditional Tier 1 and Tier 2 drivers, and the domain budget strategies above (grouping, Tier 2 demotion, domain virtualization) are the primary long-term solution — not a temporary workaround. Domain virtualization (strategy 3) and PKS (Section 10.4, future work) remain genuinely important during this extended transition window and must be implemented correctly. They cannot be dismissed as "probably never needed."
The peer kernel model is the correct direction — it reduces the Tier 1 population, eliminates device-specific Ring 0 code, and strengthens the isolation boundary — but UmkaOS must operate correctly and efficiently with today's hardware for years before that future materializes. Domain grouping and automatic Tier 2 demotion are therefore the primary and durable strategies. The ecosystem shift toward peer kernels is a beneficial long-term trend that will progressively ease the domain budget, not a solution that UmkaOS can depend on today.
10.4.5 Tier 2: User-Space Drivers (Process-Isolated)
Non-performance-critical drivers run as user-space processes with full address space isolation. Communication with UmkaOS Core uses:
- Shared-memory ring buffers (mapped into both address spaces)
- Lightweight notification via eventfd-like mechanism
- IOMMU-restricted DMA (driver can only DMA to its allocated regions)
Tier 2 MMIO access model. Tier 2 drivers access device MMIO registers via
umka_driver_mmio_map (Section 10.4, KABI syscall table), which maps a device BAR region
into the driver process's address space. This mapping is direct -- the driver reads
and writes device registers without kernel mediation on each access, avoiding per-access
syscall overhead. However, the mapping is kernel-controlled and revocable:
-
Setup-time validation. The kernel validates every
umka_driver_mmio_maprequest: the BAR index must belong to the driver's assigned device, the offset and size must fall within the BAR's bounds, and the driver must hold the appropriate device capability. The kernel never maps BARs belonging to other devices or kernel-reserved MMIO regions. -
IOMMU containment. Even though the driver can program device registers via MMIO (including registers that initiate DMA), all DMA transactions from the device pass through the IOMMU. The device's IOMMU domain restricts DMA to regions explicitly allocated by the kernel on behalf of the driver (
umka_driver_dma_alloc). A compromised Tier 2 driver that programs arbitrary DMA addresses into device registers will trigger IOMMU faults -- the DMA is blocked by hardware, not by software trust. This is the same IOMMU fencing applied to Tier 1 drivers, and it is the primary defense against DMA-based attacks from any driver tier. -
MMIO revocation on containment. When the kernel needs to contain a Tier 2 driver (crash, fault, admin action, or auto-demotion), it unmaps all MMIO regions from the driver process's address space as part of the containment sequence. This is a standard virtual memory operation (page table entry removal + TLB invalidation) that completes in microseconds. After MMIO revocation, any subsequent MMIO access by the driver process triggers a page fault and process termination -- the driver cannot issue further device commands. Combined with IOMMU fencing (which blocks DMA initiated before revocation from reaching non-driver memory), MMIO revocation provides a complete device access cutoff without requiring Function Level Reset.
PCIe peer-to-peer DMA and IOMMU group policy -- The "complete device access cutoff" guarantee above depends on all DMA traffic passing through the IOMMU. This holds when the device is in its own IOMMU group (ACS enabled on all upstream PCIe switches). However, devices behind a non-ACS PCIe switch can perform peer-to-peer DMA that bypasses the IOMMU entirely — a contained device could still DMA to a peer device's memory regions without IOMMU interception. UmkaOS addresses this by enforcing an IOMMU group co-isolation policy: when devices share an IOMMU group (no ACS), UmkaOS places all devices in that group under the same Tier 2 driver process (or co-isolates them in the same Tier 1 domain). IOMMU revocation during containment therefore affects the entire group atomically — there is no "partially contained" state where one device in the group is fenced but a peer is not. See Section 10.5.3.8 (IOMMU Groups) for the full ACS detection and group assignment policy.
Synchronous vs. asynchronous revocation -- For deliberate containment actions (admin-initiated revocation, auto-demotion, fault-triggered isolation), MMIO revocation is synchronous: the kernel performs the TLB shootdown and waits for acknowledgment from all CPUs before the containment call returns. This guarantees that no MMIO access from the driver process is possible after the containment operation completes. For the crash case (driver process dies due to SIGSEGV/SIGABRT), the dying process's threads are killed first, so the TLB shootdown is a cleanup operation -- the driver threads are no longer executing, making the timing of the shootdown a correctness concern only for the page allocator (which must not reuse the MMIO-mapped pages until the shootdown completes).
- FLR-free recovery (optimistic path). In the normal case, Tier 2 recovery does
not require Function Level Reset. Tier 1 recovery requires FLR because the driver
runs in Ring 0 and may have left the device in an arbitrary hardware state that only
a full reset can clear. Tier 2 recovery can typically avoid FLR because: (a) IOMMU
containment prevents DMA escapes regardless of device state, (b) MMIO revocation
prevents further device manipulation, and (c) the device's hardware state can be
re-initialized by the replacement driver instance during its
init()call. However, devices with complex internal state machines (GPUs, SmartNICs, FPGAs) may not be safely re-initializable without a full reset. If the replacement driver'sinit()detects an unresponsive or inconsistent device (no response to MMIO reads, unexpected register state, completion timeout), the registry escalates to FLR. This fallback is not the common case for simple devices (NICs, HID, storage controllers), but should be expected for complex devices with substantial internal firmware state.
10.4.6 Tier Mobility and Auto-Demotion
Key principle: UmkaOS's isolation model is designed for flexibility, not dogma. Different hardware has different isolation capabilities (see Section 10.2 in README.md for the full architecture-specific analysis). The tier system allows administrators to make explicit tradeoffs between isolation and performance:
-
Tier 1 provides isolation using the best available hardware mechanism: register-based on x86-64/ARMv7/PPC32/PPC64LE (~1-4% overhead), or page-table-based on AArch64 mainstream (~6-12% overhead), or POE-accelerated on AArch64 ARMv8.9+/ARMv9.4+ (~2-4% overhead). On RISC-V, Tier 1 isolation is not available — Tier 1 drivers run as Tier 0.
-
Tier 2 provides strong process-level isolation on all architectures, at the cost of higher latency (~200-600 cycles per domain crossing vs ~23-80 cycles for Tier 1).
-
The escape hatch is always available: Any Tier 1 driver can be manually demoted to Tier 2 by the administrator, or automatically demoted after repeated crashes. This allows environments that prioritize security over performance to opt into stronger isolation regardless of hardware capabilities.
Design intent: The system does not force a one-size-fits-all choice. A high-frequency trading system on x86_64 might run all drivers in Tier 1 for maximum performance. A secure enclave handling sensitive data on a RISC-V system might run all drivers in Tier 2 for maximum isolation. Both are valid deployments of the same kernel.
Drivers declare a preferred tier and a minimum tier in their manifest:
# # drivers/tier1/nvme/manifest.toml
[driver]
name = "umka-nvme"
preferred_tier = 1
minimum_tier = 1 # NVMe cannot function well in Tier 2
# # drivers/tier2/usb-hid/manifest.toml
[driver]
name = "umka-usb-hid"
preferred_tier = 2
minimum_tier = 2
The kernel's policy engine decides the actual tier based on:
- Trust level: Unsigned drivers are forced to Tier 2.
- Crash history: After 3 crashes within a configurable window, a Tier 1 driver is automatically demoted to Tier 2 (if minimum_tier allows).
- Admin overrides: System administrator can force any tier via configuration.
- Signature verification: Cryptographically signed drivers can be granted Tier 1.
10.4.7 Debugging Across Isolation Domains (ptrace)
ptrace(PTRACE_PEEKDATA) on a Tier 1 driver thread must read memory tagged with the
driver's PKEY, which the debugger process does not have access to. The kernel handles
this by performing the read on behalf of the debugger:
ptrace access flow for MPK-isolated memory (high-level overview):
1. Debugger calls ptrace(PTRACE_PEEKDATA/POKEDATA, target_tid, addr).
2. Kernel checks: does `addr` belong to a MPK-protected region?
3. If yes: kernel performs a TOCTOU-safe PKRU manipulation
(see Security Note below) to grant temporary access,
performs the copy, then restores PKRU. This happens in kernel mode,
so the debugger process never gains direct access.
4. If no: standard ptrace read/write path (no MPK involvement).
ptrace write flow:
Same as read, but with write permission instead of read.
PKRU manipulation is a single WRPKRU instruction (~23 cycles; see
[Section 18.7.8](18-compat.md#1878-performance-impact) for detailed WRPKRU cycle count
analysis (11–260 cycles depending on pipeline state and microarchitecture)).
PTRACE_ATTACH to a Tier 1 driver thread:
Requires CAP_SYS_PTRACE (same as Linux).
The debugger can single-step, set breakpoints, and inspect registers.
Memory access goes through the kernel-mediated PKRU path above.
#### 10.4.7.1 Security Note: TOCTOU Mitigation
The ptrace PKRU manipulation flow has a Time-Of-Check-Time-Of-Use (TOCTOU) concern:
the kernel checks access, changes PKRU, performs the copy, then restores PKRU.
Between the PKRU change and restore, if the traced driver could execute arbitrary code,
it could issue its own `WRPKRU` and escape isolation.
**Mitigation strategy:**
ptrace PKRU-protected access (TOCTOU-safe): 1. Acquire pt_reg_lock(target_tid) — traced thread cannot run. 2. Verify debugger holds CAP_SYS_PTRACE and ptrace relationship is authorized. This check happens before any PKRU state change. 3. Verify address belongs to a valid MPK region owned by target. 4. With IRQs disabled and pt_reg_lock held: a) Save current PKRU b) Set PKRU to grant temporary access to target's PKEY c) Perform the copy (read or write) d) Restore saved PKRU 5. Release pt_reg_lock(target_tid)
This approach creates a **locked validation window**: the traced process cannot execute
between authorization and data copy, and cannot escape by issuing its own `WRPKRU`
because it is blocked by `pt_reg_lock`. The authorization check occurs before any
PKRU manipulation, ensuring that unauthorized debuggers cannot exploit the window.
**Alternative approaches considered:**
1. **Permanently grant debugger PKRU access**: Rejected — violates isolation principle.
2. **Copy through a bounce buffer with kernel mapping**: Adds overhead but would work;
however, PKRU manipulation is fast (~23 cycles) and the lock-based approach is
simpler when the debugger is already ptrace-attached.
3. **Disable PTRACE_PEEKDATA on Tier 1 drivers**: Would compromise debuggability;
the lock-based approach provides security without removing functionality.
The key invariant is: *no user-space code from the traced process runs between PKRU
authorization and PKRU restoration*. `pt_reg_lock` enforces this invariant.
Weak-isolation fast path: On platforms without MPK (or equivalent domain registers), the entire PKRU manipulation flow is unnecessary.
ptraceuses the standard kernel read/write path — the driver's memory is in the same address space with no domain protection, so no temporary access grant is needed. Thept_reg_lockand TOCTOU-safe window are only instantiated when the architecture reports hardware domain support.
10.4.8 Signal Delivery Across Isolation Boundaries
When a signal targets a thread running in a Tier 1 (domain-isolated) driver:
Signal delivery to Tier 1 driver thread:
SIGKILL / SIGSTOP (non-catchable):
Kernel handles these directly — no signal frame is pushed.
For SIGKILL: the driver thread is terminated. The kernel runs
the driver's cleanup handler (if registered via KABI) in a
bounded context (timeout: 100ms). If cleanup doesn't complete,
the driver's isolation domain is revoked and all its memory freed.
Catchable signals (SIGSEGV, SIGUSR1, etc.):
1. Kernel saves driver's PKRU state.
2. Kernel sets PKRU to the process's default domain (no driver
memory access) before pushing the signal frame to the user stack.
3. Signal handler runs in the process's normal domain — it cannot
access driver-private memory.
4. On sigreturn: kernel restores the saved PKRU and resumes the
driver code with its original domain permissions.
This ensures a signal handler in application code cannot accidentally
(or maliciously) access driver-private memory while handling a signal
that interrupted driver execution.
> **Weak-isolation fast path**: Without hardware domain registers (no MPK/POE/DACR),
> the PKRU save/restore steps are elided. Signals are delivered using the standard
> kernel signal path without domain register manipulation. The signal handler runs
> with normal kernel permissions — on these platforms, the driver memory is not
> domain-protected anyway, so there is nothing to save or restore.
See also: Section 5.2 (SmartNIC and DPU Integration) adds an offload tier where driver data-plane operations are proxied to a DPU over PCIe or shared memory, using the same tier classification and IOMMU fencing model.
10.4.9 eBPF Interaction with Driver Isolation Domains
eBPF programs are a cross-cutting kernel extensibility mechanism used for tracing (kprobe, tracepoint), networking (XDP, tc), security (LSM, seccomp), and scheduling (struct_ops). Because eBPF programs execute in kernel mode with access to kernel data structures, their interaction with driver isolation domains requires explicit specification to prevent isolation domain circumvention.
Threat model: An eBPF program, if not properly constrained, could: 1. Access Tier 1/Tier 2 driver memory directly without going through the isolation boundary 2. Bypass MPK/POE protections by running in the same domain as umka-core 3. Modify driver state without proper capability checks 4. Exfiltrate data from isolated driver memory to user space via BPF maps
Isolation architecture: eBPF programs do not run in the same isolation domain as umka-core (PKEY 0). Each loaded eBPF program is assigned to a dedicated BPF isolation domain that is distinct from: - umka-core (PKEY 0) - All Tier 1 driver domains (PKEY 2-13 on x86-64) - The shared DMA domain (PKEY 14) - The guard domain (PKEY 15)
This means eBPF programs cannot directly access driver-private memory, umka-core internal state, or any isolation domain's memory without explicit kernel mediation.
Access rules for eBPF programs:
-
No direct driver memory access: An eBPF program attached to a kprobe or tracepoint within a Tier 1 driver's code path executes in its own BPF domain, not the driver's domain. The BPF program cannot read or write the driver's private heap, stack, or MMIO-mapped device registers. Any access to driver state must go through BPF helper functions that perform cross-domain access on the program's behalf.
-
BPF helper mediation: All BPF helpers that access kernel or driver state (e.g.,
bpf_probe_read_kernel(),bpf_sk_lookup(),bpf_ct_lookup()) are implemented as kernel-mediated cross-domain operations. The helper: - Validates that the target memory region belongs to a domain for which the BPF program's domain holds the appropriate capability (see rule 4)
- Copies data between the target domain and the BPF program's stack or map memory using kernel-internal mappings that bypass domain restrictions
-
Returns an error if the capability check fails or the access is out of bounds
-
Map isolation: BPF maps created by an eBPF program are owned by that program's BPF domain. Other isolation domains (including drivers) cannot access these maps without an explicit capability grant. Cross-domain map sharing follows the standard capability delegation mechanism (Section 8.1.1): the BPF domain must grant
MAP_READand/orMAP_WRITEcapabilities to the target domain. This prevents a compromised driver from exfiltrating data through BPF maps it does not own. -
Capability requirements for driver access: BPF helpers that query or modify driver state require the BPF domain to hold the appropriate capability:
bpf_skb_adjust_room()(modify packet buffer in NIC driver): requiresCAP_NET_RAWin the caller's network namespacebpf_xdp_adjust_head()/bpf_xdp_adjust_tail(): requiresCAP_NET_RAW-
Helpers that read driver statistics or state: require
CAP_SYS_ADMINor a subsystem-specific read capability The verifier rejects at load time any program that calls a helper for which the loading context (the process callingbpf()) does not hold the required capabilities. The eBPF runtime re-checks capabilities at helper invocation time to handle capability revocation after program load. -
XDP and driver datapath: XDP programs attached to a NIC driver's receive path do not execute in the NIC driver's isolation domain. Instead:
- The driver's receive handler (running in the driver's domain) copies the packet descriptor into a shared bounce buffer accessible to the BPF domain
- The XDP program runs in the BPF domain, reading from and writing to the bounce buffer
- Return values (
XDP_PASS,XDP_DROP,XDP_TX,XDP_REDIRECT) are communicated back to the driver via a shared-memory return code - If the XDP program modifies the packet (
XDP_TXorXDP_REDIRECTwith modified data), the driver copies the modified packet back to its own domain before transmission or redirect This bounce-buffer design ensures the XDP program never directly accesses driver-private state (DMA rings, completion queues, device registers).
Performance note for 100Gbps+: At 100Gbps with 64-byte packets (~148 Mpps),
per-packet bounce copies become a bottleneck (~10ns each = ~1.5 CPU cores just
for memcpy). For high-speed NICs (≥25Gbps), UmkaOS supports a zero-copy XDP
fast path: the NIC driver maps its receive ring into the BPF isolation domain
as read-only (via the shared DMA buffer pool, PKEY 14 on x86 / domain 2 on
AArch64), allowing XDP programs to inspect packets in-place without a copy.
Modification still requires a copy-on-write to a BPF-writable buffer. This
zero-copy path is opt-in per driver (xdp_features flag XDP_F_ZEROCOPY_RX)
and requires IOMMU to fence the BPF domain's read-only mapping.
Weak-isolation fast path: When running without hardware isolation domains (
isolation=performanceor architectures without fast isolation), the bounce buffer is bypassed. XDP/TC programs access the driver's packet buffer directly (true zero-copy, matching Linux's XDP model). The BPF verifier still enforces bounds checking and memory safety — only the domain separation between BPF and driver memory is lost. Since the driver code itself already has unrestricted kernel memory access on these platforms, the bounce buffer would be protecting the driver's memory from BPF while the driver can already read/write all of kernel memory. The per-packetmemcpysavings are significant at high packet rates (100Gbps with 64-byte packets = ~148M copies/sec eliminated).
-
TC (traffic control) BPF: Same model as XDP — TC programs execute in a BPF domain, not in the network driver's or umka-net's domain. Packet data is copied through a shared buffer; the program cannot access umka-net's socket buffers, routing tables, or connection tracking state except through verified BPF helpers (
bpf_fib_lookup(),bpf_ct_lookup(), etc.) that perform capability-checked cross-domain access. -
Kprobe and tracepoint attachment to drivers: When a BPF program is attached to a kprobe within a Tier 1 driver's code:
- The kprobe fires while the CPU is running in the driver's isolation domain
- The BPF program is invoked after the kernel switches to the BPF domain
- The program receives only the function arguments (copied to BPF stack) and cannot access the driver's heap, globals, or MMIO regions
-
Return probes (kretprobe) receive the return value copied to BPF stack The domain switch before BPF execution and the argument copy are performed by the kprobe infrastructure in umka-core, ensuring the BPF program is fully contained within its own domain.
-
LSM BPF and security hooks: LSM BPF programs attached to security hooks (file open, socket create, etc.) run in a BPF domain. They cannot access the credentials, file descriptors, or socket state of the process that triggered the hook except through BPF helpers (
bpf_get_current_pid_tgid(),bpf_get_current_cred(), etc.) that copy the relevant data into the BPF program's memory. Security decisions (allow/deny) are returned via an integer return code; the program cannot directly modify kernel security state.
Domain allocation for BPF: On x86-64, BPF domains are allocated from the same PKEY pool as Tier 1 drivers (PKEY 2-13). Typical systems run 5-8 Tier 1 driver domains, leaving 4-7 domains for BPF programs. When domain exhaustion occurs (drivers + BPF programs > 12 domains), BPF programs share a common BPF domain rather than each getting a dedicated domain. This reduces isolation granularity between BPF programs but preserves isolation between BPF and drivers and between BPF and umka-core. BPF-to-BPF isolation is a best-effort optimization, not a security guarantee — BPF programs are verified code with bounded execution, and their primary isolation boundary is BPF-to-driver and BPF-to-core, both of which are always maintained regardless of domain pressure. On architectures without a fixed domain limit (PPC64LE, AArch64 mainstream page-table path), each BPF program gets its own domain. On RISC-V (no Tier 1), BPF domains are not applicable — BPF programs run without isolation domains.
Crash handling: A crash (verifier bug, JIT bug, or helper bug) within a
BPF program triggers the same containment as a Tier 1 driver crash:
- The BPF domain is revoked
- All maps owned by that domain are invalidated (subsequent lookups return
-ENOENT)
- Attached hooks are automatically detached
- The program is marked as faulted and cannot be re-attached without reload
Unlike Tier 1 drivers, BPF programs do not have a recovery path — they are considered stateless (persistent state lives in maps, which survive program reload). The administrator must reload the program manually or via orchestration.
Full specification: The complete BPF isolation model — domain confinement, map access control, capability-gated helpers, cross-domain packet redirect rules, and verifier enforcement — is specified in Section 15.2.2 (Packet Filtering, BPF-Based). Although Section 15.2.2 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. The rules above are a driver-centric summary; Section 15.2.2 provides the canonical specification.
10.4.10 Tier 2 Interface and SDK
Tier 2 drivers run in separate user-space processes. They communicate with umka-core via dedicated KABI syscalls — not the domain ring buffers used by Tier 1.
KABI syscalls for Tier 2 drivers:
These syscalls use a dedicated syscall range (__NR_umka_driver_base + offset,
allocated from the UmkaOS-private syscall range defined in Section 18.1.2). They are
not Linux-compatible syscalls -- they are UmkaOS-specific and used only by the
Tier 2 driver SDK. The SDK wraps them behind the same KernelServicesVTable
interface that Tier 1 drivers use, so driver code is tier-agnostic.
| KABI Syscall | Syscall Offset | Arguments | Return | Purpose |
|---|---|---|---|---|
umka_driver_register |
0 | manifest: *const DriverManifest, manifest_size: u64, out_services: *mut KernelServicesVTable, out_device: *mut DeviceDescriptor |
IoResultCode |
Register with device registry. Kernel validates manifest, assigns capabilities, returns kernel services vtable and device descriptor. Called once at driver process startup. |
umka_driver_mmio_map |
1 | device_handle: DeviceHandle, bar_index: u32, offset: u64, size: u64, out_vaddr: *mut u64 |
IoResultCode |
Map a device BAR (or portion) into driver address space. Kernel validates BAR ownership, IOMMU group, and capability before creating the mapping. The mapping is revocable: the kernel can unmap it at any time during driver containment (see "Tier 2 MMIO access model" above). |
umka_driver_dma_alloc |
2 | size: u64, align: u64, flags: AllocFlags, out_vaddr: *mut u64, out_dma_addr: *mut u64 |
IoResultCode |
Allocate DMA-capable memory. Kernel allocates physical pages, creates IOMMU mapping, maps into driver process. Returns both virtual and DMA (bus) addresses. |
umka_driver_dma_free |
3 | vaddr: u64, size: u64 |
IoResultCode |
Release a DMA buffer. Kernel tears down IOMMU mapping, unmaps from process, frees physical pages. |
umka_driver_irq_wait |
4 | irq_handle: u32, timeout_ns: u64 |
IoResultCode |
Block until the registered interrupt fires or timeout expires. Returns IO_SUCCESS on interrupt, IO_TIMEOUT on timeout. Uses eventfd internally for efficient wakeup. |
umka_driver_complete |
5 | request_id: u64, status: IoResultCode, bytes_transferred: u64 |
IoResultCode |
Post an I/O completion to umka-core. The completion is forwarded to the originating io_uring CQ or waiting syscall. |
Error codes: All Tier 2 KABI syscalls return IoResultCode (defined in
umka-driver-sdk/src/abi.rs). Common errors: IO_ERR_INVALID_HANDLE (bad device
handle), IO_ERR_PERMISSION (missing capability), IO_ERR_NO_MEMORY (allocation
failed), IO_ERR_BUSY (resource in use), IO_ERR_TIMEOUT.
Performance: Per-I/O overhead floor is ~200-400ns (two syscall transitions). For high-IOPS devices (NVMe, 100GbE), this is significant — those belong in Tier 1. Tier 2 suits devices where overhead is negligible: USB, printers, audio (~1-10ms periods), experimental drivers, and third-party binaries compiled against the stable SDK.
Security boundary: A Tier 2 driver crash is an ordinary process crash. It cannot corrupt kernel memory or issue DMA outside IOMMU-fenced regions. On containment, the kernel revokes all MMIO mappings (preventing further device register access) and tears down IOMMU entries (causing any residual in-flight DMA to fault). The kernel restarts the driver process if the restart policy permits (~10ms recovery).
10.5 Device Registry and Bus Management
Summary: This section specifies the kernel-internal device registry — a topology-aware tree that tracks all hardware devices, their parent/child relationships, driver bindings, power states, and capabilities. It covers: bus enumeration and matching (Section 10.5.4), device lifecycle and hot-plug (Section 10.5.6-Section 10.5.7), power management ordering (Section 10.5.6), crash recovery integration (Section 10.5.10), sysfs compatibility (Section 10.5.12), and firmware management (Section 10.5.15). The registry is the single source of truth for "what hardware exists" and is used by the scheduler (Section 6.1), fault manager (Section 19.1), DPU offload layer (Section 5.2), and unified compute topology (Section 21.6). Readers needing only the API surface can skip to Section 10.5.3 (data model) and Section 10.5.9 (KABI integration).
10.5.1 Motivation and Prior Art
10.5.1.1 The Problem
UmkaOS's KABI provides a clean bilateral vtable exchange between kernel and driver. But the current design has no answer for:
- Device hierarchies: How does the kernel model that a USB keyboard is behind a hub, which is behind an XHCI controller, which sits on a PCI bus? The topology matters for power management ordering, hot-plug teardown, and fault propagation.
- Driver-to-device matching: When the kernel discovers a PCI device with vendor 0x8086 and device 0x2723, how does it know which driver to load? Currently there is no matching mechanism.
- Power management ordering: Suspending a PCI bridge before its child devices causes data loss. The kernel needs to know the topology to get the ordering right.
- Cross-driver services: A NIC may need a PHY driver. A GPU display pipeline may need an I2C controller. There is no way for drivers to discover and use services provided by other drivers.
- Hot-plug: When a USB device is yanked, the kernel must tear down the device, its driver, and all child devices in the correct order.
The key insight from macOS IOKit: the kernel should own the device relationship model. But IOKit's mistake was embedding the model in the driver's C++ class hierarchy, coupling it to the ABI. We build it as a kernel-internal service that drivers access through KABI methods.
10.5.1.2 What We Learn From Existing Systems
Linux (kobject / bus / device / driver / sysfs):
- Device model is a graph of kobject structures exposed via sysfs.
- Bus types (PCI, USB, platform) each implement their own match/probe/remove.
- Strengths: sysfs gives userspace introspection; uevent mechanism for hotplug.
- Weaknesses: driver matching is bus-specific with no unified property system; power
management ordering is heuristic (dpm_list), not topology-derived; the kobject model
is deeply entangled with kernel internals — drivers directly embed and manipulate
kobjects.
macOS IOKit (IORegistry): - All devices modeled as a tree of C++ objects (IORegistryEntry → IOService → ...). - Matching uses property dictionaries ("matching dictionaries"). - Power management tree mirrors the registry tree — IOPMPowerState arrays per driver. - Strengths: property-based matching is elegant; PM ordering derives from the tree; service publication/lookup via IOService matching. - Weaknesses: C++ class hierarchy is the ABI — changing a base class breaks all drivers (fragile base class problem). This is why Apple deprecated kexts and moved to DriverKit. The matching system is over-general (personality dictionaries are complex). Memory management is manual.
Windows PnP Manager: - Kernel-mode PnP manager maintains a device tree. Device nodes have properties. - INF files declare driver matching rules (declarative, external to the binary). - Power management uses IRP_MN_SET_POWER directed through the tree. - Strengths: INF-based declarative matching is clean; power IRPs propagate with correct ordering; robust hotplug. - Weaknesses: IRP-based model is complex; WDM/WDF driver model is notoriously difficult.
Fuchsia (Driver Framework v2): - "Bind rules" — a simple declarative language — match drivers to devices. - Driver manager runs as a userspace component. Device topology is a tree of nodes in a namespace. - Strengths: clean separation of concerns; bind rules are simple and composable; userspace driver manager can be restarted independently. - Weaknesses: everything going through IPC adds latency; the DFv1-to-DFv2 migration shows that evolving the framework is painful.
10.5.1.3 UmkaOS's Position
We take the best ideas from each:
| Concept | Borrowed From | Adaptation |
|---|---|---|
| Property-based matching | IOKit | Declarative match rules in driver manifest, not runtime OOP matching |
| Registry as a tree | IOKit, Linux | Kernel-internal tree, drivers get opaque handles only |
| PM ordering from topology | IOKit, Windows | Topological sort of device tree, timeouts at each level |
| Service publication/lookup | IOKit | Mediated by registry through KABI, not direct object references |
| Sysfs-compatible output | Linux | Registry is the single source of truth for /sys |
| Uevent hotplug notifications | Linux | Registry emits Linux-compatible uevents |
| Declarative bind rules | Fuchsia | Match rules embedded in driver ELF binary |
What we take from none of them: the registry is a kernel-internal data structure.
Drivers never see it directly. They interact through opaque DeviceHandle values
and KABI vtable methods. No OOP inheritance, no C++ objects, no kobject embedding, no
global symbol tables. The flat, versioned, append-only KABI philosophy is fully preserved.
10.5.2 Design Principles
-
Kernel owns the graph, drivers own the hardware logic. The registry manages topology, matching, lifecycle, and power ordering. Drivers manage hardware registers, DMA, and device-specific protocols. Clean separation.
-
Drivers are leaves, not framework participants. A driver does not subclass a framework object. It fills in a vtable and receives callbacks. The registry decides when to call those callbacks based on topology and policy.
-
No ABI coupling. The registry is kernel-internal. Drivers interact with it through KABI methods appended to
KernelServicesVTable. If the registry's internal data structures change, no driver recompilation is needed. -
Topology drives policy. Power management ordering, hot-plug teardown, crash recovery cascading, and NUMA affinity are all derived from the device tree topology. No heuristics, no manually maintained ordering lists.
-
Capability-mediated access. All cross-driver interactions go through the registry, which validates capabilities and handles tier transitions (isolation domain switches, user-kernel IPC). Drivers never communicate directly.
10.5.3 Registry Data Model
10.5.3.1 DeviceNode
The fundamental unit is a DeviceNode — a kernel-internal structure that drivers never
see directly.
Heap allocation requirement:
DeviceNodeand its child structures (Vec,String,HashMapinPropertyTableandDeviceRegistry) require heap allocation. The device registry is initialized at boot step 4g (Section 10.5.11), which is after the physical memory allocator and virtual memory subsystem are running (steps 4b-4c). Tier 0 devices (APIC, timer, serial) that are needed before heap init do not use the registry — they are registered retroactively after registry init (Section 10.5.11.1). No registry data structures are used during early boot before the heap is available.
// Kernel-internal — NOT part of KABI
pub struct DeviceNodeId(pub u64); // Unique, monotonically increasing, never reused
pub struct DeviceNode {
// Identity
id: DeviceNodeId,
name: ArrayString<64>, // e.g., "pci0000:00", "0000:00:1f.2", "usb1-1.3"
// Tree structure
parent: Option<DeviceNodeId>,
children: Vec<DeviceNodeId>, // Ordered by discovery time
// Service relationships (non-tree edges)
providers: Vec<ServiceLink>, // Services this node consumes
clients: Vec<ServiceLink>, // Nodes that consume services from this node
// Device identity
bus_type: BusType, // Reuses existing BusType from abi.rs
bus_identity: BusIdentity, // Bus-specific ID (PCI IDs, USB descriptors, etc.)
properties: PropertyTable, // Key-value property store
// Lifecycle
state: DeviceState,
driver_binding: Option<DriverBinding>,
// Placement
numa_node: i32, // -1 = unknown
// Power
power_state: PowerState,
runtime_pm: RuntimePmPolicy,
// Security
device_cap: CapHandle, // Capability for this device
// Resources
resources: DeviceResources, // BAR mappings, IRQs, DMA state
// IOMMU
iommu_group: Option<IommuGroupId>, // Shared IOMMU group (for passthrough)
// Reliability
/// Sliding-window failure tracker. Records timestamps of recent failures
/// in a circular buffer (capacity: 16 entries). The demotion policy checks
/// how many failures occurred within the configured window (default: 1 hour).
/// See `FailureWindow` definition below.
failure_window: FailureWindow,
last_transition_ns: u64,
// State buffer integrity
/// HMAC-SHA256 key for state buffer integrity verification.
/// Generated by umka-core on first driver load for this DeviceHandle.
/// Persists across driver crash/reload cycles; discarded only on
/// DeviceHandle removal (device unplugged or deregistered).
/// See `DriverHmacKey` below for the full key lifecycle specification.
state_hmac_key: Option<DriverHmacKey>,
}
DriverHmacKey: Key Storage and Lifecycle
The state_hmac_key field above is backed by DriverHmacKey, which controls every
aspect of key material storage, protection, derivation, and rotation. The key must
reside exclusively in UmkaOS Core private memory so that a compromised Tier 1 driver
(which runs at Ring 0 and can execute WRPKRU) cannot extract it and forge state
buffer integrity tags. See the threat model discussion in Section 10.8 (TOCTOU
mitigation) for why Tier 2 isolation is required to prevent key extraction by an
actively exploited driver.
/// HMAC-SHA256 key for driver state buffer integrity verification.
///
/// Stored in UmkaOS Core private memory (protection key 0 — the PKEY 0 domain
/// is inaccessible to all driver code regardless of privilege level). Driver
/// code running in PKEY 2-13 domains cannot read this key even if it executes
/// arbitrary Ring 0 instructions, because the PKRU register in Core's execution
/// context grants read/write only to PKEY 0 when performing HMAC operations.
///
/// # Key derivation
///
/// Key material is derived via HKDF-SHA256 from:
/// IKM = 256 bits from RDRAND (or platform TRNG on non-x86)
/// Salt = TPM PCR[7] measurement (secure boot policy PCR, 32 bytes)
/// Info = b"umka-driver-hmac" || slot_id.to_le_bytes() || generation.to_le_bytes()
///
/// This binds each key to its driver slot and generation, preventing a key
/// generated for slot 3 generation 5 from being used to verify state produced
/// under slot 3 generation 4 (even if the generation counter wraps — see the
/// generation wrap policy in Section 11.1.5.3).
///
/// # Memory location
///
/// The containing `DeviceNode` is allocated from the `.data.pkey0` slab, which
/// is mapped exclusively to PKEY 0 in the UmkaOS Core page tables. On non-x86
/// architectures that lack MPK/POE, the equivalent protection is achieved via
/// a dedicated kernel-only page table entry that is never present in any driver
/// domain's address space.
pub struct DriverHmacKey {
/// Raw 256-bit key material. Zeroized on driver unload via volatile writes
/// (preventing the compiler from eliding the zeroing as a dead store).
key: Zeroize<[u8; 32]>,
/// Driver slot this key is bound to. Used for audit logging and for
/// verifying that the key is not accidentally applied to a different slot.
driver_slot: DriverSlot,
/// HKDF generation input that was used when this key was derived.
/// Checked before HMAC verification: a key with generation G will not
/// successfully verify state that was tagged under generation G' ≠ G,
/// because the HKDF `Info` field differs.
generation: u32,
}
/// Memory-safe zeroizing wrapper.
///
/// Uses volatile pointer writes via `core::ptr::write_volatile` to prevent
/// the compiler from treating the zeroing as a dead store and eliding it.
/// This is the same pattern used by the `zeroize` crate in userspace.
pub struct Zeroize<T: Copy>(T);
impl<T: Copy + Default> Drop for Zeroize<T> {
fn drop(&mut self) {
// SAFETY: `self.0` is valid, aligned, and exclusively owned here.
// The volatile write prevents the optimizer from removing the zeroing.
unsafe {
core::ptr::write_volatile(&mut self.0, T::default());
}
}
}
Key lifecycle:
- Allocation —
DriverHmacKey::new(slot, generation)is called under PKEY 0 protection duringdriver_load(). The call: - Reads 32 bytes from the platform TRNG (RDRAND on x86-64; SoC TRNG on ARM/RISC-V).
- Reads TPM PCR[7] (32 bytes) via the TPM KABI call (Section 8.3).
- Derives the key via HKDF-SHA256:
key = HKDF(IKM=trng_bytes, salt=pcr7, info="umka-driver-hmac" || slot || gen). -
Stores the result in
DriverHmacKey.keywithin the PKEY 0 slab. -
Access — Only UmkaOS Core code executing with PKRU granting PKEY 0 read/write can dereference
DriverHmacKey.key. Driver code (PKEY 2-13 domains) receives a page fault if it attempts to read the key's memory. The HMAC computation itself is performed by a dedicated Core function (driver_state_hmac_compute) that briefly acquires PKEY 0 access, performs the computation into a stack-local output buffer, then restores the caller's PKRU before returning. The key material is never copied to driver-accessible memory. -
Rotation — On every driver reload (crash recovery or explicit operator unload),
generationincrements, a new TRNG sample is drawn, and a fresh key is derived. The oldDriverHmacKeyis dropped, triggeringZeroize::dropwhich overwrites the key material with zeros via volatile writes before the slab page is returned to the allocator. -
Storage location —
umka_core::driver_registry::SLOT_KEYS[slot]is a static array in the.data.pkey0linker section. The linker script maps this section to a physical page range that is exclusively assigned to PKEY 0 in the Core page tables. The array is indexed byDriverSlot(the same integer used inDeviceNode); the maximum number of concurrent driver slots is discovered at boot from the device count and the configured tier limits (no compile-time cap). -
Discarding — When a
DeviceNodeis removed from the registry (device unplugged ordriver_deregister()called by the operator),state_hmac_keyis set toNone, dropping theDriverHmacKeyvalue and zeroizing the key. Subsequent crash recovery for this slot (if a new device is hotplugged to the same slot) generates an entirely fresh key.
10.5.3.2 PropertyTable
Properties are the lingua franca of matching and introspection. They serve the same role as IOKit's property dictionaries and Linux's sysfs attributes.
// PropertyValue variants String, Bytes, and StringArray use heap-allocated
// containers. These are only constructed after heap init (boot step 4b+).
// For pre-heap device identification, Tier 0 devices use fixed-size
// ArrayString<64> in BusIdentity (Section 10.5.3.3) which is stack-allocated.
pub enum PropertyValue {
U64(u64),
I64(i64),
String(String),
Bytes(Vec<u8>),
Bool(bool),
StringArray(Vec<String>),
}
/// Stored as a sorted Vec for cache-friendly iteration and binary search.
/// Device nodes rarely have more than ~30 properties.
pub struct PropertyTable {
entries: Vec<(PropertyKey, String, PropertyValue)>,
}
Standard property keys (well-known constants):
| Key | Type | Description | Set By |
|---|---|---|---|
"bus-type" |
String | "pci", "usb", "platform", "virtio" |
Bus enumerator |
"vendor-id" |
U64 | PCI/USB vendor ID | Bus enumerator |
"device-id" |
U64 | PCI/USB device ID | Bus enumerator |
"subsystem-vendor-id" |
U64 | PCI subsystem vendor | Bus enumerator |
"subsystem-device-id" |
U64 | PCI subsystem device | Bus enumerator |
"class-code" |
U64 | PCI class code / USB class | Bus enumerator |
"revision-id" |
U64 | Hardware revision | Bus enumerator |
"compatible" |
StringArray | DT/ACPI compatible strings | Firmware parser |
"device-name" |
String | Human-readable name | Bus enumerator |
"driver-name" |
String | Name of bound driver | Registry |
"driver-tier" |
U64 | Current isolation tier | Registry |
"numa-node" |
I64 | NUMA node ID | Topology scanner |
"location" |
String | Physical topology path (e.g., PCI BDF) | Bus enumerator |
"serial-number" |
String | Device serial if available | Bus enumerator |
Properties set by "Bus enumerator" are populated during device discovery by whatever code enumerates the bus (PCI config space scan, USB hub status, ACPI namespace walk). Properties set by "Registry" are managed by the kernel. Drivers can set custom properties on their own device node via KABI.
10.5.3.3 BusIdentity
A union-like enum holding bus-specific identification. Derives from the existing
PciDeviceId in the driver SDK.
pub enum BusIdentity {
Pci {
segment: u16,
bus: u8,
device: u8,
function: u8,
id: PciDeviceId, // Existing type from abi.rs
},
Usb {
bus_num: u16,
port_path: [u8; 8], // Hub topology chain
port_depth: u8,
vendor_id: u16,
product_id: u16,
device_class: u8,
device_subclass: u8,
device_protocol: u8,
interface_class: u8,
interface_subclass: u8,
interface_protocol: u8,
},
Platform {
compatible: ArrayString<64>, // ACPI _HID or DT compatible
unit_id: u64, // ACPI _UID or DT unit address
},
VirtIo {
device_type: u32,
vendor_id: u32,
device_id: u32,
},
}
10.5.3.4 Service Links
Non-tree edges representing provider-client relationships between devices:
pub struct ServiceLink {
service_name: ArrayString<64>, // e.g., "phy", "i2c", "gpio", "block"
node_id: DeviceNodeId,
cap_handle: CapHandle, // Capability for mediated access
}
10.5.3.5 Tree Structure Example
Root
+-- acpi0 (ACPI namespace root)
| +-- pci0000:00 (PCI host bridge, segment 0, bus 0)
| | +-- 0000:00:1f.0 (ISA bridge / LPC)
| | +-- 0000:00:1f.2 (SATA controller)
| | | +-- ata0 (ATA port 0)
| | | | +-- sda (disk)
| | | +-- ata1 (ATA port 1)
| | +-- 0000:00:14.0 (USB XHCI controller)
| | | +-- usb1 (USB bus)
| | | | +-- usb1-1 (hub)
| | | | | +-- usb1-1.1 (keyboard)
| | | | | +-- usb1-1.2 (mouse)
| | +-- 0000:03:00.0 (NVMe controller)
| | | +-- nvme0n1 (NVMe namespace 1)
| | +-- 0000:04:00.0 (NIC - Intel i225)
| | | ...provider-client link: "phy" --> phy0 (not a child)
+-- platform0 (Platform device root)
+-- serial0 (Platform UART)
+-- phy0 (Platform PHY device)
Two types of edges:
-
Parent-Child (structural containment): A PCI device is a child of a PCI bridge. A USB device is a child of a USB hub. This is the primary tree structure.
-
Provider-Client (service dependency): Lateral edges. A NIC is a client of a PHY's "phy" service. A GPU display driver is a client of an I2C controller's "i2c" service. These edges do not form cycles (enforced by the registry).
10.5.3.6 The Registry
// DeviceRegistry uses BTreeMap, HashMap, Vec, and VecDeque — all heap-allocated.
// The registry is initialized at boot step 4g (Section 10.5.11), after the heap
// is available. It is never accessed before heap init.
pub struct DeviceRegistry {
/// All nodes, indexed by ID.
nodes: BTreeMap<DeviceNodeId, DeviceNode>,
/// Next node ID (monotonically increasing).
next_id: AtomicU64,
/// Index: bus identity --> node ID (fast device lookup).
bus_index: HashMap<BusLookupKey, DeviceNodeId>,
/// Index: property key+value --> set of node IDs (for matching).
property_index: HashMap<PropertyKey, Vec<DeviceNodeId>>,
/// Index: driver name --> set of node IDs (for crash recovery).
driver_index: HashMap<ArrayString<64>, Vec<DeviceNodeId>>,
/// Registered match rules from all known driver manifests.
match_rules: Vec<MatchRegistration>,
/// Pending hotplug events.
hotplug_queue: VecDeque<HotplugEvent>,
/// Power management state.
power_manager: PowerManager,
}
The registry lives entirely within UmkaOS Core. It is never exposed as a data structure to drivers.
10.5.3.7 DeviceResources
Each device node tracks its allocated hardware resources. This is the kernel-internal
counterpart of what Linux spreads across struct resource, struct pci_dev fields,
and struct msi_desc lists.
/// Hardware resources allocated to a device. Kernel-internal, NOT part of KABI.
pub struct DeviceResources {
/// PCI Base Address Register mappings (up to 6 BARs per PCI function).
pub bars: [Option<BarMapping>; 6],
/// Interrupt allocations (legacy, MSI, or MSI-X vectors).
pub irqs: Vec<IrqAllocation>,
/// Number of pages currently pinned for DMA by this device.
/// Page reclaim (Section 4.2) checks this count before attempting to compress
/// or swap a page — DMA-pinned pages are never eligible.
pub dma_pin_count: AtomicU32,
/// Maximum DMA-pinnable pages for this device (enforced by cgroup and
/// per-device limits). 0 = unlimited.
pub dma_pin_limit: u32,
/// MMIO regions mapped for this device (non-BAR, e.g., firmware tables).
pub mmio_regions: Vec<MmioRegion>,
/// Legacy I/O port ranges (x86 only, rare in modern hardware).
pub io_ports: Vec<IoPortRange>,
/// DMA address mask — how many bits of physical address the device can
/// generate. Determines bounce buffer requirements.
pub dma_mask: u64, // e.g., 0xFFFFFFFF for 32-bit DMA
pub coherent_dma_mask: u64, // For coherent (non-streaming) DMA
}
pub struct BarMapping {
pub bar_index: u8,
pub phys_addr: u64,
pub size: u64,
pub flags: BarFlags,
/// Kernel virtual address if mapped. None = not yet mapped (lazy).
pub mapped_vaddr: Option<u64>,
}
bitflags::bitflags! {
#[repr(transparent)]
pub struct BarFlags: u32 {
const MEMORY_64 = 1 << 0; // 64-bit MMIO (vs 32-bit)
const IO_PORT = 1 << 1; // I/O port space (legacy x86)
const PREFETCHABLE = 1 << 2; // Can be mapped write-combining
}
}
pub struct IrqAllocation {
pub irq_type: IrqType,
pub vector: u32, // Global IRQ vector number
pub cpu_affinity: Option<u32>, // Preferred CPU for this interrupt
}
#[repr(u32)]
pub enum IrqType {
LegacyPin = 0, // INTx (shared, level-triggered)
Msi = 1, // Message Signaled Interrupt (single vector)
MsiX = 2, // MSI-X (independent vectors, per-queue)
}
pub struct MmioRegion {
pub phys_addr: u64,
pub size: u64,
pub cacheable: bool,
}
pub struct IoPortRange {
pub base: u16,
pub size: u16,
}
DMA pin counting is a critical safety mechanism:
- Every
dma_map_*()call through KABI increments the device'sdma_pin_count. - Every
dma_unmap_*()call decrements it. - The page reclaim path (Section 4.2) checks whether a page's owning device has active DMA pins before attempting compression or swap-out. Pages with active DMA mappings are unconditionally skipped — moving a page while a device is DMAing to it would cause silent data corruption.
- On driver crash recovery (Section 10.5.10), all DMA mappings for the crashed driver are
forcibly invalidated (IOMMU entries torn down), and
dma_pin_countis reset to zero. This is safe because the device has been reset. - The
dma_pin_limitprovides defense-in-depth: a buggy or malicious driver cannot pin all of physical memory for DMA. The limit is enforced by the kernel, not the driver.
Resource lifecycle:
Resources are allocated during device discovery (BARs, IRQs) and driver initialization (DMA mappings, additional MMIO). On device removal or driver crash, all resources are reclaimed by the registry in reverse order: DMA mappings first (IOMMU teardown), then IRQs (free vectors), then BAR unmappings, then MMIO unmappings.
10.5.3.8 IOMMU Groups
IOMMU groups model hardware isolation boundaries. An IOMMU group is the smallest unit of device isolation that the hardware can enforce — all devices in a group share the same IOMMU domain (page table).
pub struct IommuGroupId(pub u32);
pub enum IommuDomainType {
/// Kernel DMA domain — device DMA goes through kernel-managed IOMMU
/// page tables. Default for all devices.
Kernel,
/// Identity-mapped DMA domain — IOMMU programs 1:1 physical-to-bus
/// mapping. Device DMA addresses equal physical addresses. Requires
/// explicit admin opt-in per device. See Section 10.5.3.8 "Per-Device DMA
/// Identity Mapping" for constraints and security implications.
Identity {
/// Upper bound of the 1:1 mapping (typically max_phys_addr).
phys_range_end: u64,
},
/// VM passthrough domain — entire group assigned to a VM. The VM's
/// IOMMU page tables control device DMA. Used for VFIO passthrough.
VmPassthrough {
vm_id: u64,
/// Second-level page table root (EPT/NPT base).
page_table_root: u64,
},
/// Userspace DMA domain — for Tier 2 drivers that need direct DMA
/// (e.g., DPDK-style networking). IOMMU restricts DMA to the
/// driver process's permitted regions.
UserspaceDma {
owning_pid: u64,
},
}
Why IOMMU groups matter:
-
VFIO passthrough: When assigning a device to a VM (GPU, NIC, NVMe controller, FPGA, etc.), the kernel must assign the entire IOMMU group. If two devices share a group (e.g., GPU and its audio function on the same PCI slot, or NIC and a co-located function), both must be assigned together. The registry validates this constraint before permitting passthrough. See Section 21.5.2.4 for GPU-specific passthrough details.
-
ACS (Access Control Services): PCIe ACS capabilities determine group boundaries. With ACS, each PCI function can be its own group. Without ACS, all devices behind a non-ACS bridge form a single group (because they could DMA to each other without going through the IOMMU).
-
Isolation guarantee: The IOMMU group is the hardware's isolation primitive. The registry enforces that no device in a passthrough group remains in the kernel domain — this would allow the VM to DMA to the kernel device's memory.
Group discovery:
During PCI enumeration (Section 10.5.11.3), the registry determines IOMMU groups by walking the PCI topology and checking ACS capability bits:
For each PCI device:
1. Walk upstream to the root port, checking ACS at each bridge.
2. If all bridges have ACS: device is in its own group.
3. If a bridge lacks ACS: all devices below that bridge share a group.
4. Peer-to-peer devices behind the same non-ACS switch: same group.
Passthrough assignment flow:
1. Admin requests device passthrough for VM (via /dev/vfio/N or umka-kvm API)
2. Registry looks up device's DeviceNode → iommu_group
3. Registry checks: all devices in group unbound or assignable?
4. If yes: unbind kernel drivers, switch group to VmPassthrough domain
5. Program IOMMU with VM's second-level page tables
6. VM's guest OS sees the device and loads its own driver
7. On VM teardown: switch back to Kernel domain, rebind kernel drivers
The registry prevents partial group assignment: if device A and device B share IOMMU
group 7, and only A is requested for passthrough, the request is rejected with
-EBUSY unless B is also unbound. This prevents a safety violation where the VM
could DMA to B's kernel-managed memory.
IOMMU Group Assignment Algorithm (device discovery):
The following algorithm runs during device enumeration (Section 2.1, boot hardware discovery) to assign each device to an isolation domain:
For each PCIe device discovered during enumeration:
a. Query the IOMMU group ID for the device from the IOMMU driver.
(IOMMU groups are defined by hardware — devices sharing a stream ID
or lacking ACS isolation are in the same group.)
b. If the group ID is new (first device in this group):
- Allocate a new isolation domain for this group.
- Register: iommu_group_domains[group_id] = new_domain.
c. If the group ID already has a domain assignment:
- Assign this device to the existing domain.
- Log: "Device [bus:dev.fn] shares IOMMU group [id] with [other devices]
— assigned to same isolation domain [domain_id]."
ACS (Access Control Services) check:
PCIe ACS must be enabled on root ports and upstream ports/switches to allow
per-function IOMMU groups. If ACS is absent on an upstream bridge:
- All devices downstream of that bridge share one IOMMU group.
- They are assigned to a shared isolation domain.
- Log a warning: "PCIe switch at [bus:dev.fn] lacks ACS — [N] devices share
one IOMMU group. Per-device isolation not possible."
- UmkaOS does NOT disable the device — it runs with reduced isolation (shared
domain) and logs the degraded state to FMA (Section 19.1).
Singleton groups (preferred):
When ACS is present and hardware supports per-function translation,
each device gets its own IOMMU group and its own isolation domain.
This is the default and preferred configuration for Tier 1 drivers.
Driver cgroup co-isolation enforcement:
UmkaOS enforces that all devices in an IOMMU group belong to the same driver
cgroup. If a user attempts to assign two devices from the same IOMMU group
to different drivers, the second assignment fails with -EACCES and the error
message: "Device [bus:dev.fn] shares IOMMU group [id] with device [other
bus:dev.fn] — both must be assigned to the same driver."
IOMMU Group Formation: pci_device_group Algorithm
The pseudo-code above describes the consumer side of IOMMU group assignment —
how the device registry attaches devices to existing domains. The following specifies
the formation side: how pci_device_group determines which IOMMU group a newly
enumerated PCI device belongs to. This algorithm matches the Linux implementation
(kernel/drivers/iommu/iommu.c, intel/iommu.c, amd/iommu.c, arm-smmu-v3.c) and is
the authoritative procedure for all three major IOMMU hardware families.
/// ACS flags that together guarantee DMA request isolation between PCIe peers.
/// Without all four bits set on an upstream bridge, devices behind that bridge
/// can issue peer-to-peer DMA that bypasses IOMMU translation entirely.
///
/// - SV (Source Validation): bridge verifies the requester ID is valid
/// - RR (Request Redirection): DMA requests are redirected through the IOMMU
/// - CR (Completion Redirection): completions return through the IOMMU
/// - UF (Upstream Forwarding): upstream traffic is forwarded to the RC
const REQ_ACS_FLAGS: AcsFlags =
AcsFlags::SV | AcsFlags::RR | AcsFlags::CR | AcsFlags::UF;
/// Determine the IOMMU group for a PCI device.
///
/// Called during driver registration and device hotplug. Returns the
/// `IommuGroup` to which this device must belong. A device can only
/// be assigned an `IommuDomain` that covers its entire group — partial
/// assignment is a hardware violation and is rejected by the registry.
///
/// # Algorithm
///
/// The four steps below are executed in order. The first step that
/// produces an existing group terminates the search and returns that group.
/// If no group is found, a fresh group is allocated in step 4.
pub fn pci_device_group(
dev: &PciDevice,
iommu: &IommuInstance,
) -> Arc<IommuGroup> {
// Step 1: DMA alias resolution.
//
// Conventional PCI devices behind a PCIe-to-PCI bridge have their
// requester ID rewritten to the bridge's BDF by the bridge — the IOMMU
// sees the bridge's requester ID, not the device's own BDF. Such devices
// are called "DMA aliases" of the bridge. All devices sharing the same
// alias must be in the same IOMMU group because the IOMMU cannot
// distinguish their DMA transactions.
//
// `resolve_dma_aliases` walks the alias set (via the PCIe alias capability
// and conventional PCI bridge topology) and returns the canonical anchor BDF.
let anchor = resolve_dma_aliases(dev);
if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
return existing;
}
// Step 2: ACS boundary walk.
//
// Walk upstream bridges from the device toward the root complex. At each
// bridge, check whether all four REQ_ACS_FLAGS bits are set in the bridge's
// ACS capability register. The first bridge that lacks full ACS is the
// isolation failure point: it cannot prevent peer devices from issuing
// DMA to each other without going through the IOMMU. Move the group
// anchor up to that bridge — all devices below it must share one group.
//
// Stop walking when we reach a bridge that has all four ACS bits set;
// that bridge IS the isolation boundary. Devices on opposite sides of a
// fully ACS-capable bridge can have separate IOMMU groups.
let anchor = walk_acs_boundary(anchor, REQ_ACS_FLAGS);
if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
return existing;
}
// Step 3: Multifunction slot grouping.
//
// PCI multifunction devices (multiple functions on the same device number,
// e.g., device 0, functions 0..7) can DMA-alias each other when ACS is
// absent. If the anchor is a multifunction device without full ACS, all
// sibling functions on the same slot must share an IOMMU group.
if anchor.is_multifunction() && !anchor.has_acs(REQ_ACS_FLAGS) {
if let Some(existing) = find_sibling_function_group(&anchor, iommu) {
return existing;
}
}
// Step 4: Allocate a new group.
//
// No existing group was found via aliases, ACS failures, or multifunction
// sharing. This device is hardware-isolated from all others and gets its
// own IOMMU group (the preferred configuration for Tier 1 isolation).
IommuGroup::new(iommu)
}
IommuGroup struct (canonical definition; replaces the forward declaration in Section 10.5.3.8):
/// Maximum PCIe devices in one IOMMU group. ACS-disabled PCIe switches can group
/// entire bus fabrics; 128 is a safe upper bound for realistic PCIe topologies.
pub const IOMMU_GROUP_MAX_DEVICES: usize = 128;
pub struct IommuGroup {
/// Unique group ID assigned at creation. Never reused after group destruction.
pub id: u32,
/// IOMMU hardware instance that manages this group.
pub iommu: Arc<IommuInstance>,
/// PCIe devices sharing this IOMMU domain.
/// Fixed capacity: avoids heap allocation during bus enumeration and device hotplug.
pub devices: ArrayVec<Arc<PciDevice>, IOMMU_GROUP_MAX_DEVICES>,
/// The currently active IOMMU domain. One domain covers the entire group —
/// it is not possible to assign different domains to devices in the same group.
pub domain: RwLock<Option<Arc<IommuDomain>>>,
}
Firmware table parsing (determines IommuInstance → device scope during early boot,
before pci_device_group is called per-device):
-
Intel VT-d (ACPI DMAR table): Each DRHD (DMA Remapping Hardware Definition) record describes one VT-d engine and its device scope entries (BDF ranges it manages). Devices not covered by any explicit DRHD scope fall under the catch-all DRHD with the
INCLUDE_PCI_ALLflag. RMRR (Reserved Memory Region Reporting) records list physical address ranges that must be identity-mapped in every domain — typically BIOS-owned USB buffers and legacy VGA regions. UmkaOS programs RMRR regions as immutable identity entries in every new IOMMU domain before handing it to a driver. -
AMD-Vi (ACPI IVRS table): IVHD (I/O Virtualization Hardware Definition) records list each AMD IOMMU and the BDF ranges it controls. UmkaOS builds a flat lookup table
amd_iommu_dev_table[BDF]during IVRS parsing, giving O(1) device-to-IOMMU resolution atpci_device_group()call time. IVMD (I/O Virtualization Memory Definition) records specify unity-mapped regions (analogous to Intel RMRR). -
ARM SMMU v3 (ACPI IORT table or Device Tree): Stream IDs (SIDs) are assigned by firmware and recorded in IORT
iommu-maptable entries or DTiommus/iommu-mapproperties. Each non-PCI device (platform device, ACPI device) gets its own IOMMU group unconditionally — non-PCI devices cannot alias each other. PCI devices behind an SMMU usepci_device_group()as above, with the SMMU providing theIommuInstance.
UmkaOS driver isolation requirement at driver_register():
A Tier 2 driver receives an IommuDomain that covers its entire IOMMU group. The
registration sequence is:
- Parse the firmware table (DMAR/IVRS/IORT) to find which
IommuInstanceowns the device, then callpci_device_group()to determine the device'sIommuGroup. - If the group already has an active
IommuDomain: attach this device to that domain (the entire group is now under the driver's control — the registry verifies that all other group members are either unbound or owned by the same driver process). - If the group has no active domain: allocate a new
IommuDomain, program the IOMMU hardware page tables (initially empty — no DMA permitted), then attach all devices in the group to the new domain. - Grant the driver process DMA access via
umka_driver_dma_alloc(Section 10.6); each allocation adds an IOVA→PA entry to the domain's page tables and the IOMMU issues an IOTLB invalidation. - Any device in the group that issues a DMA transaction to an address outside its domain's IOVA space triggers an IOMMU fault → driver crash recovery path (Section 10.8).
10.5.3.9 IOMMU Implementation Complexity
IOMMU management is one of the most complex subsystems in any OS kernel, and this complexity should not be understated. The following areas are known to be difficult and are called out explicitly as high-effort implementation items:
Nested/two-level translation (SR-IOV + VFIO) — when a VM uses VFIO passthrough with SR-IOV virtual functions, the IOMMU must perform two-level address translation: guest virtual → guest physical (first level, programmed by the guest's IOMMU driver) then guest physical → host physical (second level, programmed by the host). Intel VT-d calls this "scalable mode with first-level and second-level page tables"; AMD-Vi calls it "guest page tables with nested paging." The two-level walk doubles TLB pressure and introduces a multiplicative page table depth (4-level × 4-level = 16 potential memory accesses per translation miss). IOTLB sizing and invalidation granularity are critical performance levers.
Performance bottlenecks — known IOMMU performance traps: - Map/unmap storm: high-throughput I/O paths (NVMe at millions of IOPS, 100GbE line-rate) can generate millions of IOMMU map/unmap operations per second. Each map/unmap involves IOTLB invalidation. UmkaOS mitigates this with: (1) persistent DMA mappings for ring buffers (map once at driver init, never unmap), (2) batched invalidation (accumulate invalidations, flush once per batch), (3) per-CPU IOMMU invalidation queues to avoid contention. - IOTLB capacity: hardware IOTLB entries are scarce (~128-512 entries on typical Intel VT-d). Under heavy I/O with many DMA mappings, IOTLB misses add ~100-500ns per translation. Large pages (2MB, 1GB) in IOMMU page tables dramatically reduce IOTLB pressure — UmkaOS's DMA mapping interface prefers large-page-aligned allocations when possible. - Invalidation latency: IOTLB invalidation on Intel VT-d is not instantaneous. Drain-all invalidation can take ~1-10μs. Page-selective invalidation is faster but not supported on all hardware. UmkaOS checks hardware capability registers and uses the finest granularity available.
ACS (Access Control Services) — PCIe ACS is required for proper IOMMU group
isolation. Without ACS on a PCIe switch, all devices behind that switch land in the
same IOMMU group (defeating per-device isolation). Many consumer motherboards lack ACS
on the root port or PCIe switch, causing all devices to share one IOMMU group. UmkaOS
detects this at boot and logs a warning. The pcie_acs_override kernel parameter
(Linux compatibility) allows overriding this for testing, but with an explicit security
warning.
Errata — IOMMU hardware has errata. Intel VT-d errata include broken interrupt remapping on certain steppings, incorrect IOTLB invalidation scope, and non-compliant default domain behavior. UmkaOS's errata framework (Section 2.1.4) includes IOMMU errata alongside CPU errata — detected at boot, with workarounds applied automatically.
10.5.3.10 Per-Device DMA Identity Mapping (Opt-In Escape Hatch)
UmkaOS's default IOMMU policy is translated DMA for all devices — every DMA transaction passes through IOMMU page tables. This is non-negotiable for the driver isolation model: crash recovery, DMA fencing, and containment all depend on the kernel's ability to revoke DMA access by reprogramming IOMMU entries.
However, certain scenarios require identity-mapped DMA (device DMA addresses = physical addresses, IOMMU programmed as 1:1 pass-through for that device's domain):
- Latency-critical bare-metal I/O. High-frequency trading NICs, ultra-low-latency NVMe, and RDMA HCAs where the ~100-500ns IOTLB miss penalty on unmapped addresses is unacceptable. Persistent DMA mappings (Section 10.5.3.7) mitigate this for ring buffers, but scatter-gather DMA with dynamic buffer addresses still pays the IOTLB miss cost.
- Broken IOMMU interactions. Devices with firmware or silicon bugs that produce incorrect DMA addresses under translation (e.g., devices that hardcode physical addresses in firmware descriptors, or devices that ignore bus addresses returned by the OS).
- Debug and development. Tracing raw DMA transactions with hardware analyzers is simpler when bus addresses equal physical addresses.
/// Per-device DMA translation policy. Set via admin sysfs or boot parameter.
/// Default is Translated for all devices.
#[repr(u32)]
pub enum DeviceDmaPolicy {
/// All DMA goes through IOMMU page tables (default). Full isolation.
Translated = 0,
/// IOMMU programmed with 1:1 identity mapping for this device's domain.
/// Device DMA addresses equal physical addresses. IOMMU is still active
/// (interrupt remapping, fault reporting) but provides no DMA containment.
Identity = 1,
}
Constraints and trade-offs:
| Property | Translated (default) |
Identity |
|---|---|---|
| DMA containment | Full — device can only reach explicitly mapped regions | None — device can DMA to any physical address in its identity window |
| Crash recovery | IOMMU entries revoked → in-flight DMA faults | Identity mapping cannot be selectively revoked without full device reset |
| Driver tier | Any tier | Tier 1 only (kernel-space drivers with CAP_DMA_IDENTITY) |
| IOTLB miss cost | ~100-500ns per miss | Zero (1:1 mapping fits in a single large-page IOTLB entry) |
| Interrupt remapping | Active | Active (identity mapping does not affect interrupt remapping) |
| IOMMU group rule | Per-device | Entire IOMMU group must use Identity if any member does |
Identity mapping scope: The kernel programs a 1:1 IOMMU mapping covering the
physical address range [0, max_phys_addr) in the device's IOMMU domain, using the
largest available page size (typically 1GB pages). This is a single IOTLB entry per
GB of physical memory — effectively zero IOTLB miss overhead. The IOMMU remains
active for interrupt remapping and fault reporting; only DMA address translation is
bypassed.
Security implications: A device in Identity mode can DMA to any physical address. A compromised or buggy driver controlling such a device can corrupt arbitrary kernel memory. This is equivalent to running without an IOMMU for that device. The kernel mitigates the blast radius:
- Explicit admin opt-in required. Identity mode is set via:
- Boot parameter:
umka.dma_identity=0000:03:00.0(PCI BDF notation) - Sysfs at runtime:
/sys/bus/pci/devices/0000:03:00.0/dma_policy(requiresCAP_SYS_ADMIN+CAP_DMA_IDENTITY) - There is no global
iommu.passthrough=1equivalent. Every device must be individually opted in. This prevents accidentally disabling isolation for all devices. - Tier 1 restriction. Only Tier 1 (in-kernel) drivers may use Identity mode. Tier 2 (userspace) drivers are denied — a compromised userspace process with identity-mapped DMA would be a full kernel compromise.
- Audit logging. Every Identity mode activation is logged to the security audit subsystem (Section 19.2.9) with the device BDF, requesting process, and admin credential.
- IOMMU group enforcement. If device A is set to Identity and shares an IOMMU group with device B, device B is also switched to Identity (since devices in the same IOMMU group can peer-to-peer DMA without IOMMU translation). The kernel logs a warning identifying all affected devices.
- No crash recovery guarantee. The kernel marks devices in Identity mode with a
NO_DMA_FENCEflag. On driver crash, the kernel performs a Function Level Reset (FLR) or secondary bus reset instead of relying on IOMMU revocation — this is slower (10-100ms vs microseconds) but is the only safe option without DMA fencing.
Implementation in IommuDomainType:
pub enum IommuDomainType {
/// Kernel DMA domain with full IOMMU translation.
Kernel,
/// Identity-mapped DMA domain. IOMMU programs 1:1 mapping.
/// Only for Tier 1 drivers with explicit admin opt-in.
Identity {
/// Physical address range covered by the 1:1 mapping.
phys_range_end: u64,
},
/// VM passthrough domain — VM's page tables control DMA.
VmPassthrough {
vm_id: u64,
page_table_root: u64,
},
/// Userspace DMA domain — for Tier 2 drivers with restricted DMA.
UserspaceDma {
owning_pid: u64,
},
}
Global identity mode on weak-isolation architectures:
On most architectures, per-device identity mapping (above) is the correct granularity: even if one device needs passthrough, the rest should remain IOMMU-translated. However, on architectures where Tier 1 CPU-side isolation is already absent or equivalent to Tier 0, the IOMMU is the only remaining isolation boundary — and if the admin has already accepted that Tier 1 drivers share the kernel address space without hardware memory protection, the IOMMU overhead protects only against rogue DMA from device firmware, not from the driver code itself.
For these cases, UmkaOS provides a global identity mode restricted to platforms where CPU-side Tier 1 isolation is weak or absent:
/// System-wide DMA translation policy. Boot parameter only —
/// cannot be changed at runtime.
///
/// Boot parameter: umka.dma_default_policy={translated,identity}
/// Default: translated (always)
#[repr(u32)]
pub enum SystemDmaPolicy {
/// All devices use IOMMU translation (default on all architectures).
Translated = 0,
/// All Tier 1 devices default to identity-mapped DMA. Tier 2
/// (userspace) devices always remain Translated regardless of this
/// setting. Individual devices can still be overridden to Translated
/// via sysfs. Requires umka.isolation=performance or equivalent
/// weak-isolation architecture.
IdentityDefault = 1,
}
Preconditions for umka.dma_default_policy=identity:
The kernel refuses this boot parameter unless at least one of:
1. umka.isolation=performance is also set (admin has explicitly opted out of
Tier 1 CPU-side isolation — drivers promoted to Tier 0).
2. The architecture has no fast isolation mechanism and Tier 1 uses page-table
switching with overhead equivalent to Tier 2 (currently: PPC64LE POWER8, AArch64
mainstream with I/O-heavy workloads). On RISC-V, Tier 1 is not available and all
Tier 1 drivers already run as Tier 0; this condition does not apply.
If neither condition is met, the kernel prints a boot warning and ignores the parameter:
umka: dma_default_policy=identity rejected: CPU-side Tier 1 isolation is active.
Use umka.isolation=performance or per-device umka.dma_identity=<BDF> instead.
What global identity mode does NOT affect:
- Tier 2 (userspace) drivers — always IOMMU-translated, regardless of policy.
A compromised userspace process with identity-mapped DMA would be a full kernel
compromise.
- VM passthrough (VmPassthrough) — VM IOMMU domains are unaffected; the
hypervisor's second-level page tables remain in control.
- Interrupt remapping — remains active on all devices. Identity mode disables
DMA address translation only, not interrupt remapping.
- Per-device overrides — individual devices can be set to Translated via sysfs
even when the global default is Identity. This allows an admin to protect specific
devices (e.g., an untrusted USB controller) while running most devices in identity
mode.
Rationale: On RISC-V 64, where Tier 1 isolation is not available and all Tier 1
drivers run as Tier 0 (sharing the kernel address space in Ring 0 with full memory
access), the IOMMU is protecting against a strictly weaker threat (device firmware DMA)
than the one already accepted (driver code CPU access). Paying ~100-500ns per IOTLB
miss on every DMA operation to defend against device firmware — while the driver itself
has unrestricted access to all of kernel memory — is a questionable trade-off for
performance-sensitive workloads. The same logic applies when isolation=performance
explicitly promotes all drivers to Tier 0 on any architecture.
Why this is not Linux's iommu.passthrough=1: Linux's global passthrough exists
for legacy compatibility — many Linux drivers assume physical addresses equal bus
addresses, and passthrough preserved that assumption. UmkaOS's global identity mode
exists for a different reason: to avoid paying IOMMU overhead on platforms where the
security benefit is already negated by the absence of CPU-side isolation. The
precondition check ensures it cannot be enabled on platforms where IOMMU translation
is the critical isolation boundary (x86-64 with MPK, AArch64 with POE, etc.).
10.5.4 Device Matching
10.5.4.1 Match Rules
Drivers declare what hardware they support through match rules embedded in the driver
binary. Match rules are stored in a dedicated ELF section (.kabi_match) and read by the
kernel loader before init() is called.
/// A single match rule. Drivers can declare multiple rules — any match
/// triggers binding.
#[repr(C)]
pub struct MatchRule {
pub rule_size: u32, // Forward compat
pub match_type: MatchType,
pub data: MatchData, // 128-byte union, interpreted per match_type
}
#[repr(u32)]
pub enum MatchType {
PciId = 0, // Match by PCI vendor/device ID (with wildcards)
PciClass = 1, // Match by PCI class code (with mask)
UsbId = 2, // Match by USB vendor/product ID
UsbClass = 3, // Match by USB class/subclass/protocol
VirtIoType = 4, // Match by VirtIO device type
Compatible = 5, // Match by "compatible" string (DT/ACPI)
Property = 6, // Match by arbitrary property key/value
}
/// Match data union — interpreted per MatchType variant.
/// 128 bytes to accommodate the largest variant (Compatible: 128-byte string).
#[repr(C)]
pub union MatchData {
pub pci_id: PciMatchData, // MatchType::PciId or PciClass
pub usb_id: UsbMatchData, // MatchType::UsbId
pub usb_class: UsbClassMatch, // MatchType::UsbClass
pub virtio: VirtIoMatchData, // MatchType::VirtIoType
pub compatible: [u8; 128], // MatchType::Compatible (NUL-terminated)
pub property: PropertyMatch, // MatchType::Property
pub _raw: [u8; 128], // Pad to 128 bytes
}
#[repr(C)]
pub struct UsbMatchData {
pub vendor_id: u16, // 0xFFFF = wildcard
pub product_id: u16, // 0xFFFF = wildcard
}
#[repr(C)]
pub struct UsbClassMatch {
pub class: u8, // USB class code
pub subclass: u8, // 0xFF = wildcard
pub protocol: u8, // 0xFF = wildcard
}
#[repr(C)]
pub struct VirtIoMatchData {
pub device_type: u32, // VirtIO device type ID
}
#[repr(C)]
pub struct PropertyMatch {
pub key: [u8; 64], // Property key (NUL-terminated)
pub value: [u8; 64], // Property value (NUL-terminated)
}
Example — PCI ID match:
#[repr(C)]
pub struct PciMatchData {
pub vendor_id: u16, // 0xFFFF = wildcard
pub device_id: u16, // 0xFFFF = wildcard
pub subsystem_vendor: u16, // 0xFFFF = wildcard
pub subsystem_device: u16, // 0xFFFF = wildcard
pub class_code: u32, // Class code value
pub class_mask: u32, // Bits to compare (0 = ignore class)
}
A match table header in the ELF binary:
#[repr(C)]
pub struct MatchTableHeader {
pub magic: u32, // 0x4D415443 ("MATC")
pub header_size: u32,
pub rule_count: u32,
pub rule_size: u32, // sizeof(MatchRule)
// Followed by `rule_count` MatchRule structs
}
10.5.4.2 Match Engine
The kernel runs a simple priority-ordered match algorithm:
For each DeviceNode in Discovered state:
1. Collect the node's properties and bus identity
2. For each registered driver (sorted by priority):
a. For each MatchRule in that driver's match table:
- Evaluate the rule against the node's properties
- If match: record (driver, node, specificity) as a candidate
3. Select the candidate with highest specificity
4. If found: begin driver loading for this node
5. If no match: node stays in Discovered state (deferred probe)
Match specificity ranking (highest first):
| Rank | Match Type | Score | Example |
|---|---|---|---|
| 1 | Exact vendor + device + subsystem | 100 | This exact card from this exact OEM |
| 2 | Exact vendor + device ID | 80 | Any board with this chip |
| 3 | Full class code match | 60 | Any NVMe controller (class 01:08:02) |
| 4 | Partial class code (masked) | 40 | Any mass storage controller (class 01:xx:xx) |
| 5 | Compatible string (position-weighted) | 20+ | DT/ACPI compatible, first entry scores higher |
| 6 | Generic property match | 10 | Fallback / catchall |
Combination rule: When a single driver has multiple match rules and more than one matches a device, the driver's effective specificity is the maximum of all matching rule scores (not a sum). This ensures an exact vendor/device ID match (score 100) always dominates a class-code match (score 60) from the same driver, reflecting "most specific match wins" semantics.
When two drivers match with equal specificity, the driver with higher match_priority
(declared in its manifest) wins. If still tied, first-registered wins.
10.5.4.3 Deferred Matching
Some devices cannot be matched immediately — their driver may not yet be loaded (e.g., initramfs not yet mounted, or driver installed later by package manager).
- Devices with no match stay in
Discoveredstate indefinitely. - When a new driver is registered (loaded from initramfs, installed at runtime), all
Discovereddevices are re-evaluated against the new match rules. - A KABI method
registry_rescan()triggers manual re-evaluation.
This is analogous to Linux's deferred probe mechanism, but simpler because the matching is centralized rather than spread across per-bus probe functions.
10.5.4.4 DriverManifest Extensions
The DriverManifest (defined in umka-driver-sdk/src/capability.rs) gains match-related
fields (appended per ABI rules):
// Appended to DriverManifest
pub match_rule_count: u32, // Number of match rules in .kabi_match section
pub is_bus_driver: u32, // 1 = this driver discovers child devices
pub match_priority: u32, // Higher = preferred when specificity ties
pub _pad: u32,
10.5.4.5 Module Loader Queue
When the match engine selects a driver for a device (step 4 in Section 10.5.4.2),
it submits a DriverLoadRequest to the module loader work queue. The module loader
runs as a set of kernel worker threads and serializes concurrent loading, signature
verification, and domain allocation.
LoadReason is the shared type defined in Section 11.1.9.6
(11-kabi.md). Variants used by the device driver loader: HotPlug (device enumeration),
Boot (initramfs/cmdline), Dependency, CrashRecovery, UserRequest.
/// A device-driver-specific load request. More detailed than the KABI-level
/// `ModuleLoadRequest` (Section 11.1.9.6): includes the trigger device, result
/// type `DriverHandle`, priority override, and timeout. Uses the shared
/// `LoadReason` enum from Section 11.1.9.6.
pub struct DriverLoadRequest {
/// Absolute path to the `.kabi` manifest file in the umkafs namespace,
/// e.g., `/System/Kernel/drivers/nvme/nvme.kabi`.
pub manifest_path: Box<str>,
/// Reason for this load request (determines scheduling priority).
pub reason: LoadReason,
/// Device that triggered this load when `reason == LoadReason::HotPlug`.
/// `None` for dependency loads, user requests, and boot-time loads.
pub trigger_device: Option<DeviceHandle>,
/// Completion channel: the loader sends `Ok(DriverHandle)` on success
/// or `Err(KernelError)` on failure (bad signature, manifest error,
/// domain allocation failure, driver `init()` returning an error, etc.).
pub result_tx: oneshot::Sender<Result<DriverHandle, KernelError>>,
/// Priority override. `0` = derive from `LoadReason` (default).
/// `1`–`255` = explicit override (higher = higher priority).
pub priority_override: u8,
/// Load timeout in milliseconds. `0` = system default (30 000 ms).
/// The loader cancels and returns `Err(KernelError::Timeout)` if the
/// driver does not complete `init()` within this window.
pub timeout_ms: u32,
}
/// Priority-ordered work queue for driver module loads.
///
/// Bounded capacity prevents memory exhaustion from a flood of hotplug events
/// (e.g., enumerating a USB hub with 127 devices simultaneously).
/// Default capacity: 256 pending requests.
pub struct ModuleLoaderQueue {
/// Pending load requests ordered by effective priority (highest first).
queue: SpinLock<BinaryHeap<PrioritizedLoadRequest>>,
/// Limits the number of concurrently executing module loads.
/// Default: 4 concurrent loads (one per loader worker thread).
concurrency: Semaphore,
/// Total requests enqueued since boot.
pub total_enqueued: AtomicU64,
/// Total requests that completed successfully.
pub total_loaded: AtomicU64,
/// Total requests that failed (signature rejection, manifest error,
/// driver init failure, timeout, or domain allocation failure).
pub total_failed: AtomicU64,
}
/// Internal wrapper that adds an effective priority to a `DriverLoadRequest`
/// for ordering in the `BinaryHeap` inside `ModuleLoaderQueue`.
struct PrioritizedLoadRequest {
/// Effective priority: `priority_override` if non-zero, else derived from
/// `reason` (HotPlug/Boot = 200, CrashRecovery = 180, Dependency = 150,
/// UserRequest = 100).
pub priority: u8,
pub request: DriverLoadRequest,
}
impl PartialOrd for PrioritizedLoadRequest {
fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl Ord for PrioritizedLoadRequest {
fn cmp(&self, other: &Self) -> core::cmp::Ordering {
// Reverse order so BinaryHeap pops the highest priority first.
self.priority.cmp(&other.priority)
}
}
impl PartialEq for PrioritizedLoadRequest {
fn eq(&self, other: &Self) -> bool { self.priority == other.priority }
}
impl Eq for PrioritizedLoadRequest {}
Module loading sequence (executed by a loader worker thread after dequeuing):
1. Verify driver binary signature (ML-DSA-44 or SLH-DSA-128f per Section 8.2).
Reject if signature is absent or invalid.
2. Parse .kabi manifest: validate fields, check KabiVersion compatibility.
3. Allocate an isolation domain (MPK PKEY, POE overlay, or equivalent per arch).
If no domains are available: reject with KernelError::ResourceExhausted.
4. Map driver binary into the new domain (read+execute, no write).
5. Call driver_entry.init(services, descriptor). Apply timeout_ms watchdog.
6. On success: transition device state to Active, send Ok(handle) to result_tx.
7. On failure: free domain, unmap binary, send Err(...) to result_tx.
Registry transitions device state to Error.
10.5.5 Device Lifecycle
10.5.5.1 State Machine
The registry manages each device through a well-defined state machine. Only the kernel initiates transitions — drivers cannot set their own state.
+-> [Error] ------+----> [Quarantined]
| | |
[Discovered] -> [Matching] -> [Loading] -> [Initializing] -> [Active]
^ ^ | |
| | | v
| +--- (no match) -----------+ [Suspending]
| | |
| +-- (admin re-enable) -- [Quarantined] v
+-- (hotplug rescan) ---- [Removed] [Suspended]
| ^ |
| | v
+-- (driver reload) ----- [Stopping] <-------------- [Resuming]
^ |
| v
[Recovering] <------------- [Active]
#[repr(u32)]
pub enum DeviceState {
Discovered = 0, // Node exists, no driver bound
Matching = 1, // Match engine evaluating
Loading = 2, // Driver binary being loaded
Initializing = 3, // driver init() called, waiting for result
Active = 4, // Driver running normally
Suspending = 5, // Suspend requested, waiting for driver ack
Suspended = 6, // Driver has acknowledged suspend
Resuming = 7, // Resume requested, waiting for driver ack
Stopping = 8, // Driver being stopped (unload, removal, admin)
Recovering = 9, // Driver crashed, recovery in progress
Removed = 10, // Device physically removed (hotplug)
Error = 11, // Fatal error, non-functional
Quarantined = 12, // Driver permanently disabled (crash threshold exceeded);
// requires manual re-enable via sysfs
}
10.5.5.2 Transition Table
| From | To | Trigger | Driver Callback |
|---|---|---|---|
| Discovered | Matching | New device or new driver registered | None |
| Matching | Loading | Match found | None |
| Matching | Discovered | No match | None |
| Loading | Initializing | Binary loaded, vtable exchange begins | init() |
| Initializing | Active | init() returns success |
None |
| Initializing | Error | init() returns error or timeout |
None |
| Active | Suspending | PM suspend request | suspend() |
| Suspending | Suspended | suspend() returns success |
None |
| Suspending | Error | suspend() timeout or failure |
shutdown() (force) |
| Suspended | Resuming | PM resume request | resume() |
| Resuming | Active | resume() returns success |
None |
| Resuming | Recovering | resume() failure |
None |
| Active | Stopping | Admin request, unload, or hotplug removal | shutdown() |
| Active | Recovering | Fault detected (domain violation, watchdog, crash) | None |
| Recovering | Loading | Recovery initiated, fresh binary load | (fresh init()) |
| Error | Quarantined | Crash threshold exceeded (5+ failures in window) | None |
| Quarantined | Matching | Manual administrator re-enable via sysfs | None |
| Any | Removed | Physical device gone + teardown complete | shutdown() if possible |
10.5.5.3 Timeouts
Every callback has a timeout. If the driver does not respond within the timeout, the kernel force-stops it (same mechanism as crash recovery: revoke isolation domain / kill process).
| Callback | Tier 1 Timeout | Tier 2 Timeout |
|---|---|---|
init() |
5 seconds | 10 seconds |
shutdown() |
3 seconds | 5 seconds |
suspend() |
2 seconds | 5 seconds |
resume() |
2 seconds | 5 seconds |
All timeouts are configurable via kernel parameters.
10.5.6 Power Management
10.5.6.1 Power States
#[repr(u32)]
pub enum PowerState {
D0Active = 0, // Fully operational
D1LowPower = 1, // Low-power idle (quick resume)
D2DeepSleep = 2, // Deeper sleep (longer resume, less power)
D3Off = 3, // Powered off (full re-init on resume)
}
10.5.6.2 Topology-Driven Ordering
This is the primary advantage of having a kernel-owned device tree. Suspend/resume ordering is derived from topology, not maintained as a separate list.
Suspend order (depth-first, leaves first):
For each subtree rooted at device D:
1. Suspend all clients of D (provider-client links)
2. Recursively suspend all children of D (bottom-up)
3. Suspend D itself
Resume order (exact reverse):
For each subtree rooted at device D:
1. Resume D itself
2. Recursively resume all children of D (top-down)
3. Resume all clients of D
This is computed once by topological sort when a system PM transition begins. Provider- client edges are treated as additional dependency edges in the sort. The result is cached and invalidated when the tree topology changes.
Why this is better than Linux: Linux maintains a dpm_list that approximates
topological order but can get it wrong. The ordering is based on registration order and
heuristic adjustments, not the actual device tree. UmkaOS computes the correct order
directly from the tree.
10.5.6.3 PM Failure Handling
When a driver fails to suspend within its timeout:
- Registry marks the node as
Error. - Driver is force-stopped (revoke isolation domain / kill process).
- Suspend continues for remaining devices — one broken driver does not block the entire system.
- On resume, the failed device's driver is reloaded fresh (leveraging crash recovery from Section 10.8).
- Failure is logged with context for admin diagnosis.
This directly implements the principle from Section 17.2: "Tier 1 and Tier 2 drivers that fail to suspend within a timeout are forcibly stopped and restarted on resume."
10.5.6.4 Runtime Power Management
Beyond system suspend, individual devices can enter low-power states when idle:
pub struct RuntimePmPolicy {
pub enabled: bool,
pub idle_timeout_ms: u32, // Enter D1 after this idle period
pub min_state: PowerState, // Deepest state allowed during runtime PM
}
The registry tracks I/O activity per device (through KABI call frequency). When a device
has been idle for idle_timeout_ms, the registry initiates a runtime suspend of that
device alone. Children are only suspended if they are also idle.
Runtime PM is independent of system PM. A device can be in D1 (runtime idle) while the system is fully running.
10.5.7 Hot-Plug
10.5.7.1 Bus Drivers as Event Sources
Bus drivers (PCI host bridge, USB XHCI, USB hub) are the source of hotplug events. They detect device arrival/departure and report to the registry through KABI methods.
A bus driver is identified by is_bus_driver = 1 in its DriverManifest. It has the
HOTPLUG_NOTIFY capability (already defined in capability.rs).
10.5.7.2 Device Arrival
1. Bus driver detects new device
(PCIe hot-add interrupt, USB port status change, ACPI _STA change)
2. Bus driver calls registry_report_device() via KABI
- Passes: parent handle, bus type, bus-specific identity, initial properties
3. Registry creates a new DeviceNode in Discovered state
4. Registry populates properties from the bus driver's report
5. Registry runs the match engine on the new node
6. If match found: load driver, init, transition to Active
7. Registry emits uevent for Linux compatibility (udev/systemd)
10.5.7.3 Device Removal (Orderly)
1. Bus driver detects device departure (link down, port status change)
2. Bus driver calls registry_report_removal() via KABI
3. Registry processes the subtree bottom-up:
a. For each child (deepest first):
- Stop the child's driver (shutdown callback)
- Release capabilities
- Remove child node
b. Stop the target device's driver
c. Release all capabilities
d. Remove the DeviceNode
4. Registry emits uevent (removal)
10.5.7.4 Surprise Removal
When a device is physically yanked without warning (e.g., USB unplug during I/O):
- Bus driver detects absence (failed transaction, link down).
- Registry receives the removal report.
- All pending I/O for the device and its children is completed with
-EIO. shutdown()is called on the driver — it may fail quickly because the hardware is gone. This is expected and handled gracefully (timeout → force-stop).- The node subtree is torn down.
This mirrors crash recovery but is initiated by the bus driver rather than by a fault.
10.5.7.5 Uevent Compatibility
For Linux userspace compatibility (udev, systemd-udevd), the registry emits uevent notifications matching the Linux format:
ACTION=add
DEVPATH=/devices/pci0000:00/0000:03:00.0
SUBSYSTEM=pci
PCI_ID=8086:2723
PCI_CLASS=028000
DRIVER=umka-iwlwifi
This feeds into umka-compat/src/sys/ for sysfs and umka-compat/src/dev/ for
devtmpfs, as outlined in Section 18.1.3.
10.5.8 Service Discovery
10.5.8.1 The Problem
Drivers sometimes need services from other drivers — not through direct communication, but through mediated access. Examples:
- NIC needs a PHY driver (MII bus)
- GPU display pipeline needs I2C controller for DDC/EDID
- RAID controller needs to discover member disks
- Filesystem driver needs its underlying block device
In Linux, each of these has a subsystem-specific mechanism (phylib, i2c_adapter, md_personality, etc.) with its own registration/lookup API. In IOKit, it is done through IOService matching. UmkaOS unifies service discovery through the registry.
10.5.8.2 Service Publication
A driver can publish a named service on its device node:
Driver A (e.g., PHY driver):
1. Completes init, device node is Active
2. Calls registry_publish_service("phy", &phy_vtable)
3. Registry records: node A provides service "phy" with given vtable
The phy_vtable is a service-specific C-ABI vtable (same flat, versioned approach as
all other KABI vtables). The registry stores a reference to it.
10.5.8.3 Service Lookup
A driver can look up a named service:
Driver B (e.g., NIC driver):
1. Needs PHY service
2. Calls registry_lookup_service("phy", scope=ParentSubtree)
3. Registry searches for a node in scope that publishes "phy"
4. Registry validates Driver B has PEER_DRIVER_IPC capability
5. Registry creates a provider-client link (B consumes A's "phy")
6. Registry returns a wrapped service vtable and a ServiceHandle
Lookup scope options:
#[repr(u32)]
pub enum ServiceLookupScope {
Siblings = 0, // Same parent only
ParentSubtree = 1, // Parent and all its descendants
Global = 2, // Entire registry (expensive, rare)
Specific = 3, // A specific node (by DeviceHandle)
}
10.5.8.4 Mediated Access
The registry mediates all cross-driver service access. This is critical:
- The registry validates capabilities before returning a service handle.
- The returned vtable is wrapped by the registry — calls go through a trampoline that:
- Validates the service handle is still valid
- Performs the isolation domain switch if provider and client are in different Tier 1 domains
- Handles the user-kernel transition if one side is Tier 2
- The registry can revoke a service link at any time (e.g., when the provider crashes).
- The registry tracks all active links for PM ordering (clients must suspend before providers).
- Drivers never hold direct pointers to each other's memory.
10.5.8.5 Service Recovery
When a provider driver crashes and is reloaded:
- The registry invalidates all service handles pointing to the crashed provider.
- Client drivers that call the service vtable receive
-ENODEVfrom the trampoline. - After the provider is reloaded and republishes its service, client drivers receive
a
service_recoveredcallback (optional, new addition toDriverEntry):
// Appended to DriverEntry (optional)
pub service_recovered: Option<unsafe extern "C" fn(
ctx: *mut c_void,
service_name: *const u8,
service_name_len: u32,
) -> InitResultCode>,
The client driver can then re-acquire the service handle and resume operations.
10.5.8.5a Service Handle Liveness Protocol
After a Tier 1 driver crashes, any ServiceHandle held by Tier 2 or user processes
points to a stale vtable. Calling through a stale vtable is a use-after-free (UAF)
vulnerability. UmkaOS prevents this via generation counters:
/// Kernel-internal service reference. NOT exposed at KABI boundary.
/// The KABI-stable token is `ServiceHandle` (a newtype over `u64`).
/// Mapping: `ServiceHandle::id` → kernel looks up `InternalServiceRef` via service registry.
///
/// Contains a generation counter that is checked on every dispatch
/// to detect stale handles pointing to crashed providers.
pub struct InternalServiceRef {
/// Provider descriptor pointer (points into umka-core memory, not driver memory).
provider: *const ProviderDescriptor,
/// Generation of the provider at handle creation time.
/// Must match provider.state_generation on dispatch or the call fails.
generation: u64,
/// Rights granted to the holder of this handle.
rights: Rights,
}
/// Per-provider state generation counter. Incremented when:
/// 1. The provider crashes and is reloaded.
/// 2. The provider explicitly invalidates all handles (e.g., after a
/// security-relevant config change).
/// Stored in umka-core memory (not in the driver's memory domain) so it
/// remains valid even after the driver domain is destroyed.
pub struct ProviderDescriptor {
/// Monotonically increasing. Odd = active; even = inactive/crashed.
/// Updated atomically by umka-core on crash detection.
pub state_generation: AtomicU64,
// ... vtable pointer and other registry fields follow
}
Dispatch check (in the trampoline layer, before every cross-domain call):
fn trampoline_dispatch(handle: &InternalServiceRef, request: &Request) -> Result<Response, Error> {
// Check liveness: read the provider's current generation.
// Ordering::Acquire: ensures we see any writes made by the crash handler
// that incremented state_generation.
let current_gen = unsafe {
(*handle.provider).state_generation.load(Ordering::Acquire)
};
if current_gen != handle.generation {
return Err(Error::ProviderDead);
}
// Generation matched: safe to call through vtable.
// (Note: generation can still change between the check and the call.
// The domain fault handler catches this and returns ProviderDead to
// the caller via the normal crash-recovery path.)
dispatch_to_tier1(handle, request)
}
Handle invalidation on crash:
When a Tier 1 driver panics:
1. The domain fault handler (already specified in Section 10.8) catches the fault.
2. It atomically increments provider.state_generation (odd → even, marking inactive).
3. All subsequent dispatch attempts to this provider return Err(ProviderDead).
4. After driver reload, the new provider instance starts with state_generation + 1
(odd, active). Old handles (with old generation) remain permanently stale.
5. Callers that receive Err(ProviderDead) must re-open the service to get a
new ServiceHandle with the current generation (the kernel creates a new
InternalServiceRef with the updated generation and maps it to a fresh ServiceHandle::id).
Invariant: ProviderDescriptor is always allocated in umka-core memory, never in
the driver's isolation domain. This ensures the descriptor (including state_generation)
remains accessible and uncorrupted after the driver domain is torn down during crash
recovery.
Design intent: InternalServiceRef cannot be "refreshed" — a crashed provider's
internal reference cannot be upgraded to point at the new instance. This is intentional:
the crash may indicate a security event, and forcing callers to explicitly re-open
(obtaining a new ServiceHandle) ensures they notice the crash and can apply any required
policy (e.g., re-authenticate, validate new driver version). The generation counter is
the minimal mechanism; it adds one Acquire load (~3-5 cycles, L1-resident) per
cross-domain call.
10.5.8.6 Registry Event Notifications
Beyond driver-to-driver service recovery, kernel subsystems need to react to device lifecycle events. The registry provides an internal notification mechanism (not exposed through KABI — this is kernel-to-kernel only).
/// Registry event types that kernel subsystems can subscribe to.
#[repr(u32)]
pub enum RegistryEvent {
/// A new device node was created (after bus enumeration).
DeviceDiscovered = 0,
/// A device transitioned to Active (driver bound and initialized).
DeviceActive = 1,
/// A device is being removed (before teardown begins).
DeviceRemoving = 2,
/// A device's driver crashed and recovery is starting.
DeviceRecovering = 3,
/// A device's power state changed.
PowerStateChanged = 4,
/// IOMMU group assignment changed (passthrough ↔ kernel domain).
IommuGroupChanged = 5,
/// A service was published or unpublished.
ServiceChanged = 6,
}
/// Callback type for registry event notifications.
pub type RegistryNotifyFn = fn(
event: RegistryEvent,
node_id: DeviceNodeId,
context: *mut c_void,
);
Subscribers:
| Kernel Subsystem | Events | Purpose |
|---|---|---|
| Memory manager (Section 4.1) | DeviceDiscovered, DeviceRemoving |
Update NUMA topology when devices with local memory appear/disappear |
| Scheduler (Section 6.1) | DeviceActive, DeviceRemoving |
Update IRQ affinity recommendations |
| FMA engine (Section 19.1) | DeviceRecovering |
Log fault management events, track failure patterns |
| AccelScheduler (Section 21.1) | DeviceActive, DeviceRecovering, PowerStateChanged |
Manage accelerator context lifecycle |
| Sysfs compat (Section 10.5.12) | All events | Update /sys filesystem in real-time |
Notifications are dispatched synchronously during registry state transitions. Subscribers must not block — they record the event and defer heavy work to a workqueue. This prevents a slow subscriber from delaying device bring-up.
10.5.9 KABI Integration
10.5.9.1 New Methods Appended to KernelServicesVTable
All new methods are Option<...> for backward compatibility. Older kernels that do not
have the registry will have these as None. Drivers must check for None before calling.
// === Device Registry (appended to KernelServicesVTable) ===
/// Report a newly discovered device to the registry.
/// Called by bus drivers (PCI enumeration, USB hub, etc.).
pub registry_report_device: Option<unsafe extern "C" fn(
parent_handle: DeviceHandle,
bus_type: BusType,
bus_identity: *const u8,
bus_identity_len: u32,
properties: *const PropertyEntry,
property_count: u32,
out_handle: *mut DeviceHandle,
) -> IoResultCode>,
/// Report that a device has been physically removed.
pub registry_report_removal: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
) -> IoResultCode>,
/// Get a property value from a device node.
pub registry_get_property: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
key: *const u8,
key_len: u32,
out_value: *mut PropertyValueC,
out_value_size: *mut u32,
) -> IoResultCode>,
/// Set a property on a device node.
pub registry_set_property: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
key: *const u8,
key_len: u32,
value: *const PropertyValueC,
value_size: u32,
) -> IoResultCode>,
/// Publish a named service on this device node.
pub registry_publish_service: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
service_name: *const u8,
service_name_len: u32,
service_vtable: *const c_void,
service_vtable_size: u64,
) -> IoResultCode>,
/// Look up a named service.
pub registry_lookup_service: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
service_name: *const u8,
service_name_len: u32,
scope: u32,
out_service_vtable: *mut *const c_void,
out_service_handle: *mut ServiceHandle,
) -> IoResultCode>,
/// Release a previously acquired service handle.
pub registry_release_service: Option<unsafe extern "C" fn(
service_handle: ServiceHandle,
) -> IoResultCode>,
/// Get the device handle for the current driver instance.
pub registry_get_device_handle: Option<unsafe extern "C" fn(
out_handle: *mut DeviceHandle,
) -> IoResultCode>,
/// Enumerate children of a device node.
pub registry_enumerate_children: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
out_handles: *mut DeviceHandle,
max_count: u32,
out_count: *mut u32,
) -> IoResultCode>,
10.5.9.2 New ABI Types
/// Opaque handle to a device node in the registry.
#[repr(C)]
pub struct DeviceHandle {
pub id: u64,
}
impl DeviceHandle {
pub const INVALID: Self = Self { id: 0 };
}
/// Stable C-ABI service token. Passed across isolation domain boundaries.
/// Kernel resolves this id to an `InternalServiceRef` at each call site.
/// Liveness: the module providing this service cannot be unloaded while any
/// active `ServiceHandle` referring to it is held by a capability.
#[repr(C)]
pub struct ServiceHandle {
pub id: u64,
}
/// A property entry for C ABI transport.
#[repr(C)]
pub struct PropertyEntry {
pub key: *const u8,
pub key_len: u32,
pub value_type: PropertyType,
pub value_data: *const u8,
pub value_len: u32,
pub _pad: u32,
}
#[repr(u32)]
pub enum PropertyType {
U64 = 0,
I64 = 1,
String = 2,
Bytes = 3,
Bool = 4,
StringArray = 5,
}
/// C-ABI-safe property value output buffer.
#[repr(C)]
pub struct PropertyValueC {
pub value_type: PropertyType,
pub _pad: u32,
pub data: [u8; 256],
}
// `KabiVersion` — defined in Section 11.1.9.3 (11-kabi.md).
// Layout: { major: u16, minor: u16, patch: u16, _pad: u16 } — repr(C), 8 bytes.
// Key methods: new(major,minor,patch), is_compatible_with(kernel), as_u64(), from_u64(v).
// Constant: KABI_CURRENT = 1.0.0.
// The vtable wire format stores KabiVersion::as_u64() in the first 8 bytes of each vtable.
10.5.9.3 DeviceDescriptor Extension
The existing DeviceDescriptor gains new fields (appended):
// Appended to DeviceDescriptor
pub device_handle: DeviceHandle, // Registry handle for this device
pub numa_node: i32, // NUMA node (-1 = unknown)
pub _pad: u32,
The DeviceDescriptor passed to driver_entry.init() is now populated from the
registry node's properties, ensuring consistency between what the registry knows and
what the driver sees.
10.5.9.4 Memory Management KABI (memory_v1)
The memory_v1 KABI table provides driver-callable memory management functions
appended to KernelServicesVTable starting at KABI version 2 (the initial
KernelServicesVTable layout is version 1). Per Section 11.1.4 versioning rules, these four
Option<fn> fields are tail-appended; drivers compiled against KABI v1 see a
shorter vtable_size and never access these offsets. Drivers compiled against v2+
check vtable_size >= offset_of!(memory_v1 fields) before calling, and fall back
to non-NUMA allocation if the kernel does not expose memory_v1.
These extend the existing DMA allocation functions (Section 10.4, Tier 2 syscall table) with NUMA-aware operations for Tier 1 drivers.
// === Memory Management (appended to KernelServicesVTable, memory_v1) ===
/// Request explicit NUMA page migration for driver-private pages.
///
/// Moves the specified physical pages to the target NUMA node. Only callable
/// on pages within the calling driver's isolation domain (Tier 1 protection
/// key match required). The kernel validates ownership before migration.
///
/// Migration is **synchronous**: this function blocks until all pages have
/// been physically moved to the target node (or an error occurs). The
/// driver's virtual mappings are updated transparently — existing virtual
/// addresses remain valid after migration, only the underlying physical
/// frames change.
///
/// # Arguments
/// - `pages`: Pointer to a caller-allocated array of physical page addresses
/// (page-aligned, 4 KiB granularity). Each address must be within the
/// caller's isolation domain.
/// - `page_count`: Number of entries in the `pages` array. Maximum 512 pages
/// per call (2 MiB). For larger migrations, issue multiple calls.
/// - `target_node`: NUMA node ID to migrate to. Must be a valid node with
/// available memory. Use `numa_node_count()` to discover topology.
///
/// # Returns
/// - `IO_SUCCESS` (0): All pages successfully migrated.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT, -14): One or more page addresses are
/// outside the caller's isolation domain or not page-aligned. No pages
/// are migrated (atomic failure).
/// - `IO_ERR_INVALID_NODE` (-EINVAL, -22): `target_node` does not exist in
/// the NUMA topology or has no allocatable memory.
/// - `IO_ERR_DMA_PINNED` (-EBUSY, -16): One or more pages have an active
/// DMA mapping (`PG_dma_pinned` flag set, Section 10.5.3.7). DMA-pinned pages cannot
/// be migrated because a device holds their physical address. The driver
/// must unpin DMA buffers (`free_dma_buffer`) before migrating. No pages
/// are migrated (atomic failure).
/// - `IO_ERR_NOMEM` (-ENOMEM, -12): Target node has insufficient free memory
/// to accept the migrated pages. No pages are migrated (atomic failure).
/// - `IO_ERR_PERM` (-EPERM, -1): Caller does not hold `CAP_NUMA_MIGRATE`
/// capability (required for explicit NUMA migration).
///
/// # Atomicity
/// Migration is all-or-nothing: either all pages in the request are migrated,
/// or none are. The kernel pre-validates all pages and pre-allocates target
/// frames before beginning the move. If any page fails validation, the entire
/// request is rejected before any migration occurs.
///
/// # Concurrency
/// The kernel holds the per-page migration lock during the move, serializing
/// with concurrent NUMA balancer scans and other migration requests for the
/// same pages. Other pages in the driver's domain remain accessible during
/// migration. The migrating pages are briefly unmapped (~1-5 µs per page);
/// concurrent access from the driver's other threads will fault and block
/// until migration completes.
///
/// # Safety
/// - `pages` must point to a valid array of at least `page_count` elements.
/// - All addresses in the array must be page-aligned (4096-byte boundary).
/// - Caller must ensure no device DMA is in flight to the specified pages
/// (the `PG_dma_pinned` check catches registered DMA buffers, but the
/// driver is responsible for not issuing new DMA to these addresses
/// concurrently with migration).
pub driver_request_numa_migration: Option<unsafe extern "C" fn(
pages: *const u64,
page_count: u32,
target_node: i32,
) -> IoResultCode>,
/// Query the NUMA node for a set of physical pages.
///
/// Returns the NUMA node ID for each page in the input array.
/// Useful for drivers that want to check data locality before deciding
/// whether to migrate.
///
/// # Arguments
/// - `pages`: Pointer to array of physical page addresses (page-aligned).
/// - `page_count`: Number of entries in `pages`.
/// - `out_nodes`: Pointer to caller-allocated array of `page_count` `i32`
/// values. On success, `out_nodes[i]` contains the NUMA node ID for
/// `pages[i]`.
///
/// # Returns
/// - `IO_SUCCESS`: All node IDs written to `out_nodes`.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT): One or more pages outside caller's domain.
pub driver_query_numa_node: Option<unsafe extern "C" fn(
pages: *const u64,
page_count: u32,
out_nodes: *mut i32,
) -> IoResultCode>,
/// Query NUMA topology: number of NUMA nodes in the system.
///
/// # Returns
/// - Positive value: number of NUMA nodes (1 on non-NUMA systems).
/// - Negative value: error (should not occur; returns 1 as fallback).
pub numa_node_count: Option<unsafe extern "C" fn() -> i32>,
/// Query available memory on a NUMA node.
///
/// # Arguments
/// - `node`: NUMA node ID.
/// - `out_total_bytes`: Total physical memory on this node.
/// - `out_free_bytes`: Currently free memory on this node.
///
/// # Returns
/// - `IO_SUCCESS`: Values written to output pointers.
/// - `IO_ERR_INVALID_NODE` (-EINVAL): Node does not exist.
pub numa_node_memory: Option<unsafe extern "C" fn(
node: i32,
out_total_bytes: *mut u64,
out_free_bytes: *mut u64,
) -> IoResultCode>,
Usage pattern — A NUMA-aware NIC driver migrates receive buffer pages to the NUMA node closest to the NIC's PCIe attachment point:
1. Driver probes device, reads `DeviceDescriptor.numa_node` (Section 10.5.9.3).
2. Driver allocates receive ring buffers (via `alloc_dma_buffer`).
3. On each received packet, driver calls `driver_query_numa_node` to check
if the destination process's pages are local.
4. If remote, driver calls `driver_request_numa_migration` to pull hot pages
to the NIC's node, reducing memory access latency for subsequent packets.
5. Migration frequency is rate-limited by the driver to avoid migration storms
(recommended: at most once per page per 100ms).
10.5.10 Crash Recovery Integration
The registry participates in the crash recovery sequence defined in Section 10.8.
10.5.10.1 When a Driver Crashes
-
Detection: UmkaOS Core detects the fault (hardware exception in isolation domain, watchdog timeout, Tier 2 process crash).
-
Registry notification: UmkaOS Core identifies the faulting driver's device node. Registry transitions it to
Recovering. -
Service invalidation: All service handles pointing to the crashed driver are invalidated. Client drivers receive
-ENODEVon subsequent service calls. -
Child cascade: If the crashed driver is a bus driver with children, the registry processes children bottom-up:
- For each child: stop driver, release capabilities, transition to
Stopping. -
Children are re-probed after the bus driver recovers.
-
I/O drain + DMA fence: All pending I/O completed with
-EIO. Critically, before freeing any driver memory, UmkaOS must ensure no in-flight DMA operations can write to those pages. The sequence: - IOMMU mapping for the driver's DMA regions is revoked (set to fault-on-access) immediately at step 2 (ISOLATE). Any in-flight DMA that completes after this point will hit an IOMMU fault (harmless — the write is dropped by the IOMMU).
DMA teardown sequence (before IOTLB unmap):
-
Assert device-class DMA stop:
- PCIe devices with FLR (Function Level Reset) support: issue FLR via the PCIe Device Control register (capability offset + 0x08, bit 15). FLR resets the device state and stops all outstanding DMA.
- NVMe: issue Admin Command ABORT for all outstanding I/Os, then CC.EN=0 (Controller Enable clear) to halt the NVMe controller.
- AHCI: issue PORT_CMD_FIS_RX clear + PORT_CMD_ST clear per port.
- USB devices: send USB port reset to the host controller.
- Devices without a DMA-stop mechanism: skip to step 2 (fallback only).
-
Wait for DMA quiescence:
- Poll the device's DMA-active indicator (device-class specific) until it reports no outstanding DMA, OR until 100ms has elapsed.
- For FLR: the PCIe spec requires FLR completion within 100ms. After FLR, DMA is guaranteed stopped by hardware.
-
If step 2 does not complete within 100ms:
- Increment
driver.dma_timeout_count(exposed via/sys/devices/.../dma_timeouts) - The FMA subsystem (Section 19.1) receives a
FaultEvent::DmaTimeoutevent. - Issue PCIe Function Level Reset (FLR) via the device's FLR capability register (Device Control register bit 15). FLR is a hard device reset that stops all outstanding DMA by definition.
- Wait up to 500ms for FLR completion (poll config space; device returns 0xFFFF during reset; the PCIe Base Spec requires FLR to complete within 100ms, so 500ms provides a conservative margin).
- If FLR is unsupported by the device, or if FLR also times out:
- Do not free memory. Mark the IOMMU group as quarantined: the existing IOMMU mappings are left in place (fault-on-access) but no new mappings are granted. Memory backing those mappings is pinned and excluded from the allocator until the quarantine is lifted.
- Return
Err(DmaQuiescenceTimeout)to the crash recovery path. - The quarantined IOMMU group is reset on the next system suspend/resume cycle (which performs a full bus reset), at which point the pinned memory is released.
- Log: "DMA quiescence failed on [bus:dev.fn] after FLR — IOMMU group quarantined; memory pinned until suspend/resume reset"
- Increment
-
IOTLB invalidate: Only after confirmed device quiescence (step 1 DMA stop + step 2 poll, or step 3 FLR), invalidate IOMMU TLB entries for the unmapped region. On Intel VT-d, this uses the Invalidation Wait Descriptor with
IWD=1to wait for invalidation completion. On AMD, theCOMPLETION_WAITcommand provides equivalent functionality. Only after IOTLB invalidation completes is it safe to free physical pages.
Design note: Linux's default driver teardown does not always issue FLR, relying on IOMMU timeouts and trusting drivers to drain their own DMA. UmkaOS enforces the explicit stop sequence — it is the kernel's responsibility to ensure hardware is quiesced, not the driver's.
- Driver private memory is freed only after confirmed device quiescence and completed IOTLB invalidation. If quiescence cannot be confirmed (FLR also fails), the memory is quarantined rather than freed — no use-after-free path is permitted.
-
Why this matters: without confirmed DMA quiescence, a device still mid-DMA could write to pages that have been freed and reallocated to another driver or to userspace — a use-after-free via hardware. Proceeding past a timeout without hardware confirmation of DMA stop is not an acceptable fallback for a production kernel; quarantine is the safe alternative when quiescence cannot be established.
-
Device reset: FLR for PCIe, port reset for USB, etc.
-
Driver reload: Fresh binary loaded, new vtable exchange. The
DeviceDescriptorretains the sameDeviceHandle— the device's identity in the registry is preserved across crashes. -
Service re-publication: Reloaded driver publishes its services again. Registry notifies clients via
service_recoveredcallback. -
Child re-probe: If this was a bus driver, the registry re-enumerates and re-probes child devices.
10.5.10.2 Failure Counter Integration
/// Sliding-window failure tracker. Records timestamps of recent failures
/// in a circular buffer. Used by the auto-demotion policy to count failures
/// within a configurable time window.
pub struct FailureWindow {
/// Circular buffer of failure timestamps (monotonic nanoseconds).
timestamps: [u64; 16],
/// Index of the next write position (wraps at 16).
head: u32,
/// Total number of failures recorded (may exceed 16; only the last 16
/// timestamps are retained).
total_count: u32,
}
impl FailureWindow {
/// Count failures within the last `window_ns` nanoseconds.
pub fn count_within(&self, window_ns: u64) -> u32 { /* ... */ }
/// Record a failure at the current time.
pub fn record(&mut self, now_ns: u64) { /* ... */ }
}
The registry's per-node failure_window (a FailureWindow sliding-window counter)
feeds into the existing auto-demotion policy. The counter records timestamps in a
16-entry circular buffer; the policy query asks "how many entries fall within the
last N seconds?" (default window: 1 hour):
failure_window.count_within(1 hour):
0-2: Reload at same tier
3+: Demote to next lower tier (if minimum_tier allows)
5+: Transition to Quarantined state (driver permanently disabled, device
unbound); requires manual administrator re-enable via sysfs. Log critical alert.
This is the same policy described in Section 10.8, now with the registry as the tracking mechanism.
How auto-demotion works without recompilation — A driver that can run in both Tier 1
(isolation domain, Ring 0) and Tier 2 (process, Ring 3) does not need two separate
binaries. The KABI vtable abstraction (Section 11.1) provides identical function signatures
regardless of tier. The difference is in the hosting environment: Tier 1 drivers are
loaded as shared objects into a kernel isolation domain; Tier 2 drivers are loaded as
processes. The same .umka binary is valid in both contexts because KABI syscalls (ring
buffer operations, capability invocations) are designed to work from either Ring 0 or
Ring 3 — the Tier 1 path uses direct function calls via the vtable, while the Tier 2 path
uses syscall wrappers that implement the same vtable interface. Auto-demotion simply means
"restart this driver binary in a Tier 2 process instead of a Tier 1 isolation domain."
The driver code is unaware of the change; only the hosting environment differs.
10.5.11 Boot Sequence Integration
The registry integrates into the boot sequence (Section 2.1.3):
4. UmkaOS Core initialization:
a. Parse boot parameters and ACPI tables
b. Initialize physical memory allocator
c. Initialize virtual memory
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: APIC, timer, early console
f. Initialize capability system
g. Initialize device registry <-- NEW
h. Register Tier 0 devices in registry <-- NEW
i. Initialize scheduler
j. Mount initramfs
5. ACPI/DT enumeration: populate registry <-- NEW
6. PCI enumeration: create device nodes <-- NEW
7. Registry runs match engine, loads storage driver <-- REPLACES ad-hoc loading
8. Mount real root filesystem
9. Continue device enumeration (USB, etc.) <-- NEW
10. Execute /sbin/init
10.5.11.1 Tier 0 Devices
Tier 0 drivers (APIC, timer, serial) are statically linked and initialized before the registry exists. After registry init, they are registered retroactively:
registry.register_tier0_device("apic", ...);
registry.register_tier0_device("timer", ...);
registry.register_tier0_device("serial0", ...);
These nodes are created directly in Active state with no match/load cycle.
10.5.11.2 Console Handoff
The display and input stack transitions through multiple phases during boot. The handoff protocol ensures zero message loss and graceful degradation.
Phase 1 — Tier 0 (early boot):
- Serial console (COM1/PL011/16550) is active from the first instruction.
- VGA text mode (80×25) initialized by BIOS/UEFI firmware on x86-64.
- All kernel output goes to the ring buffer (klog), serial, and VGA text mode
simultaneously. The ring buffer captures every message from the first printk.
Phase 2 — Tier 1 loaded (DRM/KMS driver): - The DRM/KMS display driver initializes, performs modeset, and allocates a framebuffer. - A framebuffer console renderer (fbcon) is initialized with the target resolution.
Handoff protocol:
1. DRM driver completes modeset, signals "console ready" via KABI callback:
driver_event(CONSOLE_READY, framebuffer_info)
2. Kernel console subsystem:
a. Locks the console output path (brief pause, <1ms)
b. Replays the full ring buffer contents onto the framebuffer console
— no boot messages are lost, the user sees the complete boot log
c. Registers fbcon as the primary console output
d. Unlocks the console output path
3. Serial console remains active — never disabled. All output goes to BOTH
serial and framebuffer. This ensures remote management always works.
4. VGA text mode driver is deregistered as the *primary* console backend.
The VGA text mode memory region (0xB8000) is NOT released to the physical
memory allocator — it is reserved as a panic-only fallback (see below).
The region is small (4000 bytes) and the cost of keeping it reserved is
negligible compared to the benefit of having a guaranteed crash output path.
Keyboard handoff: - Early boot: PS/2 scan code handler (Tier 0) captures keystrokes into a buffer. This allows emergency interaction (e.g., boot parameter editing) before USB is up. - Tier 1 loaded: USB HID driver initializes, registers as input device. The input subsystem drains the PS/2 keystroke buffer — no keystrokes are lost. - PS/2 handler remains active for keyboards physically connected via PS/2.
Virtual terminals:
- VT switching (Ctrl+Alt+F1–F6) is implemented in umka-core's input multiplexer,
NOT in the display driver. The display driver is a passive renderer.
- On VT switch, the input multiplexer sends a SWITCH_VT(n) command to the
display driver via KABI. The driver switches which virtual framebuffer is scanned
out.
- This design means a crashing display driver doesn't break VT switching logic —
on driver recovery, the multiplexer re-sends the current VT state.
Crash fallback: - If the DRM driver faults, the core reverts to VGA text mode (x86-64) or serial-only (AArch64/RISC-V/PPC) for panic output. Tier 0 console backends are always available. - The panic handler bypasses the normal console locking path and writes directly to the Tier 0 backends (serial + VGA text if available).
10.5.11.3 PCI Enumeration
PCI enumeration is part of UmkaOS Core (Tier 0 functionality in early boot). It walks PCI configuration space and creates device nodes:
For each PCI bus (starting from bus 0):
For each device 0-31, function 0-7:
If device present:
1. Create DeviceNode with PCI bus identity
2. Populate properties: vendor-id, device-id, class-code, BARs, IRQs
3. If this is a bridge: create a bus node, recurse into secondary bus
4. Set numa_node from ACPI SRAT proximity domain
5. Registry runs match engine for this node
10.5.11.4 NUMA Awareness
ACPI SRAT (System Resource Affinity Table) provides NUMA topology. The registry uses
this to set numa_node on each device node based on the device's proximity domain (PCI
devices inherit from their root port's NUMA node).
This information is available for: - Driver memory allocation: Prefer the device's NUMA node. - DMA buffer allocation: Prefer the device's NUMA node. - IRQ affinity: Suggest CPU affinity matching the device's NUMA node. - Tier 1 domain assignment: Prefer grouping NUMA-local devices when isolation domains are shared.
10.5.11.5 ACPI Enumerator
The ACPI enumerator is Tier 0 kernel-internal code that walks the ACPI namespace and creates platform device nodes in the registry. It handles the tables that define hardware topology:
| ACPI Table | Registry Impact |
|---|---|
| MCFG (PCI Express Memory Mapped Config) | Defines PCI segment groups and ECAM base addresses. The PCI enumerator uses these to access PCI config space. |
| SRAT (System Resource Affinity) | Maps PCI bus ranges and memory ranges to NUMA proximity domains. Sets numa_node on device nodes. |
| DMAR / IVRS (DMA Remapping) | Defines IOMMU hardware. Creates IOMMU group assignments (Section 10.5.3.8). Intel DMAR for VT-d, AMD IVRS for AMD-Vi. |
| DSDT / SSDT (Differentiated System Description) | Defines platform devices (embedded controllers, power buttons, battery, thermal zones). Each ACPI device object becomes a platform device node. |
| HPET / MADT | Timer and interrupt controller topology. Creates Tier 0 device nodes for APIC, I/O APIC, HPET. |
AML evaluation: The ACPI enumerator includes an AML (ACPI Machine Language)
interpreter for evaluating _STA (device status), _CRS (current resources), and
_HID (hardware ID) methods. This is a significant subsystem but is
required for correct hardware enumeration on any x86 system. The AML interpreter runs
in Tier 0 with full kernel privileges because it accesses hardware registers directly.
Device Tree enumerator (AArch64/RISC-V/PPC): Parses the flattened device tree (FDT)
passed by the bootloader. Each DT node with a compatible property becomes a platform
device node. The reg property populates DeviceResources.bars (as MMIO regions), and
the interrupts property populates DeviceResources.irqs. DT phandle references
become provider-client service links.
10.5.11.6 Firmware Quirk Framework
ACPI tables and Device Trees are authored by firmware engineers and are notoriously
buggy. Linux has accumulated thousands of firmware workarounds scattered across
subsystem-specific code (drivers/acpi/, arch/x86/kernel/, DMI match tables,
ACPI override tables). UmkaOS centralizes firmware workarounds into a structured quirk
framework, similar to the CPU errata framework (Section 2.1.4).
The problem is real — common firmware bugs observed in the wild:
- ACPI _CRS (Current Resources) reports incorrect MMIO ranges for PCI bridges,
causing resource conflicts
- SRAT (NUMA affinity) tables claim all memory belongs to NUMA node 0 on multi-socket
systems (broken BIOS update)
- DMAR (IOMMU) tables omit devices or report wrong scope, causing IOMMU group
misassignment
- Device Tree interrupt-map entries with wrong parent phandle references (ARM SoC
vendor bugs)
- DSDT/SSDT AML code with infinite loops, incorrect register addresses, or methods that
return wrong types
- MADT reports non-existent APIC IDs (causes boot failure if kernel trusts them)
- ECAM (PCI config space) base address wrong in MCFG table
UmkaOS's firmware quirk table:
/// Firmware quirk entry — matches a system to its required workarounds.
struct FirmwareQuirk {
/// System identification (DMI vendor + product + BIOS version).
match_id: DmiMatch,
/// ACPI table match (optional — match specific table revision).
table_match: Option<AcpiTableMatch>,
/// Human-readable quirk identifier.
quirk_id: &'static str,
/// Workaround: override, ignore, or patch firmware data.
action: QuirkAction,
}
enum QuirkAction {
/// Override a specific ACPI table with a corrected version (ACPI override).
OverrideTable { table_signature: [u8; 4], replacement: &'static [u8] },
/// Ignore a specific device entry in DMAR/IVRS (broken IOMMU scope).
IgnoreIommuDevice { segment: u16, bus: u8, device: u8, function: u8 },
/// Override NUMA affinity for a memory range (broken SRAT).
OverrideNumaAffinity { phys_start: u64, phys_end: u64, node: u32 },
/// Ignore an APIC ID in MADT (non-existent CPU).
IgnoreApicId { apic_id: u32 },
/// Patch a specific AML method (replace bytecode).
PatchAml { path: &'static str, replacement: &'static [u8] },
/// Skip enumeration for a device matching this HID (broken _CRS).
SkipDevice { hid: &'static str },
/// Custom workaround function.
Custom(fn() -> Result<()>),
}
Quirk database population — the initial quirk database is seeded from:
1. Linux's existing DMI quirk tables (drivers/acpi/, arch/x86/pci/) — these
document decades of firmware workarounds with specific DMI match strings
2. Community-reported firmware bugs (same mechanism as Linux's bugzilla)
3. Vendor-provided errata sheets (when available)
ACPI table override — Linux supports loading replacement ACPI tables from initramfs
(CONFIG_ACPI_TABLE_UPGRADE). UmkaOS supports the same mechanism: if a corrected DSDT
is placed in the initramfs at /lib/firmware/acpi/, it replaces the firmware-provided
table at boot. This allows users to fix firmware bugs without waiting for a BIOS update.
Boot-time quirk logging — all applied quirks are logged at boot:
umka: Firmware quirk applied: DELL-POWEREDGE-R740-BIOS-2.12 — DMAR ignore device 0000:00:14.0 (broken IOMMU scope)
umka: Firmware quirk applied: LENOVO-T14S-BIOS-1.38 — SRAT override node 0→1 for range 0x100000000-0x200000000
Why UmkaOS is more sensitive to firmware bugs than Linux — UmkaOS's topology-aware
device registry derives NUMA affinity, IOMMU groups, power management ordering, and
driver isolation domains from firmware-reported topology. A firmware bug that reports
wrong NUMA affinity causes UmkaOS to place a driver on the wrong NUMA node (performance
degradation). In Linux, the same bug might cause a suboptimal numactl suggestion but
doesn't affect driver placement (Linux doesn't have topology-aware driver isolation).
This means UmkaOS must invest more heavily in firmware workarounds than Linux for the
same set of hardware. The structured quirk framework makes this manageable — adding a
new workaround is a single table entry, not scattered if (dmi_match(...)) checks
across the codebase.
Defensive parsing — beyond per-system quirks, all firmware table parsers are defensively coded: - ACPI table lengths are validated against the RSDP/XSDT-reported size - AML interpreter has an instruction count limit (prevents infinite loops in AML code) - Device Tree parser validates all phandle references before dereferencing - PCI config space reads are bounds-checked against MCFG-reported ECAM regions - Any parse failure is logged as an FMA event (Section 19.1) and the offending entry is skipped rather than causing a boot failure
10.5.11.7 Resource Assignment
During PCI enumeration, the registry assigns hardware resources to each device:
For each PCI device:
1. Read BAR registers to determine resource requirements (size, type).
2. Assign physical address ranges from the PCI memory/IO space allocator.
- MMIO BARs: allocate from PCI MMIO window (defined by ACPI `_CRS`
method on the PCI host bridge device; MCFG defines only the ECAM base
address for PCIe configuration space access).
- I/O BARs: allocate from PCI I/O window (legacy x86, rare).
3. Write assigned addresses back to BAR registers.
4. Populate DeviceResources.bars with the assigned mappings.
5. Allocate MSI/MSI-X vectors:
- If device supports MSI-X: allocate up to min(device_max, driver_requested) vectors.
- If MSI only: allocate power-of-2 vectors up to device limit.
- Fallback: assign legacy INTx pin.
6. Populate DeviceResources.irqs.
Resource conflicts (overlapping BAR assignments, IRQ vector exhaustion) are detected
during enumeration and logged as FMA events (Section 19.1). Conflicting devices remain
in Discovered state with no driver bound.
10.5.12 Sysfs Compatibility
The registry is the single source of truth for the /sys filesystem required by Linux
compatibility (Section 18.1.3).
10.5.12.1 Mapping
| Sysfs Path | Registry Source |
|---|---|
/sys/devices/ |
Device tree traversal (parent-child edges) |
/sys/bus/pci/devices/ |
All nodes with bus_type == Pci |
/sys/bus/usb/devices/ |
All nodes with bus_type == Usb |
/sys/class/block/ |
Nodes publishing "block" service |
/sys/class/net/ |
Nodes publishing "net" service |
/sys/devices/.../driver |
driver_binding.driver_name |
/sys/devices/.../power/ |
Power state and runtime PM policy |
/sys/devices/.../uevent |
Generated from node properties |
10.5.12.2 Attribute Files
Each standard property maps to the expected sysfs attribute format:
- vendor → property "vendor-id" formatted as 0x%04x
- device → property "device-id" formatted as 0x%04x
- class → property "class-code" formatted as 0x%06x
Custom driver-set properties appear under a properties/ subdirectory.
10.5.12.3 Device Class via Service Names
Linux's /sys/class/ directories are derived from service publication:
- A driver that publishes a "net" service → device appears under /sys/class/net/
- A driver that publishes a "block" service → device appears under /sys/class/block/
- A driver that publishes a "input" service → device appears under /sys/class/input/
This is more principled than Linux's explicit class_create() calls because the
classification falls naturally out of what the driver actually does.
10.5.13 Concurrency and Performance
10.5.13.1 Locking Strategy
- Read path (hot): Property queries, service lookups, sysfs reads. Reader-writer lock allows concurrent reads.
- Write path (cold): Node creation, state transitions, driver binding, hotplug. Takes exclusive write lock.
- Per-node state: Atomic field for lock-free state checks ("is this device active?" does not need the tree lock).
- PM ordering cache: Computed once per PM transition. Invalidated when tree topology changes (hotplug).
10.5.13.2 Scalability
- Device enumeration: O(n*m) where n = match rules, m = unmatched devices. With <1000 drivers and <200 devices on a typical system, this completes in microseconds. Runs once at boot + on hotplug.
- Service lookup: Hash-indexed by service name. O(1) amortized.
- Property query: Binary search on sorted PropertyTable. O(log n), n < 30.
- PM ordering: Topological sort is O(V+E) where V = nodes, E = edges. Computed once, cached.
10.5.13.3 Memory Budget
| Component | Per Node | Notes |
|---|---|---|
| DeviceNode struct | ~512 bytes | Fixed-size fields |
| PropertyTable (avg 15 props) | ~1 KB | Key strings + values |
| Children/providers/clients | ~128 bytes | Vec overhead |
| Total per node | ~1.7 KB |
A typical desktop with ~200 devices: ~340 KB. A busy server with ~1000 devices: ~1.7 MB. Well within kernel memory budget.
10.5.14 Resolved Design Decisions
The following design questions have been resolved:
1. USB topology depth: full topology.
The registry represents the full USB hub topology (up to 7 levels). Hub nodes carry a
UsbHub property struct with port count and per-port power control. This is required
for correct power-management ordering (suspend leaf-first, resume root-first) and
surprise-removal cascading (removing a hub invalidates all downstream devices). The node
overhead is trivial — one DeviceNode per hub.
2. GPU sub-device modeling: child nodes.
Each GPU sub-function (display controller, compute engine, video encoder, copy engine)
is a child DeviceNode with its own BusIdentity::PciFunction and capability flags.
The parent GPU node holds shared state (VRAM, power domain). Each child binds its own
extension vtable (AccelComputeVTable, AccelDisplayVTable per Section 21.1.2) while
sharing the parent's AccelBaseVTable. This enables independent driver binding per
sub-function (e.g., a display driver and a compute driver on the same GPU).
3. Firmware enumerators: pluggable Tier 0 backends.
A FirmwareEnumerator trait defines two methods: enumerate(registry: &mut DeviceRegistry)
and match_device(node: &DeviceNode) -> Option<DeviceProperties>. Two implementations:
AcpiEnumerator— walks the ACPI namespace (_STA,_HID,_CRS), creates platform device nodes.DtEnumerator— walks the flattened device treecompatiblestrings, creates platform device nodes.
Architecture selection is compile-time via arch::current::firmware_enumerator():
x86 → ACPI, ARM/RISC-V → DT, ARM server → both. Both enumerators are kernel-internal
(Tier 0), never exposed through KABI.
4. Multi-function PCI devices: one node per function.
The topology is: PciBridge → PciSlot → PciFunction(0..N). The PciSlot node is a
lightweight grouping node (no driver binding) that carries the slot's physical identity
(segment/bus/device). Each PciFunction child has its own BAR resources, MSI vectors,
and IOMMU group assignment. This matches Linux's sysfs model and makes SR-IOV VF
creation (Decision 8) natural — VFs are additional function children.
Recovery ordering for multi-function devices follows the device tree: if function 0 crashes, sibling functions (1, 2, ...) are notified via the registry's DeviceEvent::SiblingReset event. Each sibling driver independently decides whether to re-probe its function or wait for the parent slot to stabilize. The parent PciSlot node coordinates FLR (Function Level Reset) if the failing function requests it.
5. Service versioning: yes, using InterfaceVersion.
registry_publish_service requires the service vtable to start with the standard
vtable_size: u64, version: u32 header, same as all KABI vtables (Section 11.1.3).
Lookup performs major-version matching; minor-version differences are handled by
vtable_size-based field presence detection. No new mechanism — reuses the existing
KABI version negotiation protocol.
6. Multi-provider services: topology-aware lookup + enumeration variant.
registry_lookup_service(name) returns the closest provider by walking: same device →
sibling nodes → parent subtree → global. registry_lookup_all_services(name) returns
an iterator over all providers, ordered by topological distance. The "closest" heuristic
covers the common case (e.g., an I2C client finding its controller); the enumeration
variant handles multi-path cases (RAID member discovery, network bonding).
7. Persistent device naming: yes, bus-identity + serial derived. The registry generates a stable device path from bus-specific identity:
| Bus | Stable Path Source |
|---|---|
| PCI | segment:bus:device.function (stable if ACPI/DT provides _BBN/_SEG) |
| USB | Hub chain + port number (stable as long as physical topology unchanged) |
| NVMe | PCI path + namespace ID |
| SCSI | WWID / VPD page 83 |
The stable path is stored as a stable_path: ArrayString<128> property on each
DeviceNode. The compat layer creates /dev/disk/by-id/, /dev/disk/by-path/ etc. as
symlinks. The kernel itself never uses these names — they are purely for userspace
convenience.
8. IOMMU group granularity for SR-IOV: PF driver creates VF nodes via KABI.
The PF driver calls registry_create_vf_nodes(pf_handle: DeviceHandle, count: u32)
which:
- Validates the PF has ACS on its upstream port (required for per-VF IOMMU groups).
- Creates
countchildDeviceNodes withBusIdentity::Pcientries for each VF BDF. - Assigns each VF its own IOMMU group (if ACS permits) or groups them with the PF.
- Triggers driver matching on each new VF node (the same driver or the VFIO passthrough driver are both valid matches).
Destruction: registry_destroy_vf_nodes(pf_handle) tears down all VFs, unmapping their
IOMMU entries and revoking any VFIO leases. Fails with IO_RESULT_BUSY if any VF is
actively in use by a guest VM.
9. AML interpreter scope: minimal production subset, growth-on-demand.
The initial interpreter supports the following ACPI methods (the minimum for real x86
server/desktop boot): _STA, _CRS, _HID, _UID, _BBN (base bus number),
_SEG (PCI segment), _PRT (PCI routing table), _OSI (OS identification — most
DSDTs gate behavior on this), _DSM (device-specific method — used by PCIe, NVMe,
USB controllers), _PS0/_PS3 (power state transitions), _INI (device
initialization), _REG (operation region handler registration), and _CBA (ECAM base
for PCIe config space on modern systems).
Required AML bytecode opcodes: Store, If/Else, Return, Buffer, Package, Integer/String/
Buffer operations, Method invocation, OperationRegion, Field. Without _OSI and _DSM,
most x86 laptops and many servers fail to enumerate devices correctly. Extend only when
real hardware fails to enumerate — do not speculatively implement unused methods.
10. Resource reservation for hot-plug: configurable per-slot defaults, ACPI-guided.
Default reservation per hot-plug capable slot: 256MB MMIO, 256MB prefetchable MMIO,
and 8 bus numbers (matching Linux's heuristic). Configurable via kernel command-line
parameters (pci_hp_mmio=128M, pci_hp_prefetch=256M, pci_hp_buses=4). The PCI
allocator reads ACPI _HPP (Hot Plug Parameters) and _HPX (Hot Plug Extensions)
methods if present — these override the defaults with firmware-provided values. Reserved
regions are tracked as "allocated but unoccupied" to prevent other devices from claiming
them.
11. KABI long-term evolution: 5 releases default, LTS KABI opt-in. The support window is 5 major releases. A KABI version may be designated LTS at release time (not retroactively), extending its support to 7 releases. LTS designation requires that at least one major driver ecosystem (storage, network, or accelerator) has certified against that KABI version.
Lifecycle:
- At KABI_vN+3 (or +5 for LTS): deprecated methods gain #[deprecated(since = "KABI_vN")]
and emit a kernel log warning when called.
- At KABI_vN+5 (or +7 for LTS): deprecated methods are removed from the vtable.
- Dead method cleanup reduces vtable size, reclaiming the bloat from append-only evolution.
12. IOMMU nested translation performance: proactive large page promotion. The IOMMU mapper always selects the largest page size that fits the DMA mapping alignment and size:
| Condition | IOMMU Page Size |
|---|---|
| Mapping ≥ 1GB and 1GB-aligned | 1GB (rare; occurs for GPU BAR mappings) |
| Mapping ≥ 2MB and 2MB-aligned | 2MB |
| All other cases | 4KB |
This is a policy in the IOMMU mapping path, not a reactive monitor. Per-device IOMMU stats (IOTLB miss rate via performance counters, if available) are exposed through the FMA health telemetry path (Section 19.1) for observability, but the promotion decision itself is always proactive.
10.5.15 Firmware Management
Devices need firmware updates. The kernel provides infrastructure for loading and updating device firmware without requiring device-specific userspace tools.
10.5.15.1 Firmware Loading
Firmware loading flow (boot and runtime):
1. Driver calls kabi_request_firmware(name, device_id).
2. Kernel searches firmware paths in order:
a. /lib/firmware/updates/<name> (admin overrides)
b. /lib/firmware/<name> (distro-provided)
c. Initramfs embedded firmware (for boot-critical devices)
3. If found: kernel maps the firmware blob read-only into the
driver's isolation domain. Driver receives a FirmwareBlob handle
with .data() and .size() accessors.
4. Driver loads firmware to device via its own mechanism
(MMIO, DMA upload, vendor mailbox).
5. Driver releases the handle; kernel unmaps the blob.
Same semantics as Linux request_firmware() / request_firmware_nowait().
The async variant (kabi_request_firmware_async) does not block the
driver's probe path — useful for large firmware blobs (>10MB).
10.5.15.2 Firmware Update (Runtime)
Runtime firmware update (fwupd / vendor tools):
1. Userspace writes firmware capsule to /sys/class/firmware/<device>/loading.
2. Kernel validates:
a. Signature (mandatory: Ed25519 or PQC if enabled).
The signing key must match the device's firmware trust anchor
(embedded in device or provided by vendor via UEFI db).
b. Version (must be >= current version, prevents downgrade attacks
unless admin explicitly overrides via firmware.allow_downgrade=1).
3. Kernel notifies driver via KABI callback:
update_firmware(blob, blob_size) -> FirmwareUpdateResult.
4. Driver performs the device-specific update procedure:
- NVMe: Firmware Download + Firmware Commit (NVMe admin commands).
- GPU: vendor-specific update mechanism.
- NIC: flash update via vendor mailbox.
5. Driver returns result: Success, NeedsReset, Failed(error_code).
6. If NeedsReset: kernel marks device for reset. Reset can be
triggered immediately (if no active I/O) or deferred to next
maintenance window (admin-configurable).
UEFI capsule updates (system firmware):
Kernel writes capsule to EFI System Resource Table (ESRT) via
efi_capsule_update(). Actual update happens on next reboot.
Same mechanism as Linux (CONFIG_EFI_CAPSULE_LOADER).
Exposes /dev/efi_capsule_loader for userspace tools (fwupd).
10.5.15.3 Linux Compatibility
/sys/class/firmware/<device>/loading — firmware loading trigger
/sys/class/firmware/<device>/data — firmware blob upload
/sys/class/firmware/<device>/status — update status
/sys/bus/*/devices/*/firmware_node/ — ACPI firmware node link
/dev/efi_capsule_loader — UEFI capsule interface
fwupd works unmodified — it uses the standard sysfs firmware update interface and UEFI capsule loader, both of which are provided.
10.5.16 Appendix: Comparison with Prior Art
| Aspect | Linux | IOKit | Windows PnP | Fuchsia DF | UmkaOS |
|---|---|---|---|---|---|
| Tree owner | Kernel (kobject) | Kernel (IORegistry) | Kernel (devnode) | Userspace (devmgr) | Kernel (DeviceRegistry) |
| Matching | Per-bus (module_alias) | Property dict match | INF file rules | Bind rules | MatchRule in ELF .kabi_match |
| PM ordering | Heuristic (dpm_list) | IOPMPowerState tree | IRP tree walk | Component PM | Topological sort of device tree |
| Service discovery | Per-subsystem APIs | IOService matching | WDF target objects | Protocol/service | Unified registry_publish/lookup |
| Hot-plug | Per-bus callbacks | IOService terminate | PnP IRP dispatch | devmgr events | Registry-mediated events |
| Crash recovery | Kernel panic | IOService terminate | Bugcheck | Component restart | Registry-orchestrated reload |
| ABI coupling | Tight (kobject in driver) | Tight (C++ inheritance) | Tight (WDM/WDF) | Protocol-only | None (KABI vtable only) |
| Isolation | None | None | None | Process boundary | Domain isolation + process + capability |
10.6 Zero-Copy I/O Path
The entire I/O path from user space to device and back avoids all data copies. This is essential for matching Linux performance.
10.6.1 NVMe Read Example (io_uring SQPOLL + Registered Buffers)
Step 1: User writes SQE to io_uring submission ring
[User space, shared memory, 0 transitions]
Step 2: SQPOLL kernel thread reads SQE from ring
[UmkaOS Core, shared memory read, 0 copies]
Step 3: Domain switch to NVMe driver domain (~23 cycles on x86 MPK)
[Single WRPKRU on x86; MSR POR_EL0+ISB on AArch64 POE; MCR DACR on ARMv7]
Step 4: NVMe driver writes command to hardware submission queue
[Pre-computed DMA address from registered buffer]
Step 5: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
[Submit path complete, return to core domain]
Step 6: NVMe device DMAs data directly to user buffer
[IOMMU-validated, zero-copy, device -> user memory]
Step 7: NVMe device writes completion to hardware CQ, raises interrupt
Step 8: Interrupt routes to NVMe driver (domain switch, ~23 cycles on x86 MPK)
Driver reads hardware CQE
Step 9: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
Step 10: UmkaOS Core writes CQE to io_uring completion ring
[Shared memory write, 0 copies]
Step 11: User reads CQE from completion ring
[User space, shared memory, 0 transitions]
Summary: - Total data copies: 0 - Total domain switches: 4 (steps 3+5 on submit path, steps 8+9 on completion path) - Total domain switch overhead: ~92 cycles on x86 MPK (4 x ~23 cycles per Section 10.2; see Section 10.2 table for other architectures) - Device latency: ~3-10 us - Overhead percentage: < 1%
10.6.1.1 NVMe Doorbell Coalescing (Mandatory)
NVMe hardware uses doorbell registers (MMIO writes) to notify the controller that
new commands are available in the submission queue. Each doorbell write is an
uncacheable MMIO store — ~100-200 cycles on x86-64 (PCIe posted write), ~150-300
cycles on ARM (device memory type). In the naive case, every submit_io() call
writes the doorbell immediately, which means one MMIO write per I/O command.
UmkaOS coalesces doorbell writes as a core design decision. When multiple I/O commands are submitted in a batch (common with io_uring SQPOLL, which drains multiple SQEs per poll cycle), the NVMe driver writes all commands to the submission queue first, then issues a single doorbell write for the entire batch. The NVMe specification explicitly supports this: the doorbell value is the new SQ tail index, and the controller processes all entries between the previous tail and the new tail.
/// NVMe submission batch context. Accumulates commands and defers the
/// doorbell write until `flush()` is called. Created by the KABI dispatch
/// trampoline when it detects multiple pending SQEs in the domain ring buffer.
///
/// # Invariants
///
/// - `pending_count` tracks commands written to the hardware SQ since the
/// last doorbell write.
/// - `flush()` must be called before returning from the KABI dispatch
/// to ensure all commands are visible to the controller. The KABI
/// trampoline enforces this via Drop (flush on drop as safety net).
pub struct NvmeSubmitBatch<'sq> {
/// Reference to the submission queue (hardware memory).
sq: &'sq mut NvmeSubmissionQueue,
/// Number of commands written since last doorbell.
pending_count: u32,
/// Maximum batch size before auto-flush (tunable, default: 32).
/// Prevents unbounded batching that could increase per-command latency.
max_batch: u32,
}
impl<'sq> NvmeSubmitBatch<'sq> {
/// Write a command to the SQ without ringing the doorbell.
/// If `pending_count` reaches `max_batch`, auto-flushes.
pub fn submit(&mut self, cmd: &NvmeCommand) {
self.sq.write_entry(cmd);
self.pending_count += 1;
if self.pending_count >= self.max_batch {
self.flush();
}
}
/// Ring the doorbell once for all pending commands.
/// Cost: one MMIO write (~100-200 cycles) regardless of batch size.
pub fn flush(&mut self) {
if self.pending_count > 0 {
// SAFETY: doorbell is an MMIO register in the driver's private
// domain. Writes the new SQ tail index.
unsafe { self.sq.ring_doorbell() };
self.pending_count = 0;
}
}
}
impl Drop for NvmeSubmitBatch<'_> {
fn drop(&mut self) {
// Safety net: ensure all commands are submitted even if the caller
// forgets to call flush(). This is a correctness guarantee, not a
// performance path — callers should flush() explicitly.
self.flush();
}
}
Batch size selection: The default max_batch of 32 balances throughput and
latency. With io_uring SQPOLL draining at ~32-64 SQEs per poll cycle, this
typically results in 1-2 doorbell writes per poll cycle instead of 32-64. The
value is tunable per-device to accommodate different workload patterns.
Cost savings:
| Scenario | Without coalescing | With coalescing | Savings |
|---|---|---|---|
| io_uring SQPOLL, 32 SQEs/batch | 32 × ~150 cycles = ~4800 cycles | 1 × ~150 cycles = ~150 cycles | ~4650 cycles (~97%) |
| io_uring SQPOLL, 1 SQE (fsync) | 1 × ~150 cycles | 1 × ~150 cycles | 0 (no batching opportunity) |
| Direct submit (non-SQPOLL) | 1 × ~150 cycles | 1 × ~150 cycles | 0 (single command) |
Per-I/O amortized doorbell cost with batch-32: ~150/32 = ~5 cycles/command, down from ~150 cycles/command. On a 10μs NVMe read (~25,000 cycles), this reduces doorbell overhead from ~0.6% to ~0.02%.
Applicability beyond NVMe: The same coalescing pattern applies to any device
with doorbell-style notification: virtio (virtqueue kick), network TX (NIC
doorbell/tail pointer write), and accelerator command queues. The KABI dispatch
trampoline detects batch opportunities for any device type and uses the same
flush-on-last-command pattern.
10.6.2 TCP Receive Path
Step 1: NIC DMAs packet to pre-posted receive buffer
[IOMMU-validated, zero-copy]
Step 2: NIC raises interrupt -> domain switch to NIC driver (~23 cycles on x86 MPK)
Step 3: NIC driver processes descriptor, identifies packet
Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
Step 4: UmkaOS Core dispatches to umka-net -> domain switch to umka-net (~23 cycles on x86 MPK)
Step 5: umka-net processes TCP headers, copies payload to socket buffer
(This is the one "copy" -- same as Linux. Technically a move
of ownership, not a memcpy, when using page-flipping.)
Step 6: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
UmkaOS Core signals epoll/io_uring waiters
Step 7: User reads from socket via read()/recvmsg()/io_uring
Data delivered from socket buffer (zero-copy with MSG_ZEROCOPY)
Total domain switches: 4 (2 domain entries x 2 switches each: enter NIC driver + exit, enter umka-net + exit) Total domain switch overhead: ~92 cycles on x86 MPK (~20ns) on ~5 us path = ~2% (see Section 10.2 for other architectures)
10.7 IPC Architecture and Message Passing
Section 10.6 describes the data plane -- how bytes flow from user space through Tier 1 drivers to devices and back with zero copies. This section describes the control plane that Section 10.6's data plane relies on: the IPC primitives that carry commands, completions, capability transfers, and event notifications between isolation domains.
10.7.1 IPC Primitives
UmkaOS's IPC model has three distinct layers, each serving a different boundary:
1. Intra-kernel IPC (between isolation domains within Ring 0): domain ring buffers. Shared memory regions with per-domain access controlled by the isolation domain register (WRPKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, etc.). Zero-copy, zero-syscall. This is the transport for all umka-core to Tier 1 driver communication — the command/completion flow shown in Section 10.6's NVMe and TCP examples. The "domain switch" at each step in those diagrams crosses a domain ring buffer boundary.
2. Kernel-user IPC (between kernel and user space): io_uring submission/completion rings. Standard Linux ABI (Section 18.1.5). Applications submit SQEs to the io_uring submission ring and receive CQEs from the completion ring. This is the only I/O interface that user space sees. UmkaOS's io_uring implementation is fully compatible with Linux 6.x semantics -- unmodified applications work without changes.
3. Inter-process IPC (between user processes): POSIX IPC.
Pipes, Unix domain sockets, POSIX message queues, and POSIX shared memory -- implemented
via the syscall interface (Section 18.1). These are not performance-critical kernel
paths; they exist for application compatibility. System V IPC (shmget, msgget, semget)
is supported but deprecated in favor of POSIX equivalents.
4. Hardware peer IPC (between the host kernel and a device running UmkaOS firmware): domain ring buffers over PCIe P2P.
A device that participates as a first-class cluster member (Section 5.1.2.2) communicates
with the host kernel via the same domain ring buffer protocol used for intra-kernel IPC
(Layer 1), transported over PCIe peer-to-peer MMIO and MSI-X interrupts instead of
in-process memory. From the host kernel's perspective, the device firmware endpoint is
just another ring buffer pair — the same DomainRingBuffer structure, the same
ClusterMessageHeader wire format, the same message-passing discipline. The transport
medium changes (PCIe instead of cache-coherent RAM); the abstraction does not.
This is not a compatibility shim. It is the intended model for first-class hardware
participation: a SmartNIC, DPU, computational storage device, or RISC-V accelerator
running UmkaOS presents an IPC endpoint identical in structure to an in-kernel Tier 1
driver, while owning its own scheduler, memory manager, and capability space.
See Section 5.1.2.2 for the wire protocol, implementation paths (A/B/C), and near-term
hardware targets.
The terms are not interchangeable. When this document says "io_uring", it means the userspace-facing async I/O interface. When it says "domain ring buffer", it means the internal kernel transport between isolation domains. An io_uring SQE from userspace triggers an isolation domain switch to a Tier 1 driver via a domain ring buffer — the two mechanisms are connected but architecturally distinct.
User space Kernel (Ring 0)
+-----------+ +------------------------------------------+
| App | | umka-core Tier 1 driver |
| | io_uring SQE | |
| SQ ring -|-------------------->|-> dispatch -----> domain cmd ring --------->|
| | | (WRPKRU) |
| | io_uring CQE | |
| CQ ring <|--------------------<|<- collect <----- domain cpl ring <---------|
| | | (WRPKRU) |
+-----------+ +------------------------------------------+
Layer 2 Layer 1 (internal)
(Linux ABI) (domain ring buffers)
10.7.2 Domain Ring Buffer Design
Each Tier 1 driver has a pair of ring buffers shared with umka-core: a command ring (umka-core produces, driver consumes) and a completion ring (driver produces, umka-core consumes). Both use the same underlying structure:
Weak-isolation fast path (
isolation=performanceor no fast isolation mechanism): When drivers are promoted to Tier 0 (no CPU-side isolation), domain ring buffers remain the IPC mechanism — the data structure and lock-free protocol are unchanged — but the domain register switches are elided. On architectures with hardware domains (MPK, POE, DACR), each ring buffer access requires toggling the domain register to grant access to the shared region (~23-80 cycles per switch, 4 switches per I/O round-trip = ~92-320 cycles). Without hardware domains, the ring buffer memory is mapped with normal kernel permissions and no domain switch is needed: the producer writes directly, the consumer reads directly, and the only synchronization is the existing atomichead/published/tailprotocol. This eliminates the dominant per-I/O isolation overhead on RISC-V (~800-2000 cycles saved per I/O) and on any platform runningisolation=performance. The ring buffer structure itself is unchanged — only the access-control wrapper is bypassed.
/// A lock-free single-producer single-consumer ring buffer that lives in
/// a shared memory region accessible to exactly two isolation domains.
///
/// The header occupies two cache lines (one producer-owned, one
/// consumer-owned). Ring data follows immediately after the header,
/// aligned to `entry_size`.
#[repr(C, align(64))]
pub struct DomainRingBuffer {
/// Write claim position. Producers CAS this to claim slots (MPSC mode).
/// In SPSC mode, only the single producer increments this.
///
/// `AtomicU64`: u32 would wrap in ~29 seconds at 148 Mpps (100 Gbps with
/// 64-byte packets); u64 wraps after ~4 billion years at the same rate.
/// u64 counters eliminate the need for modular wrap-around logic in the hot path.
pub head: AtomicU64,
/// Published position. In MPSC mode, a producer increments this (in order)
/// AFTER writing data to the claimed slot. The consumer reads `published`
/// (not `head`) to determine how many entries are ready. In SPSC mode,
/// `published` always equals `head` (the single producer updates both).
/// In broadcast mode, this field is NOT the source of truth —
/// `last_enqueued_seq` (u64) is the authoritative write position. The
/// `published` field is derived (`write_seq / 2`) for diagnostic
/// compatibility only. Implementations MUST NOT increment `published`
/// independently in broadcast mode.
pub published: AtomicU64,
/// Number of entries. Must be a power of two.
pub size: u32,
/// Bytes per entry. Fixed at ring creation time.
pub entry_size: u32,
/// Number of entries dropped due to ring-full condition.
/// Monotonically increasing. Exposed via umkafs diagnostics (Section 19.4).
pub dropped_count: AtomicU64,
/// Sequence number of the last successfully enqueued entry.
/// Consumers use this to detect gaps: if the consumer's last-seen
/// sequence is less than `last_enqueued_seq - ring_size`, entries
/// were lost.
/// In broadcast mode, this field serves as `write_seq` for torn-read
/// prevention (incremented by 2 per entry; odd = write-in-progress,
/// even = stable). See "Broadcast channels" below.
pub last_enqueued_seq: AtomicU64,
/// Ring lifecycle state. Written by crash recovery or graceful shutdown;
/// read by producers in spin loops to detect partner death.
/// 0 = Active (normal operation)
/// 1 = Disconnected (producer died or ring being torn down)
/// Producers check this in every spin iteration and bail with
/// `Err(Disconnected)` if set. The crash recovery path (Section 10.8)
/// sets this AFTER publishing poison markers for any in-flight
/// slots (see "Producer death recovery" below).
pub state: AtomicU8,
/// Padding to fill the producer cache line to exactly 64 bytes.
/// Layout: head(8) + published(8) + size(4) + entry_size(4)
/// + dropped_count(8) + last_enqueued_seq(8) + state(1)
/// + _pad(23) = 64.
_pad_producer: [u8; 23],
/// Read position. Only the consumer increments this.
/// On a separate cache line from head/published to avoid false sharing.
///
/// `AtomicU64`: same rationale as `head` — no wrap-around at any realistic rate.
pub tail: AtomicU64,
/// Padding to fill the consumer cache line to exactly 64 bytes.
_pad_consumer: [u8; 56],
// Ring data follows: `size * entry_size` bytes.
}
/// Errors returned by ring buffer produce operations.
pub enum RingError {
/// Ring is full — no free slots available.
Full,
/// Ring partner has died (crash recovery set `state = Disconnected`).
/// Caller must not retry; propagate the error.
Disconnected,
/// System severely overloaded — entry was discarded (poison marker written).
/// The entry was lost but the ring remains operational.
Overloaded,
}
Note on false sharing: size and entry_size are read-only after initialization and
are read by both producer and consumer. They are placed on the producer's cache line for
layout simplicity, but implementations SHOULD duplicate these values on the consumer's
cache line (as consumer_size and consumer_entry_size) to avoid false sharing. The
consumer reads only from its own cache line.
Lock-free SPSC protocol. The producer writes an entry at data[head % size], then
increments head and published together (in SPSC mode they are always equal). The
consumer reads the entry at data[tail % size] when published > tail, then increments
tail. If the first byte of an entry is 0xFF (poison marker), the consumer skips the
entry and increments tail without processing — this occurs only when a producer hit
the Err(Overloaded) path and had to force-publish a discarded slot.
No locks, no CAS, no contention. The head/published fields are on one cache
line (producer-owned); tail is on a separate cache line (consumer-owned). This
eliminates false sharing on hot paths.
Memory ordering. The producer uses Release ordering on the published store. The
consumer uses Acquire ordering on the published load. This pair ensures that the
entry data written by the producer is visible to the consumer before the consumer sees
the updated published counter.
On x86-64 this compiles to plain MOV instructions (TSO provides the required ordering
for free). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers
(stlr/ldar on ARM, fence-qualified atomics on RISC-V, lwsync/isync on PPC).
| Architecture | Producer (Release store) | Consumer (Acquire load) | Notes |
|---|---|---|---|
| x86-64 | MOV (TSO) |
MOV (TSO) |
No explicit barriers needed |
| AArch64 | STLR |
LDAR |
ARM's acquire/release instructions |
| RISC-V 64 | amoswap.w.rl or fence rw,w + sw |
lw + fence r,rw |
RVWMO requires explicit fencing |
| PPC32 | lwsync + stw |
lwz + isync |
Weak ordering; lwsync = lightweight sync |
| PPC64LE | lwsync + std |
ld + isync |
Same model as PPC32; lwsync preferred over sync |
Backpressure. When the ring is full (head - tail == size), the producer cannot write.
For SPSC rings (command and completion channels), umka-core handles this in two stages:
(1) spin for up to 64 iterations checking whether the consumer has advanced tail — this
covers the common case where the driver is actively draining; (2) if the ring is still full
after spinning, yield to the scheduler via sched_yield_current() and retry on the next
scheduling quantum. Both stages check state on each iteration — if the ring is
Disconnected (partner driver died), the producer returns Err(Disconnected) immediately
rather than waiting for a dead consumer to drain. This avoids wasting CPU on a stalled
driver while keeping the fast path lock-free. For MPSC rings (event channels), backpressure behavior depends on the calling
context — see the MPSC producer API contract in Section 10.7.3 for the distinction between
blocking (mpsc_produce_blocking(), thread context only) and non-blocking
(mpsc_try_produce(), safe in any context) variants.
10.7.3 Channel Types and Capability Passing
The ring buffer primitive from Section 10.7.2 is instantiated in four channel configurations:
Command channels (SPSC): umka-core -> driver. One per driver instance. Carries I/O requests (read, write, discard), configuration commands (set queue depth, enable feature), and health queries (heartbeat, statistics request). Umka-core is the sole producer; the driver is the sole consumer.
Completion channels (SPSC): driver -> umka-core. One per driver instance. Carries I/O completions (success, error, partial), interrupt notifications (forwarded from the hardware interrupt handler), and error reports (device errors, internal driver faults). The driver is the sole producer; umka-core is the sole consumer.
Event channels (MPSC): multiple drivers -> umka-core event loop. Used for asynchronous
events that do not belong to a specific I/O flow: device hotplug notifications, link state
changes (NIC up/down), thermal throttle alerts, error notifications requiring global
coordination. Multiple drivers may need to signal the same event loop, so the MPSC variant
uses a compare-and-swap on head to coordinate multiple producers:
MPSC scaling limits: For event channels with >10 concurrent producers (unusual but possible in systems with many independent drivers signaling a single event loop), CAS contention on the ring head can degrade performance. In this regime, hierarchical fanout is recommended: drivers signal per-device intermediate rings, and an aggregator thread (or softirq batch) forwards events to the central ring. This reduces contention from O(producers) to O(1) at the cost of one additional indirection. The default single-ring design is optimized for the common case of 2-5 active producers per channel.
Per-CPU deferred publish buffer — When Phase 2 publication would require spinning for too long (>64 iterations, meaning an earlier producer is slow), the producer defers its publication by storing the ring pointer and slot into a small per-CPU buffer, then re-enables interrupts. This ensures interrupt-disabled windows remain bounded to ~1-2μs.
/// Per-CPU buffer for deferred MPSC ring publications.
///
/// When Phase 2 cannot complete within 64 spin iterations (because an earlier
/// producer has not yet written its data), the producer stores its pending
/// publication here and re-enables interrupts immediately. The drain function
/// is called at the start of every subsequent `send()` and at idle entry,
/// so deferred publications are completed within bounded time.
///
/// Capacity 16: supports up to 16 simultaneously stalled producers across
/// different rings. Under normal load, 0-2 entries are pending; 16 is
/// reached only under extreme contention or scheduling stalls.
pub struct DeferredPublishBuf {
/// Ring of (published_counter_ptr, slot_index) pairs awaiting Phase 2.
/// `published_ptr` is a pointer into the ring's AtomicU64 `published` field.
/// `slot` is the index this producer claimed in Phase 1.
pub entries: [Option<DeferredEntry>; 16],
/// Head index (next slot to fill).
pub head: u8,
/// Tail index (next slot to drain).
pub tail: u8,
}
pub struct DeferredEntry {
/// Pointer to the ring's `published` counter (the one this entry must advance).
pub published_ptr: *const AtomicU64,
/// Slot index claimed by Phase 1 CAS.
pub slot: u64,
}
DeferredPublishBuf is stored in the per-CPU data structure alongside
CpuLocal fields. deferred_publish_drain() iterates tail..head, and for
each entry attempts Phase 2 publication: if published == slot - 1, advance
published to slot (success, remove from buffer); otherwise leave in place for
the next drain pass.
Overflow behavior: When DeferredPublishBuf reaches capacity (16 entries), the
producer performs an eager flush: all 16 pending entries are written to the domain
ring buffer before adding the new entry. If the ring buffer is full (consumer is
behind), the flush blocks until sufficient space is available — this provides natural
backpressure. A stalled Tier 1 consumer will stall its producer, preventing unbounded
deferred entry accumulation. The 16-entry buffer is a coalescing optimization, not a
queue; it is never intended to hold more than a few entries in steady state.
impl DomainRingBuffer {
/// MPSC non-blocking produce: multiple producers coordinate via CAS on head.
/// Returns Err(RingError::Full) immediately if the ring is full, or
/// Err(RingError::Disconnected) if the ring partner has died.
/// Safe to call from any context (thread, IRQ, softirq).
/// See "MPSC producer API contract" below for the blocking variant.
///
/// Two-phase commit protocol:
/// Phase 1 (claim): CAS on `head` to reserve a slot. After CAS success,
/// the slot is exclusively ours but NOT yet visible to the consumer.
/// Phase 2 (publish): After writing data, wait until `published` catches
/// up to our slot (ensuring in-order publication), then advance `published`.
///
/// The consumer reads `published` (not `head`) to determine ready entries.
/// This eliminates the data race where a consumer sees an incremented `head`
/// but reads a slot whose data has not yet been written.
pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError> {
// --- BEGIN interrupt-disabled section ---
// Disable interrupts BEFORE the Phase 1 CAS to prevent a deadlock:
// if an interrupt fires between a successful CAS (slot claimed) and
// Phase 2 (published advanced), an interrupt handler calling
// mpsc_try_produce on the same ring would spin forever in Phase 2
// waiting for the interrupted thread's slot to be published. Moving
// local_irq_save() here eliminates that race window entirely.
// The CAS loop is bounded (succeeds or returns RingError::Full), so the
// additional interrupt-disabled time is minimal.
let irq_state = arch::current::interrupts::local_irq_save();
// Phase 1: Claim a slot by advancing head (interrupts already disabled).
let my_slot;
loop {
let current_head = self.head.load(Ordering::Relaxed);
let current_tail = self.tail.load(Ordering::Acquire);
// Ring disconnected?
if self.state.load(Ordering::Acquire) != 0 {
arch::current::interrupts::local_irq_restore(irq_state);
return Err(RingError::Disconnected);
}
// Ring full?
if current_head.wrapping_sub(current_tail) >= self.size as u64 {
arch::current::interrupts::local_irq_restore(irq_state);
return Err(RingError::Full);
}
// Strong CAS required: on AArch64 LL/SC architectures, compare_exchange_weak
// permits spurious failures. In an interrupt-disabled window, spurious failures
// cause unbounded spinning — use compare_exchange (strong) to prevent this.
// Attempt to claim the slot.
if self
.head
.compare_exchange(
current_head,
current_head.wrapping_add(1),
Ordering::AcqRel,
Ordering::Relaxed,
)
.is_ok()
{
my_slot = current_head;
break;
}
core::hint::spin_loop();
}
// Write entry data to the claimed slot.
let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
// SAFETY: offset is within bounds (power-of-two size, fixed entry_size).
// The slot is exclusively ours because we won the CAS race.
unsafe {
core::ptr::copy_nonoverlapping(
entry.as_ptr(),
self.data_ptr().add(offset),
self.entry_size as usize,
);
}
// Phase 2: Publish. Wait until all prior slots are published, then
// advance `published` to make our slot visible to the consumer.
// This spin is brief: it only waits for producers that claimed earlier
// slots to finish their writes. Under normal operation, this completes
// in 1-2 iterations.
//
// Drain deferred publications from previous calls. Before attempting
// our own Phase 2, drain ALL entries from the per-CPU deferred publish
// ring buffer. This ensures that deferrals from prior send() calls are
// re-attempted (and completed) before new entries are published, preventing
// silent loss if multiple producers defer in succession.
//
// The drain takes no arguments — each deferred entry stores a pointer to
// the ring's `published` counter alongside the slot index, so the drain
// correctly targets the ring that each slot belongs to (a producer may
// have deferred on ring A and now be calling send() on ring B).
arch::current::cpu::deferred_publish_drain();
// **IRQ-disabled window**: Interrupts are disabled only during Phase 1
// CAS + Phase 2 publication attempt (bounded at 64 iterations). If Phase 2
// exceeds 64 spins, the entry is deferred and interrupts are restored
// immediately. The 256-iteration fallback spin (if the defer buffer is full)
// runs with interrupts **re-enabled**. Worst-case IRQ-disabled duration:
// ~64 CAS operations ≈ 1-2μs.
//
// **Phase 2 uses compare_exchange (strong), not compare_exchange_weak.**
// On AArch64 LL/SC architectures, compare_exchange_weak can fail spuriously
// (no actual contention — just LL/SC interference from an unrelated store).
// In Phase 2, spurious failures increment spin_count, potentially exhausting
// the 64-iteration budget and triggering unnecessary deferred-publish overhead.
// Strong CAS ensures the spin count only advances on genuine contention (another
// producer with an earlier slot has not yet published), keeping the common-case
// IRQ-disabled window at the expected 1-3 iterations.
//
// Bounded publish wait: To prevent unbounded interrupt-disabled spinning,
// Phase 2 uses a bounded spin of 64 iterations. If `self.published` has
// not advanced to `my_slot` within 64 iterations, the producer stores the
// ring's `published` pointer and its slot index as a pair into a per-CPU
// deferred publish ring buffer, then re-enables interrupts. The drain path
// (at the start of the next `send()` call and on the consumer side)
// re-attempts publication on behalf of the stalled producer, using the
// stored ring pointer to target the correct ring. The per-CPU deferred
// buffer is a ring (`[Option<(*const AtomicU64, u64)>; 16]` with
// head/tail indices) rather than a single `Option<u64>`, so multiple
// consecutive deferrals (potentially targeting different rings) can queue
// without silently losing earlier deferred values. The buffer holds 16
// entries (increased from an earlier 4-entry design to ensure bounded-time
// behavior under heavy contention). If the deferred buffer itself is full
// (16 outstanding deferrals — an extreme edge case indicating severe system
// overload), the producer re-enables interrupts before falling back to a
// bounded spin (up to 256 iterations with `core::hint::spin_loop()`). If the
// bounded spin also fails, the producer returns `Err(Overloaded)` to the
// caller, which applies backpressure (increment `dropped_count` for IRQ
// producers, or yield and retry for thread-context producers). This ensures
// the interrupt-disabled window is always bounded. The common-case bound is:
// Phase 1 CAS (~5ns, usually 1 attempt) + data write +
// drain (up to 16 entries * CAS each = ~80ns) + Phase 2 spin
// (up to 64 * ~5ns = ~320ns) = ~410ns in the common case.
let mut spin_count = 0u32;
loop {
if self
.published
.compare_exchange(
my_slot,
my_slot.wrapping_add(1),
Ordering::Release,
Ordering::Relaxed,
)
.is_ok()
{
break;
}
spin_count += 1;
if spin_count >= 64 {
// Exceeded bounded spin — defer completion to the consumer drain
// path and re-enable interrupts to avoid unbounded IRQ latency.
// The deferred buffer holds up to 16 entries; if it is full,
// re-enable IRQs and fall through to bounded spin (system overloaded).
// Fence ensures entry data written at the slot is visible to
// all CPUs before the slot can be published by a deferred drain
// on any CPU. Without this, on weakly-ordered architectures
// (AArch64, RISC-V, PPC), a different CPU draining and publishing
// via CAS(Release) would only order its own stores, not the
// original writer's stores.
core::sync::atomic::fence(Ordering::Release);
if arch::current::cpu::deferred_publish_enqueue(&self.published, my_slot) {
arch::current::interrupts::local_irq_restore(irq_state);
return Ok(());
}
// Deferred buffer full — re-enable IRQs to preserve RT
// guarantees, then bounded spin outside the IRQ-disabled window.
arch::current::interrupts::local_irq_restore(irq_state);
let mut fallback_spin = 0u32;
loop {
if self.state.load(Ordering::Acquire) != 0 {
return Err(RingError::Disconnected);
}
if self.published.compare_exchange(
my_slot, my_slot.wrapping_add(1),
Ordering::Release, Ordering::Relaxed,
).is_ok() {
return Ok(());
}
fallback_spin += 1;
if fallback_spin >= 256 {
// System severely overloaded. We must still advance `published`
// past our slot to prevent permanently wedging the ring.
// Write a poison marker (entry_type = 0xFF) into the slot so
// the consumer knows to skip it, then spin until we can
// advance `published`. This spin waits for earlier producers
// to publish. If an earlier producer has died (Tier 1/2 crash
// between Phase 1 and Phase 2), the crash recovery path will
// have set `state = Disconnected` and published poison markers
// for the dead producer's slots, unblocking this spin. We
// check `state` on every iteration to detect this case.
let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
// SAFETY: slot is ours (won the Phase 1 CAS); offset in bounds.
unsafe { *self.data_ptr().add(offset) = 0xFF; } // poison marker
let mut publish_spin = 0u32;
while self.published.compare_exchange(
my_slot, my_slot.wrapping_add(1),
Ordering::Release, Ordering::Relaxed,
).is_err() {
if self.state.load(Ordering::Acquire) != 0 {
return Err(RingError::Disconnected);
}
publish_spin += 1;
if publish_spin >= 4096 {
// Earlier producer is alive but severely delayed.
// Yield the CPU to allow it to make progress.
// This prevents livelock on the same core.
arch::current::cpu::yield_cpu();
publish_spin = 0;
}
core::hint::spin_loop();
}
return Err(RingError::Overloaded);
}
core::hint::spin_loop();
}
}
core::hint::spin_loop();
}
// --- END interrupt-disabled section ---
arch::current::interrupts::local_irq_restore(irq_state);
Ok(())
}
}
To prevent data loss when no future send() occurs, the per-CPU idle entry hook
(cpu_idle_enter(), Section 7.2) drains the deferred publish buffer for all MPSC rings
registered on that CPU. Additionally, when a thread that performed a deferred publish is
migrated to a different CPU, the migration path drains the source CPU's deferred buffer.
These hooks ensure deferred entries are published within a bounded window (at most one
scheduler tick, ~4ms).
MPSC Phase 2 preemption hazard and mitigation. The Phase 2 publish spin in
mpsc_try_produce() can stall if a producer is preempted (by an interrupt or scheduler)
between Phase 1 (CAS on head) and Phase 2 (advancing published). While preempted,
the published counter is stuck at the preempted producer's slot, blocking all
subsequent producers from making their entries visible to the consumer -- even though
their data is already written. This is not a deadlock (the preempted producer will
eventually resume and complete Phase 2), but it can cause unbounded latency spikes on
the consumer side.
Mitigation: UmkaOS addresses this in three ways:
-
Interrupts disabled from before Phase 1 through Phase 2. The MPSC produce path disables interrupts (not just preemption) BEFORE the Phase 1 CAS, keeping them disabled through the Phase 2
publishedcounter advancement. This prevents the following deadlock scenario: on a uniprocessor (or any CPU), thread T1 claims slot N via CAS, then an IRQ fires and the IRQ handler claims slot N+1 via CAS. The IRQ handler's Phase 2 spin waits forpublishedto reach N, but T1 cannot advancepublishedbecause it is interrupted — deadlock. Disabling interrupts before Phase 1 eliminates this window entirely (there is no gap between CAS success and interrupt disabling). The interrupt-disabled region covers: Phase 1 CAS (bounded — succeeds or returns Full/Disconnected, typically 1 attempt = ~5ns), data write, deferred drain (up to 16 entries × ~5ns CAS = ~80ns), and Phase 2 publish CAS (up to 64 iterations = ~320ns), totaling ~410ns in the common case. On multiprocessor systems, disabling preemption alone would suffice (another CPU could run the interrupted producer), but disabling interrupts is correct on all configurations and the cost is negligible. -
Consumer-side stuck detection and recovery (defense in depth). The consumer (umka-core event loop) maintains a watchdog: if
head > publishedfor more than 1000 consecutive poll iterations (~10μs), the consumer treats the gap as a stalled producer. If the gap persists beyond 10ms (configurable), the consumer initiates forced slot recovery: for each unpublished slot frompublishedtohead, write a poison marker (0xFF) and advancepublished. This unblocks any live producers spinning on Phase 2 while discarding the dead producer's incomplete entries. The consumer logs a diagnostic event with the number of force-published slots and the ring identity.
This consumer-side recovery is a safety net for the case where the crash recovery path (Section 10.8, step 5a) has not yet run — e.g., the driver faulted but the FMA detection latency exceeds 10ms, or the fault was a silent hang rather than a trap. Under normal operation, the crash recovery path (step 5a below) handles slot recovery before the consumer watchdog fires.
- Interrupt handlers use bounded produce. Interrupt handlers that produce to MPSC
rings use
mpsc_try_produce(), which fails withErr(Full)if the ring is full rather than spinning. This prevents interrupt handlers from spinning on a full ring while the consumer (which runs in thread context) cannot drain it — if the consumer needs to be scheduled to make progress, a spinning IRQ handler creates an unbounded spin or deadlock.
MPSC producer API contract. The MPSC ring exposes two producer entry points with distinct calling context requirements:
impl DomainRingBuffer {
/// Non-blocking produce. Returns immediately if the ring is full or
/// disconnected. Safe to call from ANY context (thread, IRQ, softirq).
///
/// On success: entry is enqueued and will be visible to the consumer
/// after Phase 2 publish completes.
/// On Err(Full): the ring has no free slots. The caller is responsible
/// for handling the overflow (see overflow accounting below).
/// On Err(Disconnected): the ring partner has died. Caller must not retry.
pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError>;
/// Blocking produce. Spins (with bounded spin + yield) until a slot
/// becomes available, then enqueues the entry.
///
/// MUST NOT be called with interrupts disabled. If the ring is full,
/// this function spins waiting for the consumer to drain entries. If
/// interrupts are disabled, the consumer (which runs in thread context)
/// may never be scheduled, causing an unbounded spin.
///
/// In debug builds: panics immediately if called with interrupts
/// disabled (detected via arch::current::interrupts::are_enabled()).
/// In release builds: falls back to mpsc_try_produce() with overflow
/// accounting if interrupts are disabled (defense in depth — the debug
/// panic should catch all such call sites during development).
pub fn mpsc_produce_blocking(&self, entry: &[u8]);
}
Calling mpsc_produce_blocking() with interrupts disabled is a BUG. Debug builds
panic to catch the error during development; release builds fall back to
mpsc_try_produce() with overflow accounting to avoid a hard hang in production. The
release fallback is a safety net, not a license to call the blocking variant from IRQ
context — all such call sites must be fixed.
Overflow accounting. When mpsc_try_produce() returns Err(Full) or
Err(Overloaded) (whether called directly from IRQ context or as the release-build
fallback), the caller increments a per-ring atomic overflow counter. Err(Disconnected)
is not counted as overflow — it indicates the ring is being torn down and the caller
should propagate the error to its own caller rather than retrying. The overflow statistics are stored directly in the
DomainRingBuffer producer cache line as dropped_count and last_enqueued_seq
(see struct definition in Section 10.7.2). Inlining these fields into the ring header
avoids an extra pointer dereference on the drop path and keeps both fields on the same
cache line as head and published, which are already hot during produce operations.
Each MPSC entry includes a monotonic sequence number in its header. The consumer detects dropped entries by checking for gaps in the sequence: if the sequence jumps from N to N+K (where K > 1), then K-1 entries were dropped due to overflow. The consumer logs a diagnostic event on gap detection, including the ring identity and gap size, so operators can identify rings that need larger depth configuration (Section 10.7.4 channel depths).
Summary of context rules:
| Producer context | Permitted API | On ring full | Notes |
|---|---|---|---|
| Thread context (IRQs enabled) | mpsc_produce_blocking() or mpsc_try_produce() |
Blocking: spin + yield until space (or Err(Disconnected) on partner death). Try: return Err(Full) or Err(Disconnected). |
Blocking variant is the normal path for thread-context producers. |
| IRQ handler / softirq | mpsc_try_produce() ONLY |
Return Err(Full) or Err(Disconnected), increment dropped_count, drop message. |
Calling the blocking variant is a BUG (debug panic / release fallback). |
| NMI / MCE handler | NEITHER — use per-CPU buffer | N/A | See NMI/MCE safety below. |
NMI/MCE safety: NMI handlers and Machine Check Exception (MCE) handlers MUST NOT produce to MPSC rings. Mitigation 1 (disabling interrupts) does NOT protect against NMIs or MCEs — both are non-maskable architectural exceptions that fire regardless of the interrupt flag state. If an NMI or MCE handler needs to log data, it must use a dedicated per-CPU single-producer buffer (not shared with normal interrupt context) that is drained by the main kernel after the exception returns. On x86, MCE handlers additionally run on a dedicated IST (Interrupt Stack Table) stack, so they must not access per-CPU data structures that assume the normal kernel stack.
Producer death recovery. If a producer (any tier) dies between MPSC Phase 1
(CAS claim) and Phase 2 (advancing published), the published counter is stuck
at the dead producer's slot, blocking all subsequent producers. Three mechanisms
ensure recovery:
-
Crash recovery step 5a (Tier 1, Section 10.8.2) / step 4 (Tier 2, Section 10.8.3): The crash handler identifies all MPSC rings where the dead driver was a producer. For each ring with
head > published, it writes poison markers (0xFF) into all slots frompublishedtohead, then advancespublishedtohead. Finally, it setsstate = Disconnected. Any live producer currently spinning in Phase 2 observes thestatechange and returnsErr(Disconnected). This is the primary recovery mechanism and handles the vast majority of cases. -
Consumer-side watchdog (mitigation 2 above): If
head > publishedpersists beyond 10ms (the crash handler hasn't run yet — e.g., silent hang, FMA detection latency), the consumer force-publishes poison markers and advancespublished. Safety net only. -
Spin loop state checks: Every spin loop in
mpsc_try_produce()(the 256- iteration fallback and the final unbounded spin) checksstateon each iteration. OnDisconnected, the spin exits immediately withErr(Disconnected)rather than waiting forpublishedto advance.
These mechanisms are tier-independent for Tier 1 and Tier 2: the ring protocol handles producer death the same way regardless of whether the producer was Tier 1 (MPK fault) or Tier 2 (process death). The tier determines detection latency (Tier 1: <1ms via fault handler; Tier 2: immediate via process exit), but the ring recovery sequence is identical.
Tier 0 (in-kernel) drivers: The recovery mechanisms above do not apply to Tier 0. A Tier 0 driver runs without isolation — if it crashes between Phase 1 and Phase 2 (or anywhere), the kernel is already in a panic state. Corrupted kernel memory makes ring recovery meaningless; the system is going down. The MPSC produce path mitigates the window by disabling interrupts before Phase 1 (preventing preemption between CAS and publication), but no software mechanism can recover from a Tier 0 fault — only hardware isolation provides that.
This is the explicit trade-off of Tier 0 promotion: zero isolation overhead in
exchange for accepting that any driver bug is a kernel panic. On platforms that lack
hardware domain isolation (e.g., RISC-V without a fast isolation mechanism, or when
isolation=performance is set), all Tier 1 drivers are effectively promoted to
Tier 0. Operators choosing this configuration accept the reduced fault containment.
The ring buffer's state/poison-marker recovery remains compiled in (zero cost when
not triggered) but cannot fire because no crash recovery path exists to set
state = Disconnected — the kernel has already panicked.
Broadcast channels (SPMC): umka-core -> all drivers. Used for system-wide notifications
(suspend imminent, memory pressure, clock change). Umka-core writes once; each driver
reads independently. The broadcast channel uses a sequence-numbered ring with a single
sequencing mechanism: the last_enqueued_seq field (hereafter write_seq in broadcast
mode), a u64 in the ring header. write_seq increments by 2 for each published entry
(odd values indicate a write in progress; even values indicate a stable, readable entry —
see torn-read prevention below). The logical entry count is write_seq / 2. The
DomainRingBuffer's published field is not used independently in broadcast mode;
if read, it is derived as write_seq / 2 for compatibility with diagnostic code that
inspects published. Implementations must not increment published separately from
write_seq in broadcast mode — write_seq is the sole source of truth.
Each consumer tracks its own read position (a u64 sequence number stored in the
consumer's private memory, not in the shared ring header). To read, a consumer scans
from its last-seen sequence to the ring's current write_seq (even values only). The
ring's tail field is unused in broadcast mode — the producer never needs to know
individual consumer positions. Instead, the producer overwrites the oldest entry when
the ring is full (broadcast semantics: slow consumers miss entries rather than blocking
the producer). Consumers detect missed entries by checking for sequence gaps.
Torn-read prevention: Each broadcast ring entry is bracketed by a u64 sequence
stamp. Layout: [seq_start: u64 | payload: [u8; entry_size - 16] | seq_end: u64]. The
producer writes seq_start = write_seq | 1 (odd = write in progress), then the payload,
then seq_end = write_seq (even = complete), then advances write_seq by 2. The
consumer reads seq_start, copies the payload, reads seq_end. If
seq_end != (seq_start ^ 1), the read is torn — seq_start and seq_end are not a
matched pair from the same write (a concurrent write changes seq_start to a different
odd value, causing this check to fail). Additionally, if seq_start < consumer.last_seq,
the entry is stale. In either case, the consumer detects the gap, increments gap_count,
and advances to the next entry. All sequence accesses use Ordering::Acquire (reads)
and Ordering::Release (writes).
/// Per-consumer broadcast state (stored in consumer's private memory).
pub struct BroadcastConsumer {
/// Last sequence number consumed by this consumer.
pub last_seq: u64,
}
Capability passing. Capabilities (Section 8.1) can be transferred over any IPC channel.
The sending domain writes a CapabilityHandle (an opaque 64-bit token) into a ring buffer
entry. Umka-core intercepts the transfer at the domain boundary and validates the capability:
does the sender actually hold this capability? Is the capability transferable? Is the
receiver permitted to hold capabilities of this type? If validation passes, umka-core
translates the handle into the receiving domain's capability space -- the receiver gets a
new handle that maps to the same underlying resource but exists in its own namespace. Raw
capability data (kernel pointers, permission bitmasks) never crosses domain boundaries;
only validated, translated handles do.
10.7.4 Flow Control and Ordering
Ordering within a channel. Ring buffer entries are processed in strict FIFO order within a single channel. If umka-core submits commands A, B, C to a driver's command ring, the driver sees them in A, B, C order. Completions flow back in the order the driver produces them (which may differ from submission order -- a driver may complete a fast read before a slow write).
No ordering across channels. There is no ordering guarantee between different channels.
Driver A's completion may arrive at umka-core before driver B's completion, regardless of
which command was submitted first. Applications that need cross-device ordering must
enforce it at the io_uring level (using IOSQE_IO_LINK or IOSQE_IO_DRAIN), which
umka-core translates into sequencing constraints on the domain command rings.
Channel depths. Each channel has a configurable entry count, set at ring creation time via the device registry (Section 10.5):
| Channel type | Default depth | Typical entry size | Notes |
|---|---|---|---|
| Command (SPSC) | 256 | 64 bytes | Matches NVMe SQ depth default |
| Completion (SPSC) | 1024 | 16 bytes | 4x command depth for batched completions |
| Event (MPSC) | 512 | 32 bytes | Shared across all drivers on this event loop |
| Broadcast (SPMC) | 64 | 32 bytes | Low-frequency system events |
The minimum useful broadcast entry size is 24 bytes (8 bytes payload with 16 bytes of
sequence stamps for torn-read prevention). The default of 32 bytes provides 16 bytes of
payload, suitable for most event notifications. Umka-core rejects broadcast ring creation
requests with entry_size < 24.
Depths are tunable per-driver via the device registry's ring_config property. Drivers
that handle high-throughput workloads (NVMe, high-speed NIC) typically increase command
depth to 1024 or 4096 to match hardware queue depths.
Priority channels. Real-time I/O (Section 7.2) uses a separate high-priority command ring per driver. The driver polls the priority ring before the normal ring on every iteration. This ensures RT I/O is not head-of-line blocked behind bulk I/O. Priority rings use the same SPSC structure but are typically shallow (32-64 entries) since RT workloads are low-volume, latency-sensitive flows.
umka-core dispatch logic (per driver, per poll iteration):
1. Check priority command ring -> process all pending entries
2. Check normal command ring -> process up to batch_limit entries
3. Check event ring (MPSC) -> process system events
Comparison with Linux. Linux has no equivalent to the intra-kernel domain ring buffer.
Subsystem communication within the Linux kernel uses direct function calls with no
isolation boundary. The closest analogy is Linux's io_uring internal implementation
(the SQ/CQ ring structure), but that serves a different purpose (kernel-to-userspace
communication). UmkaOS effectively uses an io_uring-inspired ring structure inside the
kernel to connect isolated subsystems that Linux connects via unprotected function calls.
10.7.5 Terminology Reference
The following terms are used precisely throughout this document. This reference resolves ambiguity that arises from the word "ring" appearing in multiple contexts:
| Term | Meaning | Where used |
|---|---|---|
| io_uring | Linux-compatible userspace async I/O interface. SQ/CQ rings mapped into user space. | Section 18.1.5, user-facing I/O API |
| domain ring buffer | Internal kernel IPC mechanism between isolation domains. SPSC or MPSC lock-free rings in shared memory. | Section 10.7, driver architecture |
| MPSC ring | A domain ring buffer variant with CAS-based multi-producer support. Used for event aggregation. | Section 10.7.3, event channels |
| Hardware queue | Device-specific command/completion queues (e.g., NVMe SQ/CQ, virtio virtqueue). Mapped via MMIO. | Section 10.6, device I/O paths |
| SPSC | Single-Producer Single-Consumer. The default domain ring buffer mode. | Section 10.7.2 |
| SPMC | Single-Producer Multi-Consumer. Used for broadcast channels (umka-core -> all drivers). | Section 10.7.3 |
Any unqualified reference to "ring buffer" in the driver architecture sections (Sections 5-9) means a domain ring buffer. Any reference to "io_uring" means the userspace interface. Hardware queues are always qualified by device type (e.g., "NVMe submission queue", "virtio virtqueue").
10.8 Crash Recovery and State Preservation
This is UmkaOS's killer feature -- the primary reason to choose it over Linux.
Scope: This section covers Tier 1 and Tier 2 driver crash recovery where the host kernel acts as supervisor. For peer kernel crash recovery (devices running UmkaOS as a first-class multikernel peer), see Section 5.1.3, which uses a different isolation model (IOMMU hard boundary + PCIe unilateral controls rather than software domain supervision).
10.8.1 The Linux Problem
In Linux, all drivers run in the same address space with no isolation. A single bug in any driver -- null pointer dereference, buffer overflow, use-after-free -- triggers a kernel panic. Recovery requires a full system reboot: 30-60 seconds of downtime, loss of all in-flight state, and potential filesystem corruption if writes were in progress.
10.8.2 UmkaOS Tier 1 Recovery Sequence
When a Tier 1 (domain-isolated) driver faults:
1. FAULT DETECTED
- Hardware exception (page fault, GPF) within a Tier 1 isolation domain
- OR watchdog timer expires (driver stalled for >Nms)
- OR driver returns invalid result / corrupts its ring buffer
2. ISOLATE
- UmkaOS Core revokes the faulting driver's isolation domain by setting
the access-disable bit for that domain's key in the domain register
(x86: set AD bit in PKRU; AArch64: clear overlay permissions in POR_EL0;
ARMv7: set domain to "No Access" in DACR; PPC32: invalidate segment;
PPC64LE: switch to revoked PID; RISC-V/fallback: unmap driver pages)
- Driver can no longer access any memory in its domain
- Interrupt lines for this driver are masked
3. DRAIN PENDING I/O
- All pending requests from user space are completed with -EIO
- Applications receive error codes, not crashes
- io_uring CQEs are posted with error status
4. DEVICE RESET
- Issue Function-Level Reset (FLR) via PCIe
- OR vendor-specific device reset sequence
- Device returns to known-good state
5. RELEASE KABI LOCKS
- The KABI lock registry tracks all Core kernel locks currently held on
behalf of this driver. Every lock-acquiring KABI call (e.g., mutex_lock,
rw_lock_read) pushes a (lock_ptr, lock_type) entry onto a per-driver,
per-CPU lock stack (max depth 8, statically allocated in the driver
descriptor). On normal unlock, the entry is popped.
- On crash recovery, the registry is walked in reverse order (LIFO):
each held lock is force-released (mutex: set owner to NONE and wake
waiters; rwlock: decrement reader count or clear writer; spinlock:
release). This prevents deadlock when a driver panics mid-critical-section.
- After lock release, per-CPU borrow states held by the driver are reset
to 0 (free), matching the PerCpu borrow-state tracking in Section 3.1.1.
- **Invariant**: KABI calls that acquire Core locks MUST be non-reentrant
and hold at most one Core lock at a time (enforced by the KABI vtable
wrappers). This bounds the lock stack depth and ensures reverse-order
release is always safe.
5a. RECOVER RING BUFFER IN-FLIGHT SLOTS
- For each MPSC ring where the dead driver was a producer: if
`head > published` (indicating the driver may have claimed a slot
via Phase 1 CAS but died before Phase 2 publication), write poison
markers (0xFF) into all unpublished slots from `published` to `head`
and advance `published` to `head`. This unblocks any live producers
spinning in Phase 2 waiting for the dead driver's slot to be published.
- For SPSC completion rings (driver -> core): the ring is drained of all
valid entries up to `published`, then the ring is reset (`head = tail
= published = 0`) for the replacement driver instance.
- Set `state = Disconnected` (AtomicU8, value 1) on all rings owned by
the dead driver. Any producer currently spinning in a Phase 2 loop
will observe this on its next `state.load()` and return
`Err(Disconnected)`. This field is reset to `Active` (0) when the
replacement driver re-initializes the ring.
6. UNLOAD DRIVER
- Free all driver-private memory
- Release all driver capabilities
- Unmap driver MMIO regions
7. RELOAD DRIVER
- Load fresh copy of driver binary
- New bilateral vtable exchange
- Device re-initialization
- Re-register interrupt handlers
8. RESUME
- New driver begins accepting I/O requests
- Applications retry failed operations (standard I/O error handling)
TOTAL RECOVERY TIME: ~50ms typical (soft-reset path) to ~150ms (FLR path)
(design target; validation requires hardware prototype — actual timing depends
on driver state snapshot complexity and memory domain reset cost)
10.8.2a Reload Failure Handling
If the new driver instance fails to initialize after a crash, UmkaOS handles the failure as follows:
- Detection: reload failure is defined as the new driver instance crashing during initialization, OR initialization not completing within 500 ms (hard timeout).
- Device offline: the device is marked
DeviceState::Failed; no new I/O is accepted. - Client notification: all processes with open file descriptors to this device receive
SIGHUP; any pending I/O syscalls returnEIO. - Kernel continues: a Tier 1 reload failure does not panic the kernel — the device is simply unavailable. All other drivers and subsystems continue operating normally.
- Audit: a kernel warning is logged with the device canonical name, failure reason (crash vs timeout), and driver version.
- Manual recovery: an operator can trigger a fresh reload attempt via the umkafs
control interface at
/System/Kernel/drivers/<name>/reloadafter investigating the cause; the failure counter (Section 10.5.10.2) may also trigger automatic demotion to Tier 2 on repeated failures.
Recovery timing breakdown — The ~50ms figure applies to the soft-reset path where the driver performs a vendor-specific device reset (register write + status poll) without a full PCIe Function Level Reset. Many devices (Intel NICs, AHCI controllers) support fast software reset in 1-10ms. The full PCIe FLR path takes longer: the PCIe spec requires the function to complete FLR within 100ms (the device must not be accessed until FLR completes; software polls the device's configuration space to detect completion). With driver reload overhead, the FLR path totals ~150ms. UmkaOS prefers the soft-reset path when the driver crash was a software bug (the device hardware is fine); FLR is used when the device itself appears hung (no response to MMIO reads, completion timeout). In either case, the recovery is 100-1000x faster than a full Linux reboot (30-60s).
10.8.2b FLR Timeout Recovery
The PCIe Base Specification requires that a function complete FLR within 100 ms. UmkaOS enforces this deadline and defines an escalating recovery sequence for the case where FLR does not complete in time.
FLR with timeout enforcement:
/// Poll interval while waiting for FLR completion.
const FLR_POLL_INTERVAL_US: u64 = 1_000; // 1 ms
/// Maximum wait for FLR per PCIe Base Spec (Section 6.6.2).
const FLR_TIMEOUT_MS: u64 = 100;
/// Initiate FLR on a PCIe function and poll for completion.
/// Returns Ok(()) when the function's config space is accessible again.
/// Returns Err(PcieError::FlrTimeout) if the deadline elapses.
fn pcie_flr_with_timeout(dev: &mut PcieDevice) -> Result<(), PcieError> {
// Initiate FLR: write bit 15 of the Device Control register.
// Cap offset is discovered via the PCIe Capability structure pointer.
let devctl_offset = dev.pcie_cap_offset + PCI_EXP_DEVCTL;
dev.config_write_u16(devctl_offset, PCI_EXP_DEVCTL_BCR_FLR);
let deadline_ns = monotonic_ns() + FLR_TIMEOUT_MS * 1_000_000;
loop {
delay_us(FLR_POLL_INTERVAL_US);
// FLR completion is indicated by config space returning valid data.
// A device undergoing FLR returns 0xFFFF for any config read.
if dev.config_read_u16(PCI_VENDOR_ID) != 0xFFFF {
return Ok(());
}
if monotonic_ns() >= deadline_ns {
break;
}
}
Err(PcieError::FlrTimeout)
}
Escalation sequence on PcieError::FlrTimeout:
When FLR does not complete within 100 ms, UmkaOS escalates through the following steps in order, stopping at the first step that succeeds:
-
IOMMU quarantine (immediate, before attempting any escalation): the device's IOMMU domain is placed in fault mode — all further DMA from the device is blocked by the IOMMU. This prevents the hung device from corrupting memory during the escalation sequence, regardless of how long escalation takes.
-
Secondary bus reset: if the device is behind a PCIe bridge (not directly attached to the root complex), assert the bridge's secondary bus reset bit (
PCI_BRIDGE_CTL_BUS_RESET, bit 6 of the Bridge Control register at config offset0x3E). Hold for 1 ms, then deassert and wait up to 100 ms for the device's Vendor ID to become valid. A secondary bus reset resets all functions on the secondary bus, so sibling functions receiveDeviceEvent::SiblingReset. -
Hot-plug slot power cycle: if the slot exposes Hot-Plug capability and the
HPC_POWER_CTRLbit is set in the Slot Capabilities register, toggle slot power off and on. Wait up to 1 s for the slot's Presence Detect State to return to present and the device's config space to become accessible. -
Permanent fault: if neither secondary bus reset nor hot-plug power cycle recovers the device:
- Transition the device to
DeviceState::FaultedUnrecoverable. - Remove the device from the active device registry (it is retained as a tombstone entry for diagnostic purposes, accessible via umkafs).
- Invoke the Tier 1 driver's teardown path (unload the driver, release its memory domain and capabilities) as if a crash occurred, but without attempting reload.
- Log:
pcie: FLR timeout on [bus:dev.fn] (vid={vid} did={did}), secondary bus reset {"succeeded"|"failed"}, slot power cycle {"succeeded"|"failed"|"unavailable"}, device faulted permanently. -
The FMA subsystem (Section 19.1) receives a
FaultEvent::PcieFlrTimeoutevent carrying the BDF, the vendor/device ID, and the escalation result. FMA may trigger a predictive replacement recommendation. -
User notification: after the fault is recorded, send a uevent to userspace (
ACTION=change,SUBSYSTEM=pci,PCIE_EVENT=FLR_TIMEOUT,PCI_SLOT_NAME=<bdf>). Device manager daemons (udev, systemd-udevd) can trigger operator alerts or automated replacement workflows.
Invariants:
- IOMMU quarantine (step 1) is unconditional and runs before any escalation attempt.
The device must not be able to DMA during escalation.
- Steps 2 and 3 each have their own 100 ms and 1 s timeouts respectively. Total
worst-case escalation time before permanent fault: ~1.2 s.
- No driver code runs after FlrTimeout is returned. The escalation sequence is
entirely in the kernel's PCIe subsystem (Tier 0), not in the Tier 1 driver.
- If a secondary bus reset is performed, the sibling functions' drivers are notified
via DeviceEvent::SiblingReset before the reset is asserted, giving them 5 ms to
quiesce outstanding I/O.
10.8.2a Crash State Buffer Wire Format
When a Tier 1 driver panics, a pre-allocated crash state buffer is filled before the driver's isolation domain is destroyed. This buffer is stored in umka-core memory and remains accessible after teardown. It is used for post-mortem diagnostics, FMA fault reporting, and optionally for warm-restart state recovery.
/// Wire format of the crash state buffer saved when a Tier 1 driver panics.
/// Saved to a pre-allocated crash buffer in umka-core memory so it remains
/// accessible after the driver's memory domain is destroyed.
///
/// Total size: 512 bytes. Aligned to 64 bytes (cache-line boundary).
#[repr(C, align(64))]
pub struct DriverCrashState {
/// Magic number for validation: 0x49534C435241534B ("ISLCRASH" in ASCII).
pub magic: u64,
/// Format version. Current: 1.
pub version: u16,
/// Driver ID (same as in the driver registry).
pub driver_id: u32,
_pad0: [u8; 2],
/// TSC value at the time of crash (monotonic, CPU-local).
pub crash_tsc: u64,
/// Program counter (instruction pointer) at crash.
pub crash_pc: u64,
/// Stack pointer at crash.
pub crash_sp: u64,
/// Frame pointer at crash (for stack unwinding).
pub crash_fp: u64,
/// Crash reason code.
pub crash_reason: CrashReason,
_pad1: [u8; 4],
/// Ring buffer head index at crash time.
pub ring_head: u32,
/// Ring buffer tail index at crash time.
pub ring_tail: u32,
/// First 256 bytes of the request being processed when the crash occurred
/// (zero-padded if the request is shorter or unavailable).
pub partial_request: [u8; 256],
/// Crash backtrace: first 128 bytes. Symbolicated if DWARF debug info is
/// available at the time of crash; raw 8-byte addresses otherwise.
pub backtrace: [u8; 128],
_pad2: [u8; 24],
// Field layout:
// magic(8) + version(2) + driver_id(4) + _pad0(2) + crash_tsc(8)
// + crash_pc(8) + crash_sp(8) + crash_fp(8) + crash_reason(4)
// + _pad1(4) + ring_head(4) + ring_tail(4) + partial_request(256)
// + backtrace(128) + _pad2(24) = 512 bytes total.
}
/// Crash reason codes stored in DriverCrashState.
#[repr(u32)]
pub enum CrashReason {
/// Driver code invoked panic!() or hit an assertion failure.
Panic = 0,
/// Page fault (null dereference, stack overflow, bad pointer).
PageFault = 1,
/// Invalid opcode (#UD fault — executed an undefined instruction).
InvalidOpcode = 2,
/// Divide-by-zero (#DE fault).
DivByZero = 3,
/// Capability access violation (attempted to cross an isolation boundary
/// without a valid capability token).
CapViolation = 4,
/// Watchdog timer expired (driver did not make forward progress).
Timeout = 5,
/// Stack overflow detected (guard page fault at the bottom of the driver stack).
StackOverflow = 6,
}
The crash buffer is pre-allocated per driver at load time (no allocation during the crash path). The domain fault handler fills it with whatever register state is available at fault entry, then proceeds with the normal recovery sequence.
10.8.3 UmkaOS Tier 2 Recovery Sequence
Tier 2 (user-space process) driver recovery is even simpler:
1. Driver process crashes (SIGSEGV, SIGABRT, etc.)
2. UmkaOS Core's driver supervisor detects process exit
3. REVOKE DEVICE ACCESS
- Mark the device as "in recovery" in the device registry, preventing
any new MMIO mappings or device access grants for this device.
- Revoke the driver's IOMMU entries (tear down the device's IOMMU
domain mappings). Any in-flight DMA that completes after this point
hits an IOMMU fault and is dropped.
- If the dying process's teardown has not yet completed MMIO unmapping
(page table entry removal + TLB shootdown), force-invalidate the
relevant page table entries. In practice, the process is already
exiting at step 2, so MMIO unmapping is a cleanup operation — the
device registry marking and IOMMU revocation are what actually
prevent further device access.
4. RECOVER RING BUFFER IN-FLIGHT SLOTS
- Same as Tier 1 step 5a: publish poison markers for any MPSC slots
claimed but unpublished by the dead driver, set ring `state =
Disconnected`. Unblocks live producers spinning on Phase 2.
5. Pending I/O completed with -EIO
6. Supervisor restarts driver process
7. New process re-initializes device, resumes service
TOTAL RECOVERY TIME: ~10ms
Why Tier 2 is faster than Tier 1 -- Counter-intuitively, the "weaker" isolation tier recovers faster. The reason is that Tier 2 recovery skips the most expensive step in the Tier 1 sequence: no device FLR in the normal case. Tier 2 drivers have direct MMIO access to their device's BAR regions (for performance), but MMIO revocation (step 3 above) cuts off device access immediately. The IOMMU prevents any DMA initiated through those MMIO registers from reaching non-driver memory, so there is no DMA safety hazard even if the device has in-flight operations.
IOTLB coherence and DMA page lifetime -- A lightweight IOMMU invalidation (not a
full drain fence) suffices at step 3 because Tier 2 recovery defers freeing the
crashed driver's DMA pages rather than draining all in-flight DMA. After IOMMU entry
revocation, stale IOTLB entries may still allow in-flight DMA to complete to the old
physical addresses. If those pages were freed immediately, this would be a
use-after-free via hardware. Instead, the old DMA pages remain allocated (owned by the
kernel, not the dead process) until the replacement driver instance calls init() and
either reuses them (warm restart via the state buffer) or explicitly releases them back
to the allocator. By the time pages are actually freed, the IOTLB has long since been
flushed — either by the invalidation at step 3, by natural IOTLB eviction, or by the
new driver's own IOMMU setup. This makes the IOTLB coherence window moot without
requiring a synchronous drain fence.
DMA deferred-free lifetime bound -- The deferred-free strategy described above has
a resource exhaustion risk: if the replacement driver never loads (or loads but never
calls init()), the old DMA pages remain allocated indefinitely. An attacker could
repeatedly crash Tier 2 drivers to exhaust DMA-capable memory (typically ZONE_DMA /
ZONE_DMA32 on x86-64, or CMA regions on ARM). To bound this exposure, every deferred
DMA page set carries a reclaim deadline:
- When a Tier 2 driver crashes and its DMA pages are moved to deferred-free status,
each page set is tagged with
deferred_deadline = now + 30_seconds. - A kernel background task (
dma_reclaim_worker, period = 10 seconds) scans all deferred-free DMA page sets. Any page set whose deadline has passed is reclaimed immediately — the "wait for replacement driver" check is bypassed. The reclaim frees the physical pages back to the allocator and logs a warning identifying the driver and number of pages force-reclaimed. - Rationale: 30 seconds is ample time for the driver supervisor to restart the
replacement process and for the new driver to call
init()and either reuse or release the preserved pages. If no replacement has loaded after 30 seconds, the driver is presumed permanently crashed (or its supervisor has given up), and the pages are safe to reclaim. By the 30-second mark, any stale IOTLB entries have long since been flushed (IOTLB eviction typically occurs within microseconds to milliseconds), so reclaiming the pages carries no DMA safety hazard.
/// Maximum DMA pages preserved across a Tier 2 driver crash.
/// 512 pages × 4 KB = 2 MB maximum preserved DMA state per driver.
/// Drivers requiring more than 2 MB of preserved DMA state should use
/// persistent memory (DAX) or external state servers.
pub const MAX_DEFERRED_DMA_PAGES: usize = 512;
/// DMA pages held in deferred-free state after a Tier 2 driver crash.
///
/// These pages are preserved so the replacement driver can reuse them
/// (warm restart via the state buffer). If no replacement loads before
/// `deadline`, the `dma_reclaim_worker` force-reclaims them.
pub struct DeferredDmaPages {
/// Physical pages held for replacement driver use after crash recovery.
/// Fixed-size array: crash handlers MUST NOT allocate from heap.
/// Pre-allocated at driver initialization time.
pub pages: ArrayVec<PhysPage, MAX_DEFERRED_DMA_PAGES>,
pub page_count: u16, // actual count; u16 sufficient (max 512 < 65535)
/// Deadline after which pages are reclaimed regardless of driver state.
pub deadline: Instant,
/// Which driver's state these pages belong to (for logging on forced reclaim).
pub driver_name: DriverName,
}
The DriverRegistry maintains a counter for observability:
/// Number of times the 30-second deadline triggered forced DMA page
/// reclaim. Exposed via umkafs at `/Devices/<device>/dma_forced_reclaims`.
/// A sustained non-zero rate indicates drivers that crash without timely
/// replacement — investigate the driver supervisor and restart policy.
pub dma_forced_reclaims: AtomicU64,
If the device appears hung after the Tier 2 crash (the replacement driver's init()
detects an unresponsive device), the registry escalates to FLR, but this fallback is
rare. Tier 2 recovery is typically "revoke mappings, restart the process, reconnect to
the ring buffer" -- a ~10ms operation dominated by process creation and driver
init().
10.8.4 State Preservation and Checkpointing
Driver recovery (Section 10.8 steps 1–6) restarts a new driver instance, but without state preservation the new instance starts cold — losing in-flight I/O, device configuration, and connection state. UmkaOS uses a Theseus-inspired state spill design to enable warm restarts.
State buffer — Each Tier 1 driver has an associated kernel-managed "state buffer" that resides outside the driver's isolation domain. The buffer is allocated by umka-core and mapped read-write into the driver's address space. On crash, the isolation domain is destroyed but the state buffer survives (it belongs to umka-core).
Driver Isolation Domain (destroyed on crash) umka-core (survives)
┌─────────────────────────┐ ┌──────────────────────┐
│ Driver code + heap │ checkpoint → │ State Buffer │
│ Internal caches │ ──────────→ │ ┌────────────────┐ │
│ (NOT preserved) │ │ │ Version: 3 │ │
│ │ │ │ DevCmdQueue[] │ │
│ │ │ │ RingBufPos │ │
│ │ │ │ ConnState[] │ │
│ │ │ │ HMAC Tag │ │
└─────────────────────────┘ │ └────────────────┘ │
└──────────────────────┘
State buffer format: - Driver-defined structure (the driver author decides what to checkpoint). - Versioned via KABI version field — the state buffer header includes a format version number so a newer driver binary can detect and handle (or reject) state from an older version. - HMAC-SHA256 integrity tag — computed by umka-core using a per-driver key, verified before handing to the new driver instance. Corrupt or tampered buffers are discarded. The HMAC key is generated by umka-core on the first load of a driver for a given DeviceHandle. The key is stored in the DeviceNode (Section 10.6 Device Registry) and persists across driver crash/reload cycles. The key is only discarded when the DeviceHandle is removed from the registry (device unplugged or explicitly deregistered). On reload, umka-core verifies the existing state buffer using the persisted key, then continues using the same key for the new driver instance. The driver writes state data, but only umka-core can produce valid integrity tags, preventing a buggy driver from poisoning the state buffer with corrupted data. Note: Tier 1 drivers run in Ring 0, so a deliberately compromised driver (with arbitrary code execution) could read the HMAC key from umka-core memory by bypassing MPK via WRPKRU (Section 10.2, WRPKRU threat model). This is within the documented Tier 1 threat model — MPK provides crash containment, not exploitation prevention. The HMAC protects state buffer integrity against bugs (the common case), not against active exploitation (which requires Tier 2 for defense).
Checkpoint frequency: - Configurable per-driver. Default: checkpoint after every I/O batch completion, or every 1ms, whichever comes first. - Checkpoint is a memcpy from driver-local structures to the inactive state buffer slot (~1–4 KB typical) plus an atomic doorbell write. At 1ms intervals, the overhead is negligible.
Torn checkpoint protection (double buffering):
The driver cannot compute the HMAC (only umka-core can), so a driver crash mid-write would leave a torn (partially written) state buffer. To prevent this, the state buffer uses a double-buffering protocol:
- The state buffer contains two slots (A and B). At any time, one slot is active (the last successfully checkpointed state) and the other is inactive (the write target for the next checkpoint).
- The driver writes its checkpoint data to the inactive slot. When the write is complete, the driver signals umka-core by writing a completion flag to a shared doorbell — a single atomic write visible to umka-core.
- Umka-core, on observing the doorbell (polled during periodic work or on driver crash), computes HMAC-SHA256 over the completed slot and atomically swaps the active slot pointer.
- On crash recovery, umka-core verifies the active slot's HMAC. If valid, that state is used for the new driver instance. If invalid (corruption or incomplete swap), umka-core falls back to the previous active slot, which still holds the last known-good checkpoint.
- The double-buffer swap is an atomic pointer update. There is no race with driver writes because the driver only ever writes to the inactive slot.
- After ringing the doorbell, the inactive slot is considered "pending" -- the driver must not begin a new checkpoint until umka-core completes the swap and clears the doorbell flag. If the next 1 ms checkpoint interval arrives while a swap is still pending, the driver skips that checkpoint cycle. In practice, umka-core processes the doorbell within a few microseconds (HMAC-SHA256 on 4 KB takes ~2–5 µs with hardware SHA acceleration, ~15–30 µs without — see HMAC-SHA256 performance note below), so skipped checkpoints are rare.
TOCTOU mitigation (verify-then-use atomicity):
The state buffer is mapped read-write into the driver's address space, which creates a potential Time-Of-Check-Time-Of-Use (TOCTOU) vulnerability: a compromised driver could modify the active slot after umka-core verifies the HMAC but before the new driver instance reads it. UmkaOS prevents this attack through the following mechanisms:
-
Slot revocation on crash: When a driver crashes, umka-core immediately revokes the crashed driver's write access to both state buffer slots by unmapping the entire state buffer from the old isolation domain. This is step 2 of the recovery sequence (Section 10.8) — it happens before HMAC verification (step 4). After revocation, the crashed driver's code cannot execute and its page tables are destroyed, so there is no entity that can modify the buffer between verification and use.
-
Copy-on-verify to kernel-private storage: After HMAC verification succeeds, umka-core copies the verified slot contents to a kernel-private buffer (not mapped into any driver's address space). The new driver instance receives a read-only snapshot of this copy, not a pointer to the original state buffer. This ensures that even if an attacker could somehow gain write access to the original buffer (which they cannot, per point 1), the verified data cannot be altered.
-
New driver isolation: The new driver instance is created with a fresh isolation domain. The state buffer is not mapped into this new domain until after the new driver calls
init()and signals that it has finished consuming the checkpoint data. During initialization, the driver reads from the kernel-private copy (provided via a read-only mapping or explicit copy to the driver's local heap). Only afterinit()returns successfully does umka-core map the state buffer (both slots) read-write into the new driver's address space for future checkpoints. -
Atomicity guarantee: The sequence — unmap from old domain, verify HMAC, copy to kernel-private storage, create new domain — is performed with preemption disabled on the recovery CPU. There is no window during which any user-space code (driver or otherwise) can execute while holding write access to the verified buffer.
This design ensures that HMAC verification and data consumption are effectively atomic: once verified, the data cannot be modified by any entity before the new driver reads it. The cost is one additional memcpy (~4 KB) per recovery, which is negligible compared to the overall recovery latency (~50-150 ms).
HMAC-SHA256 performance:
HMAC-SHA256 for a 4 KB message: - With hardware SHA acceleration (SHA-NI on x86-64 Skylake+, SHA1/SHA256 extensions on AArch64/ARMv7, Zknh on RISC-V): ~2.1 cycles/byte → ~8,600 cycles → ~2–5 µs at 3 GHz - Without hardware acceleration (software implementation, SSSE3, or generic): ~13 cycles/byte → ~53,000 cycles → ~15–30 µs at 3 GHz
UmkaOS selects the optimal implementation at boot via algorithm priority:
hardware-SHA > SSSE3 > generic. The crypto_shash_alloc() API transparently
selects the fastest available implementation for the running CPU.
HMAC-SHA256 computation is performed by umka-core asynchronously — not on the driver's hot path. The driver's checkpoint cost is limited to the memcpy plus an atomic doorbell write.
What is preserved vs. rebuilt:
| Preserved (in state buffer) | NOT preserved (rebuilt from scratch) |
|---|---|
| Device command queue positions | Driver-internal caches |
| Hardware register snapshots | Deferred work queues |
| In-flight I/O descriptors | Timers and timeout state |
| Ring buffer head/tail pointers | Debug/logging state |
| Connection/session state | Statistics counters (reset to zero) |
| Device configuration (MTU, features, etc.) |
NVMe example: - Checkpointed: submission queue tail doorbell position, completion queue head position, in-flight command IDs with their scatter-gather lists, namespace configuration. - On reload: new driver reads state buffer, re-maps device BARs, verifies queue state against hardware registers, and resumes submission. In-flight commands that were submitted but not completed are re-issued.
NIC example: - Checkpointed: active flow table entries, RSS (Receive Side Scaling) indirection table and hash key, interrupt coalescing settings, VLAN filter table, MAC address list. - On reload: new driver re-programs the NIC with the checkpointed configuration. Active TCP connections see a brief pause (~50-150ms) but do not reset — the connection state lives in umka-net (Tier 1), not in the NIC driver.
Fallback: - If HMAC verification of the state buffer fails, or the version is incompatible, the new driver instance performs a cold restart (current behavior: full device reset, all in-flight I/O returned as -EIO). - Cold restart is always safe — state preservation is an optimization, not a requirement.
10.8.5 Crash Dump Infrastructure
When umka-core itself faults (not a driver — the core kernel), the system needs to capture diagnostic state for post-mortem analysis. Unlike driver crashes (which are recoverable), a core panic is fatal.
Reserved memory region:
- At boot, UmkaOS reserves a contiguous physical memory region for crash dumps, configured
via boot parameter: umka.crashkernel=256M (similar to Linux crashkernel=).
- This region is excluded from the normal physical memory allocator — it survives a
warm reboot if the firmware doesn't clear RAM.
Panic sequence:
1. Core panic triggered (null deref, assertion failure, double fault, etc.)
2. Disable interrupts on all CPUs (IPI NMI broadcast)
3. Panic handler (Tier 0 code, always resident, minimal dependencies):
a. Save register state for the faulting CPU:
- x86-64: GPRs, CR3, IDTR, RSP, RFLAGS, RIP, segment selectors
- AArch64: GPRs (x0-x30), SP_EL1, ELR_EL1, SPSR_EL1, ESR_EL1, FAR_EL1
- ARMv7: GPRs (r0-r15), CPSR, DFAR, DFSR, IFAR, IFSR
- RISC-V: GPRs (x0-x31), sepc, scause, stval, sstatus, satp
b. Walk the stack, generate backtrace (using .eh_frame / DWARF unwind info)
c. Snapshot key data structures:
- Active process list + their states
- Capability table summary
- Driver registry state
- IRQ routing table
- Recent ring buffer entries (last 64KB of klog)
d. Write all of the above into the reserved crash region as an ELF core dump
4. Flush panic message to serial console (already works in current implementation)
5. If a pre-registered NVMe region exists (configured at boot):
a. Use the NVMe driver's Tier 0 "panic write" path (polled mode, no interrupts)
b. Write the crash dump from reserved memory to the NVMe region
6. Halt or reboot (configurable: `umka.panic=halt|reboot`, default: halt)
Crash stub: - The panic handler is Tier 0 code: statically linked, no dynamic dispatch, no allocation, no locks (or only try-lock with immediate fallback). It must work even if the heap, scheduler, or interrupt subsystem is corrupted. - Serial output always works (Tier 0 serial driver, polled mode). - NVMe panic write uses polled I/O (no interrupts, no completion queues) — a simplified write path that can function with a partially-corrupted kernel.
Next boot recovery:
1. Bootloader loads UmkaOS kernel
2. Early init checks the reserved crash region for a valid dump header
3. If found:
a. Copy dump to a temporary in-memory buffer
b. After filesystem mount, write to /var/crash/umka-dump-<timestamp>.elf
c. Log "Previous crash dump saved to /var/crash/umka-dump-<timestamp>.elf"
d. Clear the reserved crash region
4. The dump can be analyzed with standard tools:
- `crash` utility (same as Linux kdump analysis)
- GDB with the UmkaOS kernel debug symbols
- `umka-crashdump` tool (UmkaOS-specific, extracts structured summaries)
Dump format:
- ELF core dump format, compatible with the crash utility and GDB.
- Contains: register state, memory regions (kernel text, data, stack pages for active
threads, page tables), and a note section with UmkaOS-specific metadata (kernel version,
boot parameters, uptime, driver state).
No kexec on day one: - Linux uses kexec to boot a second "crash kernel" that writes the dump. This is reliable but complex. - UmkaOS uses a simpler "in-place dump" to reserved memory: the panic handler writes directly to the reserved region without booting a second kernel. - kexec-based crash dump is a future enhancement for systems where the in-place approach is insufficient (e.g., very large memory dumps requiring a full kernel to compress and transmit).
10.8.6 Recovery Comparison
| Scenario | Linux | UmkaOS |
|---|---|---|
| NVMe driver null deref | Kernel panic, full reboot | Reload driver, ~50-150ms (design target) |
| NIC driver infinite loop | System freeze | Watchdog kill, reload, ~50-150ms (design target) |
| USB driver buffer overflow | Kernel panic | Restart process, ~10ms |
| FS driver corruption | Kernel panic + fsck | Reload driver, fsck on mount |
| Audio driver crash | Kernel panic | Restart process, ~10ms |
10.8.7 Crash History and Auto-Demotion
The kernel tracks per-driver crash statistics:
crash_count[driver_id] within window (default: 1 hour)
0-2 crashes: Reload at same tier
3+ crashes: Demote to next lower tier (if minimum_tier allows)
Log warning, notify admin
5+ crashes: Transition to Quarantined (driver permanently disabled); manual re-enable via sysfs. Log critical alert
A Tier 1 driver that crashes 3 times is demoted to Tier 2 (full process isolation), accepting the performance penalty for increased safety. An administrator can override this policy.
See also: Section 12.6 (Live Kernel Evolution) extends crash recovery to proactively replace core kernel components at runtime, reusing the same state-export/reload mechanism. Section 19.1 (Fault Management) adds predictive telemetry and diagnosis before crashes occur.
10.9 USB Class Drivers and Mass Storage
USB devices follow a class-based driver model. The USB host controller driver (xHCI for USB 3.x, EHCI for USB 2.0) is a Tier 1 platform driver that manages host controller hardware and the root hub. Class drivers are layered above it and bind to devices by USB class code, subclass, and protocol — not by vendor/product ID — giving a single driver coverage across all standards-compliant devices of a class.
10.9.1 USB Host Controller (xHCI, Tier 1)
The xHCI driver (USB 3.2 specification) manages:
- Transfer ring management: each endpoint has a ring buffer (producer/consumer pointers in memory). The driver enqueues Transfer Request Blocks (TRBs); the controller processes them and posts Transfer Event TRBs to the Event Ring.
- Command ring: host-issued commands (Enable Slot, Disable Slot, Configure Endpoint, Reset Device) use a separate command ring.
- Interrupt moderation: MSI-X per-interrupter; Event Ring Segment Table (ERST) maps event ring memory to the controller.
Device enumeration: root hub port status change → enumerate device at default
address 0 → GET_DESCRIPTOR (device, configuration, interface, endpoint) →
assign address via SET_ADDRESS → bind class driver based on bDeviceClass or
bInterfaceClass.
10.9.2 USB Mass Storage (UMS) and USB Attached SCSI (UAS)
Both protocols expose USB storage devices as block devices to umka-block.
UMS (USB Mass Storage, Bulk-Only Transport):
- Wraps SCSI commands in a Command Block Wrapper (CBW) sent over a bulk-out
endpoint; device responds with data and a Command Status Wrapper (CSW) on
bulk-in. One outstanding command at a time.
- Device registers as BlockDevice with umka-block upon successful SCSI
INQUIRY → READ CAPACITY(16) sequence.
UAS (USB Attached SCSI, USB 3.0+):
- Four-endpoint protocol (command, status, data-in, data-out). Multiple
outstanding commands (up to 65535 via stream IDs). Significantly higher
throughput and lower latency than UMS for fast SSDs.
- Preferred over UMS when both are supported (bInterfaceProtocol = 0x62).
- Same BlockDevice registration as UMS; umka-block sees no difference.
Hotplug: USB device removal triggers an Unregister event in the device
registry (Section 10.5). The volume layer (Section 14.3) transitions dependent block devices to
DEVICE_FAILED state. Auto-mount/unmount policy is handled by a userspace
daemon (udev-compatible via umka-compat) reacting to device registry events.
Tier classification: UMS/UAS drivers are Tier 2 — they communicate over USB (inherently higher latency than PCIe), and the attack surface of USB storage firmware justifies full process isolation over the modest CPU overhead.
10.9.3 USB4 and Thunderbolt
USB4 (based on Thunderbolt 3 protocol) and Thunderbolt 3/4 are high-bandwidth interconnects (40 Gbps) that tunnel multiple protocols — PCIe, DisplayPort, USB — over a single cable. They are relevant across server (external NVMe enclosures, 40GbE NICs), workstation (external GPUs), and embedded (dock stations) contexts.
Architecture: A USB4/Thunderbolt port is controlled by a retimer/router chip with its own firmware. The host-side driver configures the router and establishes tunnels. The tunneled protocols then appear as native devices:
Physical cable (USB4/TB4)
└── USB4 router (host controller + retimer firmware)
├── PCIe tunnel → appears as PCIe device (NVMe, GPU, NIC)
├── DisplayPort tunnel → appears as DP connector (Section 20.4.3, `20-user-io.md`)
└── USB tunnel → appears as USB hub → USB class devices
Kernel responsibilities:
-
Router enumeration: Discover USB4 routers via their management interface (MMIO registers or USB control endpoint). Read router topology descriptor to find upstream/downstream adapters and their capabilities.
-
IOMMU enforcement (mandatory for PCIe tunnels): Before establishing a PCIe tunnel to an external device, the kernel allocates an IOMMU domain for the tunneled device. The PCIe device behind the tunnel is treated identically to a native PCIe device — it gets its own IOMMU domain, its own device registry entry, and its driver follows the normal Tier 1/2 model. IOMMU protection is not optional; external PCIe devices are untrusted by definition.
-
Tunnel authorization: The kernel blocks PCIe tunnel establishment until an authorization signal is received via sysfs:
/sys/bus/thunderbolt/devices/<device>/authorizedWriting1authorizes the device; writing0de-authorizes and tears down the tunnel. This is the kernel's policy interface — what triggers the write (user prompt, pre-approved list, automatic trust) is userspace policy. -
Hotplug lifecycle:
- Connect: router detects device → kernel enumerates → IOMMU domain allocated → authorization check → tunnel established → PCIe/DP/USB device appears
- Disconnect: router reports link-down → kernel tears down tunnel → IOMMU
domain revoked → device registry
Unregisterevent → volume/display/USB layers handle disappearance gracefully
/// USB4/Thunderbolt router state.
pub struct Usb4Router {
/// Router hardware generation and capabilities.
pub gen: Usb4Generation,
/// Upstream adapter (host-facing port).
pub upstream: Usb4Adapter,
/// Downstream adapters (device-facing ports).
pub downstream: Vec<Usb4Adapter>,
/// Currently active tunnels.
pub tunnels: Vec<Usb4Tunnel>,
/// IOMMU domains for active PCIe tunnels.
pub pcie_domains: BTreeMap<Usb4AdapterId, IommuDomain>,
}
#[repr(u32)]
pub enum Usb4Generation {
Usb4Gen2 = 2, // 20 Gbps
Usb4Gen3 = 3, // 40 Gbps
Tb3 = 30, // Thunderbolt 3 (40 Gbps)
Tb4 = 40, // Thunderbolt 4 (40 Gbps, mandatory PCIe + DP)
}
pub struct Usb4Tunnel {
pub kind: Usb4TunnelKind,
pub adapter_id: Usb4AdapterId,
pub iommu_domain: Option<IommuDomain>, // Some for PCIe tunnels
}
#[repr(u32)]
pub enum Usb4TunnelKind {
Pcie = 0,
DisplayPort = 1,
Usb3 = 2,
}
IOMMU domain lifecycle on disconnect/reconnect:
To prevent IOMMU domain reuse on rapid disconnect/reconnect sequences:
-
On disconnect: the device's IOMMU domain is immediately invalidated (all IOMMU mappings flushed). The domain ID enters a quarantine period (TTL = 5 seconds). The device's
CAP_DMAcapability is revoked immediately viacap_revoke(device_cap_handle). -
Quarantine: the quarantined domain ID is reserved and cannot be assigned to any new device until TTL expires and all in-flight DMA transactions are confirmed drained (via
iommu_domain_drain_wait()). -
On reconnect: the reconnecting device receives a fresh IOMMU domain with a new domain ID. It never inherits the quarantined domain. Authorization re-runs from scratch (user prompt or policy check).
-
DMA capability binding:
CAP_DMAis bound to the IOMMU domain ID, not the device identity. A reconnecting device gets a newCAP_DMAcapability after authorization; the old capability is permanently revoked.
This prevents the race where old IOMMU mappings remain active when a new device appears at the same slot, and prevents capability reuse across device identities.
Firmware updates for TB controllers: Controller firmware is updatable via
the NVM update protocol (vendor-specific, typically via the thunderbolt sysfs
interface). The kernel exposes the firmware version and provides a write interface
for firmware blobs. Actual firmware image selection and update policy is userspace.
Relationship to Section 5.1.3: External PCIe devices attached via USB4/Thunderbolt use the same IOMMU hard boundary and unilateral controls (bus master disable, FLR, slot power) as internal PCIe devices. If the external device runs an UmkaOS peer kernel (Section 5.1.2.2), it participates in the cluster exactly as an internal device would — the tunnel is transparent to the cluster protocol.
10.9.3.1 Authorization TOCTOU Prevention
The authorization flow described above has a time-of-check / time-of-use (TOCTOU) window: a malicious device could present one identity at authorization time and then swap its firmware or topology between authorization and PCIe tunnel enumeration, gaining access to an authorized tunnel under a different identity. UmkaOS closes this window with a cryptographic authorization token that binds to immutable hardware identifiers, plus a mandatory re-verification step at the start of enumeration.
Authorization token:
/// Default authorization token lifetime for interactive sessions.
/// After this duration, the token expires and a new authorization is required.
/// Balances security (limits damage window if device is swapped) with usability.
/// Overridable via the `umka.thunderbolt_auth_timeout_s` kernel parameter.
pub const TBT_AUTH_DEFAULT_TIMEOUT_S: u64 = 1800; // 30 minutes
/// Timeout waiting for the userspace authorization daemon to respond.
/// If no response within this window, the tunnel request is denied (fail-closed).
/// Prevents hangs when the authorization daemon is unresponsive.
pub const TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S: u64 = 30;
/// Cryptographic authorization token for a USB4/Thunderbolt PCIe tunnel.
/// Generated by the security manager when the user authorizes a device.
/// Stored in umka-core memory (not in the driver's isolation domain).
pub struct TbtAuthToken {
/// HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes).
/// `auth_key` is a kernel-private key generated at boot (never exported).
pub token: [u8; 32],
/// Thunderbolt device UUID as reported by router firmware (immutable field).
pub device_uuid: [u8; 16],
/// Thunderbolt device serial number as reported by router firmware (immutable).
pub device_serial: u64,
/// Topology path (upstream router UIDs + adapter indices) at authorization time.
pub topology_path: TbtTopologyPath,
/// Monotonic nanosecond timestamp when authorization was granted.
pub authorized_at_ns: u64,
/// Expiry timestamp (monotonic ns). 0 means valid until disconnect.
/// Defaults to authorized_at_ns + TBT_AUTH_DEFAULT_TIMEOUT_S * 1_000_000_000
/// for interactive sessions. Set to 0 for explicit "valid until disconnect" policy.
pub expires_at_ns: u64,
}
/// Topology path: ordered list of (router_uid, adapter_index) pairs from the
/// host controller down to the authorized device. Max depth = 6 hops (USB4 spec).
pub struct TbtTopologyPath {
pub hops: [(u64, u8); 6], // (router_uid, adapter_index)
pub depth: u8,
}
Headless and daemon-less authorization policy:
Headless/daemon-less policy: If no USB4/Thunderbolt authorization daemon is
registered (headless server, container, or daemon crash), ALL PCIe tunnel requests
are denied by default until explicit authorization via umka-tbtctl authorize <uuid>.
USB-class endpoints (not PCIe tunnels) are unaffected by this policy.
The deny-default is logged at KERN_INFO level.
Daemon response timeout: If no daemon response arrives within
TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S seconds, the kernel auto-denies the tunnel
and logs the timeout event at KERN_WARNING.
Token generation at authorization time:
When the security manager grants authorization (in response to a write of 1 to
/sys/bus/thunderbolt/devices/<device>/authorized):
- Read the device's UUID and serial from the router firmware via the Thunderbolt management interface (read-only fields in the router topology descriptor; these fields are populated at cable plug-in by the router firmware from the device's identity block and cannot be modified by software).
- Snapshot the current topology path (upstream router UIDs + adapter indices from host controller down to this device).
- Compute
HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes)using the kernel's boot-time-generatedauth_key. - Store the resulting
TbtAuthTokenin umka-core memory, associated with the device'sUsb4Routerentry.
Re-verification at enumeration time:
Before the USB4/TBT driver establishes the PCIe tunnel and presents the tunneled device to the PCIe bus, the kernel performs a mandatory re-verification:
PCIe tunnel enumeration protocol (enforced by umka-core, not the driver):
1. USB4/TBT driver requests PCIe tunnel enumeration for adapter <id>.
2. Security manager retrieves the stored TbtAuthToken for that adapter.
3. If no token exists: enumeration denied (PermissionDenied).
4. If token has expired (expires_at_ns != 0 && monotonic_ns() > expires_at_ns):
revoke authorization, log security event, return PermissionDenied.
5. Re-read device UUID + serial from router firmware.
6. Re-snapshot current topology path.
7. Recompute HMAC-SHA256(auth_key, uuid || serial || path_bytes).
8. Compare computed token with stored token — must match byte-for-byte.
9. If mismatch:
a. Log security event: "TBT TOCTOU: device identity changed after authorization,
adapter <id>, expected UUID <stored_uuid>, got <current_uuid>"
b. Revoke authorization (clear authorized bit, destroy stored token).
c. Disconnect: instruct the router firmware to disable the PCIe adapter.
d. Return SecurityViolation to the caller.
10. If match: proceed with PCIe tunnel establishment and enumeration,
holding a reference to the auth token for the duration of enumeration.
11. During enumeration: verify that the PCIe device hierarchy rooted at the
tunnel matches the topology snapshot (router count, UIDs at each hop).
Any discrepancy aborts enumeration with the same TOCTOU revocation sequence.
12. Post-enumeration: associate the PCIe device nodes with this auth token.
Store the token reference in each `DeviceDescriptor` for the tunneled devices.
Topology change monitoring after enumeration:
After the PCIe tunnel is established, the USB4 driver monitors router firmware events (hotplug notifications, link-state change interrupts from the host controller):
- Router added or removed: any unexpected change in the topology between the host controller and the authorized device triggers re-verification. If the re-verify fails (token mismatch due to topology change), the kernel disconnects the PCIe tunnel and revokes authorization.
- Link-down on authorized adapter: treated as a disconnect event. The auth token is destroyed. Reconnection requires a fresh authorization cycle.
- Router UID mismatch: if a router at any hop in the stored topology path reports a different UID than the token recorded, the kernel disconnects immediately. This catches the attack where an intermediate router (not the endpoint device) is replaced.
The topology monitoring event loop runs in the USB4 host controller driver (Tier 1). Events are delivered via the host controller's interrupt, processed in the driver's interrupt handler, and dispatched to the security manager via an MPSC ring.
10.10 I2C/SMBus Bus Framework
I2C (Inter-Integrated Circuit) and SMBus (System Management Bus, a subset of I2C) are low-speed serial buses used throughout the hardware stack — in servers as well as consumer and embedded devices:
Server / datacenter uses: - BMC (Baseboard Management Controller) sensor buses: CPU, DIMM, and VRM temperature sensors; fan speed controllers; PSU monitoring - PMBus (Power Management Bus, layered on SMBus): voltage regulators, power sequencing, power rail telemetry - SPD (Serial Presence Detect): JEDEC EEPROM on each DIMM, read at boot for memory training; JEDEC JEP106 manufacturer ID, capacity, speed grade, thermal sensor register on DDR4/5 DIMMs - IPMI satellite controllers (IPMB — IPMI over I2C)
Consumer / embedded uses: - Touchpads and touchscreens (I2C-HID protocol, Section 10.10.3 below) - Audio codecs (I2C control path for volume, routing, power state) - Ambient light sensors, accelerometers (shock/vibration detection) - Battery and charger controllers (Smart Battery System over SMBus)
10.10.1 I2C Bus Trait
Platform I2C controller drivers (Intel LPSS, AMD FCH, Synopsys DesignWare,
Broadcom BCM2835, Aspeed AST2600 BMC) implement the I2cBus trait. The trait
is in umka-core/src/bus/i2c.rs.
/// I2C device address (7-bit, right-aligned; 0x00–0x7F).
pub type I2cAddr = u8;
/// I2C transfer result.
#[repr(u32)]
pub enum I2cResult {
Ok = 0,
/// No ACK (device not present or not responding).
NoAck = 1,
/// Bus arbitration lost (multi-master collision).
ArbitrationLost = 2,
/// Timeout (clock stretching exceeded or device hung).
Timeout = 3,
InvalidParam = 4,
}
/// I2C bus trait. Implemented by platform-specific controller drivers.
/// Used only within Rust-internal code (same compilation unit). For KABI
/// boundaries between separately-compiled modules, use `I2cBusVTable` below.
pub trait I2cBus: Send + Sync {
/// Combined write-then-read (I2C repeated START).
/// Typical pattern: write register address, read value.
fn transfer(&self, addr: I2cAddr, write: &[u8], read: &mut [u8]) -> I2cResult;
fn write(&self, addr: I2cAddr, data: &[u8]) -> I2cResult {
self.transfer(addr, data, &mut [])
}
fn read(&self, addr: I2cAddr, buf: &mut [u8]) -> I2cResult {
self.transfer(addr, &[], buf)
}
}
/// C-ABI vtable for I2C bus controller operations, used at KABI boundaries.
/// When a Tier 1 HID/sensor driver needs to call the I2C bus controller (which
/// may be a separately-compiled Tier 0 module), it receives an `I2cDevice`
/// (below) containing a pointer to this vtable rather than an `Arc<dyn I2cBus>`.
#[repr(C)]
pub struct I2cBusVTable {
/// Vtable size in bytes. Always `core::mem::size_of::<I2cBusVTable>()` for
/// the implementing driver; receivers use it for version compatibility.
pub vtable_size: u64,
/// Combined write-then-read (I2C repeated START).
/// `ctx`: opaque per-bus context pointer (first arg to all operations).
pub transfer: unsafe extern "C" fn(
ctx: *mut c_void,
addr: I2cAddr,
write: *const u8,
write_len: u32,
read: *mut u8,
read_len: u32,
) -> I2cResult,
}
/// Handle to a device at a fixed address on a specific I2C bus.
/// Uses C-ABI compatible vtable pointer + opaque context instead of
/// `Arc<dyn I2cBus>` to allow use across KABI boundaries between separately
/// compiled Tier 0 bus controller and Tier 1 device driver modules.
pub struct I2cDevice {
/// Pointer to the bus controller's operation vtable. Points to a static
/// vtable allocated in the bus controller module; never null.
pub bus_ops: *const I2cBusVTable,
/// Opaque per-bus context pointer passed as the first argument to every
/// vtable function. Points to the controller driver's internal bus state.
pub bus_ctx: *mut c_void,
pub addr: I2cAddr,
}
impl I2cDevice {
pub fn read_reg(&self, reg: u8) -> Result<u8, I2cResult> {
let mut buf = [0u8];
// SAFETY: bus_ops and bus_ctx come from the bus controller at probe time.
let result = unsafe {
((*self.bus_ops).transfer)(
self.bus_ctx, self.addr, ® as *const u8, 1, buf.as_mut_ptr(), 1,
)
};
match result {
I2cResult::Ok => Ok(buf[0]),
e => Err(e),
}
}
pub fn write_reg(&self, reg: u8, val: u8) -> I2cResult {
let data = [reg, val];
// SAFETY: bus_ops and bus_ctx are valid; data is stack-local and valid for transfer duration.
unsafe {
((*self.bus_ops).transfer)(
self.bus_ctx, self.addr, data.as_ptr(), 2, core::ptr::null_mut(), 0,
)
}
}
/// Read a 16-bit little-endian register (common on SMBus devices).
pub fn read_reg16_le(&self, reg: u8) -> Result<u16, I2cResult> {
let mut buf = [0u8; 2];
// SAFETY: bus_ops/bus_ctx valid; buf is stack-local and valid for transfer duration.
let result = unsafe {
((*self.bus_ops).transfer)(
self.bus_ctx, self.addr, ® as *const u8, 1, buf.as_mut_ptr(), 2,
)
};
match result {
I2cResult::Ok => Ok(u16::from_le_bytes(buf)),
e => Err(e),
}
}
}
Tier classification: I2C controller drivers are Tier 1 — they are platform-integrated and accessed from multiple other Tier 1 drivers (audio, sensor, battery). Device drivers using I2C (touchpads, sensors) follow their own tier classification based on their function.
Device enumeration: I2C devices are enumerated from ACPI (_HID, _CRS
with I2cSerialBusV2 resource) or device-tree compatible strings. The bus
manager matches each ACPI/DT node to a registered I2C device driver.
10.10.2 SMBus and Hardware Sensors
SMBus restricts I2C to well-defined transaction types (Quick Command, Send Byte,
Read Byte, Read Word, Block Read) and adds a PEC (Packet Error Code) byte for
data integrity. The UmkaOS SMBus layer wraps I2cBus and enforces SMBus
transaction semantics.
10.10.2.1 Hardware Monitoring (hwmon) Interface
Server and workstation motherboards expose dozens of sensors over I2C/SMBus.
UmkaOS provides a HwmonDevice trait analogous to Linux's hwmon subsystem:
/// A hardware monitor device (temperature, voltage, fan, current sensors).
pub trait HwmonDevice: Send + Sync {
/// Device name (e.g., "nct6779", "ina3221", "max31790").
fn name(&self) -> &str;
/// Read a temperature sensor in millidegrees Celsius.
/// Returns None if the sensor index is not present.
fn temperature_mc(&self, index: u8) -> Option<i32>;
/// Read a fan speed in RPM.
fn fan_rpm(&self, index: u8) -> Option<u32>;
/// Read a voltage in millivolts.
fn voltage_mv(&self, index: u8) -> Option<i32>;
/// Read a current in milliamperes.
fn current_ma(&self, index: u8) -> Option<i32>;
/// Set a fan PWM duty cycle (0–255).
fn set_fan_pwm(&self, index: u8, pwm: u8) -> Result<(), I2cResult>;
}
Registered HwmonDevice instances are exposed via sysfs under
/sys/class/hwmon/hwmon<N>/. Userspace daemons (fancontrol, lm-sensors,
IPMI daemons, monitoring agents like Prometheus node-exporter) read these
paths without kernel modifications. UmkaOS's hwmon sysfs layout is compatible
with Linux's hwmon ABI.
10.10.2.2 PMBus (Power Management Bus)
PMBus is a layered protocol over SMBus for communicating with power conversion devices (VRMs, PSUs, battery chargers). PMBus defines a standardised command set (PMBUS_READ_VIN, PMBUS_READ_VOUT, PMBUS_READ_IOUT, PMBUS_READ_TEMPERATURE_1, etc.) with standardised data formats.
The UmkaOS PMBus driver:
1. Probes devices via ACPI/DT with pmbus compatible string.
2. Reads PMBUS_MFR_ID, PMBUS_MFR_MODEL for identification.
3. Registers a HwmonDevice exposing all PMBus telemetry channels.
4. Monitors STATUS_WORD for fault conditions (over-voltage, over-current,
over-temperature, fan fault) and posts HwmonFaultEvent to the event
subsystem (Section 6.6, 06-scheduling.md) so userspace daemons can react.
10.10.2.3 DIMM SPD and Thermal Sensors
DDR4/DDR5 DIMMs have an SPD EEPROM at I2C address 0x50–0x57 (slot-indexed). The memory controller driver reads SPD at boot for training parameters. DDR4 DIMMs also expose a thermal sensor at address 0x18–0x1F via the TS3518 or compatible interface.
/// SPD EEPROM read (partial — first 256 bytes sufficient for JEDEC training).
pub fn read_spd(bus: &dyn I2cBus, slot: u8) -> Result<[u8; 256], I2cResult> {
let addr = 0x50u8 | (slot & 0x07);
let mut buf = [0u8; 256];
// SPD page select not needed for first 256 bytes on DDR4.
bus.transfer(addr, &[0x00], &mut buf)?;
Ok(buf)
}
/// DDR4 thermal sensor read (TS register, 13-bit two's complement, 0.0625°C LSB).
pub fn read_dimm_temp_mc(bus: Arc<dyn I2cBus>, slot: u8) -> Result<i32, I2cResult> {
let addr = 0x18u8 | (slot & 0x07);
// JEDEC JC42.4 thermal sensors transmit MSB first (big-endian).
let raw = I2cDevice { bus, addr }.read_reg16_be(0x05)?;
// Bits [15:13] are flags; bits [12:4] are temperature in 1/16°C units.
let temp_raw = ((raw as i16) >> 4) as i32;
Ok(temp_raw * 625 / 10) // convert to millidegrees Celsius
}
10.10.3 I2C-HID Protocol
I2C-HID (HID over I2C, HIDI2C v1.0 specification) is used for touchpads, touchscreens, fingerprint readers, and other HID devices with I2C interfaces. The kernel implements the transport layer; HID report parsing is shared with the USB HID stack (Section 10.9.1).
Protocol flow:
1. ACPI reports device with PNP0C50 (_HID) or ACPI0C50; _CRS provides
I2C address, IRQ GPIO line, and descriptor register address.
2. Driver reads HID descriptor (30 bytes) from the descriptor register.
3. Driver reads HID Report Descriptor and passes it to the shared HidParser.
4. Device asserts IRQ GPIO (falling edge) when a new input report is ready.
5. ISR: reads input report from the input register address specified in
descriptor; parses via HidParser; posts InputEvent to the input
subsystem ring buffer (Section 20.1, 20-user-io.md).
#[repr(C, packed)]
pub struct I2cHidDescriptor {
pub length: u16, // Must be 30 (per HIDI2C v1.0 spec)
pub bcd_version: u16, // 0x0100 for v1.0
pub report_desc_len: u16,
pub report_desc_reg: u16,
pub input_reg: u16,
pub max_input_len: u16,
pub output_reg: u16,
pub max_output_len: u16,
pub cmd_reg: u16,
pub data_reg: u16,
pub vendor_id: u16,
pub product_id: u16,
pub version_id: u16,
// No _reserved field: the HIDI2C v1.0 wire format is exactly 30 bytes
// (15 × u16). The struct is 30 bytes with #[repr(C, packed)].
// When reading from the device, read exactly 30 bytes into this struct.
}
HID parser security bounds (all input from the USB device is UNTRUSTED):
/// Maximum HID report descriptor byte length.
/// USB HID spec §6.2.1 recommends keeping descriptors under 4096 bytes.
/// UmkaOS enforces this as a hard limit to prevent parser state explosion
/// from untrusted (potentially malicious) USB devices.
pub const HID_REPORT_DESC_MAX_BYTES: usize = 4096;
/// Maximum number of usage/field items per HID report ID.
/// Limits parser memory to HID_MAX_FIELDS_PER_REPORT × sizeof(HidField) per report.
pub const HID_MAX_FIELDS_PER_REPORT: usize = 256;
/// Maximum number of report descriptors per HID device.
/// (Enforced structurally by ArrayVec<HidReport, HID_MAX_REPORTS>.)
pub const HID_MAX_REPORTS: usize = 16;
HID descriptor parsing error handling (all input is UNTRUSTED — from USB device):
- Descriptor exceeds HID_REPORT_DESC_MAX_BYTES → return Err(HidError::DescriptorTooLong)
- Unknown item tag → skip item per USB HID §6.2.2.7 (long-item skipping) and continue
parsing (permissive, for hardware compatibility with quirky devices)
- Fields exceed HID_MAX_FIELDS_PER_REPORT → truncate excess fields, log KERN_WARNING
- report_count × report_size overflows u32 → return Err(HidError::ReportSizeOverflow)
- Descriptor ends mid-item → return Err(HidError::TruncatedDescriptor)
The full I2cHidDevice implementation and interrupt handler:
// umka-core/src/hid/i2c_hid.rs
/// I2C-HID driver state.
pub struct I2cHidDevice {
/// I2C device handle.
pub i2c: I2cDevice,
/// Descriptor (fetched at probe time).
pub desc: I2cHidDescriptor,
/// Interrupt GPIO line (from ACPI `_CRS` GpioInt resource).
pub irq_gpio: GpioLine,
/// HID report descriptor (fetched once at probe time). `Box<[u8]>` over
/// `Vec<u8>`: the slice is allocated at probe with the exact length returned
/// by the device and never resized. Prevents accidental reallocation if a
/// method on `Vec` is called after probe.
pub report_desc: Box<[u8]>,
/// Pre-allocated input report buffer sized to `desc.max_input_len` at probe.
/// `Box<[u8]>` over `Vec<u8>`: the fixed-capacity slice prevents reallocation
/// in interrupt context. The interrupt handler writes into `&mut report_buf[..]`
/// via a pre-sized slice — no heap allocation occurs during IRQ handling.
pub report_buf: Box<[u8]>,
/// Parsed HID report parser state. Parses a HID report descriptor
/// (sequence of items per USB HID spec Section 6.2.2) into a structured
/// representation of reports, fields, and usages.
/// Bounds enforced: HID_MAX_REPORTS, HID_MAX_FIELDS_PER_REPORT, HID_REPORT_DESC_MAX_BYTES.
///
/// ```rust
/// pub struct HidParser {
/// /// Parsed report descriptors, indexed by report ID.
/// pub reports: ArrayVec<HidReport, HID_MAX_REPORTS>,
/// }
/// pub struct HidReport {
/// pub report_id: u8,
/// pub report_type: HidReportType, // Input, Output, Feature
/// pub fields: ArrayVec<HidField, HID_MAX_FIELDS_PER_REPORT>,
/// pub total_bits: u32,
/// }
/// pub struct HidField {
/// pub usage_page: u16,
/// pub usage_min: u16,
/// pub usage_max: u16,
/// pub logical_min: i32,
/// pub logical_max: i32,
/// pub bit_offset: u32,
/// pub bit_size: u32,
/// pub count: u32,
/// pub flags: u32, // Variable, Array, Absolute, Wrap, etc.
/// }
/// ```
pub parser: HidParser,
}
impl I2cHidDevice {
/// Probe an I2C-HID device. Called when ACPI reports `PNP0C50` (I2C-HID).
pub fn probe(i2c: I2cDevice, irq_gpio: GpioLine) -> Result<Self, ProbeError> {
// Read descriptor from register 0x0001.
let mut desc_buf = [0u8; 30];
i2c.bus.transfer(i2c.addr, &[0x01, 0x00], &mut desc_buf)?;
// SAFETY: I2cHidDescriptor is #[repr(C, packed)] with all u16/u8 fields,
// matching the 30-byte wire format. read_unaligned is required because the
// I2C transfer buffer may not be 2-byte aligned.
let desc: I2cHidDescriptor = unsafe { core::ptr::read_unaligned(desc_buf.as_ptr() as *const _) };
// Read HID report descriptor.
let mut report_desc = vec![0u8; desc.report_desc_len as usize];
let reg_bytes = desc.report_desc_reg.to_le_bytes();
i2c.bus.transfer(i2c.addr, ®_bytes, &mut report_desc)?;
// Parse HID report descriptor to build parser.
let parser = HidParser::parse(&report_desc)?;
// Register interrupt handler.
irq_gpio.enable_interrupt(GpioInterruptMode::FallingEdge, move || {
Self::handle_interrupt(&i2c, &desc, &parser);
})?;
let report_buf = vec![0u8; desc.max_input_len as usize];
Ok(Self { i2c, desc, irq_gpio, report_desc, report_buf, parser })
}
/// Interrupt handler: read HID report, parse, deliver events.
fn handle_interrupt(i2c: &I2cDevice, desc: &I2cHidDescriptor, parser: &HidParser,
report_buf: &mut [u8]) {
// Pre-allocated in probe() — interrupt handlers must not perform heap allocation.
let reg_bytes = desc.input_reg.to_le_bytes();
if i2c.bus.transfer(i2c.addr, ®_bytes, &mut report_buf) != I2cResult::Ok {
return; // Ignore read errors (spurious interrupt or device glitch).
}
// Parse HID report → InputEvent structs.
let events = parser.parse_input_report(&report_buf);
for event in events {
umka_input::post_event(event); // Write to input subsystem ring buffer (Section 20.1).
}
}
}
10.10.4 Precision Touchpad (PTP)
Windows Precision Touchpad devices use HID Usage Page 0x0D (Digitizers), Usage 0x05 (Touch Pad). The HID report contains: - Contact count: Number of active touches (0-10+). - Per-contact data: X/Y position (absolute, in logical units), contact width/height, pressure, contact ID. - Button state: Physical button click (if present), pad click (tap-to-click handled in userspace).
// umka-core/src/hid/touchpad.rs
/// Parsed Precision Touchpad report.
pub struct PtpReport {
/// Number of active contacts.
pub contact_count: u8,
/// Per-contact data (up to 10 simultaneous touches).
pub contacts: [PtpContact; 10],
/// Button state (bit 0 = left button, bit 1 = right button).
pub buttons: u8,
}
/// Single touch contact on a Precision Touchpad.
#[derive(Clone, Copy)]
pub struct PtpContact {
/// Contact ID (persistent across reports while finger is down).
pub id: u8,
/// Tip switch (1 = finger down, 0 = finger lifted).
pub tip: bool,
/// X position (logical units, 0 = left edge).
pub x: u16,
/// Y position (logical units, 0 = top edge).
pub y: u16,
/// Width (logical units, or 0 if not reported).
pub width: u16,
/// Height (logical units, or 0 if not reported).
pub height: u16,
}
Gesture recognition: Kernel delivers raw multi-touch HID reports via the input ring buffer. Gesture recognition (palm rejection, tap-to-click, multi-finger swipes) is handled by a userspace input library (libinput or equivalent).