Chapter 10: Driver Architecture and Isolation

Three-tier protection model, isolation mechanisms, KABI, driver model, device registry, zero-copy I/O, IPC, crash recovery, driver subsystems

10.1 Three-Tier Protection Model

UmkaOS organizes code into three driver tiers — the UmkaOS Core microkernel (Tier 0), Tier 1 kernel-adjacent drivers, and Tier 2 user-space drivers — plus standard user space. The "three tiers" refer to the three levels at which kernel/driver code executes (Core, Tier 1, Tier 2); user space is not counted as a tier because it uses the standard Linux process model unchanged.

A fourth class, Tier M (Multikernel Peer), emerges when attached hardware runs its own UmkaOS kernel instance or an UmkaOS-compatible shim. Tier M is not a tier within a single UmkaOS instance — it is a physically separate execution environment with isolation stronger than Tier 2 and near-zero host-side driver complexity. See Section 10.1.2.

+======================================================================+
|                         UmkaOS CORE  (Ring 0)                          |
|  Microkernel: Rust + C/asm for arch boot                            |
|                                                                      |
|  - Capability manager         - Physical memory allocator            |
|  - Thread/process management  - Scheduler (CFS/EEVDF + RT + DL)     |
|  - IPC primitives             - MMU / IOMMU programming              |
|  - Interrupt routing          - vDSO maintenance                     |
|  - Virtual memory manager     - Page cache                           |
|  - Timer management           - Linux syscall interface              |
+======================================================================+
         |  MPK switch (~23 cycles)     |  Shared memory (0 copies)
         v                              v
+======================================================================+
|                    TIER 1: Kernel-Adjacent Drivers                    |
|  Ring 0, MPK-isolated (Intel Memory Protection Keys)                 |
|                                                                      |
|  - NVMe, AHCI/SATA            - High-perf NICs (Intel, Mellanox)    |
|  - TCP/IP + UDP stack          - GPU compute drivers                 |
|  - Block I/O layer             - Filesystem impls (ext4, XFS, btrfs) |
|  - VirtIO drivers              - Crypto subsystem                    |
|  - KVM hypervisor (*)          - Netfilter/nftables engine           |
+======================================================================+
(*) KVM runs as a Tier 1 driver with extended hardware privileges (KvmHardwareCapability),
    which authorizes umka-core to execute VMX/VHE/H-extension operations on KVM's behalf
    via a validated VMX/VHE trampoline. KVM retains full Tier 1 crash-recovery semantics.
    See Section 18.1.4.5 for the full classification rationale.
         |  Address-space switch           |  IOMMU-isolated
         |  (~200-500 cycles, PCID/ASID)  |  DMA fencing
         v                                v
+======================================================================+
|                    TIER 2: User-Space Drivers                         |
|  Ring 3, separate address space, IOMMU-protected DMA                 |
|                                                                      |
|  - USB drivers                 - Audio (HDA, USB Audio)              |
|  - Input devices               - Bluetooth, WiFi control plane       |
|  - Printers, scanners          - Third-party / vendor drivers        |
|  - Display server drivers      - Non-performance-critical devices    |
+======================================================================+
         |  Standard Linux syscall interface (100% compatible)
         v
+======================================================================+
|                    USER SPACE  (Ring 3)                               |
|  Unmodified Linux binaries: glibc, musl, systemd, Docker, K8s, etc. |
+======================================================================+

════════════ Hardware Fabric Boundary (PCIe / CXL / coherent on-chip) ════════════

+======================================================================+
|          TIER M: Multikernel Peer  (separate kernel instance)        |
|  Own CPU complex · own memory · UmkaOS kernel or UmkaOS-compatible shim  |
|                                                                      |
|  - SmartNIC / DPU (BlueField, Pensando, Marvell OCTEON)             |
|  - Computational storage SoC (Arm Cortex-R, Zynq UltraScale+)      |
|  - On-chip hardware partition (ARM CCA Realm, RISC-V WorldGuard)    |
|  - GPU / NIC with UmkaOS-compatible firmware shim                      |
+======================================================================+
(*) Tier M: deployment-time property — 0 to N peers per host.
    Host representation: umka-peer-transport (~2,000 lines, device-agnostic).

Complexity management — The core-of-core (scheduler + memory + caps + IPC) should be as small as feasible. For reference: seL4's verified microkernel is ~10K SLOC (but provides far fewer services), QNX's microkernel is ~100K, and the Zircon kernel (Fuchsia) is ~200K. Any subsystem that grows beyond the minimum necessary for its function should be re-evaluated for extraction to Tier 1.

10.1.1 How the Tiers Interact

UmkaOS Core to Tier 1: The core switches the MPK protection domain via WRPKRU (a single unprivileged instruction, approximately 23 cycles). Both run in Ring 0 and share the same address space, but MPK keys prevent a Tier 1 driver from reading or writing memory belonging to the core or to other Tier 1 domains. Communication uses shared-memory ring buffers -- zero copies, zero transitions for data.

UmkaOS Core to Tier 2: Standard process-based isolation. Tier 2 drivers run in Ring 3 with their own address space. Communication uses mapped shared-memory rings for data (zero copy) and lightweight syscall-based notifications. IOMMU restricts DMA to driver-allocated regions.

How Tier 2 zero-copy works — dual physical page mapping: The shared ring buffers between UmkaOS Core and a Tier 2 driver are backed by a single set of physical pages mapped into two virtual address spaces simultaneously. The kernel side holds a VmArea covering these pages (with VM_SHARED | VM_IO flags) and accesses them through its own virtual address. The Tier 2 driver side calls mmap(UMKA_RING_FD, ...) on a special file descriptor issued at driver registration; the kernel maps the same physical frames into the Tier 2 process address space as a read-write shared mapping. No copy occurs on either side: the kernel writes to its virtual address and the Tier 2 driver reads from its virtual address, both resolving to the same physical frames. The shared region is bounded to the ring buffer size; the Tier 2 driver cannot access kernel memory outside the mapped ring (enforced by VMA boundaries and IOMMU). Cache coherency is guaranteed by the CPU coherency protocol on x86 and ARM; non-coherent platforms add a memory fence before and after ring accesses.

Tier 1/2 to User Space: No direct interaction for control paths — all user-space requests go through UmkaOS Core's syscall layer, which dispatches to the appropriate tier. However, the data path does allow direct shared memory: UmkaOS Core sets up shared ring buffers (Section 10.6) that are mapped into both the driver and user-space address spaces. Once established, data flows through these rings without UmkaOS Core mediation (zero-copy). UmkaOS Core mediates only the ring setup, teardown, and error paths.

UmkaOS Core to Tier M (Peer Kernel): Communication uses typed capability channels over the hardware fabric — PCIe P2P ring buffers, CXL shared memory, or coherent on-chip SRAM. No UmkaOS Core data paths cross the hardware boundary. The host-side umka-peer-transport module (~2,000 lines) manages cluster membership, capability negotiation, and crash recovery. The peer kernel runs its own scheduler, memory manager, and capability space independently; the host CPU is not in the device's data path.

10.1.2 Tier M: Multikernel Peer Isolation

Tier M describes the isolation class of devices running their own UmkaOS kernel instance (or UmkaOS-compatible shim) as cluster peers. The three-tier model (Tiers 0–2) describes isolation within a single UmkaOS instance. Tier M is a between-kernel isolation class.

Isolation properties:

No shared kernel address space. Tiers 0–2 all execute within the host UmkaOS kernel (Ring 0 or Ring 3) and share kernel data structures at various depths. A Tier M peer has an entirely separate address space, CPU state, and capability namespace. The host UmkaOS Core never maps peer memory.
Hardware boundary, not software policy. Tier 1 isolation (MPK/POE/DACR) and Tier 2 isolation (IOMMU) are policies enforced by software-programmable hardware registers — a sufficiently privileged exploit can alter them. The Tier M boundary is a hardware fabric (PCIe, CXL, on-chip partition fence); crossing it requires physical access or firmware compromise of the device, a categorically different threat model.
Isolation stronger than Tier 2. Tier 2 is Ring 3 + IOMMU on the host — the driver still shares the host kernel for syscall dispatch, signal delivery, and page table management. A Tier M peer shares none of these. The only communication surface is the typed capability channel, a significantly smaller attack surface.
Ordered crash recovery. On peer kernel crash: IOMMU lockout and PCIe bus master disable within 2ms, then distributed state cleanup, then optional FLR and reboot. The host kernel never panics; applications see a brief I/O stall. See Section 5.1.3.

Performance properties:

Host-side complexity: ~2,000 lines of device-agnostic umka-peer-transport regardless of device class, vs. 100K–700K lines of device-specific Ring 0 driver code for equivalent traditional devices.
Host CPU out of data path: the peer kernel manages its own scheduler and I/O independently. Host CPU overhead is proportional to control-path events (peer joins, leaves, crashes, capability renegotiation), not to data throughput.
Communication latency by fabric:

Fabric	Example hardware	Round-trip latency
PCIe P2P	Discrete SmartNIC, DPU	~1–2 μs
CXL coherent	Attached memory/compute expander	~100–300 ns
On-chip hardware partition	ARM CCA Realm, RISC-V WorldGuard	~10–50 ns

Hardware forms:

Form	Examples	Notes
Discrete device	BlueField-3, Pensando Elba, Marvell OCTEON 10	Full UmkaOS kernel port
Computational storage	Arm Cortex-R NVMe SoC, Zynq UltraScale+	Full UmkaOS kernel port
On-chip partition	ARM CCA Realm (Neoverse V3+), RISC-V WorldGuard	Same physical package; coherent fabric
UmkaOS shim	GPU or NIC firmware implementing the UmkaOS peer protocol	Device need not run the full kernel

The on-chip partition form — ARM CCA Realms (shipping in Neoverse V3+, Cortex-X4+) and RISC-V WorldGuard (specification in progress) — is conceptually identical to the discrete device form. The architectural pattern is the same: a separate UmkaOS instance, a hardware-enforced boundary, typed capability channels as the sole communication surface. The difference is physical proximity, which determines communication latency but not the isolation model.

Availability: Tier M is a deployment-time property. A host may have zero, one, or many Tier M peers depending on attached hardware. The host kernel supports the traditional driver model (Tiers 0–2) and the peer model simultaneously with no configuration distinction — umka-peer-transport loads on demand when a peer device is detected.

10.2 Isolation Mechanisms and Performance Modes

Hardware-assisted memory isolation enables UmkaOS's three-tier driver model (Section 11.1) at near-zero overhead on platforms with hardware isolation support (x86 MPK ~23 cycles, ARMv7 DACR ~10-20 cycles, AArch64 page-table ~150-300 cycles). On RISC-V, Tier 1 isolation is not available — Tier 1 drivers run as Tier 0 (in-kernel, fully trusted) until RISC-V hardware provides suitable isolation primitives. This section covers the mechanisms, their costs, threat model, and the adaptive policy that allows UmkaOS to run on hardware ranging from x86_64 with MPK (~23 cycles) to RISC-V where Tier 1 is absent entirely. Isolation is one of eight core capabilities — see Section 1.1 for the full list.

10.2.1 Isolation Philosophy: Best Effort Within Performance Budget

Key principle: Driver isolation in UmkaOS is not a single fixed design point. It is a spectrum that varies across hardware architectures, and the approach is deliberately "best effort within the performance budget" rather than "maximum isolation everywhere."

Why this matters:

Hardware capability varies widely: x86_64 has MPK (16 domains, ~23 cycles). AArch64 uses page-table + ASID isolation (~150-300 cycles) as the standard mechanism on all current deployed hardware (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920). POE (ARMv8.9+/ARMv9.4+, ~40-80 cycles) is an optional hardware acceleration available on newer silicon (Neoverse V3+, Cortex-X4+) that provides 2-4x speedup when present. ARMv7 has DACR (16 domains, ~10-20 cycles). RISC-V has no suitable isolation mechanism — Tier 1 is not available on RISC-V. A design that mandates uniform isolation would either (a) impose unacceptable overhead on some architectures, or (b) fail to leverage better isolation on architectures that support it.
Performance is a requirement, not a nice-to-have: The 5% overhead target is non-negotiable. UmkaOS must be a drop-in replacement for Linux — if I/O latency increases by 20%, users will not adopt it regardless of how strong the isolation is.
The escape hatch always exists: Any Tier 1 driver can be demoted to Tier 2 (full process isolation) at any time — via per-driver manifest, sysfs knob, or automatic crash-count policy. If an administrator values isolation over performance for a specific workload or hardware configuration, that choice is always available. The tradeoff is explicit and user-controlled.
This is not a bug, it's a feature: Some reviewers may see varying isolation strength across architectures as a "flaw" or "inconsistency." It is neither. It is an honest acknowledgment of hardware reality. The alternative — pretending all architectures have identical isolation capabilities, or mandating full process isolation everywhere (and accepting 20-50% overhead) — would make UmkaOS impractical for its intended use case as a Linux replacement.

The design contract:

Hardware	Tier 1 Isolation	Overhead	Alternative
x86_64 with MPK	Strong (MPK domains)	~1-2%	Demote to Tier 2 for stronger isolation
AArch64 (mainstream: page-table + ASID)	Moderate (page-table domains)	~6-12%	Demote to Tier 2, or promote to Tier 0 for performance
AArch64 with POE (ARMv8.9+/ARMv9.4+)	Strong (POE indices)	~2-4%	Demote to Tier 2 for stronger isolation
ARMv7 with DACR	Strong (DACR domains)	~0.5-1%	Demote to Tier 2 for stronger isolation
RISC-V	Tier 1 unavailable — Tier 1 drivers run as Tier 0	0% overhead (no isolation boundary)	Demote to Tier 2 for isolation
PPC32/PPC64LE	Strong-Moderate	~1-5%	Demote to Tier 2 for stronger isolation

Summary: UmkaOS provides the best isolation the hardware can deliver within the performance budget, with a user-controlled escape hatch to stronger isolation (Tier 2) when security requirements exceed what the hardware can efficiently provide. This is a pragmatic engineering tradeoff, not a design flaw.

AArch64 deployment note: The global UmkaOS performance budget (≤5% overhead vs Linux) requires POE (ARMv8.9+/ARMv9.4-A, FEAT_S1POE) to be met with Tier 1 on AArch64. Without POE, page-table + ASID isolation costs 6-12% per domain switch, which exceeds the budget for high-throughput workloads (NVMe, network). On current mainstream AArch64 servers (Graviton 2/3/4, Neoverse V1/V2, Ampere Altra) that lack POE, operators have two options: 1. Use Tier 1 and accept the higher overhead (appropriate when crash containment is the priority and workloads have low I/O frequency — e.g., compute-heavy, GPU inference). 2. Prefer Tier 2 for I/O-intensive drivers (USB, SATA, fast storage) and promote only low-frequency drivers to Tier 1. This keeps per-request overhead within budget.

POE support detection is automatic at boot (ID_AA64MMFR3_EL1.S1PIE). Operators can also force Tier 2 globally on AArch64 without POE via umka.tier1_aarch64=0.

10.2.2 How MPK Works

Each page table entry contains a 4-bit protection key (PKEY), assigning the page to one of 16 domains (0-15). The PKRU register holds per-domain read/write permission bits. The WRPKRU instruction updates these permissions in approximately 23 cycles (measured: ~23 cycles on Skylake [libmpk, USENIX ATC '19], ~28 cycles on Skylake-SP [EPK, USENIX ATC '22]) -- no TLB flush, no privilege transition, no system call.

10.2.3 Cost Comparison

Mechanism	Cost per transition	Isolation strength	Used for
Function call	~1-5 cycles	None	Linux monolithic
Intel MPK `WRPKRU`	~23 cycles	Memory domain	Tier 1 drivers
Full IPC (seL4-style)	~600-1000 cycles	Full address space	Too expensive
Address-space switch	~200-600 cycles	Full process	Tier 2 drivers

MPK gives meaningful isolation -- a Tier 1 driver cannot read or write kernel private data, other driver data, or memory in other MPK domains -- at only approximately 23 cycles per boundary crossing. Combined with IOMMU for DMA fencing, this is the foundation of our performance story.

10.2.4 MPK Domain Allocation

With 16 available domains (PKEY 0-15), the allocation strategy is:

PKEY	Assignment
0	UmkaOS Core (kernel private data)
1	Shared read-only (ring buffer descriptors)
2-13	Tier 1 driver domains (12 available)
14	Shared DMA buffer pool
15	Guard / unmapped

When more than 12 Tier 1 domains are needed, related drivers are grouped into the same domain (for example, all block drivers share one domain, all network drivers share another). This grouping is configurable via policy.

10.2.5 WRPKRU Threat Model: Crash Containment, Not Exploitation Prevention

Critical design constraint: WRPKRU is an unprivileged instruction. Any code running in Ring 0 — including Tier 1 driver code — can execute WRPKRU to modify its own MPK permission register, granting access to any MPK domain including UmkaOS Core (PKEY 0). This means MPK isolation provides crash containment (preventing buggy drivers from corrupting kernel memory) but does not provide exploitation prevention (compromised Ring 0 code can execute WRPKRU to escape).

Security model — UmkaOS's Tier 1 isolation is designed to survive driver bugs, not driver exploitation. The rationale: the vast majority of kernel crashes are caused by bugs (null dereference, use-after-free, buffer overrun), not by attackers with arbitrary code execution inside a specific driver. For environments requiring defense against compromised Ring 0 code, Tier 2 (full process isolation) provides the strong boundary — at higher latency cost.

What MPK actually protects against:

Accidental memory corruption: Null pointer dereferences, buffer overruns, and similar bugs that write to wrong addresses are contained — the hardware fault triggers before the driver can corrupt kernel memory.
Crash recovery: When a driver faults, UmkaOS Core can safely restart it without system panic because driver memory is isolated from core state.
Fault propagation containment: A bug in one Tier 1 driver cannot corrupt data belonging to other drivers or to UmkaOS Core.

What MPK does NOT protect against:

Deliberate exploitation: An attacker who achieves arbitrary code execution within a Tier 1 driver can execute WRPKRU to escape isolation. The instruction is unprivileged by design and the sanctioned switch_domain() trampoline uses it legitimately — it cannot be detected or blocked.
Runtime code injection: JIT code or ROP gadgets that contain WRPKRU can execute the instruction directly.

Driver signing — All Tier 1 drivers must be signed (Section 8.2). An attacker cannot load a malicious driver binary without a valid signature. The attack surface is limited to exploiting bugs in legitimately signed driver code. Combined with Rust's memory safety guarantees and standard Linux hardening (CFI, CET), this raises the bar for achieving arbitrary code execution, but does not eliminate the WRPKRU escape vector.

Tier 2 for exploitation-sensitive workloads — For environments where defense against compromised Ring 0 code is required, drivers should run at Tier 2 (full process isolation). The auto-demotion mechanism (Section 10.5.10.2) allows administrators to pin specific drivers to Tier 2 via policy, trading higher I/O latency for stronger isolation.

10.2.5.1 PKRU Write Elision (Mandatory)

The ~23-cycle WRPKRU cost is per instruction, not per domain crossing. When an I/O path traverses multiple domains in sequence (e.g., NIC driver → TCP stack → socket layer), a naive implementation issues a WRPKRU at every boundary — 6 writes for a 3-boundary round-trip. UnderBridge (Gu et al., USENIX ATC '20) demonstrated that many of these writes are redundant and must be elided.

WRPKRU elision is a mandatory core design decision, not a deferred optimization. Every WRPKRU instruction in UmkaOS goes through the switch_domain() trampoline (defined below), which enforces shadow comparison before any hardware write. There is no code path in the kernel that issues a raw WRPKRU without shadow checking — this invariant is enforced at the API level (the x86::wrpkru() function is unsafe and only called from switch_domain()).

The three elision techniques (all implemented from day one):

Same-permission transition: if domain A and domain B both need read access to a shared buffer, and the only permission change is adding write access to B's private region, the WRPKRU write may be unnecessary if A's private region is already read-disabled. The key insight: WRPKRU sets all 16 domain permissions simultaneously — if the new permission bitmap happens to be identical to the current one, the write is redundant.
Batched transitions: when crossing A → B → C in rapid succession (e.g., NIC driver → TCP → socket), instead of writing PKRU three times (disable A/enable B, disable B/enable C), compute the final PKRU state and write once. The intermediate states are unnecessary if no untrusted code executes between transitions.
Cached PKRU shadow: a per-CPU shadow of the current PKRU value (stored in CpuLocalBlock, see Section 3.1.2). Before issuing WRPKRU, switch_domain() compares the desired value against the shadow. If identical, the instruction is skipped entirely. This is a single register comparison (~1 cycle) versus the ~23-cycle WRPKRU.

UmkaOS implementation — every domain switch goes through this trampoline. The pkru_shadow is stored in CpuLocalBlock for single-instruction access. No code path in the kernel issues WRPKRU outside this function. The switch_domain() inline function:

#[inline(always)]
fn switch_domain(target_pkru: u32) {
    let shadow = per_cpu::pkru_shadow();
    if shadow != target_pkru {
        // SAFETY: WRPKRU updates permission bits for all 16 MPK domains.
        // target_pkru is computed from the domain allocation table and
        // validated at driver load time — only valid permission sets are
        // reachable. Preemption is disabled (see above).
        unsafe { x86::wrpkru(target_pkru) };
        per_cpu::set_pkru_shadow(target_pkru);
    }
}

Context switch coherence: On every context switch, the scheduler calls arch::x86_64::isolation::save_pkru(prev_task) to save the outgoing task's PKRU register value into prev_task.saved_pkru, then calls arch::x86_64::isolation::restore_pkru(next_task) to load next_task.saved_pkru via WRPKRU. The per-CPU CpuLocalBlock.isolation_shadow field (Section 3.1.2) is updated to next_task.saved_pkru atomically with the WRPKRU execution — the shadow always reflects the actual hardware PKRU register value on this CPU. This invariant is required for the validate_current_domain() fast path which reads the shadow without executing RDPKRU. Any code path that issues WRPKRU outside switch_domain() or the context switch save/restore functions is a bug: it would desync the shadow from the hardware register, causing switch_domain() to skip necessary WRPKRU writes on subsequent domain transitions.

Guaranteed savings — on a typical TCP receive path (4 WRPKRU instructions in the naive case: 2 boundary crossings × 2 switches each), shadow comparison eliminates 1-2 redundant writes (the intermediate transitions where permissions don't actually change). At ~23 cycles per elided write, this saves ~23-46 cycles per packet — reducing TCP path overhead from ~2% to ~1-1.5%. On NVMe paths, back-to-back domain transitions (submit→complete with no intervening domain change) hit the shadow cache and skip the second WRPKRU pair entirely, saving ~46 cycles.

Generalization to other architectures: The shadow-comparison pattern applies to every architecture's isolation register, not just x86 PKRU:

Architecture	Register	Shadow location	Skip cost	Hardware write cost
x86-64	PKRU (`WRPKRU`)	`CpuLocalBlock.pkru_shadow`	~1 cycle (compare)	~23 cycles
AArch64	POR_EL0 (`MSR`)	`CpuLocalBlock.por_shadow`	~1 cycle	~40-80 cycles
ARMv7	DACR (`MCR p15`)	`CpuLocalBlock.dacr_shadow`	~1 cycle	~10-20 cycles
PPC64	Radix PID (`mtspr`)	`CpuLocalBlock.rpid_shadow`	~1 cycle	~30-60 cycles
PPC32	Segment regs (`mtsr`)	`CpuLocalBlock.sr_shadow[16]`	~1 cycle	~10-30 cycles

RISC-V has no isolation register (Tier 1 is not available on RISC-V) — shadow elision is not applicable. AArch64 uses POR_EL0 when POE hardware is present; on mainstream AArch64 without POE, the shadow tracks ASID/TTBR0 to elide redundant page-table switches. The shadow pattern provides the largest benefit on x86-64 and AArch64 POE, where the hardware write cost is highest relative to the comparison cost.

10.2.6 Isolation on Other Architectures

Each supported architecture uses its best available fast isolation mechanism:

Architecture	Mechanism	Switch Cost	Domains
x86_64	MPK (`WRPKRU`)	~23 cycles	12 for drivers
AArch64 (mainstream)	Page-table + ASID	~150-300 cycles	Unlimited
AArch64 + POE (ARMv8.9+/ARMv9.4+)	POE (`MSR POR_EL0`)	~40-80 cycles	7 usable (3 for drivers after infra deductions; see Section 23.4.3)
ARMv7	DACR (`MCR p15`)	~10-20 cycles	15 usable
PPC32	Segment registers (`mtsr`)	~10-30 cycles	15 usable
PPC64LE	Radix PID (`mtspr PIDR`)	~30-60 cycles	Per-process
RISC-V 64	None — Tier 1 unavailable	N/A	N/A

Page-table + ASID isolation is the standard AArch64 mechanism and runs on all current ARM datacenter deployments: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. POE (ARM FEAT_S1POE) is a hardware acceleration available on ARMv8.9+/ARMv9.4+ silicon (Neoverse V3+, Cortex-X4+) that reduces switch cost to ~40-80 cycles; it is an optional optimization, not the primary mechanism. When domain counts are exhausted, architectures with register-based isolation fall back to page-table switches. ARMv7 DACR is universally available on all Cortex-A cores and matches MPK in both cost and domain count.

10.2.6.1 Per-Architecture Mechanism Details

aarch64: ARM Memory Domains (up to 16 domains via DACR on ARMv7) are not available on ARMv8/AArch64 in the same form. The standard AArch64 isolation mechanism is page-table-based domain isolation with ASID-preserving switches (~150-300 cycles per TTBR0_EL1 write + ISB + TLBI ASIDE1IS). This is what runs on all current deployed ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. On hardware with ARM FEAT_S1POE (optional from ARMv8.9/ARMv9.4, available on Neoverse V3+ and Cortex-X4+), UmkaOS activates the Permission Overlay Extension as an acceleration: POE provides 8 overlay indices (3 bits from PTE bits [62:60]), with index 0 reserved, giving 7 usable domains — fewer than x86 MPK's 12 driver domains. After infrastructure deductions (index 1: shared read-only, index 2: shared DMA, index 6: userspace, index 7: temporary/debug), only 3 indices remain for Tier 1 driver domains (indices 3-5); see Section 23.4.3 for the full AArch64 grouping scheme. Domain grouping is therefore much more aggressive on AArch64 when POE is active. POE is an optimization that reduces switch cost to ~40-80 cycles (~2-4x improvement); the system operates correctly without it using the page-table path.
armv7: ARMv7 provides hardware Domain Access Control via the DACR register, supporting 16 memory domains (15 usable — domain 0 reserved for kernel). Each domain can be set to No Access, Client (checked against page permissions), or Manager (unchecked access) via a single MCR instruction to update DACR. This is the closest hardware analogue to x86 MPK on 32-bit ARM — a single privileged (MCR p15) register write switches domain permissions without TLB flushes. Unlike x86 WRPKRU (which is unprivileged and executable from Ring 3), DACR writes require PL1 — this is a security advantage: user-space code cannot forge domain switches.
riscv64: RISC-V currently has no hardware isolation primitive suitable for Tier 1. SPMP (S-mode Physical Memory Protection) is only active when paging is disabled (satp.mode == Bare) and cannot be used in a kernel with virtual memory enabled. Smmtt (Supervisor Domain Access Protection) targets confidential computing, not MPK-style fast domain switching. Pointer Masking (Smnpm/Ssnpm, ratified Oct 2024) is not a domain isolation mechanism. Tier 1 isolation is not available on RISC-V. Tier 1 drivers on RISC-V platforms are promoted to Tier 0 (in-kernel, statically linked, fully trusted) — the same model as Linux. This is an accepted hardware constraint, not a design flaw. Tier 2 (Ring 3 + IOMMU) remains available on RISC-V for untrusted drivers where isolation is required. When RISC-V ISA extensions provide suitable isolation primitives (e.g., future Smpmp or custom domain extensions), UmkaOS will support Tier 1 on RISC-V without requiring architectural changes — the driver model is designed for this upgrade path.
ppc32: PPC32 uses segment registers for memory domain isolation. The 32-bit PowerPC architecture provides 16 segment registers (SR0–SR15), each controlling access to a 256 MB virtual address region. Updating a segment register via mtsr is a single supervisor-mode instruction with low overhead (~10-30 cycles). When segments are insufficient, UmkaOS falls back to page-table-based isolation.
ppc64le: PPC64LE on POWER9+ uses the Radix MMU with partition table entries (process table / PID) for isolation. On POWER8, the Hashed Page Table (HPT) with LPAR (Logical Partitioning) provides hardware-assisted isolation. The Radix MMU's PID-based isolation switches via mtspr PIDR (~30-60 cycles). HPT fallback uses full page table switches (~200-400 cycles).

10.2.6.2 Per-Architecture Isolation Cost Analysis

The x86_64 MPK WRPKRU instruction provides ~23-cycle domain switches (measured on Skylake-class server cores; varies by microarchitecture — see Section 18.7.8 for full range: 11 cycles on Alder Lake, up to 260 cycles on Atom). Other architectures use different mechanisms with different cost profiles:

Architecture	Mechanism	Domain Switch Cost	Domains	Notes
x86_64	MPK (`WRPKRU`)	~23 cycles	12 for drivers	16 total keys. PKEY 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure.
x86_64 (no MPK)	Page table switch + ASID	~200-400 cycles	Unlimited	Used when MPK unavailable (pre-Skylake). Full CR3 write + TLB management.
aarch64 (mainstream)	Page table switch + ASID	~150-300 cycles	Unlimited	Standard mechanism on all current ARM servers: Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920. `TTBR0_EL1` write + `ISB` + `TLBI ASIDE1IS`.
aarch64 + POE (ARMv8.9+/ARMv9.4+)	`MSR POR_EL0` + `ISB`	~40-80 cycles	7 usable	Optional acceleration: ARM FEAT_S1POE available on Neoverse V3+, Cortex-X4+. `ISB` barrier required (~20-40 cycles). Provides ~2-4x improvement over page-table path.
aarch64 + MTE	(not viable for domain isolation)	N/A	N/A	MTE assigns 4-bit tags per 16-byte granule, but tags are compared per-pointer — no single-register switch exists. Valuable for memory safety, not domain isolation.
armv7	DACR (`MCR p15`)	~10-20 cycles	15 usable	Single `MCR p15, 0, Rd, c3, c0, 0` writes all 16 domain permissions. No barrier required. Comparable to MPK cost.
armv7 (fallback)	Page table switch + CONTEXTIDR	~150-300 cycles	Unlimited	`MCR` to TTBR0 + `ISB` + `TLBI`. Similar cost profile to aarch64 page-table path.
riscv64	Tier 1 not available	N/A — Tier 1 drivers run as Tier 0	N/A	No suitable hardware isolation exists with paging enabled. SPMP requires paging disabled; Smmtt targets confidential computing. Tier 1 drivers are promoted to Tier 0 (no isolation overhead).
ppc32	Segment registers (`mtsr`)	~10-30 cycles	15 usable	Single `mtsr` instruction per 256 MB segment. No barrier required. Comparable to armv7 DACR cost.
ppc32 (fallback)	Page table switch	~200-400 cycles	Unlimited	Full TLB invalidation + page table base update.
ppc64le (Radix)	PID switch (`mtspr PIDR`)	~30-60 cycles	Process-table scoped	POWER9+ Radix MMU. `mtspr PIDR` + `isync`. ~2-3x MPK cost.
ppc64le (HPT)	HPT + LPAR switch	~200-400 cycles	Unlimited	POWER8 Hashed Page Table. `tlbie` + table update.

Impact on performance budget — The Section 1.2 overhead analysis uses x86_64 MPK (~23 cycles per switch, ~92 cycles per I/O round-trip):

Architecture	Overhead per NVMe 4KB read	Overhead per TCP RX
x86_64 MPK	+1% (92 cycles / 10μs)	+2% (~92 cycles / 5μs, with NAPI batching; naive per-packet is ~17-26%, see Section 15.1.7)
aarch64 page-table (mainstream)	+6-12% (600-1200 cycles / 10μs)	+12-24% (600-1200 cycles / 5μs)
aarch64 + POE (ARMv8.9+/ARMv9.4+)	+2-3% (160-320 cycles / 10μs)	+3-6% (160-320 cycles / 5μs)
armv7 DACR	+0.5-1% (40-80 cycles / 10μs)	+1-2% (40-80 cycles / 5μs)
riscv64	N/A — Tier 1 not available; Tier 1 drivers run as Tier 0 (zero isolation overhead, same as Linux)	N/A
ppc32 segments	+0.5-1% (40-120 cycles / 10μs)	+1-2% (40-120 cycles / 5μs)
ppc64le Radix	+1-2% (120-240 cycles / 10μs)	+2-5% (120-240 cycles / 5μs)

For armv7 with DACR and ppc32 with segment registers, the overhead is comparable to or better than x86 MPK. For aarch64 with POE and ppc64le with Radix PID, the overhead remains within the 5% budget for storage workloads. On mainstream AArch64 (page-table path), Tier 1 overhead reaches 6-12%, which exceeds the 5% budget for I/O-heavy workloads; administrators can promote performance-critical drivers to Tier 0 or demote to Tier 2 as appropriate.

ARM server reality — Page-table + ASID isolation (~150-300 cycles) is the mechanism that runs on nearly all currently deployed ARM servers. FEAT_S1POE is optional from ARMv8.9/ARMv9.4. Current mainstream datacenter cores — Neoverse V2 (ARMv9.0, AWS Graviton 4, Google Axion), Neoverse V3 (ARMv9.2, AWS Graviton 5, Azure Cobalt 200), Ampere Altra (ARMv8.2), and Kunpeng 920 (ARMv8.2) — do not implement POE. The page-table path is not a fallback; it is the standard operating mode for AArch64. POE is a hardware acceleration that becomes available on ARMv8.9+/ARMv9.4+ silicon (Neoverse V3+, Cortex-X4+) and reduces per-switch cost by ~2-4x when present.

RISC-V reality — RISC-V currently has no hardware isolation mechanism suitable for Tier 1. Tier 1 isolation is not available on RISC-V; all Tier 1 drivers run as Tier 0 (in-kernel, fully trusted, zero isolation overhead). The 5% overhead budget applies to operations that do run — without the Tier 1 isolation layer, there is no overhead to measure on that path. Tier 2 remains available for drivers where isolation is required. When RISC-V ISA extensions provide suitable isolation primitives, UmkaOS will support Tier 1 on RISC-V without architectural changes to the driver model.

10.2.7 Adaptive Isolation Policy (Graceful Degradation)

UmkaOS targets six architectures with fundamentally different isolation capabilities. The design philosophy: use the best isolation the hardware provides; when the hardware provides nothing, degrade gracefully — don't refuse to run. This mirrors Linux's approach to every hardware feature.

Three boot-time modes, selectable via umka.isolation= kernel parameter or runtime sysfs:

strict (default when fast isolation available): All Tier 1 drivers run in hardware-isolated domains. Full isolation at ~23-80 cycle cost per switch (register-based) or ~150-300 cycles (page-table, AArch64 mainstream).
degraded (default on AArch64 mainstream): Page-table isolation operates the three-tier model with ~150-300 cycle overhead per crossing. This is the normal operating mode for current ARM server deployments, not a degraded state.
performance: Tier 1 drivers promoted to Tier 0 — zero boundary-crossing overhead, matching Linux exactly. IOMMU DMA fencing and capability checks remain active. Appropriate for I/O-heavy workloads where the page-table path overhead is unacceptable.

On RISC-V, the adaptive policy always selects Tier 0 for all Tier 1 drivers — Tier 1 isolation is not available on RISC-V due to hardware capability limitations. This is not a performance mode selection; it is a platform capability constraint that will be resolved when RISC-V hardware provides suitable isolation primitives.

Per-driver overrides are available via driver manifests (no_fast_isolation policy): drivers can individually choose promote_tier0, page_table (default), or demote_tier2 regardless of global mode.

10.2.7.1 Performance Mode Details

On hardware without fast isolation, Tier 1 drivers are promoted to Tier 0 — they run in the same protection domain as umka-core with zero boundary-crossing overhead. Performance matches Linux exactly. The system logs a prominent warning:

umka: isolation=performance: Tier 1 drivers running WITHOUT memory isolation
umka: Driver crashes may cause kernel panic (same as Linux monolithic behavior)
umka: IOMMU DMA fencing is still active — DMA isolation preserved

Key properties of performance mode: - IOMMU DMA fencing remains active — even without MPK memory isolation, DMA operations are still restricted to driver-allocated regions. - Crash recovery is best-effort — without memory isolation, a crashing driver may corrupt umka-core state, making recovery impossible. - Capability system still enforced — the software-level capability model remains active. Only the memory enforcement is relaxed. - Security model partially degraded — a malicious driver could exploit the shared address space. This mode is appropriate for trusted environments with known drivers.

Per-driver tier pinning via driver manifest:

# umka-nvme driver manifest
[driver]
name = "umka-nvme"
preferred_tier = 1
minimum_tier = 1

# Override: on hardware without MPK, run this driver as Tier 0
# instead of using the slow page-table fallback
[driver.isolation_fallback]
no_fast_isolation = "promote_tier0"   # "promote_tier0" | "page_table" | "demote_tier2"

Options per driver: - promote_tier0: run in Tier 0 (fast, no isolation) — for performance-critical drivers - page_table: use page-table fallback (slow, but isolated) — default - demote_tier2: move to Tier 2 userspace (full process isolation) — for untrusted or crash-prone drivers

Historical context — Apple's transition from kexts (Ring 0, no isolation) to DriverKit (userspace, full isolation) took 5 years. UmkaOS's approach is more nuanced: rather than a binary choice between "fast and dangerous" and "safe and slow," hardware-assisted isolation (MPK, POE, DACR, segments, Radix PID) provides a third option — "fast and safe" — on modern hardware. The adaptive isolation policy ensures UmkaOS remains viable on older hardware by honestly trading off isolation for performance when the hardware cannot support both simultaneously.

Isolation is one of eight core capabilities, not the only one. Even on hardware without fast isolation (RISC-V where Tier 1 is unavailable, older x86 without MPK), UmkaOS still provides: driver crash recovery (best-effort in Tier 0, full in Tier 2), distributed kernel primitives, heterogeneous compute management, structured observability, power budgeting, post-quantum security, live kernel evolution, and a stable driver ABI. A RISC-V server operating without Tier 1 isolation retains all seven other capabilities. This is a hardware-imposed constraint, not a design failure — and it is resolved when suitable RISC-V hardware becomes available.

10.4 Driver Isolation Tiers

10.4.1 Tier Classification

Tier 0 has two sub-forms: static (compiled into the kernel binary) and loadable (dynamically loaded but running in the Core domain with no isolation). Both are Tier 0 in the trust and crash-consequence sense; they differ in deployment. See Section 10.4.2 for details.

Property	Tier 0 Static	Tier 0 Loadable	Tier 1	Tier 2
Location	Compiled into kernel binary	Ring 0, Core domain, dynamically loaded	Ring 0, dynamically loaded, domain-isolated	Ring 3, separate process
KABI transport	Direct vtable call (T0)	Direct vtable call (T0)	Ring buffer (T1)	Ring buffer (T2)
Isolation	None	None (same address space)	Hardware memory domains + IOMMU	Full address space + IOMMU
Crash behavior	Kernel panic	Kernel panic	Reload module (~50-150ms, design target)	Restart process (~10ms)
DMA access	Unrestricted	Unrestricted	IOMMU-fenced	IOMMU-fenced
Performance	Zero overhead	~2–5 cycles (vtable dispatch)	~23 cycles domain switch + marshaling (x86 MPK)	~200-500 cycles per crossing
Trust level	Maximum (core kernel)	Maximum (signed, sealed index)	High (verified, signed)	Low (untrusted acceptable)
Unloadable	No (static)	No (`load_once: true`)	Yes (domain revocation)	Yes (process exit)
Examples	APIC, timer, early console, Core allocator	SCSI mid-layer, MDIO bus, SPI bus core, cfg80211 framework, V4L2 core	NVMe, NIC, TCP/IP, FS, GPU, KVM, audio (default), WiFi driver	USB, input, BT, audio (optional demotion), HID

Tier 1 isolation mechanism per architecture:

The "hardware memory domains" used for Tier 1 isolation are architecture-specific. Not all architectures have a fast isolation mechanism; RISC-V has none at all and runs Tier 1 drivers as Tier 0. See Section 10.2.6.2 for per-architecture cycle costs and Section 10.2.7 for the adaptive policy.

Architecture	Tier 1 Mechanism	Switch Cost	Domains	Availability
x86-64	MPK (`WRPKRU`)	~23 cycles	12 usable	Intel Skylake+ / AMD Zen 3+
x86-64 (no MPK)	Page table + ASID	~200-400 cycles	Unlimited	All x86-64
AArch64 (mainstream)	Page table + ASID	~150-300 cycles	Unlimited	All AArch64 — standard mechanism on Graviton 2/3/4, Neoverse V1/V2, Ampere Altra, Kunpeng 920
AArch64 + POE (ARMv8.9+/ARMv9.4+)	POE (`MSR POR_EL0` + `ISB`)	~40-80 cycles	7 usable (3 for drivers after infra deductions; see Section 23.4.3)	Optional acceleration: FEAT_S1POE on Neoverse V3+, Cortex-X4+
ARMv7	DACR (`MCR p15`)	~10-20 cycles	15 usable	All ARMv7 (universal)
RISC-V 64	Tier 1 not available — Tier 1 drivers run as Tier 0	N/A (no isolation boundary)	N/A	Hardware capability not yet available on any RISC-V silicon
PPC32	Segment registers (`mtsr`)	~10-30 cycles	15 usable	All PPC32
PPC64LE (POWER9+)	Radix PID (`mtspr PIDR`)	~30-60 cycles	Process-scoped	POWER9+ with Radix MMU
PPC64LE (POWER8)	HPT + LPAR	~200-400 cycles	Unlimited	POWER8

On RISC-V 64, Tier 1 isolation is not available. As of early 2026, no ratified RISC-V extension provides a suitable intra-address-space isolation mechanism with paging enabled (SPMP requires paging disabled; Smmtt targets confidential computing; Pointer Masking Smnpm/Ssnpm, ratified Oct 2024, is not a domain isolation mechanism). All Tier 1 drivers on RISC-V are promoted to Tier 0 — they run in-kernel with no hardware isolation boundary, identical to the Linux monolithic driver model. Tier 2 (Ring 3 + IOMMU) remains available for RISC-V drivers where isolation is required. When RISC-V hardware provides suitable isolation primitives, UmkaOS will activate Tier 1 on RISC-V without requiring changes to the driver model or driver manifests.

10.4.2 Tier 0: Boot-Critical and Core Framework Code

Tier 0 encompasses all kernel code that runs in Ring 0 inside the Core memory domain, with no hardware isolation boundary between it and the static kernel binary. A crash in any Tier 0 code — static or loadable — causes a kernel panic. Tier 0 is split into two deployment forms.

10.4.2.1 Tier 0 Static

Compiled directly into the kernel binary. Required before any dynamic loading infrastructure is available:

Local APIC and I/O APIC
PIT/HPET/TSC timer
Early serial/VGA console
ACPI table parsing (early boot only). Security trade-off: ACPI tables are firmware-provided data that the kernel must trust at boot. A malicious or buggy BIOS can supply corrupt ACPI tables (malformed AML, overlapping MMIO regions, impossible NUMA topologies). UmkaOS's Tier 0 ACPI parser performs defensive parsing: all table lengths are bounds-checked, AML interpretation uses a sandboxed evaluator with a cycle limit (no infinite loops), and MMIO regions claimed by ACPI are validated against the e820/UEFI memory map before being mapped. Despite these defenses, ACPI parsing remains the largest attack surface in Tier 0. The firmware quirk framework (Section 10.5.11.6) provides per-platform overrides for known-buggy tables.

Tier 0 static code is held to the highest review standard and kept minimal. Only code that is genuinely required before the module loader and isolation infrastructure are operational belongs here.

10.4.2.2 Tier 0 Loadable Modules

Dynamically loaded into the Core domain after the module loader initialises, but before or during device enumeration. Tier 0 loadable modules:

Run in Ring 0 in the same memory domain as static Core
Communicate with static Core and other Tier 0 modules via direct vtable calls (Transport T0, Section 11.1.8) — not ring buffers
Are loaded by the kernel-internal module loader (Section 11.1.9.6) without requiring userspace
Are never unloaded (load_once: true) — safe Tier 0 unloading would require auditing every execution context in the kernel for stale function pointers, which is not tractable
Declare their requires and provides dependencies in their .kabi file (Section 11.1.9.2)
Are transparent to Tier 1 callers — from Tier 1's perspective, calling a Tier 0 loadable service is identical to calling static Core

When to use Tier 0 loadable vs Tier 1: Use Tier 0 loadable for kernel framework modules that provide services to many other drivers and whose correctness can be established by inspection — bus frameworks (SCSI mid-layer, MDIO, SPI bus core), protocol framework layers (cfg80211, V4L2 core, media controller), and subsystem registries (backlight, NVMEM, rfkill). These are trusted, signed, reviewed code that is too large to statically link but too fundamental to pay Tier 1 ring buffer overhead on every call. Use Tier 1 for hardware drivers and any code that benefits from crash containment — if a bug is plausible, it should be Tier 1.

10.4.3 Tier 1: Kernel-Adjacent Drivers (Hardware Memory Domain Isolated)

Tier 1 is crash containment, not a security boundary. Hardware memory domain isolation (MPK, POE, DACR) prevents a buggy driver from corrupting kernel memory by accident. It does not prevent a compromised driver from escaping deliberately — on x86-64, WRPKRU is unprivileged and any Ring 0 code can execute it. This is a known architectural property of Intel MPK, not an UmkaOS design flaw, and is documented as an intentional tradeoff. Tier 2 (Ring 3 + IOMMU) is the security boundary for untrusted or third-party drivers. Do not flag Tier 1 MPK escape as a vulnerability — the threat model is stated and the mitigation is Tier 2. See Section 10.4.5 for Tier 2.

Performance-critical drivers run in Ring 0 but are isolated via hardware memory domains (MPK on x86-64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE, page-table + ASID on AArch64 mainstream, POE on AArch64 ARMv8.9+/ARMv9.4+ when available — see "Tier 1 isolation mechanism per architecture" table above). Note: Tier 1 isolation is not available on RISC-V; Tier 1 drivers on RISC-V run as Tier 0. Each driver (or driver group) that does have Tier 1 isolation is assigned a protection domain. The driver can only access:

Its own private memory (tagged with its domain key)
Shared ring buffers (tagged with the shared domain, read-write)
Shared DMA buffers (tagged with DMA domain, read-write)
Its MMIO regions (mapped with its domain key)

It cannot access: - UmkaOS Core private memory - Other Tier 1 drivers' private memory - Page tables, capability tables, or scheduler state - Arbitrary physical memory

Security limitation: Tier 1 isolation protects against bugs, not exploitation. On x86-64, MPK isolation uses the WRPKRU instruction, which is unprivileged -- any Ring 0 code (including Tier 1 driver code) can execute it to modify its own domain permissions and access any MPK-protected memory, including UmkaOS Core (PKEY 0). This means a compromised Tier 1 driver with arbitrary code execution can trivially bypass MPK isolation. On ARMv7, MCR to DACR is privileged (PL1), which is stronger -- user-space cannot forge domain switches, but kernel-mode drivers still can. On PPC32 and PPC64LE, segment register and AMR updates are similarly supervisor-mode.

Tier 1 threat model: MPK (and its architectural equivalents) provides defense against accidental corruption -- buffer overflows, use-after-free, null dereferences that happen to write to the wrong address. It does not defend against deliberate exploitation where an attacker achieves arbitrary code execution within a Tier 1 driver and intentionally escapes the domain. For the exploitation case, Tier 2 (full process isolation in Ring 3) is the appropriate boundary.

Tier 1 trust requirement: Tier 1 drivers run in Ring 0 with only domain isolation (not address space isolation). They must be treated as trusted code: cryptographically signed, manifest-verified (Section 1.2), and subject to the same security review standard as Core kernel code. Tier 1 is not appropriate for third-party, untrusted, or unaudited drivers. Untrusted drivers must use Tier 2 (Ring 3 process isolation) where a compromised driver cannot escalate to kernel privilege regardless of the exploit technique. See Section 10.4.8 (Signal Delivery Across Isolation Boundaries) for the complete domain crossing specification during signal handling.

Mitigations that raise the bar for exploitation are detailed in Section 10.2 ("WRPKRU Threat Model: Unprivileged Domain Escape"): binary scanning for unauthorized WRPKRU/XRSTOR instructions at load time, W^X enforcement on driver code pages, forward-edge CFI (Clang -fsanitize=cfi-icall), and the NMI watchdog for detecting PKRU state mismatches.

Tier 0 fast path: On RISC-V (where Tier 1 is not available), POWER8, or when isolation=performance promotes all drivers to Tier 0, MPK-specific mitigations are automatically skipped: - Binary scanning for WRPKRU/XRSTOR: skipped (no MPK → no WRPKRU exploit). - NMI PKRU watchdog: disabled (no PKRU state to verify). - W^X enforcement and forward-edge CFI remain active — these defend against code injection and control-flow hijacking regardless of isolation tier and are standard hardening measures, not isolation-specific overhead.

Future: PKS (Protection Keys for Supervisor) -- Intel's PKS extension provides supervisor-mode protection keys that are controlled via MSR writes (privileged operations that require Ring 0 + CPL 0 MSR access). Unlike WRPKRU (which any Ring 0 code can execute), PKS key modifications go through WRMSR to IA32_PKS, which can be trapped by a hypervisor or controlled by umka-core. When PKS-capable hardware is available, UmkaOS will use PKS for Tier 1 isolation, closing the unprivileged-WRPKRU escape path. PKS is available on Intel Sapphire Rapids and later server CPUs.

10.4.3.1 VirtIO Device Hosting

VirtIO devices in UmkaOS run as Tier 1 drivers (Ring 0, hardware memory domain isolated). Rationale: VirtIO devices are almost always used in virtualized environments where high-throughput I/O is required; Tier 1 gives them direct access to the network and block stacks without ring-crossing overhead, while the MPK/POE/DACR isolation boundary still contains crashes.

The VirtIO transport layer (PCI or MMIO config space, virtqueue management) is implemented inside the Tier 1 driver domain.
Virtqueues (split or packed ring format) are backed by RingBuffer<VirtqDesc> — the same infrastructure used for other UmkaOS driver rings, providing unified memory accounting across all device types.
The Linux VirtIO userspace API (vhost-user, vDPA) is surfaced through UmkaOS's compat layer unchanged — guest VMs and containers see standard VirtIO PCI/MMIO devices.
Tier 2 option: a VirtIO device MAY be hosted as Tier 2 (full userspace process) via vhost-user if the operator prioritizes fault isolation over latency; this adds approximately 5–15 μs of ring-crossing overhead per batch.

10.4.4 Protection Key Exhaustion (Hardware Domain Limit)

Intel MPK provides only 16 protection keys (PKEY 0-15). With PKEY 0 reserved for UmkaOS Core, PKEY 1 for shared read-only descriptors, PKEY 14 for shared DMA, and PKEY 15 as guard, only 12 keys (PKEY 2-13) are available for Tier 1 driver domains (see Section 10.2, "MPK Domain Allocation"). This limits the number of independently isolated Tier 1 drivers to 12 on x86-64 with MPK. Architectures with equivalent mechanisms (AArch64 POE: 7 usable domains, ARMv7 DACR: 15 usable domains, PPC32 segments: 15 usable) face the same constraint. This is a hard hardware limit that cannot be worked around without changing the isolation granularity. PPC64LE (Radix PID) use process-scoped isolation without a fixed small domain budget, so domain exhaustion does not apply to those architectures — but they pay higher per-switch costs (see Section 10.2.6.2 cost table). RISC-V has no Tier 1 isolation at all; domain exhaustion does not apply.

When domains are exhausted (more concurrent Tier 1 drivers than available hardware domains — 12 on x86 MPK, 7 on AArch64 POE, 15 on ARMv7 DACR, 15 on PPC32 segments), UmkaOS applies three strategies in priority order:

Domain grouping (default): Related drivers share a protection key. For example, all block storage drivers (NVMe, AHCI, virtio-blk) share one key, all network drivers (NIC, TCP/IP stack) share another. Grouping reduces isolation granularity -- a bug in one block driver can corrupt another block driver's memory within the same group -- but preserves isolation between groups (network cannot corrupt storage). Grouping policy is configurable via the driver manifest: toml [driver.isolation] isolation_group = "block" # Share isolation domain with other "block" group drivers
Automatic Tier 2 demotion: Drivers below a configurable priority threshold are demoted to Tier 2 (process isolation) when all hardware isolation domains are consumed. Only the most performance-critical drivers retain Tier 1 placement. The priority is determined by match_priority in the driver manifest -- higher priority retains Tier 1.
Domain virtualization (future): On context switch, the scheduler can save and restore the isolation domain register (PKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, segment registers on PPC32) along with a remapped domain assignment table, allowing more logical domains than hardware provides by time-multiplexing physical domains. Domain virtualization adds overhead to context switches (~50-100 cycles for the register save/restore and domain table lookup (warm-cache fast path: WRPKRU ~20 cycles + L1-resident domain table lookup; cold-cache misses add ~100-200 cycles to the domain table access)) and is used only when strategies 1 and 2 are insufficient. This is a future optimization -- domain grouping and Tier 2 demotion handle all current deployment scenarios.

Strategy 3: POE + ASID Domain (ARMv8.9+ systems with POE support)

On AArch64 systems with Permission Overlay Extensions (ARMv8.9+ / FEAT_S1PIE):

Each Tier 1 driver domain is assigned a POE domain (POR_EL0 register field, up to 8 domains).
Domain switch: MSR POR_EL0, x0 (single instruction, ~40-80 cycles, no TLB flush).
Domain assignment: PKEY 0 = UmkaOS Core private, PKEYs 1-6 = Tier 1 driver domains, PKEY 7 = shared DMA pool. (POE supports 8 domains = one less than x86 MPK's 16.)
Fallback: If the hardware supports POE but a driver requires exclusive ASID isolation (e.g., cryptographic device handling key material), the ASID-table strategy (Strategy 2) is used for that driver even on POE-capable hardware. The driver registers require_asid_isolation: true in its .kabi manifest.
Combined POE+ASID: For the highest isolation guarantee on ARMv8.9+, use both: POE for fast memory-domain switching + a dedicated ASID for the driver domain. This prevents both memory domain escapes (POE) and TLB side-channel attacks (ASID). Cost: ~80-150 cycles per domain switch (ASID flush + POE switch); used for Tier 1 drivers handling sensitive key material.
Detection: POE availability is checked at boot via ID_AA64MMFR3_EL1.S1PIE[8:11] != 0. Exposed via IsolationCapabilities::poe_available: bool to the driver subsystem.

When domain grouping is applied, the kernel logs a warning (umka: isolation domain exhausted, grouping {driver_a} with {driver_b}) and exposes the current domain allocation in /sys/kernel/umka/isolation/domains for admin visibility.

Practical impact: A typical server has 5-8 performance-critical driver types (NVMe, NIC, TCP/IP, filesystem, GPU, KVM, virtio, crypto). With grouping, these fit within the hardware domain budget on x86 (12 domains), ARMv7 (15), and PPC32 (15) with room to spare. On AArch64 with POE (7 total usable indices, of which only 3 are available for Tier 1 drivers after infrastructure reservations — see Section 23.4.3 in 23-roadmap.md for the full index allocation), a typical 5-8 driver configuration requires at least one grouping (e.g., NVMe + filesystem share a domain). Systems with unusually many distinct Tier 1 drivers (e.g., multi-vendor NIC + storage + GPU + FPGA configurations) trigger Tier 2 demotion for the lowest-priority drivers.

Long-term trajectory: the domain budget pressure diminishes as devices become peers — but this is a multi-year ecosystem shift, not a near-term fix. The devices that consume the most Tier 1 domain slots today — GPU (~700K lines of handwritten driver code, excluding auto-generated headers), high-end NIC/DPU (~150K lines), and high-throughput storage controllers — are exactly the devices most suited to become UmkaOS multikernel peers (Section 5.1.2.2). When a device runs its own UmkaOS kernel and participates as a cluster peer, it is handled entirely by umka-peer-transport (~2K lines) and consumes zero MPK domains; it exits the Tier 1 population entirely and is contained by the IOMMU hard boundary instead.

However, UmkaOS cannot assume vendor adoption. Rewriting device firmware to implement UmkaOS message passing requires vendor investment, ecosystem tooling, and standardization effort that will take years to mature. For the foreseeable future, most devices will continue to use traditional Tier 1 and Tier 2 drivers, and the domain budget strategies above (grouping, Tier 2 demotion, domain virtualization) are the primary long-term solution — not a temporary workaround. Domain virtualization (strategy 3) and PKS (Section 10.4, future work) remain genuinely important during this extended transition window and must be implemented correctly. They cannot be dismissed as "probably never needed."

The peer kernel model is the correct direction — it reduces the Tier 1 population, eliminates device-specific Ring 0 code, and strengthens the isolation boundary — but UmkaOS must operate correctly and efficiently with today's hardware for years before that future materializes. Domain grouping and automatic Tier 2 demotion are therefore the primary and durable strategies. The ecosystem shift toward peer kernels is a beneficial long-term trend that will progressively ease the domain budget, not a solution that UmkaOS can depend on today.

10.4.5 Tier 2: User-Space Drivers (Process-Isolated)

Non-performance-critical drivers run as user-space processes with full address space isolation. Communication with UmkaOS Core uses:

Shared-memory ring buffers (mapped into both address spaces)
Lightweight notification via eventfd-like mechanism
IOMMU-restricted DMA (driver can only DMA to its allocated regions)

Tier 2 MMIO access model. Tier 2 drivers access device MMIO registers via umka_driver_mmio_map (Section 10.4, KABI syscall table), which maps a device BAR region into the driver process's address space. This mapping is direct -- the driver reads and writes device registers without kernel mediation on each access, avoiding per-access syscall overhead. However, the mapping is kernel-controlled and revocable:

Setup-time validation. The kernel validates every umka_driver_mmio_map request: the BAR index must belong to the driver's assigned device, the offset and size must fall within the BAR's bounds, and the driver must hold the appropriate device capability. The kernel never maps BARs belonging to other devices or kernel-reserved MMIO regions.
IOMMU containment. Even though the driver can program device registers via MMIO (including registers that initiate DMA), all DMA transactions from the device pass through the IOMMU. The device's IOMMU domain restricts DMA to regions explicitly allocated by the kernel on behalf of the driver (umka_driver_dma_alloc). A compromised Tier 2 driver that programs arbitrary DMA addresses into device registers will trigger IOMMU faults -- the DMA is blocked by hardware, not by software trust. This is the same IOMMU fencing applied to Tier 1 drivers, and it is the primary defense against DMA-based attacks from any driver tier.
MMIO revocation on containment. When the kernel needs to contain a Tier 2 driver (crash, fault, admin action, or auto-demotion), it unmaps all MMIO regions from the driver process's address space as part of the containment sequence. This is a standard virtual memory operation (page table entry removal + TLB invalidation) that completes in microseconds. After MMIO revocation, any subsequent MMIO access by the driver process triggers a page fault and process termination -- the driver cannot issue further device commands. Combined with IOMMU fencing (which blocks DMA initiated before revocation from reaching non-driver memory), MMIO revocation provides a complete device access cutoff without requiring Function Level Reset.

PCIe peer-to-peer DMA and IOMMU group policy -- The "complete device access cutoff" guarantee above depends on all DMA traffic passing through the IOMMU. This holds when the device is in its own IOMMU group (ACS enabled on all upstream PCIe switches). However, devices behind a non-ACS PCIe switch can perform peer-to-peer DMA that bypasses the IOMMU entirely — a contained device could still DMA to a peer device's memory regions without IOMMU interception. UmkaOS addresses this by enforcing an IOMMU group co-isolation policy: when devices share an IOMMU group (no ACS), UmkaOS places all devices in that group under the same Tier 2 driver process (or co-isolates them in the same Tier 1 domain). IOMMU revocation during containment therefore affects the entire group atomically — there is no "partially contained" state where one device in the group is fenced but a peer is not. See Section 10.5.3.8 (IOMMU Groups) for the full ACS detection and group assignment policy.

Synchronous vs. asynchronous revocation -- For deliberate containment actions (admin-initiated revocation, auto-demotion, fault-triggered isolation), MMIO revocation is synchronous: the kernel performs the TLB shootdown and waits for acknowledgment from all CPUs before the containment call returns. This guarantees that no MMIO access from the driver process is possible after the containment operation completes. For the crash case (driver process dies due to SIGSEGV/SIGABRT), the dying process's threads are killed first, so the TLB shootdown is a cleanup operation -- the driver threads are no longer executing, making the timing of the shootdown a correctness concern only for the page allocator (which must not reuse the MMIO-mapped pages until the shootdown completes).

FLR-free recovery (optimistic path). In the normal case, Tier 2 recovery does not require Function Level Reset. Tier 1 recovery requires FLR because the driver runs in Ring 0 and may have left the device in an arbitrary hardware state that only a full reset can clear. Tier 2 recovery can typically avoid FLR because: (a) IOMMU containment prevents DMA escapes regardless of device state, (b) MMIO revocation prevents further device manipulation, and (c) the device's hardware state can be re-initialized by the replacement driver instance during its init() call. However, devices with complex internal state machines (GPUs, SmartNICs, FPGAs) may not be safely re-initializable without a full reset. If the replacement driver's init() detects an unresponsive or inconsistent device (no response to MMIO reads, unexpected register state, completion timeout), the registry escalates to FLR. This fallback is not the common case for simple devices (NICs, HID, storage controllers), but should be expected for complex devices with substantial internal firmware state.

10.4.6 Tier Mobility and Auto-Demotion

Key principle: UmkaOS's isolation model is designed for flexibility, not dogma. Different hardware has different isolation capabilities (see Section 10.2 in README.md for the full architecture-specific analysis). The tier system allows administrators to make explicit tradeoffs between isolation and performance:

Tier 1 provides isolation using the best available hardware mechanism: register-based on x86-64/ARMv7/PPC32/PPC64LE (~1-4% overhead), or page-table-based on AArch64 mainstream (~6-12% overhead), or POE-accelerated on AArch64 ARMv8.9+/ARMv9.4+ (~2-4% overhead). On RISC-V, Tier 1 isolation is not available — Tier 1 drivers run as Tier 0.
Tier 2 provides strong process-level isolation on all architectures, at the cost of higher latency (~200-600 cycles per domain crossing vs ~23-80 cycles for Tier 1).
The escape hatch is always available: Any Tier 1 driver can be manually demoted to Tier 2 by the administrator, or automatically demoted after repeated crashes. This allows environments that prioritize security over performance to opt into stronger isolation regardless of hardware capabilities.

Design intent: The system does not force a one-size-fits-all choice. A high-frequency trading system on x86_64 might run all drivers in Tier 1 for maximum performance. A secure enclave handling sensitive data on a RISC-V system might run all drivers in Tier 2 for maximum isolation. Both are valid deployments of the same kernel.

Drivers declare a preferred tier and a minimum tier in their manifest:

# # drivers/tier1/nvme/manifest.toml
[driver]
name = "umka-nvme"
preferred_tier = 1
minimum_tier = 1     # NVMe cannot function well in Tier 2

# # drivers/tier2/usb-hid/manifest.toml
[driver]
name = "umka-usb-hid"
preferred_tier = 2
minimum_tier = 2

The kernel's policy engine decides the actual tier based on:

Trust level: Unsigned drivers are forced to Tier 2.
Crash history: After 3 crashes within a configurable window, a Tier 1 driver is automatically demoted to Tier 2 (if minimum_tier allows).
Admin overrides: System administrator can force any tier via configuration.
Signature verification: Cryptographically signed drivers can be granted Tier 1.

10.4.7 Debugging Across Isolation Domains (ptrace)

ptrace(PTRACE_PEEKDATA) on a Tier 1 driver thread must read memory tagged with the driver's PKEY, which the debugger process does not have access to. The kernel handles this by performing the read on behalf of the debugger:

ptrace access flow for MPK-isolated memory (high-level overview):
  1. Debugger calls ptrace(PTRACE_PEEKDATA/POKEDATA, target_tid, addr).
  2. Kernel checks: does `addr` belong to a MPK-protected region?
  3. If yes: kernel performs a TOCTOU-safe PKRU manipulation
     (see Security Note below) to grant temporary access,
     performs the copy, then restores PKRU. This happens in kernel mode,
     so the debugger process never gains direct access.
  4. If no: standard ptrace read/write path (no MPK involvement).

ptrace write flow:
  Same as read, but with write permission instead of read.
  PKRU manipulation is a single WRPKRU instruction (~23 cycles; see
  [Section 18.7.8](18-compat.md#1878-performance-impact) for detailed WRPKRU cycle count
  analysis (11–260 cycles depending on pipeline state and microarchitecture)).

PTRACE_ATTACH to a Tier 1 driver thread:
  Requires CAP_SYS_PTRACE (same as Linux).
  The debugger can single-step, set breakpoints, and inspect registers.
  Memory access goes through the kernel-mediated PKRU path above.

#### 10.4.7.1 Security Note: TOCTOU Mitigation

The ptrace PKRU manipulation flow has a Time-Of-Check-Time-Of-Use (TOCTOU) concern:
the kernel checks access, changes PKRU, performs the copy, then restores PKRU.
Between the PKRU change and restore, if the traced driver could execute arbitrary code,
it could issue its own `WRPKRU` and escape isolation.

**Mitigation strategy:**

ptrace PKRU-protected access (TOCTOU-safe): 1. Acquire pt_reg_lock(target_tid) — traced thread cannot run. 2. Verify debugger holds CAP_SYS_PTRACE and ptrace relationship is authorized. This check happens before any PKRU state change. 3. Verify address belongs to a valid MPK region owned by target. 4. With IRQs disabled and pt_reg_lock held: a) Save current PKRU b) Set PKRU to grant temporary access to target's PKEY c) Perform the copy (read or write) d) Restore saved PKRU 5. Release pt_reg_lock(target_tid)


This approach creates a **locked validation window**: the traced process cannot execute
between authorization and data copy, and cannot escape by issuing its own `WRPKRU`
because it is blocked by `pt_reg_lock`. The authorization check occurs before any
PKRU manipulation, ensuring that unauthorized debuggers cannot exploit the window.

**Alternative approaches considered:**

1. **Permanently grant debugger PKRU access**: Rejected — violates isolation principle.
2. **Copy through a bounce buffer with kernel mapping**: Adds overhead but would work;
   however, PKRU manipulation is fast (~23 cycles) and the lock-based approach is
   simpler when the debugger is already ptrace-attached.
3. **Disable PTRACE_PEEKDATA on Tier 1 drivers**: Would compromise debuggability;
  the lock-based approach provides security without removing functionality.

The key invariant is: *no user-space code from the traced process runs between PKRU
authorization and PKRU restoration*. `pt_reg_lock` enforces this invariant.

Weak-isolation fast path: On platforms without MPK (or equivalent domain registers), the entire PKRU manipulation flow is unnecessary. ptrace uses the standard kernel read/write path — the driver's memory is in the same address space with no domain protection, so no temporary access grant is needed. The pt_reg_lock and TOCTOU-safe window are only instantiated when the architecture reports hardware domain support.

10.4.8 Signal Delivery Across Isolation Boundaries

When a signal targets a thread running in a Tier 1 (domain-isolated) driver:

Signal delivery to Tier 1 driver thread:
  SIGKILL / SIGSTOP (non-catchable):
    Kernel handles these directly — no signal frame is pushed.
    For SIGKILL: the driver thread is terminated. The kernel runs
    the driver's cleanup handler (if registered via KABI) in a
    bounded context (timeout: 100ms). If cleanup doesn't complete,
    the driver's isolation domain is revoked and all its memory freed.

  Catchable signals (SIGSEGV, SIGUSR1, etc.):
    1. Kernel saves driver's PKRU state.
    2. Kernel sets PKRU to the process's default domain (no driver
       memory access) before pushing the signal frame to the user stack.
    3. Signal handler runs in the process's normal domain — it cannot
       access driver-private memory.
    4. On sigreturn: kernel restores the saved PKRU and resumes the
       driver code with its original domain permissions.

  This ensures a signal handler in application code cannot accidentally
  (or maliciously) access driver-private memory while handling a signal
  that interrupted driver execution.

> **Weak-isolation fast path**: Without hardware domain registers (no MPK/POE/DACR),
> the PKRU save/restore steps are elided. Signals are delivered using the standard
> kernel signal path without domain register manipulation. The signal handler runs
> with normal kernel permissions — on these platforms, the driver memory is not
> domain-protected anyway, so there is nothing to save or restore.

See also: Section 5.2 (SmartNIC and DPU Integration) adds an offload tier where driver data-plane operations are proxied to a DPU over PCIe or shared memory, using the same tier classification and IOMMU fencing model.

10.4.9 eBPF Interaction with Driver Isolation Domains

eBPF programs are a cross-cutting kernel extensibility mechanism used for tracing (kprobe, tracepoint), networking (XDP, tc), security (LSM, seccomp), and scheduling (struct_ops). Because eBPF programs execute in kernel mode with access to kernel data structures, their interaction with driver isolation domains requires explicit specification to prevent isolation domain circumvention.

Threat model: An eBPF program, if not properly constrained, could: 1. Access Tier 1/Tier 2 driver memory directly without going through the isolation boundary 2. Bypass MPK/POE protections by running in the same domain as umka-core 3. Modify driver state without proper capability checks 4. Exfiltrate data from isolated driver memory to user space via BPF maps

Isolation architecture: eBPF programs do not run in the same isolation domain as umka-core (PKEY 0). Each loaded eBPF program is assigned to a dedicated BPF isolation domain that is distinct from: - umka-core (PKEY 0) - All Tier 1 driver domains (PKEY 2-13 on x86-64) - The shared DMA domain (PKEY 14) - The guard domain (PKEY 15)

This means eBPF programs cannot directly access driver-private memory, umka-core internal state, or any isolation domain's memory without explicit kernel mediation.

Access rules for eBPF programs:

No direct driver memory access: An eBPF program attached to a kprobe or tracepoint within a Tier 1 driver's code path executes in its own BPF domain, not the driver's domain. The BPF program cannot read or write the driver's private heap, stack, or MMIO-mapped device registers. Any access to driver state must go through BPF helper functions that perform cross-domain access on the program's behalf.
BPF helper mediation: All BPF helpers that access kernel or driver state (e.g., bpf_probe_read_kernel(), bpf_sk_lookup(), bpf_ct_lookup()) are implemented as kernel-mediated cross-domain operations. The helper:
Validates that the target memory region belongs to a domain for which the BPF program's domain holds the appropriate capability (see rule 4)
Copies data between the target domain and the BPF program's stack or map memory using kernel-internal mappings that bypass domain restrictions
Returns an error if the capability check fails or the access is out of bounds
Map isolation: BPF maps created by an eBPF program are owned by that program's BPF domain. Other isolation domains (including drivers) cannot access these maps without an explicit capability grant. Cross-domain map sharing follows the standard capability delegation mechanism (Section 8.1.1): the BPF domain must grant MAP_READ and/or MAP_WRITE capabilities to the target domain. This prevents a compromised driver from exfiltrating data through BPF maps it does not own.
Capability requirements for driver access: BPF helpers that query or modify driver state require the BPF domain to hold the appropriate capability:
bpf_skb_adjust_room() (modify packet buffer in NIC driver): requires CAP_NET_RAW in the caller's network namespace
bpf_xdp_adjust_head() / bpf_xdp_adjust_tail(): requires CAP_NET_RAW
Helpers that read driver statistics or state: require CAP_SYS_ADMIN or a subsystem-specific read capability The verifier rejects at load time any program that calls a helper for which the loading context (the process calling bpf()) does not hold the required capabilities. The eBPF runtime re-checks capabilities at helper invocation time to handle capability revocation after program load.
XDP and driver datapath: XDP programs attached to a NIC driver's receive path do not execute in the NIC driver's isolation domain. Instead:
The driver's receive handler (running in the driver's domain) copies the packet descriptor into a shared bounce buffer accessible to the BPF domain
The XDP program runs in the BPF domain, reading from and writing to the bounce buffer
Return values (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) are communicated back to the driver via a shared-memory return code
If the XDP program modifies the packet (XDP_TX or XDP_REDIRECT with modified data), the driver copies the modified packet back to its own domain before transmission or redirect This bounce-buffer design ensures the XDP program never directly accesses driver-private state (DMA rings, completion queues, device registers).

Performance note for 100Gbps+: At 100Gbps with 64-byte packets (~148 Mpps), per-packet bounce copies become a bottleneck (~10ns each = ~1.5 CPU cores just for memcpy). For high-speed NICs (≥25Gbps), UmkaOS supports a zero-copy XDP fast path: the NIC driver maps its receive ring into the BPF isolation domain as read-only (via the shared DMA buffer pool, PKEY 14 on x86 / domain 2 on AArch64), allowing XDP programs to inspect packets in-place without a copy. Modification still requires a copy-on-write to a BPF-writable buffer. This zero-copy path is opt-in per driver (xdp_features flag XDP_F_ZEROCOPY_RX) and requires IOMMU to fence the BPF domain's read-only mapping.

Weak-isolation fast path: When running without hardware isolation domains (isolation=performance or architectures without fast isolation), the bounce buffer is bypassed. XDP/TC programs access the driver's packet buffer directly (true zero-copy, matching Linux's XDP model). The BPF verifier still enforces bounds checking and memory safety — only the domain separation between BPF and driver memory is lost. Since the driver code itself already has unrestricted kernel memory access on these platforms, the bounce buffer would be protecting the driver's memory from BPF while the driver can already read/write all of kernel memory. The per-packet memcpy savings are significant at high packet rates (100Gbps with 64-byte packets = ~148M copies/sec eliminated).

TC (traffic control) BPF: Same model as XDP — TC programs execute in a BPF domain, not in the network driver's or umka-net's domain. Packet data is copied through a shared buffer; the program cannot access umka-net's socket buffers, routing tables, or connection tracking state except through verified BPF helpers (bpf_fib_lookup(), bpf_ct_lookup(), etc.) that perform capability-checked cross-domain access.
Kprobe and tracepoint attachment to drivers: When a BPF program is attached to a kprobe within a Tier 1 driver's code:
The kprobe fires while the CPU is running in the driver's isolation domain
The BPF program is invoked after the kernel switches to the BPF domain
The program receives only the function arguments (copied to BPF stack) and cannot access the driver's heap, globals, or MMIO regions
Return probes (kretprobe) receive the return value copied to BPF stack The domain switch before BPF execution and the argument copy are performed by the kprobe infrastructure in umka-core, ensuring the BPF program is fully contained within its own domain.
LSM BPF and security hooks: LSM BPF programs attached to security hooks (file open, socket create, etc.) run in a BPF domain. They cannot access the credentials, file descriptors, or socket state of the process that triggered the hook except through BPF helpers (bpf_get_current_pid_tgid(), bpf_get_current_cred(), etc.) that copy the relevant data into the BPF program's memory. Security decisions (allow/deny) are returned via an integer return code; the program cannot directly modify kernel security state.

Domain allocation for BPF: On x86-64, BPF domains are allocated from the same PKEY pool as Tier 1 drivers (PKEY 2-13). Typical systems run 5-8 Tier 1 driver domains, leaving 4-7 domains for BPF programs. When domain exhaustion occurs (drivers + BPF programs > 12 domains), BPF programs share a common BPF domain rather than each getting a dedicated domain. This reduces isolation granularity between BPF programs but preserves isolation between BPF and drivers and between BPF and umka-core. BPF-to-BPF isolation is a best-effort optimization, not a security guarantee — BPF programs are verified code with bounded execution, and their primary isolation boundary is BPF-to-driver and BPF-to-core, both of which are always maintained regardless of domain pressure. On architectures without a fixed domain limit (PPC64LE, AArch64 mainstream page-table path), each BPF program gets its own domain. On RISC-V (no Tier 1), BPF domains are not applicable — BPF programs run without isolation domains.

Crash handling: A crash (verifier bug, JIT bug, or helper bug) within a BPF program triggers the same containment as a Tier 1 driver crash: - The BPF domain is revoked - All maps owned by that domain are invalidated (subsequent lookups return -ENOENT) - Attached hooks are automatically detached - The program is marked as faulted and cannot be re-attached without reload

Unlike Tier 1 drivers, BPF programs do not have a recovery path — they are considered stateless (persistent state lives in maps, which survive program reload). The administrator must reload the program manually or via orchestration.

Full specification: The complete BPF isolation model — domain confinement, map access control, capability-gated helpers, cross-domain packet redirect rules, and verifier enforcement — is specified in Section 15.2.2 (Packet Filtering, BPF-Based). Although Section 15.2.2 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. The rules above are a driver-centric summary; Section 15.2.2 provides the canonical specification.

10.4.10 Tier 2 Interface and SDK

Tier 2 drivers run in separate user-space processes. They communicate with umka-core via dedicated KABI syscalls — not the domain ring buffers used by Tier 1.

KABI syscalls for Tier 2 drivers:

These syscalls use a dedicated syscall range (__NR_umka_driver_base + offset, allocated from the UmkaOS-private syscall range defined in Section 18.1.2). They are not Linux-compatible syscalls -- they are UmkaOS-specific and used only by the Tier 2 driver SDK. The SDK wraps them behind the same KernelServicesVTable interface that Tier 1 drivers use, so driver code is tier-agnostic.

KABI Syscall	Syscall Offset	Arguments	Return	Purpose
`umka_driver_register`	0	`manifest: const DriverManifest, manifest_size: u64, out_services: mut KernelServicesVTable, out_device: *mut DeviceDescriptor`	`IoResultCode`	Register with device registry. Kernel validates manifest, assigns capabilities, returns kernel services vtable and device descriptor. Called once at driver process startup.
`umka_driver_mmio_map`	1	`device_handle: DeviceHandle, bar_index: u32, offset: u64, size: u64, out_vaddr: *mut u64`	`IoResultCode`	Map a device BAR (or portion) into driver address space. Kernel validates BAR ownership, IOMMU group, and capability before creating the mapping. The mapping is revocable: the kernel can unmap it at any time during driver containment (see "Tier 2 MMIO access model" above).
`umka_driver_dma_alloc`	2	`size: u64, align: u64, flags: AllocFlags, out_vaddr: mut u64, out_dma_addr: mut u64`	`IoResultCode`	Allocate DMA-capable memory. Kernel allocates physical pages, creates IOMMU mapping, maps into driver process. Returns both virtual and DMA (bus) addresses.
`umka_driver_dma_free`	3	`vaddr: u64, size: u64`	`IoResultCode`	Release a DMA buffer. Kernel tears down IOMMU mapping, unmaps from process, frees physical pages.
`umka_driver_irq_wait`	4	`irq_handle: u32, timeout_ns: u64`	`IoResultCode`	Block until the registered interrupt fires or timeout expires. Returns `IO_SUCCESS` on interrupt, `IO_TIMEOUT` on timeout. Uses eventfd internally for efficient wakeup.
`umka_driver_complete`	5	`request_id: u64, status: IoResultCode, bytes_transferred: u64`	`IoResultCode`	Post an I/O completion to umka-core. The completion is forwarded to the originating io_uring CQ or waiting syscall.

Error codes: All Tier 2 KABI syscalls return IoResultCode (defined in umka-driver-sdk/src/abi.rs). Common errors: IO_ERR_INVALID_HANDLE (bad device handle), IO_ERR_PERMISSION (missing capability), IO_ERR_NO_MEMORY (allocation failed), IO_ERR_BUSY (resource in use), IO_ERR_TIMEOUT.

Performance: Per-I/O overhead floor is ~200-400ns (two syscall transitions). For high-IOPS devices (NVMe, 100GbE), this is significant — those belong in Tier 1. Tier 2 suits devices where overhead is negligible: USB, printers, audio (~1-10ms periods), experimental drivers, and third-party binaries compiled against the stable SDK.

Security boundary: A Tier 2 driver crash is an ordinary process crash. It cannot corrupt kernel memory or issue DMA outside IOMMU-fenced regions. On containment, the kernel revokes all MMIO mappings (preventing further device register access) and tears down IOMMU entries (causing any residual in-flight DMA to fault). The kernel restarts the driver process if the restart policy permits (~10ms recovery).

10.5 Device Registry and Bus Management

Summary: This section specifies the kernel-internal device registry — a topology-aware tree that tracks all hardware devices, their parent/child relationships, driver bindings, power states, and capabilities. It covers: bus enumeration and matching (Section 10.5.4), device lifecycle and hot-plug (Section 10.5.6-Section 10.5.7), power management ordering (Section 10.5.6), crash recovery integration (Section 10.5.10), sysfs compatibility (Section 10.5.12), and firmware management (Section 10.5.15). The registry is the single source of truth for "what hardware exists" and is used by the scheduler (Section 6.1), fault manager (Section 19.1), DPU offload layer (Section 5.2), and unified compute topology (Section 21.6). Readers needing only the API surface can skip to Section 10.5.3 (data model) and Section 10.5.9 (KABI integration).

10.5.1 Motivation and Prior Art

10.5.1.1 The Problem

UmkaOS's KABI provides a clean bilateral vtable exchange between kernel and driver. But the current design has no answer for:

Device hierarchies: How does the kernel model that a USB keyboard is behind a hub, which is behind an XHCI controller, which sits on a PCI bus? The topology matters for power management ordering, hot-plug teardown, and fault propagation.
Driver-to-device matching: When the kernel discovers a PCI device with vendor 0x8086 and device 0x2723, how does it know which driver to load? Currently there is no matching mechanism.
Power management ordering: Suspending a PCI bridge before its child devices causes data loss. The kernel needs to know the topology to get the ordering right.
Cross-driver services: A NIC may need a PHY driver. A GPU display pipeline may need an I2C controller. There is no way for drivers to discover and use services provided by other drivers.
Hot-plug: When a USB device is yanked, the kernel must tear down the device, its driver, and all child devices in the correct order.

The key insight from macOS IOKit: the kernel should own the device relationship model. But IOKit's mistake was embedding the model in the driver's C++ class hierarchy, coupling it to the ABI. We build it as a kernel-internal service that drivers access through KABI methods.

10.5.1.2 What We Learn From Existing Systems

Linux (kobject / bus / device / driver / sysfs): - Device model is a graph of kobject structures exposed via sysfs. - Bus types (PCI, USB, platform) each implement their own match/probe/remove. - Strengths: sysfs gives userspace introspection; uevent mechanism for hotplug. - Weaknesses: driver matching is bus-specific with no unified property system; power management ordering is heuristic (dpm_list), not topology-derived; the kobject model is deeply entangled with kernel internals — drivers directly embed and manipulate kobjects.

macOS IOKit (IORegistry): - All devices modeled as a tree of C++ objects (IORegistryEntry → IOService → ...). - Matching uses property dictionaries ("matching dictionaries"). - Power management tree mirrors the registry tree — IOPMPowerState arrays per driver. - Strengths: property-based matching is elegant; PM ordering derives from the tree; service publication/lookup via IOService matching. - Weaknesses: C++ class hierarchy is the ABI — changing a base class breaks all drivers (fragile base class problem). This is why Apple deprecated kexts and moved to DriverKit. The matching system is over-general (personality dictionaries are complex). Memory management is manual.

Windows PnP Manager: - Kernel-mode PnP manager maintains a device tree. Device nodes have properties. - INF files declare driver matching rules (declarative, external to the binary). - Power management uses IRP_MN_SET_POWER directed through the tree. - Strengths: INF-based declarative matching is clean; power IRPs propagate with correct ordering; robust hotplug. - Weaknesses: IRP-based model is complex; WDM/WDF driver model is notoriously difficult.

Fuchsia (Driver Framework v2): - "Bind rules" — a simple declarative language — match drivers to devices. - Driver manager runs as a userspace component. Device topology is a tree of nodes in a namespace. - Strengths: clean separation of concerns; bind rules are simple and composable; userspace driver manager can be restarted independently. - Weaknesses: everything going through IPC adds latency; the DFv1-to-DFv2 migration shows that evolving the framework is painful.

10.5.1.3 UmkaOS's Position

We take the best ideas from each:

Concept	Borrowed From	Adaptation
Property-based matching	IOKit	Declarative match rules in driver manifest, not runtime OOP matching
Registry as a tree	IOKit, Linux	Kernel-internal tree, drivers get opaque handles only
PM ordering from topology	IOKit, Windows	Topological sort of device tree, timeouts at each level
Service publication/lookup	IOKit	Mediated by registry through KABI, not direct object references
Sysfs-compatible output	Linux	Registry is the single source of truth for /sys
Uevent hotplug notifications	Linux	Registry emits Linux-compatible uevents
Declarative bind rules	Fuchsia	Match rules embedded in driver ELF binary

What we take from none of them: the registry is a kernel-internal data structure. Drivers never see it directly. They interact through opaque DeviceHandle values and KABI vtable methods. No OOP inheritance, no C++ objects, no kobject embedding, no global symbol tables. The flat, versioned, append-only KABI philosophy is fully preserved.

10.5.2 Design Principles

Kernel owns the graph, drivers own the hardware logic. The registry manages topology, matching, lifecycle, and power ordering. Drivers manage hardware registers, DMA, and device-specific protocols. Clean separation.
Drivers are leaves, not framework participants. A driver does not subclass a framework object. It fills in a vtable and receives callbacks. The registry decides when to call those callbacks based on topology and policy.
No ABI coupling. The registry is kernel-internal. Drivers interact with it through KABI methods appended to KernelServicesVTable. If the registry's internal data structures change, no driver recompilation is needed.
Topology drives policy. Power management ordering, hot-plug teardown, crash recovery cascading, and NUMA affinity are all derived from the device tree topology. No heuristics, no manually maintained ordering lists.
Capability-mediated access. All cross-driver interactions go through the registry, which validates capabilities and handles tier transitions (isolation domain switches, user-kernel IPC). Drivers never communicate directly.

10.5.3 Registry Data Model

10.5.3.1 DeviceNode

The fundamental unit is a DeviceNode — a kernel-internal structure that drivers never see directly.

Heap allocation requirement: DeviceNode and its child structures (Vec, String, HashMap in PropertyTable and DeviceRegistry) require heap allocation. The device registry is initialized at boot step 4g (Section 10.5.11), which is after the physical memory allocator and virtual memory subsystem are running (steps 4b-4c). Tier 0 devices (APIC, timer, serial) that are needed before heap init do not use the registry — they are registered retroactively after registry init (Section 10.5.11.1). No registry data structures are used during early boot before the heap is available.

// Kernel-internal — NOT part of KABI

pub struct DeviceNodeId(pub u64);   // Unique, monotonically increasing, never reused

pub struct DeviceNode {
    // Identity
    id: DeviceNodeId,
    name: ArrayString<64>,          // e.g., "pci0000:00", "0000:00:1f.2", "usb1-1.3"

    // Tree structure
    parent: Option<DeviceNodeId>,
    children: Vec<DeviceNodeId>,    // Ordered by discovery time

    // Service relationships (non-tree edges)
    providers: Vec<ServiceLink>,    // Services this node consumes
    clients: Vec<ServiceLink>,      // Nodes that consume services from this node

    // Device identity
    bus_type: BusType,              // Reuses existing BusType from abi.rs
    bus_identity: BusIdentity,      // Bus-specific ID (PCI IDs, USB descriptors, etc.)
    properties: PropertyTable,      // Key-value property store

    // Lifecycle
    state: DeviceState,
    driver_binding: Option<DriverBinding>,

    // Placement
    numa_node: i32,                 // -1 = unknown

    // Power
    power_state: PowerState,
    runtime_pm: RuntimePmPolicy,

    // Security
    device_cap: CapHandle,          // Capability for this device

    // Resources
    resources: DeviceResources,     // BAR mappings, IRQs, DMA state

    // IOMMU
    iommu_group: Option<IommuGroupId>,  // Shared IOMMU group (for passthrough)

    // Reliability
    /// Sliding-window failure tracker. Records timestamps of recent failures
    /// in a circular buffer (capacity: 16 entries). The demotion policy checks
    /// how many failures occurred within the configured window (default: 1 hour).
    /// See `FailureWindow` definition below.
    failure_window: FailureWindow,
    last_transition_ns: u64,

    // State buffer integrity
    /// HMAC-SHA256 key for state buffer integrity verification.
    /// Generated by umka-core on first driver load for this DeviceHandle.
    /// Persists across driver crash/reload cycles; discarded only on
    /// DeviceHandle removal (device unplugged or deregistered).
    /// See `DriverHmacKey` below for the full key lifecycle specification.
    state_hmac_key: Option<DriverHmacKey>,
}

`DriverHmacKey`: Key Storage and Lifecycle

The state_hmac_key field above is backed by DriverHmacKey, which controls every aspect of key material storage, protection, derivation, and rotation. The key must reside exclusively in UmkaOS Core private memory so that a compromised Tier 1 driver (which runs at Ring 0 and can execute WRPKRU) cannot extract it and forge state buffer integrity tags. See the threat model discussion in Section 10.8 (TOCTOU mitigation) for why Tier 2 isolation is required to prevent key extraction by an actively exploited driver.

/// HMAC-SHA256 key for driver state buffer integrity verification.
///
/// Stored in UmkaOS Core private memory (protection key 0 — the PKEY 0 domain
/// is inaccessible to all driver code regardless of privilege level). Driver
/// code running in PKEY 2-13 domains cannot read this key even if it executes
/// arbitrary Ring 0 instructions, because the PKRU register in Core's execution
/// context grants read/write only to PKEY 0 when performing HMAC operations.
///
/// # Key derivation
///
/// Key material is derived via HKDF-SHA256 from:
///   IKM  = 256 bits from RDRAND (or platform TRNG on non-x86)
///   Salt = TPM PCR[7] measurement (secure boot policy PCR, 32 bytes)
///   Info = b"umka-driver-hmac" || slot_id.to_le_bytes() || generation.to_le_bytes()
///
/// This binds each key to its driver slot and generation, preventing a key
/// generated for slot 3 generation 5 from being used to verify state produced
/// under slot 3 generation 4 (even if the generation counter wraps — see the
/// generation wrap policy in Section 11.1.5.3).
///
/// # Memory location
///
/// The containing `DeviceNode` is allocated from the `.data.pkey0` slab, which
/// is mapped exclusively to PKEY 0 in the UmkaOS Core page tables. On non-x86
/// architectures that lack MPK/POE, the equivalent protection is achieved via
/// a dedicated kernel-only page table entry that is never present in any driver
/// domain's address space.
pub struct DriverHmacKey {
    /// Raw 256-bit key material. Zeroized on driver unload via volatile writes
    /// (preventing the compiler from eliding the zeroing as a dead store).
    key: Zeroize<[u8; 32]>,
    /// Driver slot this key is bound to. Used for audit logging and for
    /// verifying that the key is not accidentally applied to a different slot.
    driver_slot: DriverSlot,
    /// HKDF generation input that was used when this key was derived.
    /// Checked before HMAC verification: a key with generation G will not
    /// successfully verify state that was tagged under generation G' ≠ G,
    /// because the HKDF `Info` field differs.
    generation: u32,
}

/// Memory-safe zeroizing wrapper.
///
/// Uses volatile pointer writes via `core::ptr::write_volatile` to prevent
/// the compiler from treating the zeroing as a dead store and eliding it.
/// This is the same pattern used by the `zeroize` crate in userspace.
pub struct Zeroize<T: Copy>(T);

impl<T: Copy + Default> Drop for Zeroize<T> {
    fn drop(&mut self) {
        // SAFETY: `self.0` is valid, aligned, and exclusively owned here.
        // The volatile write prevents the optimizer from removing the zeroing.
        unsafe {
            core::ptr::write_volatile(&mut self.0, T::default());
        }
    }
}

Key lifecycle:

Allocation — DriverHmacKey::new(slot, generation) is called under PKEY 0 protection during driver_load(). The call:
Reads 32 bytes from the platform TRNG (RDRAND on x86-64; SoC TRNG on ARM/RISC-V).
Reads TPM PCR[7] (32 bytes) via the TPM KABI call (Section 8.3).
Derives the key via HKDF-SHA256: key = HKDF(IKM=trng_bytes, salt=pcr7, info="umka-driver-hmac" || slot || gen).
Stores the result in DriverHmacKey.key within the PKEY 0 slab.
Access — Only UmkaOS Core code executing with PKRU granting PKEY 0 read/write can dereference DriverHmacKey.key. Driver code (PKEY 2-13 domains) receives a page fault if it attempts to read the key's memory. The HMAC computation itself is performed by a dedicated Core function (driver_state_hmac_compute) that briefly acquires PKEY 0 access, performs the computation into a stack-local output buffer, then restores the caller's PKRU before returning. The key material is never copied to driver-accessible memory.
Rotation — On every driver reload (crash recovery or explicit operator unload), generation increments, a new TRNG sample is drawn, and a fresh key is derived. The old DriverHmacKey is dropped, triggering Zeroize::drop which overwrites the key material with zeros via volatile writes before the slab page is returned to the allocator.
Storage location — umka_core::driver_registry::SLOT_KEYS[slot] is a static array in the .data.pkey0 linker section. The linker script maps this section to a physical page range that is exclusively assigned to PKEY 0 in the Core page tables. The array is indexed by DriverSlot (the same integer used in DeviceNode); the maximum number of concurrent driver slots is discovered at boot from the device count and the configured tier limits (no compile-time cap).
Discarding — When a DeviceNode is removed from the registry (device unplugged or driver_deregister() called by the operator), state_hmac_key is set to None, dropping the DriverHmacKey value and zeroizing the key. Subsequent crash recovery for this slot (if a new device is hotplugged to the same slot) generates an entirely fresh key.

10.5.3.2 PropertyTable

Properties are the lingua franca of matching and introspection. They serve the same role as IOKit's property dictionaries and Linux's sysfs attributes.

// PropertyValue variants String, Bytes, and StringArray use heap-allocated
// containers. These are only constructed after heap init (boot step 4b+).
// For pre-heap device identification, Tier 0 devices use fixed-size
// ArrayString<64> in BusIdentity (Section 10.5.3.3) which is stack-allocated.
pub enum PropertyValue {
    U64(u64),
    I64(i64),
    String(String),
    Bytes(Vec<u8>),
    Bool(bool),
    StringArray(Vec<String>),
}

/// Stored as a sorted Vec for cache-friendly iteration and binary search.
/// Device nodes rarely have more than ~30 properties.
pub struct PropertyTable {
    entries: Vec<(PropertyKey, String, PropertyValue)>,
}

Standard property keys (well-known constants):

Key	Type	Description	Set By
`"bus-type"`	String	`"pci"`, `"usb"`, `"platform"`, `"virtio"`	Bus enumerator
`"vendor-id"`	U64	PCI/USB vendor ID	Bus enumerator
`"device-id"`	U64	PCI/USB device ID	Bus enumerator
`"subsystem-vendor-id"`	U64	PCI subsystem vendor	Bus enumerator
`"subsystem-device-id"`	U64	PCI subsystem device	Bus enumerator
`"class-code"`	U64	PCI class code / USB class	Bus enumerator
`"revision-id"`	U64	Hardware revision	Bus enumerator
`"compatible"`	StringArray	DT/ACPI compatible strings	Firmware parser
`"device-name"`	String	Human-readable name	Bus enumerator
`"driver-name"`	String	Name of bound driver	Registry
`"driver-tier"`	U64	Current isolation tier	Registry
`"numa-node"`	I64	NUMA node ID	Topology scanner
`"location"`	String	Physical topology path (e.g., PCI BDF)	Bus enumerator
`"serial-number"`	String	Device serial if available	Bus enumerator

Properties set by "Bus enumerator" are populated during device discovery by whatever code enumerates the bus (PCI config space scan, USB hub status, ACPI namespace walk). Properties set by "Registry" are managed by the kernel. Drivers can set custom properties on their own device node via KABI.

10.5.3.3 BusIdentity

A union-like enum holding bus-specific identification. Derives from the existing PciDeviceId in the driver SDK.

pub enum BusIdentity {
    Pci {
        segment: u16,
        bus: u8,
        device: u8,
        function: u8,
        id: PciDeviceId,        // Existing type from abi.rs
    },
    Usb {
        bus_num: u16,
        port_path: [u8; 8],    // Hub topology chain
        port_depth: u8,
        vendor_id: u16,
        product_id: u16,
        device_class: u8,
        device_subclass: u8,
        device_protocol: u8,
        interface_class: u8,
        interface_subclass: u8,
        interface_protocol: u8,
    },
    Platform {
        compatible: ArrayString<64>,    // ACPI _HID or DT compatible
        unit_id: u64,                   // ACPI _UID or DT unit address
    },
    VirtIo {
        device_type: u32,
        vendor_id: u32,
        device_id: u32,
    },
}

10.5.3.4 Service Links

Non-tree edges representing provider-client relationships between devices:

pub struct ServiceLink {
    service_name: ArrayString<64>,  // e.g., "phy", "i2c", "gpio", "block"
    node_id: DeviceNodeId,
    cap_handle: CapHandle,          // Capability for mediated access
}

10.5.3.5 Tree Structure Example

Root
 +-- acpi0 (ACPI namespace root)
 |    +-- pci0000:00 (PCI host bridge, segment 0, bus 0)
 |    |    +-- 0000:00:1f.0 (ISA bridge / LPC)
 |    |    +-- 0000:00:1f.2 (SATA controller)
 |    |    |    +-- ata0 (ATA port 0)
 |    |    |    |    +-- sda (disk)
 |    |    |    +-- ata1 (ATA port 1)
 |    |    +-- 0000:00:14.0 (USB XHCI controller)
 |    |    |    +-- usb1 (USB bus)
 |    |    |    |    +-- usb1-1 (hub)
 |    |    |    |    |    +-- usb1-1.1 (keyboard)
 |    |    |    |    |    +-- usb1-1.2 (mouse)
 |    |    +-- 0000:03:00.0 (NVMe controller)
 |    |    |    +-- nvme0n1 (NVMe namespace 1)
 |    |    +-- 0000:04:00.0 (NIC - Intel i225)
 |    |    |    ...provider-client link: "phy" --> phy0 (not a child)
 +-- platform0 (Platform device root)
      +-- serial0 (Platform UART)
      +-- phy0 (Platform PHY device)

Two types of edges:

Parent-Child (structural containment): A PCI device is a child of a PCI bridge. A USB device is a child of a USB hub. This is the primary tree structure.
Provider-Client (service dependency): Lateral edges. A NIC is a client of a PHY's "phy" service. A GPU display driver is a client of an I2C controller's "i2c" service. These edges do not form cycles (enforced by the registry).

10.5.3.6 The Registry

// DeviceRegistry uses BTreeMap, HashMap, Vec, and VecDeque — all heap-allocated.
// The registry is initialized at boot step 4g (Section 10.5.11), after the heap
// is available. It is never accessed before heap init.
pub struct DeviceRegistry {
    /// All nodes, indexed by ID.
    nodes: BTreeMap<DeviceNodeId, DeviceNode>,

    /// Next node ID (monotonically increasing).
    next_id: AtomicU64,

    /// Index: bus identity --> node ID (fast device lookup).
    bus_index: HashMap<BusLookupKey, DeviceNodeId>,

    /// Index: property key+value --> set of node IDs (for matching).
    property_index: HashMap<PropertyKey, Vec<DeviceNodeId>>,

    /// Index: driver name --> set of node IDs (for crash recovery).
    driver_index: HashMap<ArrayString<64>, Vec<DeviceNodeId>>,

    /// Registered match rules from all known driver manifests.
    match_rules: Vec<MatchRegistration>,

    /// Pending hotplug events.
    hotplug_queue: VecDeque<HotplugEvent>,

    /// Power management state.
    power_manager: PowerManager,
}

The registry lives entirely within UmkaOS Core. It is never exposed as a data structure to drivers.

10.5.3.7 DeviceResources

Each device node tracks its allocated hardware resources. This is the kernel-internal counterpart of what Linux spreads across struct resource, struct pci_dev fields, and struct msi_desc lists.

/// Hardware resources allocated to a device. Kernel-internal, NOT part of KABI.
pub struct DeviceResources {
    /// PCI Base Address Register mappings (up to 6 BARs per PCI function).
    pub bars: [Option<BarMapping>; 6],

    /// Interrupt allocations (legacy, MSI, or MSI-X vectors).
    pub irqs: Vec<IrqAllocation>,

    /// Number of pages currently pinned for DMA by this device.
    /// Page reclaim (Section 4.2) checks this count before attempting to compress
    /// or swap a page — DMA-pinned pages are never eligible.
    pub dma_pin_count: AtomicU32,

    /// Maximum DMA-pinnable pages for this device (enforced by cgroup and
    /// per-device limits). 0 = unlimited.
    pub dma_pin_limit: u32,

    /// MMIO regions mapped for this device (non-BAR, e.g., firmware tables).
    pub mmio_regions: Vec<MmioRegion>,

    /// Legacy I/O port ranges (x86 only, rare in modern hardware).
    pub io_ports: Vec<IoPortRange>,

    /// DMA address mask — how many bits of physical address the device can
    /// generate. Determines bounce buffer requirements.
    pub dma_mask: u64,             // e.g., 0xFFFFFFFF for 32-bit DMA
    pub coherent_dma_mask: u64,    // For coherent (non-streaming) DMA
}

pub struct BarMapping {
    pub bar_index: u8,
    pub phys_addr: u64,
    pub size: u64,
    pub flags: BarFlags,
    /// Kernel virtual address if mapped. None = not yet mapped (lazy).
    pub mapped_vaddr: Option<u64>,
}

bitflags::bitflags! {
    #[repr(transparent)]
    pub struct BarFlags: u32 {
        const MEMORY_64     = 1 << 0;  // 64-bit MMIO (vs 32-bit)
        const IO_PORT       = 1 << 1;  // I/O port space (legacy x86)
        const PREFETCHABLE  = 1 << 2;  // Can be mapped write-combining
    }
}

pub struct IrqAllocation {
    pub irq_type: IrqType,
    pub vector: u32,          // Global IRQ vector number
    pub cpu_affinity: Option<u32>,  // Preferred CPU for this interrupt
}

#[repr(u32)]
pub enum IrqType {
    LegacyPin = 0,   // INTx (shared, level-triggered)
    Msi       = 1,   // Message Signaled Interrupt (single vector)
    MsiX      = 2,   // MSI-X (independent vectors, per-queue)
}

pub struct MmioRegion {
    pub phys_addr: u64,
    pub size: u64,
    pub cacheable: bool,
}

pub struct IoPortRange {
    pub base: u16,
    pub size: u16,
}

DMA pin counting is a critical safety mechanism:

Every dma_map_*() call through KABI increments the device's dma_pin_count.
Every dma_unmap_*() call decrements it.
The page reclaim path (Section 4.2) checks whether a page's owning device has active DMA pins before attempting compression or swap-out. Pages with active DMA mappings are unconditionally skipped — moving a page while a device is DMAing to it would cause silent data corruption.
On driver crash recovery (Section 10.5.10), all DMA mappings for the crashed driver are forcibly invalidated (IOMMU entries torn down), and dma_pin_count is reset to zero. This is safe because the device has been reset.
The dma_pin_limit provides defense-in-depth: a buggy or malicious driver cannot pin all of physical memory for DMA. The limit is enforced by the kernel, not the driver.

Resource lifecycle:

Resources are allocated during device discovery (BARs, IRQs) and driver initialization (DMA mappings, additional MMIO). On device removal or driver crash, all resources are reclaimed by the registry in reverse order: DMA mappings first (IOMMU teardown), then IRQs (free vectors), then BAR unmappings, then MMIO unmappings.

10.5.3.8 IOMMU Groups

IOMMU groups model hardware isolation boundaries. An IOMMU group is the smallest unit of device isolation that the hardware can enforce — all devices in a group share the same IOMMU domain (page table).

pub struct IommuGroupId(pub u32);

pub enum IommuDomainType {
    /// Kernel DMA domain — device DMA goes through kernel-managed IOMMU
    /// page tables. Default for all devices.
    Kernel,

    /// Identity-mapped DMA domain — IOMMU programs 1:1 physical-to-bus
    /// mapping. Device DMA addresses equal physical addresses. Requires
    /// explicit admin opt-in per device. See Section 10.5.3.8 "Per-Device DMA
    /// Identity Mapping" for constraints and security implications.
    Identity {
        /// Upper bound of the 1:1 mapping (typically max_phys_addr).
        phys_range_end: u64,
    },

    /// VM passthrough domain — entire group assigned to a VM. The VM's
    /// IOMMU page tables control device DMA. Used for VFIO passthrough.
    VmPassthrough {
        vm_id: u64,
        /// Second-level page table root (EPT/NPT base).
        page_table_root: u64,
    },

    /// Userspace DMA domain — for Tier 2 drivers that need direct DMA
    /// (e.g., DPDK-style networking). IOMMU restricts DMA to the
    /// driver process's permitted regions.
    UserspaceDma {
        owning_pid: u64,
    },
}

Why IOMMU groups matter:

VFIO passthrough: When assigning a device to a VM (GPU, NIC, NVMe controller, FPGA, etc.), the kernel must assign the entire IOMMU group. If two devices share a group (e.g., GPU and its audio function on the same PCI slot, or NIC and a co-located function), both must be assigned together. The registry validates this constraint before permitting passthrough. See Section 21.5.2.4 for GPU-specific passthrough details.
ACS (Access Control Services): PCIe ACS capabilities determine group boundaries. With ACS, each PCI function can be its own group. Without ACS, all devices behind a non-ACS bridge form a single group (because they could DMA to each other without going through the IOMMU).
Isolation guarantee: The IOMMU group is the hardware's isolation primitive. The registry enforces that no device in a passthrough group remains in the kernel domain — this would allow the VM to DMA to the kernel device's memory.

Group discovery:

During PCI enumeration (Section 10.5.11.3), the registry determines IOMMU groups by walking the PCI topology and checking ACS capability bits:

For each PCI device:
  1. Walk upstream to the root port, checking ACS at each bridge.
  2. If all bridges have ACS: device is in its own group.
  3. If a bridge lacks ACS: all devices below that bridge share a group.
  4. Peer-to-peer devices behind the same non-ACS switch: same group.

Passthrough assignment flow:

1. Admin requests device passthrough for VM (via /dev/vfio/N or umka-kvm API)
2. Registry looks up device's DeviceNode → iommu_group
3. Registry checks: all devices in group unbound or assignable?
4. If yes: unbind kernel drivers, switch group to VmPassthrough domain
5. Program IOMMU with VM's second-level page tables
6. VM's guest OS sees the device and loads its own driver
7. On VM teardown: switch back to Kernel domain, rebind kernel drivers

The registry prevents partial group assignment: if device A and device B share IOMMU group 7, and only A is requested for passthrough, the request is rejected with -EBUSY unless B is also unbound. This prevents a safety violation where the VM could DMA to B's kernel-managed memory.

IOMMU Group Assignment Algorithm (device discovery):

The following algorithm runs during device enumeration (Section 2.1, boot hardware discovery) to assign each device to an isolation domain:

For each PCIe device discovered during enumeration:
  a. Query the IOMMU group ID for the device from the IOMMU driver.
     (IOMMU groups are defined by hardware — devices sharing a stream ID
     or lacking ACS isolation are in the same group.)
  b. If the group ID is new (first device in this group):
     - Allocate a new isolation domain for this group.
     - Register: iommu_group_domains[group_id] = new_domain.
  c. If the group ID already has a domain assignment:
     - Assign this device to the existing domain.
     - Log: "Device [bus:dev.fn] shares IOMMU group [id] with [other devices]
       — assigned to same isolation domain [domain_id]."

ACS (Access Control Services) check:
  PCIe ACS must be enabled on root ports and upstream ports/switches to allow
  per-function IOMMU groups. If ACS is absent on an upstream bridge:
  - All devices downstream of that bridge share one IOMMU group.
  - They are assigned to a shared isolation domain.
  - Log a warning: "PCIe switch at [bus:dev.fn] lacks ACS — [N] devices share
    one IOMMU group. Per-device isolation not possible."
  - UmkaOS does NOT disable the device — it runs with reduced isolation (shared
    domain) and logs the degraded state to FMA (Section 19.1).

Singleton groups (preferred):
  When ACS is present and hardware supports per-function translation,
  each device gets its own IOMMU group and its own isolation domain.
  This is the default and preferred configuration for Tier 1 drivers.

Driver cgroup co-isolation enforcement:
  UmkaOS enforces that all devices in an IOMMU group belong to the same driver
  cgroup. If a user attempts to assign two devices from the same IOMMU group
  to different drivers, the second assignment fails with -EACCES and the error
  message: "Device [bus:dev.fn] shares IOMMU group [id] with device [other
  bus:dev.fn] — both must be assigned to the same driver."

IOMMU Group Formation: `pci_device_group` Algorithm

The pseudo-code above describes the consumer side of IOMMU group assignment — how the device registry attaches devices to existing domains. The following specifies the formation side: how pci_device_group determines which IOMMU group a newly enumerated PCI device belongs to. This algorithm matches the Linux implementation (kernel/drivers/iommu/iommu.c, intel/iommu.c, amd/iommu.c, arm-smmu-v3.c) and is the authoritative procedure for all three major IOMMU hardware families.

/// ACS flags that together guarantee DMA request isolation between PCIe peers.
/// Without all four bits set on an upstream bridge, devices behind that bridge
/// can issue peer-to-peer DMA that bypasses IOMMU translation entirely.
///
/// - SV  (Source Validation): bridge verifies the requester ID is valid
/// - RR  (Request Redirection): DMA requests are redirected through the IOMMU
/// - CR  (Completion Redirection): completions return through the IOMMU
/// - UF  (Upstream Forwarding): upstream traffic is forwarded to the RC
const REQ_ACS_FLAGS: AcsFlags =
    AcsFlags::SV | AcsFlags::RR | AcsFlags::CR | AcsFlags::UF;

/// Determine the IOMMU group for a PCI device.
///
/// Called during driver registration and device hotplug. Returns the
/// `IommuGroup` to which this device must belong. A device can only
/// be assigned an `IommuDomain` that covers its entire group — partial
/// assignment is a hardware violation and is rejected by the registry.
///
/// # Algorithm
///
/// The four steps below are executed in order. The first step that
/// produces an existing group terminates the search and returns that group.
/// If no group is found, a fresh group is allocated in step 4.
pub fn pci_device_group(
    dev: &PciDevice,
    iommu: &IommuInstance,
) -> Arc<IommuGroup> {
    // Step 1: DMA alias resolution.
    //
    // Conventional PCI devices behind a PCIe-to-PCI bridge have their
    // requester ID rewritten to the bridge's BDF by the bridge — the IOMMU
    // sees the bridge's requester ID, not the device's own BDF. Such devices
    // are called "DMA aliases" of the bridge. All devices sharing the same
    // alias must be in the same IOMMU group because the IOMMU cannot
    // distinguish their DMA transactions.
    //
    // `resolve_dma_aliases` walks the alias set (via the PCIe alias capability
    // and conventional PCI bridge topology) and returns the canonical anchor BDF.
    let anchor = resolve_dma_aliases(dev);
    if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
        return existing;
    }

    // Step 2: ACS boundary walk.
    //
    // Walk upstream bridges from the device toward the root complex. At each
    // bridge, check whether all four REQ_ACS_FLAGS bits are set in the bridge's
    // ACS capability register. The first bridge that lacks full ACS is the
    // isolation failure point: it cannot prevent peer devices from issuing
    // DMA to each other without going through the IOMMU. Move the group
    // anchor up to that bridge — all devices below it must share one group.
    //
    // Stop walking when we reach a bridge that has all four ACS bits set;
    // that bridge IS the isolation boundary. Devices on opposite sides of a
    // fully ACS-capable bridge can have separate IOMMU groups.
    let anchor = walk_acs_boundary(anchor, REQ_ACS_FLAGS);
    if let Some(existing) = iommu.group_for_bdf(anchor.bdf()) {
        return existing;
    }

    // Step 3: Multifunction slot grouping.
    //
    // PCI multifunction devices (multiple functions on the same device number,
    // e.g., device 0, functions 0..7) can DMA-alias each other when ACS is
    // absent. If the anchor is a multifunction device without full ACS, all
    // sibling functions on the same slot must share an IOMMU group.
    if anchor.is_multifunction() && !anchor.has_acs(REQ_ACS_FLAGS) {
        if let Some(existing) = find_sibling_function_group(&anchor, iommu) {
            return existing;
        }
    }

    // Step 4: Allocate a new group.
    //
    // No existing group was found via aliases, ACS failures, or multifunction
    // sharing. This device is hardware-isolated from all others and gets its
    // own IOMMU group (the preferred configuration for Tier 1 isolation).
    IommuGroup::new(iommu)
}

IommuGroup struct (canonical definition; replaces the forward declaration in Section 10.5.3.8):

/// Maximum PCIe devices in one IOMMU group. ACS-disabled PCIe switches can group
/// entire bus fabrics; 128 is a safe upper bound for realistic PCIe topologies.
pub const IOMMU_GROUP_MAX_DEVICES: usize = 128;

pub struct IommuGroup {
    /// Unique group ID assigned at creation. Never reused after group destruction.
    pub id:      u32,
    /// IOMMU hardware instance that manages this group.
    pub iommu:   Arc<IommuInstance>,
    /// PCIe devices sharing this IOMMU domain.
    /// Fixed capacity: avoids heap allocation during bus enumeration and device hotplug.
    pub devices: ArrayVec<Arc<PciDevice>, IOMMU_GROUP_MAX_DEVICES>,
    /// The currently active IOMMU domain. One domain covers the entire group —
    /// it is not possible to assign different domains to devices in the same group.
    pub domain:  RwLock<Option<Arc<IommuDomain>>>,
}

Firmware table parsing (determines IommuInstance → device scope during early boot, before pci_device_group is called per-device):

Intel VT-d (ACPI DMAR table): Each DRHD (DMA Remapping Hardware Definition) record describes one VT-d engine and its device scope entries (BDF ranges it manages). Devices not covered by any explicit DRHD scope fall under the catch-all DRHD with the INCLUDE_PCI_ALL flag. RMRR (Reserved Memory Region Reporting) records list physical address ranges that must be identity-mapped in every domain — typically BIOS-owned USB buffers and legacy VGA regions. UmkaOS programs RMRR regions as immutable identity entries in every new IOMMU domain before handing it to a driver.
AMD-Vi (ACPI IVRS table): IVHD (I/O Virtualization Hardware Definition) records list each AMD IOMMU and the BDF ranges it controls. UmkaOS builds a flat lookup table amd_iommu_dev_table[BDF] during IVRS parsing, giving O(1) device-to-IOMMU resolution at pci_device_group() call time. IVMD (I/O Virtualization Memory Definition) records specify unity-mapped regions (analogous to Intel RMRR).
ARM SMMU v3 (ACPI IORT table or Device Tree): Stream IDs (SIDs) are assigned by firmware and recorded in IORT iommu-map table entries or DT iommus / iommu-map properties. Each non-PCI device (platform device, ACPI device) gets its own IOMMU group unconditionally — non-PCI devices cannot alias each other. PCI devices behind an SMMU use pci_device_group() as above, with the SMMU providing the IommuInstance.

UmkaOS driver isolation requirement at driver_register():

A Tier 2 driver receives an IommuDomain that covers its entire IOMMU group. The registration sequence is:

Parse the firmware table (DMAR/IVRS/IORT) to find which IommuInstance owns the device, then call pci_device_group() to determine the device's IommuGroup.
If the group already has an active IommuDomain: attach this device to that domain (the entire group is now under the driver's control — the registry verifies that all other group members are either unbound or owned by the same driver process).
If the group has no active domain: allocate a new IommuDomain, program the IOMMU hardware page tables (initially empty — no DMA permitted), then attach all devices in the group to the new domain.
Grant the driver process DMA access via umka_driver_dma_alloc (Section 10.6); each allocation adds an IOVA→PA entry to the domain's page tables and the IOMMU issues an IOTLB invalidation.
Any device in the group that issues a DMA transaction to an address outside its domain's IOVA space triggers an IOMMU fault → driver crash recovery path (Section 10.8).

10.5.3.9 IOMMU Implementation Complexity

IOMMU management is one of the most complex subsystems in any OS kernel, and this complexity should not be understated. The following areas are known to be difficult and are called out explicitly as high-effort implementation items:

Nested/two-level translation (SR-IOV + VFIO) — when a VM uses VFIO passthrough with SR-IOV virtual functions, the IOMMU must perform two-level address translation: guest virtual → guest physical (first level, programmed by the guest's IOMMU driver) then guest physical → host physical (second level, programmed by the host). Intel VT-d calls this "scalable mode with first-level and second-level page tables"; AMD-Vi calls it "guest page tables with nested paging." The two-level walk doubles TLB pressure and introduces a multiplicative page table depth (4-level × 4-level = 16 potential memory accesses per translation miss). IOTLB sizing and invalidation granularity are critical performance levers.

Performance bottlenecks — known IOMMU performance traps: - Map/unmap storm: high-throughput I/O paths (NVMe at millions of IOPS, 100GbE line-rate) can generate millions of IOMMU map/unmap operations per second. Each map/unmap involves IOTLB invalidation. UmkaOS mitigates this with: (1) persistent DMA mappings for ring buffers (map once at driver init, never unmap), (2) batched invalidation (accumulate invalidations, flush once per batch), (3) per-CPU IOMMU invalidation queues to avoid contention. - IOTLB capacity: hardware IOTLB entries are scarce (~128-512 entries on typical Intel VT-d). Under heavy I/O with many DMA mappings, IOTLB misses add ~100-500ns per translation. Large pages (2MB, 1GB) in IOMMU page tables dramatically reduce IOTLB pressure — UmkaOS's DMA mapping interface prefers large-page-aligned allocations when possible. - Invalidation latency: IOTLB invalidation on Intel VT-d is not instantaneous. Drain-all invalidation can take ~1-10μs. Page-selective invalidation is faster but not supported on all hardware. UmkaOS checks hardware capability registers and uses the finest granularity available.

ACS (Access Control Services) — PCIe ACS is required for proper IOMMU group isolation. Without ACS on a PCIe switch, all devices behind that switch land in the same IOMMU group (defeating per-device isolation). Many consumer motherboards lack ACS on the root port or PCIe switch, causing all devices to share one IOMMU group. UmkaOS detects this at boot and logs a warning. The pcie_acs_override kernel parameter (Linux compatibility) allows overriding this for testing, but with an explicit security warning.

Errata — IOMMU hardware has errata. Intel VT-d errata include broken interrupt remapping on certain steppings, incorrect IOTLB invalidation scope, and non-compliant default domain behavior. UmkaOS's errata framework (Section 2.1.4) includes IOMMU errata alongside CPU errata — detected at boot, with workarounds applied automatically.

10.5.3.10 Per-Device DMA Identity Mapping (Opt-In Escape Hatch)

UmkaOS's default IOMMU policy is translated DMA for all devices — every DMA transaction passes through IOMMU page tables. This is non-negotiable for the driver isolation model: crash recovery, DMA fencing, and containment all depend on the kernel's ability to revoke DMA access by reprogramming IOMMU entries.

However, certain scenarios require identity-mapped DMA (device DMA addresses = physical addresses, IOMMU programmed as 1:1 pass-through for that device's domain):

Latency-critical bare-metal I/O. High-frequency trading NICs, ultra-low-latency NVMe, and RDMA HCAs where the ~100-500ns IOTLB miss penalty on unmapped addresses is unacceptable. Persistent DMA mappings (Section 10.5.3.7) mitigate this for ring buffers, but scatter-gather DMA with dynamic buffer addresses still pays the IOTLB miss cost.
Broken IOMMU interactions. Devices with firmware or silicon bugs that produce incorrect DMA addresses under translation (e.g., devices that hardcode physical addresses in firmware descriptors, or devices that ignore bus addresses returned by the OS).
Debug and development. Tracing raw DMA transactions with hardware analyzers is simpler when bus addresses equal physical addresses.

/// Per-device DMA translation policy. Set via admin sysfs or boot parameter.
/// Default is Translated for all devices.
#[repr(u32)]
pub enum DeviceDmaPolicy {
    /// All DMA goes through IOMMU page tables (default). Full isolation.
    Translated = 0,

    /// IOMMU programmed with 1:1 identity mapping for this device's domain.
    /// Device DMA addresses equal physical addresses. IOMMU is still active
    /// (interrupt remapping, fault reporting) but provides no DMA containment.
    Identity = 1,
}

Constraints and trade-offs:

Property	`Translated` (default)	`Identity`
DMA containment	Full — device can only reach explicitly mapped regions	None — device can DMA to any physical address in its identity window
Crash recovery	IOMMU entries revoked → in-flight DMA faults	Identity mapping cannot be selectively revoked without full device reset
Driver tier	Any tier	Tier 1 only (kernel-space drivers with `CAP_DMA_IDENTITY`)
IOTLB miss cost	~100-500ns per miss	Zero (1:1 mapping fits in a single large-page IOTLB entry)
Interrupt remapping	Active	Active (identity mapping does not affect interrupt remapping)
IOMMU group rule	Per-device	Entire IOMMU group must use Identity if any member does

Identity mapping scope: The kernel programs a 1:1 IOMMU mapping covering the physical address range [0, max_phys_addr) in the device's IOMMU domain, using the largest available page size (typically 1GB pages). This is a single IOTLB entry per GB of physical memory — effectively zero IOTLB miss overhead. The IOMMU remains active for interrupt remapping and fault reporting; only DMA address translation is bypassed.

Security implications: A device in Identity mode can DMA to any physical address. A compromised or buggy driver controlling such a device can corrupt arbitrary kernel memory. This is equivalent to running without an IOMMU for that device. The kernel mitigates the blast radius:

Explicit admin opt-in required. Identity mode is set via:
Boot parameter: umka.dma_identity=0000:03:00.0 (PCI BDF notation)
Sysfs at runtime: /sys/bus/pci/devices/0000:03:00.0/dma_policy (requires CAP_SYS_ADMIN + CAP_DMA_IDENTITY)
There is no global iommu.passthrough=1 equivalent. Every device must be individually opted in. This prevents accidentally disabling isolation for all devices.
Tier 1 restriction. Only Tier 1 (in-kernel) drivers may use Identity mode. Tier 2 (userspace) drivers are denied — a compromised userspace process with identity-mapped DMA would be a full kernel compromise.
Audit logging. Every Identity mode activation is logged to the security audit subsystem (Section 19.2.9) with the device BDF, requesting process, and admin credential.
IOMMU group enforcement. If device A is set to Identity and shares an IOMMU group with device B, device B is also switched to Identity (since devices in the same IOMMU group can peer-to-peer DMA without IOMMU translation). The kernel logs a warning identifying all affected devices.
No crash recovery guarantee. The kernel marks devices in Identity mode with a NO_DMA_FENCE flag. On driver crash, the kernel performs a Function Level Reset (FLR) or secondary bus reset instead of relying on IOMMU revocation — this is slower (10-100ms vs microseconds) but is the only safe option without DMA fencing.

Implementation in IommuDomainType:

pub enum IommuDomainType {
    /// Kernel DMA domain with full IOMMU translation.
    Kernel,

    /// Identity-mapped DMA domain. IOMMU programs 1:1 mapping.
    /// Only for Tier 1 drivers with explicit admin opt-in.
    Identity {
        /// Physical address range covered by the 1:1 mapping.
        phys_range_end: u64,
    },

    /// VM passthrough domain — VM's page tables control DMA.
    VmPassthrough {
        vm_id: u64,
        page_table_root: u64,
    },

    /// Userspace DMA domain — for Tier 2 drivers with restricted DMA.
    UserspaceDma {
        owning_pid: u64,
    },
}

Global identity mode on weak-isolation architectures:

On most architectures, per-device identity mapping (above) is the correct granularity: even if one device needs passthrough, the rest should remain IOMMU-translated. However, on architectures where Tier 1 CPU-side isolation is already absent or equivalent to Tier 0, the IOMMU is the only remaining isolation boundary — and if the admin has already accepted that Tier 1 drivers share the kernel address space without hardware memory protection, the IOMMU overhead protects only against rogue DMA from device firmware, not from the driver code itself.

For these cases, UmkaOS provides a global identity mode restricted to platforms where CPU-side Tier 1 isolation is weak or absent:

/// System-wide DMA translation policy. Boot parameter only —
/// cannot be changed at runtime.
///
/// Boot parameter: umka.dma_default_policy={translated,identity}
/// Default: translated (always)
#[repr(u32)]
pub enum SystemDmaPolicy {
    /// All devices use IOMMU translation (default on all architectures).
    Translated = 0,

    /// All Tier 1 devices default to identity-mapped DMA. Tier 2
    /// (userspace) devices always remain Translated regardless of this
    /// setting. Individual devices can still be overridden to Translated
    /// via sysfs. Requires umka.isolation=performance or equivalent
    /// weak-isolation architecture.
    IdentityDefault = 1,
}

Preconditions for umka.dma_default_policy=identity:

The kernel refuses this boot parameter unless at least one of: 1. umka.isolation=performance is also set (admin has explicitly opted out of Tier 1 CPU-side isolation — drivers promoted to Tier 0). 2. The architecture has no fast isolation mechanism and Tier 1 uses page-table switching with overhead equivalent to Tier 2 (currently: PPC64LE POWER8, AArch64 mainstream with I/O-heavy workloads). On RISC-V, Tier 1 is not available and all Tier 1 drivers already run as Tier 0; this condition does not apply.

If neither condition is met, the kernel prints a boot warning and ignores the parameter:

umka: dma_default_policy=identity rejected: CPU-side Tier 1 isolation is active.
      Use umka.isolation=performance or per-device umka.dma_identity=<BDF> instead.

What global identity mode does NOT affect: - Tier 2 (userspace) drivers — always IOMMU-translated, regardless of policy. A compromised userspace process with identity-mapped DMA would be a full kernel compromise. - VM passthrough (VmPassthrough) — VM IOMMU domains are unaffected; the hypervisor's second-level page tables remain in control. - Interrupt remapping — remains active on all devices. Identity mode disables DMA address translation only, not interrupt remapping. - Per-device overrides — individual devices can be set to Translated via sysfs even when the global default is Identity. This allows an admin to protect specific devices (e.g., an untrusted USB controller) while running most devices in identity mode.

Rationale: On RISC-V 64, where Tier 1 isolation is not available and all Tier 1 drivers run as Tier 0 (sharing the kernel address space in Ring 0 with full memory access), the IOMMU is protecting against a strictly weaker threat (device firmware DMA) than the one already accepted (driver code CPU access). Paying ~100-500ns per IOTLB miss on every DMA operation to defend against device firmware — while the driver itself has unrestricted access to all of kernel memory — is a questionable trade-off for performance-sensitive workloads. The same logic applies when isolation=performance explicitly promotes all drivers to Tier 0 on any architecture.

Why this is not Linux's iommu.passthrough=1: Linux's global passthrough exists for legacy compatibility — many Linux drivers assume physical addresses equal bus addresses, and passthrough preserved that assumption. UmkaOS's global identity mode exists for a different reason: to avoid paying IOMMU overhead on platforms where the security benefit is already negated by the absence of CPU-side isolation. The precondition check ensures it cannot be enabled on platforms where IOMMU translation is the critical isolation boundary (x86-64 with MPK, AArch64 with POE, etc.).

10.5.4 Device Matching

10.5.4.1 Match Rules

Drivers declare what hardware they support through match rules embedded in the driver binary. Match rules are stored in a dedicated ELF section (.kabi_match) and read by the kernel loader before init() is called.

/// A single match rule. Drivers can declare multiple rules — any match
/// triggers binding.
#[repr(C)]
pub struct MatchRule {
    pub rule_size: u32,         // Forward compat
    pub match_type: MatchType,
    pub data: MatchData,        // 128-byte union, interpreted per match_type
}

#[repr(u32)]
pub enum MatchType {
    PciId       = 0,    // Match by PCI vendor/device ID (with wildcards)
    PciClass    = 1,    // Match by PCI class code (with mask)
    UsbId       = 2,    // Match by USB vendor/product ID
    UsbClass    = 3,    // Match by USB class/subclass/protocol
    VirtIoType  = 4,    // Match by VirtIO device type
    Compatible  = 5,    // Match by "compatible" string (DT/ACPI)
    Property    = 6,    // Match by arbitrary property key/value
}

/// Match data union — interpreted per MatchType variant.
/// 128 bytes to accommodate the largest variant (Compatible: 128-byte string).
#[repr(C)]
pub union MatchData {
    pub pci_id: PciMatchData,        // MatchType::PciId or PciClass
    pub usb_id: UsbMatchData,        // MatchType::UsbId
    pub usb_class: UsbClassMatch,    // MatchType::UsbClass
    pub virtio: VirtIoMatchData,     // MatchType::VirtIoType
    pub compatible: [u8; 128],       // MatchType::Compatible (NUL-terminated)
    pub property: PropertyMatch,     // MatchType::Property
    pub _raw: [u8; 128],             // Pad to 128 bytes
}

#[repr(C)]
pub struct UsbMatchData {
    pub vendor_id: u16,      // 0xFFFF = wildcard
    pub product_id: u16,     // 0xFFFF = wildcard
}

#[repr(C)]
pub struct UsbClassMatch {
    pub class: u8,           // USB class code
    pub subclass: u8,        // 0xFF = wildcard
    pub protocol: u8,        // 0xFF = wildcard
}

#[repr(C)]
pub struct VirtIoMatchData {
    pub device_type: u32,    // VirtIO device type ID
}

#[repr(C)]
pub struct PropertyMatch {
    pub key: [u8; 64],       // Property key (NUL-terminated)
    pub value: [u8; 64],     // Property value (NUL-terminated)
}

Example — PCI ID match:

#[repr(C)]
pub struct PciMatchData {
    pub vendor_id: u16,         // 0xFFFF = wildcard
    pub device_id: u16,         // 0xFFFF = wildcard
    pub subsystem_vendor: u16,  // 0xFFFF = wildcard
    pub subsystem_device: u16,  // 0xFFFF = wildcard
    pub class_code: u32,        // Class code value
    pub class_mask: u32,        // Bits to compare (0 = ignore class)
}

A match table header in the ELF binary:

#[repr(C)]
pub struct MatchTableHeader {
    pub magic: u32,             // 0x4D415443 ("MATC")
    pub header_size: u32,
    pub rule_count: u32,
    pub rule_size: u32,         // sizeof(MatchRule)
    // Followed by `rule_count` MatchRule structs
}

10.5.4.2 Match Engine

The kernel runs a simple priority-ordered match algorithm:

For each DeviceNode in Discovered state:
  1. Collect the node's properties and bus identity
  2. For each registered driver (sorted by priority):
     a. For each MatchRule in that driver's match table:
        - Evaluate the rule against the node's properties
        - If match: record (driver, node, specificity) as a candidate
  3. Select the candidate with highest specificity
  4. If found: begin driver loading for this node
  5. If no match: node stays in Discovered state (deferred probe)

Match specificity ranking (highest first):

Rank	Match Type	Score	Example
1	Exact vendor + device + subsystem	100	This exact card from this exact OEM
2	Exact vendor + device ID	80	Any board with this chip
3	Full class code match	60	Any NVMe controller (class 01:08:02)
4	Partial class code (masked)	40	Any mass storage controller (class 01:xx:xx)
5	Compatible string (position-weighted)	20+	DT/ACPI compatible, first entry scores higher
6	Generic property match	10	Fallback / catchall

Combination rule: When a single driver has multiple match rules and more than one matches a device, the driver's effective specificity is the maximum of all matching rule scores (not a sum). This ensures an exact vendor/device ID match (score 100) always dominates a class-code match (score 60) from the same driver, reflecting "most specific match wins" semantics.

When two drivers match with equal specificity, the driver with higher match_priority (declared in its manifest) wins. If still tied, first-registered wins.

10.5.4.3 Deferred Matching

Some devices cannot be matched immediately — their driver may not yet be loaded (e.g., initramfs not yet mounted, or driver installed later by package manager).

Devices with no match stay in Discovered state indefinitely.
When a new driver is registered (loaded from initramfs, installed at runtime), all Discovered devices are re-evaluated against the new match rules.
A KABI method registry_rescan() triggers manual re-evaluation.

This is analogous to Linux's deferred probe mechanism, but simpler because the matching is centralized rather than spread across per-bus probe functions.

10.5.4.4 DriverManifest Extensions

The DriverManifest (defined in umka-driver-sdk/src/capability.rs) gains match-related fields (appended per ABI rules):

// Appended to DriverManifest
pub match_rule_count: u32,      // Number of match rules in .kabi_match section
pub is_bus_driver: u32,         // 1 = this driver discovers child devices
pub match_priority: u32,        // Higher = preferred when specificity ties
pub _pad: u32,

10.5.4.5 Module Loader Queue

When the match engine selects a driver for a device (step 4 in Section 10.5.4.2), it submits a DriverLoadRequest to the module loader work queue. The module loader runs as a set of kernel worker threads and serializes concurrent loading, signature verification, and domain allocation.

LoadReason is the shared type defined in Section 11.1.9.6 (11-kabi.md). Variants used by the device driver loader: HotPlug (device enumeration), Boot (initramfs/cmdline), Dependency, CrashRecovery, UserRequest.

/// A device-driver-specific load request. More detailed than the KABI-level
/// `ModuleLoadRequest` (Section 11.1.9.6): includes the trigger device, result
/// type `DriverHandle`, priority override, and timeout. Uses the shared
/// `LoadReason` enum from Section 11.1.9.6.
pub struct DriverLoadRequest {
    /// Absolute path to the `.kabi` manifest file in the umkafs namespace,
    /// e.g., `/System/Kernel/drivers/nvme/nvme.kabi`.
    pub manifest_path: Box<str>,
    /// Reason for this load request (determines scheduling priority).
    pub reason: LoadReason,
    /// Device that triggered this load when `reason == LoadReason::HotPlug`.
    /// `None` for dependency loads, user requests, and boot-time loads.
    pub trigger_device: Option<DeviceHandle>,
    /// Completion channel: the loader sends `Ok(DriverHandle)` on success
    /// or `Err(KernelError)` on failure (bad signature, manifest error,
    /// domain allocation failure, driver `init()` returning an error, etc.).
    pub result_tx: oneshot::Sender<Result<DriverHandle, KernelError>>,
    /// Priority override. `0` = derive from `LoadReason` (default).
    /// `1`–`255` = explicit override (higher = higher priority).
    pub priority_override: u8,
    /// Load timeout in milliseconds. `0` = system default (30 000 ms).
    /// The loader cancels and returns `Err(KernelError::Timeout)` if the
    /// driver does not complete `init()` within this window.
    pub timeout_ms: u32,
}

/// Priority-ordered work queue for driver module loads.
///
/// Bounded capacity prevents memory exhaustion from a flood of hotplug events
/// (e.g., enumerating a USB hub with 127 devices simultaneously).
/// Default capacity: 256 pending requests.
pub struct ModuleLoaderQueue {
    /// Pending load requests ordered by effective priority (highest first).
    queue: SpinLock<BinaryHeap<PrioritizedLoadRequest>>,
    /// Limits the number of concurrently executing module loads.
    /// Default: 4 concurrent loads (one per loader worker thread).
    concurrency: Semaphore,
    /// Total requests enqueued since boot.
    pub total_enqueued: AtomicU64,
    /// Total requests that completed successfully.
    pub total_loaded: AtomicU64,
    /// Total requests that failed (signature rejection, manifest error,
    /// driver init failure, timeout, or domain allocation failure).
    pub total_failed: AtomicU64,
}

/// Internal wrapper that adds an effective priority to a `DriverLoadRequest`
/// for ordering in the `BinaryHeap` inside `ModuleLoaderQueue`.
struct PrioritizedLoadRequest {
    /// Effective priority: `priority_override` if non-zero, else derived from
    /// `reason` (HotPlug/Boot = 200, CrashRecovery = 180, Dependency = 150,
    /// UserRequest = 100).
    pub priority: u8,
    pub request: DriverLoadRequest,
}

impl PartialOrd for PrioritizedLoadRequest {
    fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
        Some(self.cmp(other))
    }
}
impl Ord for PrioritizedLoadRequest {
    fn cmp(&self, other: &Self) -> core::cmp::Ordering {
        // Reverse order so BinaryHeap pops the highest priority first.
        self.priority.cmp(&other.priority)
    }
}
impl PartialEq for PrioritizedLoadRequest {
    fn eq(&self, other: &Self) -> bool { self.priority == other.priority }
}
impl Eq for PrioritizedLoadRequest {}

Module loading sequence (executed by a loader worker thread after dequeuing):

1. Verify driver binary signature (ML-DSA-44 or SLH-DSA-128f per Section 8.2).
   Reject if signature is absent or invalid.
2. Parse .kabi manifest: validate fields, check KabiVersion compatibility.
3. Allocate an isolation domain (MPK PKEY, POE overlay, or equivalent per arch).
   If no domains are available: reject with KernelError::ResourceExhausted.
4. Map driver binary into the new domain (read+execute, no write).
5. Call driver_entry.init(services, descriptor). Apply timeout_ms watchdog.
6. On success: transition device state to Active, send Ok(handle) to result_tx.
7. On failure: free domain, unmap binary, send Err(...) to result_tx.
   Registry transitions device state to Error.

10.5.5 Device Lifecycle

10.5.5.1 State Machine

The registry manages each device through a well-defined state machine. Only the kernel initiates transitions — drivers cannot set their own state.

                                    +-> [Error] ------+----> [Quarantined]
                                    |                  |          |
[Discovered] -> [Matching] -> [Loading] -> [Initializing] -> [Active]
      ^              ^                          |                 |
      |              |                          |                 v
      |              +--- (no match) -----------+            [Suspending]
      |              |                                            |
      |              +-- (admin re-enable) -- [Quarantined]      v
      +-- (hotplug rescan) ---- [Removed]                  [Suspended]
      |                            ^                            |
      |                            |                            v
      +-- (driver reload) ----- [Stopping] <-------------- [Resuming]
                                   ^                            |
                                   |                            v
                                [Recovering] <------------- [Active]

#[repr(u32)]
pub enum DeviceState {
    Discovered    = 0,  // Node exists, no driver bound
    Matching      = 1,  // Match engine evaluating
    Loading       = 2,  // Driver binary being loaded
    Initializing  = 3,  // driver init() called, waiting for result
    Active        = 4,  // Driver running normally
    Suspending    = 5,  // Suspend requested, waiting for driver ack
    Suspended     = 6,  // Driver has acknowledged suspend
    Resuming      = 7,  // Resume requested, waiting for driver ack
    Stopping      = 8,  // Driver being stopped (unload, removal, admin)
    Recovering    = 9,  // Driver crashed, recovery in progress
    Removed       = 10, // Device physically removed (hotplug)
    Error         = 11, // Fatal error, non-functional
    Quarantined   = 12, // Driver permanently disabled (crash threshold exceeded);
                        // requires manual re-enable via sysfs
}

10.5.5.2 Transition Table

From	To	Trigger	Driver Callback
Discovered	Matching	New device or new driver registered	None
Matching	Loading	Match found	None
Matching	Discovered	No match	None
Loading	Initializing	Binary loaded, vtable exchange begins	`init()`
Initializing	Active	`init()` returns success	None
Initializing	Error	`init()` returns error or timeout	None
Active	Suspending	PM suspend request	`suspend()`
Suspending	Suspended	`suspend()` returns success	None
Suspending	Error	`suspend()` timeout or failure	`shutdown()` (force)
Suspended	Resuming	PM resume request	`resume()`
Resuming	Active	`resume()` returns success	None
Resuming	Recovering	`resume()` failure	None
Active	Stopping	Admin request, unload, or hotplug removal	`shutdown()`
Active	Recovering	Fault detected (domain violation, watchdog, crash)	None
Recovering	Loading	Recovery initiated, fresh binary load	(fresh `init()`)
Error	Quarantined	Crash threshold exceeded (5+ failures in window)	None
Quarantined	Matching	Manual administrator re-enable via sysfs	None
Any	Removed	Physical device gone + teardown complete	`shutdown()` if possible

10.5.5.3 Timeouts

Every callback has a timeout. If the driver does not respond within the timeout, the kernel force-stops it (same mechanism as crash recovery: revoke isolation domain / kill process).

Callback	Tier 1 Timeout	Tier 2 Timeout
`init()`	5 seconds	10 seconds
`shutdown()`	3 seconds	5 seconds
`suspend()`	2 seconds	5 seconds
`resume()`	2 seconds	5 seconds

All timeouts are configurable via kernel parameters.

10.5.6 Power Management

10.5.6.1 Power States

#[repr(u32)]
pub enum PowerState {
    D0Active    = 0,    // Fully operational
    D1LowPower  = 1,    // Low-power idle (quick resume)
    D2DeepSleep = 2,    // Deeper sleep (longer resume, less power)
    D3Off       = 3,    // Powered off (full re-init on resume)
}

10.5.6.2 Topology-Driven Ordering

This is the primary advantage of having a kernel-owned device tree. Suspend/resume ordering is derived from topology, not maintained as a separate list.

Suspend order (depth-first, leaves first):

For each subtree rooted at device D:
  1. Suspend all clients of D (provider-client links)
  2. Recursively suspend all children of D (bottom-up)
  3. Suspend D itself

Resume order (exact reverse):

For each subtree rooted at device D:
  1. Resume D itself
  2. Recursively resume all children of D (top-down)
  3. Resume all clients of D

This is computed once by topological sort when a system PM transition begins. Provider- client edges are treated as additional dependency edges in the sort. The result is cached and invalidated when the tree topology changes.

Why this is better than Linux: Linux maintains a dpm_list that approximates topological order but can get it wrong. The ordering is based on registration order and heuristic adjustments, not the actual device tree. UmkaOS computes the correct order directly from the tree.

10.5.6.3 PM Failure Handling

When a driver fails to suspend within its timeout:

Registry marks the node as Error.
Driver is force-stopped (revoke isolation domain / kill process).
Suspend continues for remaining devices — one broken driver does not block the entire system.
On resume, the failed device's driver is reloaded fresh (leveraging crash recovery from Section 10.8).
Failure is logged with context for admin diagnosis.

This directly implements the principle from Section 17.2: "Tier 1 and Tier 2 drivers that fail to suspend within a timeout are forcibly stopped and restarted on resume."

10.5.6.4 Runtime Power Management

Beyond system suspend, individual devices can enter low-power states when idle:

pub struct RuntimePmPolicy {
    pub enabled: bool,
    pub idle_timeout_ms: u32,       // Enter D1 after this idle period
    pub min_state: PowerState,      // Deepest state allowed during runtime PM
}

The registry tracks I/O activity per device (through KABI call frequency). When a device has been idle for idle_timeout_ms, the registry initiates a runtime suspend of that device alone. Children are only suspended if they are also idle.

Runtime PM is independent of system PM. A device can be in D1 (runtime idle) while the system is fully running.

10.5.7 Hot-Plug

10.5.7.1 Bus Drivers as Event Sources

Bus drivers (PCI host bridge, USB XHCI, USB hub) are the source of hotplug events. They detect device arrival/departure and report to the registry through KABI methods.

A bus driver is identified by is_bus_driver = 1 in its DriverManifest. It has the HOTPLUG_NOTIFY capability (already defined in capability.rs).

10.5.7.2 Device Arrival

1. Bus driver detects new device
   (PCIe hot-add interrupt, USB port status change, ACPI _STA change)
2. Bus driver calls registry_report_device() via KABI
   - Passes: parent handle, bus type, bus-specific identity, initial properties
3. Registry creates a new DeviceNode in Discovered state
4. Registry populates properties from the bus driver's report
5. Registry runs the match engine on the new node
6. If match found: load driver, init, transition to Active
7. Registry emits uevent for Linux compatibility (udev/systemd)

10.5.7.3 Device Removal (Orderly)

1. Bus driver detects device departure (link down, port status change)
2. Bus driver calls registry_report_removal() via KABI
3. Registry processes the subtree bottom-up:
   a. For each child (deepest first):
      - Stop the child's driver (shutdown callback)
      - Release capabilities
      - Remove child node
   b. Stop the target device's driver
   c. Release all capabilities
   d. Remove the DeviceNode
4. Registry emits uevent (removal)

10.5.7.4 Surprise Removal

When a device is physically yanked without warning (e.g., USB unplug during I/O):

Bus driver detects absence (failed transaction, link down).
Registry receives the removal report.
All pending I/O for the device and its children is completed with -EIO.
shutdown() is called on the driver — it may fail quickly because the hardware is gone. This is expected and handled gracefully (timeout → force-stop).
The node subtree is torn down.

This mirrors crash recovery but is initiated by the bus driver rather than by a fault.

10.5.7.5 Uevent Compatibility

For Linux userspace compatibility (udev, systemd-udevd), the registry emits uevent notifications matching the Linux format:

ACTION=add
DEVPATH=/devices/pci0000:00/0000:03:00.0
SUBSYSTEM=pci
PCI_ID=8086:2723
PCI_CLASS=028000
DRIVER=umka-iwlwifi

This feeds into umka-compat/src/sys/ for sysfs and umka-compat/src/dev/ for devtmpfs, as outlined in Section 18.1.3.

10.5.8 Service Discovery

10.5.8.1 The Problem

Drivers sometimes need services from other drivers — not through direct communication, but through mediated access. Examples:

NIC needs a PHY driver (MII bus)
GPU display pipeline needs I2C controller for DDC/EDID
RAID controller needs to discover member disks
Filesystem driver needs its underlying block device

In Linux, each of these has a subsystem-specific mechanism (phylib, i2c_adapter, md_personality, etc.) with its own registration/lookup API. In IOKit, it is done through IOService matching. UmkaOS unifies service discovery through the registry.

10.5.8.2 Service Publication

A driver can publish a named service on its device node:

Driver A (e.g., PHY driver):
  1. Completes init, device node is Active
  2. Calls registry_publish_service("phy", &phy_vtable)
  3. Registry records: node A provides service "phy" with given vtable

The phy_vtable is a service-specific C-ABI vtable (same flat, versioned approach as all other KABI vtables). The registry stores a reference to it.

10.5.8.3 Service Lookup

A driver can look up a named service:

Driver B (e.g., NIC driver):
  1. Needs PHY service
  2. Calls registry_lookup_service("phy", scope=ParentSubtree)
  3. Registry searches for a node in scope that publishes "phy"
  4. Registry validates Driver B has PEER_DRIVER_IPC capability
  5. Registry creates a provider-client link (B consumes A's "phy")
  6. Registry returns a wrapped service vtable and a ServiceHandle

Lookup scope options:

#[repr(u32)]
pub enum ServiceLookupScope {
    Siblings       = 0,    // Same parent only
    ParentSubtree  = 1,    // Parent and all its descendants
    Global         = 2,    // Entire registry (expensive, rare)
    Specific       = 3,    // A specific node (by DeviceHandle)
}

10.5.8.4 Mediated Access

The registry mediates all cross-driver service access. This is critical:

The registry validates capabilities before returning a service handle.
The returned vtable is wrapped by the registry — calls go through a trampoline that:
Validates the service handle is still valid
Performs the isolation domain switch if provider and client are in different Tier 1 domains
Handles the user-kernel transition if one side is Tier 2
The registry can revoke a service link at any time (e.g., when the provider crashes).
The registry tracks all active links for PM ordering (clients must suspend before providers).
Drivers never hold direct pointers to each other's memory.

10.5.8.5 Service Recovery

When a provider driver crashes and is reloaded:

The registry invalidates all service handles pointing to the crashed provider.
Client drivers that call the service vtable receive -ENODEV from the trampoline.
After the provider is reloaded and republishes its service, client drivers receive a service_recovered callback (optional, new addition to DriverEntry):

// Appended to DriverEntry (optional)
pub service_recovered: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    service_name: *const u8,
    service_name_len: u32,
) -> InitResultCode>,

The client driver can then re-acquire the service handle and resume operations.

10.5.8.5a Service Handle Liveness Protocol

After a Tier 1 driver crashes, any ServiceHandle held by Tier 2 or user processes points to a stale vtable. Calling through a stale vtable is a use-after-free (UAF) vulnerability. UmkaOS prevents this via generation counters:

/// Kernel-internal service reference. NOT exposed at KABI boundary.
/// The KABI-stable token is `ServiceHandle` (a newtype over `u64`).
/// Mapping: `ServiceHandle::id` → kernel looks up `InternalServiceRef` via service registry.
///
/// Contains a generation counter that is checked on every dispatch
/// to detect stale handles pointing to crashed providers.
pub struct InternalServiceRef {
    /// Provider descriptor pointer (points into umka-core memory, not driver memory).
    provider: *const ProviderDescriptor,
    /// Generation of the provider at handle creation time.
    /// Must match provider.state_generation on dispatch or the call fails.
    generation: u64,
    /// Rights granted to the holder of this handle.
    rights: Rights,
}

/// Per-provider state generation counter. Incremented when:
/// 1. The provider crashes and is reloaded.
/// 2. The provider explicitly invalidates all handles (e.g., after a
///    security-relevant config change).
/// Stored in umka-core memory (not in the driver's memory domain) so it
/// remains valid even after the driver domain is destroyed.
pub struct ProviderDescriptor {
    /// Monotonically increasing. Odd = active; even = inactive/crashed.
    /// Updated atomically by umka-core on crash detection.
    pub state_generation: AtomicU64,
    // ... vtable pointer and other registry fields follow
}

Dispatch check (in the trampoline layer, before every cross-domain call):

fn trampoline_dispatch(handle: &InternalServiceRef, request: &Request) -> Result<Response, Error> {
    // Check liveness: read the provider's current generation.
    // Ordering::Acquire: ensures we see any writes made by the crash handler
    // that incremented state_generation.
    let current_gen = unsafe {
        (*handle.provider).state_generation.load(Ordering::Acquire)
    };
    if current_gen != handle.generation {
        return Err(Error::ProviderDead);
    }
    // Generation matched: safe to call through vtable.
    // (Note: generation can still change between the check and the call.
    // The domain fault handler catches this and returns ProviderDead to
    // the caller via the normal crash-recovery path.)
    dispatch_to_tier1(handle, request)
}

Handle invalidation on crash: When a Tier 1 driver panics: 1. The domain fault handler (already specified in Section 10.8) catches the fault. 2. It atomically increments provider.state_generation (odd → even, marking inactive). 3. All subsequent dispatch attempts to this provider return Err(ProviderDead). 4. After driver reload, the new provider instance starts with state_generation + 1 (odd, active). Old handles (with old generation) remain permanently stale. 5. Callers that receive Err(ProviderDead) must re-open the service to get a new ServiceHandle with the current generation (the kernel creates a new InternalServiceRef with the updated generation and maps it to a fresh ServiceHandle::id).

Invariant: ProviderDescriptor is always allocated in umka-core memory, never in the driver's isolation domain. This ensures the descriptor (including state_generation) remains accessible and uncorrupted after the driver domain is torn down during crash recovery.

Design intent: InternalServiceRef cannot be "refreshed" — a crashed provider's internal reference cannot be upgraded to point at the new instance. This is intentional: the crash may indicate a security event, and forcing callers to explicitly re-open (obtaining a new ServiceHandle) ensures they notice the crash and can apply any required policy (e.g., re-authenticate, validate new driver version). The generation counter is the minimal mechanism; it adds one Acquire load (~3-5 cycles, L1-resident) per cross-domain call.

10.5.8.6 Registry Event Notifications

Beyond driver-to-driver service recovery, kernel subsystems need to react to device lifecycle events. The registry provides an internal notification mechanism (not exposed through KABI — this is kernel-to-kernel only).

/// Registry event types that kernel subsystems can subscribe to.
#[repr(u32)]
pub enum RegistryEvent {
    /// A new device node was created (after bus enumeration).
    DeviceDiscovered  = 0,
    /// A device transitioned to Active (driver bound and initialized).
    DeviceActive      = 1,
    /// A device is being removed (before teardown begins).
    DeviceRemoving    = 2,
    /// A device's driver crashed and recovery is starting.
    DeviceRecovering  = 3,
    /// A device's power state changed.
    PowerStateChanged = 4,
    /// IOMMU group assignment changed (passthrough ↔ kernel domain).
    IommuGroupChanged = 5,
    /// A service was published or unpublished.
    ServiceChanged    = 6,
}

/// Callback type for registry event notifications.
pub type RegistryNotifyFn = fn(
    event: RegistryEvent,
    node_id: DeviceNodeId,
    context: *mut c_void,
);

Subscribers:

Kernel Subsystem	Events	Purpose
Memory manager (Section 4.1)	`DeviceDiscovered`, `DeviceRemoving`	Update NUMA topology when devices with local memory appear/disappear
Scheduler (Section 6.1)	`DeviceActive`, `DeviceRemoving`	Update IRQ affinity recommendations
FMA engine (Section 19.1)	`DeviceRecovering`	Log fault management events, track failure patterns
AccelScheduler (Section 21.1)	`DeviceActive`, `DeviceRecovering`, `PowerStateChanged`	Manage accelerator context lifecycle
Sysfs compat (Section 10.5.12)	All events	Update `/sys` filesystem in real-time

Notifications are dispatched synchronously during registry state transitions. Subscribers must not block — they record the event and defer heavy work to a workqueue. This prevents a slow subscriber from delaying device bring-up.

10.5.9 KABI Integration

10.5.9.1 New Methods Appended to KernelServicesVTable

All new methods are Option<...> for backward compatibility. Older kernels that do not have the registry will have these as None. Drivers must check for None before calling.

// === Device Registry (appended to KernelServicesVTable) ===

/// Report a newly discovered device to the registry.
/// Called by bus drivers (PCI enumeration, USB hub, etc.).
pub registry_report_device: Option<unsafe extern "C" fn(
    parent_handle: DeviceHandle,
    bus_type: BusType,
    bus_identity: *const u8,
    bus_identity_len: u32,
    properties: *const PropertyEntry,
    property_count: u32,
    out_handle: *mut DeviceHandle,
) -> IoResultCode>,

/// Report that a device has been physically removed.
pub registry_report_removal: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
) -> IoResultCode>,

/// Get a property value from a device node.
pub registry_get_property: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    key: *const u8,
    key_len: u32,
    out_value: *mut PropertyValueC,
    out_value_size: *mut u32,
) -> IoResultCode>,

/// Set a property on a device node.
pub registry_set_property: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    key: *const u8,
    key_len: u32,
    value: *const PropertyValueC,
    value_size: u32,
) -> IoResultCode>,

/// Publish a named service on this device node.
pub registry_publish_service: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    service_name: *const u8,
    service_name_len: u32,
    service_vtable: *const c_void,
    service_vtable_size: u64,
) -> IoResultCode>,

/// Look up a named service.
pub registry_lookup_service: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    service_name: *const u8,
    service_name_len: u32,
    scope: u32,
    out_service_vtable: *mut *const c_void,
    out_service_handle: *mut ServiceHandle,
) -> IoResultCode>,

/// Release a previously acquired service handle.
pub registry_release_service: Option<unsafe extern "C" fn(
    service_handle: ServiceHandle,
) -> IoResultCode>,

/// Get the device handle for the current driver instance.
pub registry_get_device_handle: Option<unsafe extern "C" fn(
    out_handle: *mut DeviceHandle,
) -> IoResultCode>,

/// Enumerate children of a device node.
pub registry_enumerate_children: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    out_handles: *mut DeviceHandle,
    max_count: u32,
    out_count: *mut u32,
) -> IoResultCode>,

10.5.9.2 New ABI Types

/// Opaque handle to a device node in the registry.
#[repr(C)]
pub struct DeviceHandle {
    pub id: u64,
}

impl DeviceHandle {
    pub const INVALID: Self = Self { id: 0 };
}

/// Stable C-ABI service token. Passed across isolation domain boundaries.
/// Kernel resolves this id to an `InternalServiceRef` at each call site.
/// Liveness: the module providing this service cannot be unloaded while any
/// active `ServiceHandle` referring to it is held by a capability.
#[repr(C)]
pub struct ServiceHandle {
    pub id: u64,
}

/// A property entry for C ABI transport.
#[repr(C)]
pub struct PropertyEntry {
    pub key: *const u8,
    pub key_len: u32,
    pub value_type: PropertyType,
    pub value_data: *const u8,
    pub value_len: u32,
    pub _pad: u32,
}

#[repr(u32)]
pub enum PropertyType {
    U64         = 0,
    I64         = 1,
    String      = 2,
    Bytes       = 3,
    Bool        = 4,
    StringArray = 5,
}

/// C-ABI-safe property value output buffer.
#[repr(C)]
pub struct PropertyValueC {
    pub value_type: PropertyType,
    pub _pad: u32,
    pub data: [u8; 256],
}

// `KabiVersion` — defined in Section 11.1.9.3 (11-kabi.md).
// Layout: { major: u16, minor: u16, patch: u16, _pad: u16 } — repr(C), 8 bytes.
// Key methods: new(major,minor,patch), is_compatible_with(kernel), as_u64(), from_u64(v).
// Constant: KABI_CURRENT = 1.0.0.
// The vtable wire format stores KabiVersion::as_u64() in the first 8 bytes of each vtable.

10.5.9.3 DeviceDescriptor Extension

The existing DeviceDescriptor gains new fields (appended):

// Appended to DeviceDescriptor
pub device_handle: DeviceHandle,    // Registry handle for this device
pub numa_node: i32,                 // NUMA node (-1 = unknown)
pub _pad: u32,

The DeviceDescriptor passed to driver_entry.init() is now populated from the registry node's properties, ensuring consistency between what the registry knows and what the driver sees.

10.5.9.4 Memory Management KABI (`memory_v1`)

The memory_v1 KABI table provides driver-callable memory management functions appended to KernelServicesVTable starting at KABI version 2 (the initial KernelServicesVTable layout is version 1). Per Section 11.1.4 versioning rules, these four Option<fn> fields are tail-appended; drivers compiled against KABI v1 see a shorter vtable_size and never access these offsets. Drivers compiled against v2+ check vtable_size >= offset_of!(memory_v1 fields) before calling, and fall back to non-NUMA allocation if the kernel does not expose memory_v1.

These extend the existing DMA allocation functions (Section 10.4, Tier 2 syscall table) with NUMA-aware operations for Tier 1 drivers.

// === Memory Management (appended to KernelServicesVTable, memory_v1) ===

/// Request explicit NUMA page migration for driver-private pages.
///
/// Moves the specified physical pages to the target NUMA node. Only callable
/// on pages within the calling driver's isolation domain (Tier 1 protection
/// key match required). The kernel validates ownership before migration.
///
/// Migration is **synchronous**: this function blocks until all pages have
/// been physically moved to the target node (or an error occurs). The
/// driver's virtual mappings are updated transparently — existing virtual
/// addresses remain valid after migration, only the underlying physical
/// frames change.
///
/// # Arguments
/// - `pages`: Pointer to a caller-allocated array of physical page addresses
///   (page-aligned, 4 KiB granularity). Each address must be within the
///   caller's isolation domain.
/// - `page_count`: Number of entries in the `pages` array. Maximum 512 pages
///   per call (2 MiB). For larger migrations, issue multiple calls.
/// - `target_node`: NUMA node ID to migrate to. Must be a valid node with
///   available memory. Use `numa_node_count()` to discover topology.
///
/// # Returns
/// - `IO_SUCCESS` (0): All pages successfully migrated.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT, -14): One or more page addresses are
///   outside the caller's isolation domain or not page-aligned. No pages
///   are migrated (atomic failure).
/// - `IO_ERR_INVALID_NODE` (-EINVAL, -22): `target_node` does not exist in
///   the NUMA topology or has no allocatable memory.
/// - `IO_ERR_DMA_PINNED` (-EBUSY, -16): One or more pages have an active
///   DMA mapping (`PG_dma_pinned` flag set, Section 10.5.3.7). DMA-pinned pages cannot
///   be migrated because a device holds their physical address. The driver
///   must unpin DMA buffers (`free_dma_buffer`) before migrating. No pages
///   are migrated (atomic failure).
/// - `IO_ERR_NOMEM` (-ENOMEM, -12): Target node has insufficient free memory
///   to accept the migrated pages. No pages are migrated (atomic failure).
/// - `IO_ERR_PERM` (-EPERM, -1): Caller does not hold `CAP_NUMA_MIGRATE`
///   capability (required for explicit NUMA migration).
///
/// # Atomicity
/// Migration is all-or-nothing: either all pages in the request are migrated,
/// or none are. The kernel pre-validates all pages and pre-allocates target
/// frames before beginning the move. If any page fails validation, the entire
/// request is rejected before any migration occurs.
///
/// # Concurrency
/// The kernel holds the per-page migration lock during the move, serializing
/// with concurrent NUMA balancer scans and other migration requests for the
/// same pages. Other pages in the driver's domain remain accessible during
/// migration. The migrating pages are briefly unmapped (~1-5 µs per page);
/// concurrent access from the driver's other threads will fault and block
/// until migration completes.
///
/// # Safety
/// - `pages` must point to a valid array of at least `page_count` elements.
/// - All addresses in the array must be page-aligned (4096-byte boundary).
/// - Caller must ensure no device DMA is in flight to the specified pages
///   (the `PG_dma_pinned` check catches registered DMA buffers, but the
///   driver is responsible for not issuing new DMA to these addresses
///   concurrently with migration).
pub driver_request_numa_migration: Option<unsafe extern "C" fn(
    pages: *const u64,
    page_count: u32,
    target_node: i32,
) -> IoResultCode>,

/// Query the NUMA node for a set of physical pages.
///
/// Returns the NUMA node ID for each page in the input array.
/// Useful for drivers that want to check data locality before deciding
/// whether to migrate.
///
/// # Arguments
/// - `pages`: Pointer to array of physical page addresses (page-aligned).
/// - `page_count`: Number of entries in `pages`.
/// - `out_nodes`: Pointer to caller-allocated array of `page_count` `i32`
///   values. On success, `out_nodes[i]` contains the NUMA node ID for
///   `pages[i]`.
///
/// # Returns
/// - `IO_SUCCESS`: All node IDs written to `out_nodes`.
/// - `IO_ERR_INVALID_ADDR` (-EFAULT): One or more pages outside caller's domain.
pub driver_query_numa_node: Option<unsafe extern "C" fn(
    pages: *const u64,
    page_count: u32,
    out_nodes: *mut i32,
) -> IoResultCode>,

/// Query NUMA topology: number of NUMA nodes in the system.
///
/// # Returns
/// - Positive value: number of NUMA nodes (1 on non-NUMA systems).
/// - Negative value: error (should not occur; returns 1 as fallback).
pub numa_node_count: Option<unsafe extern "C" fn() -> i32>,

/// Query available memory on a NUMA node.
///
/// # Arguments
/// - `node`: NUMA node ID.
/// - `out_total_bytes`: Total physical memory on this node.
/// - `out_free_bytes`: Currently free memory on this node.
///
/// # Returns
/// - `IO_SUCCESS`: Values written to output pointers.
/// - `IO_ERR_INVALID_NODE` (-EINVAL): Node does not exist.
pub numa_node_memory: Option<unsafe extern "C" fn(
    node: i32,
    out_total_bytes: *mut u64,
    out_free_bytes: *mut u64,
) -> IoResultCode>,

Usage pattern — A NUMA-aware NIC driver migrates receive buffer pages to the NUMA node closest to the NIC's PCIe attachment point:

1. Driver probes device, reads `DeviceDescriptor.numa_node` (Section 10.5.9.3).
2. Driver allocates receive ring buffers (via `alloc_dma_buffer`).
3. On each received packet, driver calls `driver_query_numa_node` to check
   if the destination process's pages are local.
4. If remote, driver calls `driver_request_numa_migration` to pull hot pages
   to the NIC's node, reducing memory access latency for subsequent packets.
5. Migration frequency is rate-limited by the driver to avoid migration storms
   (recommended: at most once per page per 100ms).

10.5.10 Crash Recovery Integration

The registry participates in the crash recovery sequence defined in Section 10.8.

10.5.10.1 When a Driver Crashes

Detection: UmkaOS Core detects the fault (hardware exception in isolation domain, watchdog timeout, Tier 2 process crash).
Registry notification: UmkaOS Core identifies the faulting driver's device node. Registry transitions it to Recovering.
Service invalidation: All service handles pointing to the crashed driver are invalidated. Client drivers receive -ENODEV on subsequent service calls.
Child cascade: If the crashed driver is a bus driver with children, the registry processes children bottom-up:
For each child: stop driver, release capabilities, transition to Stopping.
Children are re-probed after the bus driver recovers.
I/O drain + DMA fence: All pending I/O completed with -EIO. Critically, before freeing any driver memory, UmkaOS must ensure no in-flight DMA operations can write to those pages. The sequence:
IOMMU mapping for the driver's DMA regions is revoked (set to fault-on-access) immediately at step 2 (ISOLATE). Any in-flight DMA that completes after this point will hit an IOMMU fault (harmless — the write is dropped by the IOMMU).

DMA teardown sequence (before IOTLB unmap):

Assert device-class DMA stop:
- PCIe devices with FLR (Function Level Reset) support: issue FLR via the PCIe Device Control register (capability offset + 0x08, bit 15). FLR resets the device state and stops all outstanding DMA.
- NVMe: issue Admin Command ABORT for all outstanding I/Os, then CC.EN=0 (Controller Enable clear) to halt the NVMe controller.
- AHCI: issue PORT_CMD_FIS_RX clear + PORT_CMD_ST clear per port.
- USB devices: send USB port reset to the host controller.
- Devices without a DMA-stop mechanism: skip to step 2 (fallback only).
Wait for DMA quiescence:
- Poll the device's DMA-active indicator (device-class specific) until it reports no outstanding DMA, OR until 100ms has elapsed.
- For FLR: the PCIe spec requires FLR completion within 100ms. After FLR, DMA is guaranteed stopped by hardware.
If step 2 does not complete within 100ms:
- Increment driver.dma_timeout_count (exposed via /sys/devices/.../dma_timeouts)
- The FMA subsystem (Section 19.1) receives a FaultEvent::DmaTimeout event.
- Issue PCIe Function Level Reset (FLR) via the device's FLR capability register (Device Control register bit 15). FLR is a hard device reset that stops all outstanding DMA by definition.
- Wait up to 500ms for FLR completion (poll config space; device returns 0xFFFF during reset; the PCIe Base Spec requires FLR to complete within 100ms, so 500ms provides a conservative margin).
- If FLR is unsupported by the device, or if FLR also times out:
  - Do not free memory. Mark the IOMMU group as quarantined: the existing IOMMU mappings are left in place (fault-on-access) but no new mappings are granted. Memory backing those mappings is pinned and excluded from the allocator until the quarantine is lifted.
  - Return Err(DmaQuiescenceTimeout) to the crash recovery path.
  - The quarantined IOMMU group is reset on the next system suspend/resume cycle (which performs a full bus reset), at which point the pinned memory is released.
  - Log: "DMA quiescence failed on [bus:dev.fn] after FLR — IOMMU group quarantined; memory pinned until suspend/resume reset"
IOTLB invalidate: Only after confirmed device quiescence (step 1 DMA stop + step 2 poll, or step 3 FLR), invalidate IOMMU TLB entries for the unmapped region. On Intel VT-d, this uses the Invalidation Wait Descriptor with IWD=1 to wait for invalidation completion. On AMD, the COMPLETION_WAIT command provides equivalent functionality. Only after IOTLB invalidation completes is it safe to free physical pages.

Design note: Linux's default driver teardown does not always issue FLR, relying on IOMMU timeouts and trusting drivers to drain their own DMA. UmkaOS enforces the explicit stop sequence — it is the kernel's responsibility to ensure hardware is quiesced, not the driver's.

Driver private memory is freed only after confirmed device quiescence and completed IOTLB invalidation. If quiescence cannot be confirmed (FLR also fails), the memory is quarantined rather than freed — no use-after-free path is permitted.
Why this matters: without confirmed DMA quiescence, a device still mid-DMA could write to pages that have been freed and reallocated to another driver or to userspace — a use-after-free via hardware. Proceeding past a timeout without hardware confirmation of DMA stop is not an acceptable fallback for a production kernel; quarantine is the safe alternative when quiescence cannot be established.
Device reset: FLR for PCIe, port reset for USB, etc.
Driver reload: Fresh binary loaded, new vtable exchange. The DeviceDescriptor retains the same DeviceHandle — the device's identity in the registry is preserved across crashes.
Service re-publication: Reloaded driver publishes its services again. Registry notifies clients via service_recovered callback.
Child re-probe: If this was a bus driver, the registry re-enumerates and re-probes child devices.

10.5.10.2 Failure Counter Integration

/// Sliding-window failure tracker. Records timestamps of recent failures
/// in a circular buffer. Used by the auto-demotion policy to count failures
/// within a configurable time window.
pub struct FailureWindow {
    /// Circular buffer of failure timestamps (monotonic nanoseconds).
    timestamps: [u64; 16],
    /// Index of the next write position (wraps at 16).
    head: u32,
    /// Total number of failures recorded (may exceed 16; only the last 16
    /// timestamps are retained).
    total_count: u32,
}

impl FailureWindow {
    /// Count failures within the last `window_ns` nanoseconds.
    pub fn count_within(&self, window_ns: u64) -> u32 { /* ... */ }
    /// Record a failure at the current time.
    pub fn record(&mut self, now_ns: u64) { /* ... */ }
}

The registry's per-node failure_window (a FailureWindow sliding-window counter) feeds into the existing auto-demotion policy. The counter records timestamps in a 16-entry circular buffer; the policy query asks "how many entries fall within the last N seconds?" (default window: 1 hour):

failure_window.count_within(1 hour):
  0-2: Reload at same tier
  3+:  Demote to next lower tier (if minimum_tier allows)
  5+:  Transition to Quarantined state (driver permanently disabled, device
       unbound); requires manual administrator re-enable via sysfs. Log critical alert.

This is the same policy described in Section 10.8, now with the registry as the tracking mechanism.

How auto-demotion works without recompilation — A driver that can run in both Tier 1 (isolation domain, Ring 0) and Tier 2 (process, Ring 3) does not need two separate binaries. The KABI vtable abstraction (Section 11.1) provides identical function signatures regardless of tier. The difference is in the hosting environment: Tier 1 drivers are loaded as shared objects into a kernel isolation domain; Tier 2 drivers are loaded as processes. The same .umka binary is valid in both contexts because KABI syscalls (ring buffer operations, capability invocations) are designed to work from either Ring 0 or Ring 3 — the Tier 1 path uses direct function calls via the vtable, while the Tier 2 path uses syscall wrappers that implement the same vtable interface. Auto-demotion simply means "restart this driver binary in a Tier 2 process instead of a Tier 1 isolation domain." The driver code is unaware of the change; only the hosting environment differs.

10.5.11 Boot Sequence Integration

The registry integrates into the boot sequence (Section 2.1.3):

4. UmkaOS Core initialization:
   a. Parse boot parameters and ACPI tables
   b. Initialize physical memory allocator
   c. Initialize virtual memory
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: APIC, timer, early console
   f. Initialize capability system
   g. Initialize device registry                        <-- NEW
   h. Register Tier 0 devices in registry               <-- NEW
   i. Initialize scheduler
   j. Mount initramfs
5. ACPI/DT enumeration: populate registry               <-- NEW
6. PCI enumeration: create device nodes                  <-- NEW
7. Registry runs match engine, loads storage driver      <-- REPLACES ad-hoc loading
8. Mount real root filesystem
9. Continue device enumeration (USB, etc.)               <-- NEW
10. Execute /sbin/init

10.5.11.1 Tier 0 Devices

Tier 0 drivers (APIC, timer, serial) are statically linked and initialized before the registry exists. After registry init, they are registered retroactively:

registry.register_tier0_device("apic", ...);
registry.register_tier0_device("timer", ...);
registry.register_tier0_device("serial0", ...);

These nodes are created directly in Active state with no match/load cycle.

10.5.11.2 Console Handoff

The display and input stack transitions through multiple phases during boot. The handoff protocol ensures zero message loss and graceful degradation.

Phase 1 — Tier 0 (early boot): - Serial console (COM1/PL011/16550) is active from the first instruction. - VGA text mode (80×25) initialized by BIOS/UEFI firmware on x86-64. - All kernel output goes to the ring buffer (klog), serial, and VGA text mode simultaneously. The ring buffer captures every message from the first printk.

Phase 2 — Tier 1 loaded (DRM/KMS driver): - The DRM/KMS display driver initializes, performs modeset, and allocates a framebuffer. - A framebuffer console renderer (fbcon) is initialized with the target resolution.

Handoff protocol:

1. DRM driver completes modeset, signals "console ready" via KABI callback:
     driver_event(CONSOLE_READY, framebuffer_info)

2. Kernel console subsystem:
   a. Locks the console output path (brief pause, <1ms)
   b. Replays the full ring buffer contents onto the framebuffer console
      — no boot messages are lost, the user sees the complete boot log
   c. Registers fbcon as the primary console output
   d. Unlocks the console output path

3. Serial console remains active — never disabled. All output goes to BOTH
   serial and framebuffer. This ensures remote management always works.

4. VGA text mode driver is deregistered as the *primary* console backend.
   The VGA text mode memory region (0xB8000) is NOT released to the physical
   memory allocator — it is reserved as a panic-only fallback (see below).
   The region is small (4000 bytes) and the cost of keeping it reserved is
   negligible compared to the benefit of having a guaranteed crash output path.

Keyboard handoff: - Early boot: PS/2 scan code handler (Tier 0) captures keystrokes into a buffer. This allows emergency interaction (e.g., boot parameter editing) before USB is up. - Tier 1 loaded: USB HID driver initializes, registers as input device. The input subsystem drains the PS/2 keystroke buffer — no keystrokes are lost. - PS/2 handler remains active for keyboards physically connected via PS/2.

Virtual terminals: - VT switching (Ctrl+Alt+F1–F6) is implemented in umka-core's input multiplexer, NOT in the display driver. The display driver is a passive renderer. - On VT switch, the input multiplexer sends a SWITCH_VT(n) command to the display driver via KABI. The driver switches which virtual framebuffer is scanned out. - This design means a crashing display driver doesn't break VT switching logic — on driver recovery, the multiplexer re-sends the current VT state.

Crash fallback: - If the DRM driver faults, the core reverts to VGA text mode (x86-64) or serial-only (AArch64/RISC-V/PPC) for panic output. Tier 0 console backends are always available. - The panic handler bypasses the normal console locking path and writes directly to the Tier 0 backends (serial + VGA text if available).

10.5.11.3 PCI Enumeration

PCI enumeration is part of UmkaOS Core (Tier 0 functionality in early boot). It walks PCI configuration space and creates device nodes:

For each PCI bus (starting from bus 0):
  For each device 0-31, function 0-7:
    If device present:
      1. Create DeviceNode with PCI bus identity
      2. Populate properties: vendor-id, device-id, class-code, BARs, IRQs
      3. If this is a bridge: create a bus node, recurse into secondary bus
      4. Set numa_node from ACPI SRAT proximity domain
      5. Registry runs match engine for this node

10.5.11.4 NUMA Awareness

ACPI SRAT (System Resource Affinity Table) provides NUMA topology. The registry uses this to set numa_node on each device node based on the device's proximity domain (PCI devices inherit from their root port's NUMA node).

This information is available for: - Driver memory allocation: Prefer the device's NUMA node. - DMA buffer allocation: Prefer the device's NUMA node. - IRQ affinity: Suggest CPU affinity matching the device's NUMA node. - Tier 1 domain assignment: Prefer grouping NUMA-local devices when isolation domains are shared.

10.5.11.5 ACPI Enumerator

The ACPI enumerator is Tier 0 kernel-internal code that walks the ACPI namespace and creates platform device nodes in the registry. It handles the tables that define hardware topology:

ACPI Table	Registry Impact
MCFG (PCI Express Memory Mapped Config)	Defines PCI segment groups and ECAM base addresses. The PCI enumerator uses these to access PCI config space.
SRAT (System Resource Affinity)	Maps PCI bus ranges and memory ranges to NUMA proximity domains. Sets `numa_node` on device nodes.
DMAR / IVRS (DMA Remapping)	Defines IOMMU hardware. Creates IOMMU group assignments (Section 10.5.3.8). Intel DMAR for VT-d, AMD IVRS for AMD-Vi.
DSDT / SSDT (Differentiated System Description)	Defines platform devices (embedded controllers, power buttons, battery, thermal zones). Each ACPI device object becomes a platform device node.
HPET / MADT	Timer and interrupt controller topology. Creates Tier 0 device nodes for APIC, I/O APIC, HPET.

AML evaluation: The ACPI enumerator includes an AML (ACPI Machine Language) interpreter for evaluating _STA (device status), _CRS (current resources), and _HID (hardware ID) methods. This is a significant subsystem but is required for correct hardware enumeration on any x86 system. The AML interpreter runs in Tier 0 with full kernel privileges because it accesses hardware registers directly.

Device Tree enumerator (AArch64/RISC-V/PPC): Parses the flattened device tree (FDT) passed by the bootloader. Each DT node with a compatible property becomes a platform device node. The reg property populates DeviceResources.bars (as MMIO regions), and the interrupts property populates DeviceResources.irqs. DT phandle references become provider-client service links.

10.5.11.6 Firmware Quirk Framework

ACPI tables and Device Trees are authored by firmware engineers and are notoriously buggy. Linux has accumulated thousands of firmware workarounds scattered across subsystem-specific code (drivers/acpi/, arch/x86/kernel/, DMI match tables, ACPI override tables). UmkaOS centralizes firmware workarounds into a structured quirk framework, similar to the CPU errata framework (Section 2.1.4).

The problem is real — common firmware bugs observed in the wild: - ACPI _CRS (Current Resources) reports incorrect MMIO ranges for PCI bridges, causing resource conflicts - SRAT (NUMA affinity) tables claim all memory belongs to NUMA node 0 on multi-socket systems (broken BIOS update) - DMAR (IOMMU) tables omit devices or report wrong scope, causing IOMMU group misassignment - Device Tree interrupt-map entries with wrong parent phandle references (ARM SoC vendor bugs) - DSDT/SSDT AML code with infinite loops, incorrect register addresses, or methods that return wrong types - MADT reports non-existent APIC IDs (causes boot failure if kernel trusts them) - ECAM (PCI config space) base address wrong in MCFG table

UmkaOS's firmware quirk table:

/// Firmware quirk entry — matches a system to its required workarounds.
struct FirmwareQuirk {
    /// System identification (DMI vendor + product + BIOS version).
    match_id: DmiMatch,
    /// ACPI table match (optional — match specific table revision).
    table_match: Option<AcpiTableMatch>,
    /// Human-readable quirk identifier.
    quirk_id: &'static str,
    /// Workaround: override, ignore, or patch firmware data.
    action: QuirkAction,
}

enum QuirkAction {
    /// Override a specific ACPI table with a corrected version (ACPI override).
    OverrideTable { table_signature: [u8; 4], replacement: &'static [u8] },
    /// Ignore a specific device entry in DMAR/IVRS (broken IOMMU scope).
    IgnoreIommuDevice { segment: u16, bus: u8, device: u8, function: u8 },
    /// Override NUMA affinity for a memory range (broken SRAT).
    OverrideNumaAffinity { phys_start: u64, phys_end: u64, node: u32 },
    /// Ignore an APIC ID in MADT (non-existent CPU).
    IgnoreApicId { apic_id: u32 },
    /// Patch a specific AML method (replace bytecode).
    PatchAml { path: &'static str, replacement: &'static [u8] },
    /// Skip enumeration for a device matching this HID (broken _CRS).
    SkipDevice { hid: &'static str },
    /// Custom workaround function.
    Custom(fn() -> Result<()>),
}

Quirk database population — the initial quirk database is seeded from: 1. Linux's existing DMI quirk tables (drivers/acpi/, arch/x86/pci/) — these document decades of firmware workarounds with specific DMI match strings 2. Community-reported firmware bugs (same mechanism as Linux's bugzilla) 3. Vendor-provided errata sheets (when available)

ACPI table override — Linux supports loading replacement ACPI tables from initramfs (CONFIG_ACPI_TABLE_UPGRADE). UmkaOS supports the same mechanism: if a corrected DSDT is placed in the initramfs at /lib/firmware/acpi/, it replaces the firmware-provided table at boot. This allows users to fix firmware bugs without waiting for a BIOS update.

Boot-time quirk logging — all applied quirks are logged at boot:

umka: Firmware quirk applied: DELL-POWEREDGE-R740-BIOS-2.12 — DMAR ignore device 0000:00:14.0 (broken IOMMU scope)
umka: Firmware quirk applied: LENOVO-T14S-BIOS-1.38 — SRAT override node 0→1 for range 0x100000000-0x200000000

Why UmkaOS is more sensitive to firmware bugs than Linux — UmkaOS's topology-aware device registry derives NUMA affinity, IOMMU groups, power management ordering, and driver isolation domains from firmware-reported topology. A firmware bug that reports wrong NUMA affinity causes UmkaOS to place a driver on the wrong NUMA node (performance degradation). In Linux, the same bug might cause a suboptimal numactl suggestion but doesn't affect driver placement (Linux doesn't have topology-aware driver isolation).

This means UmkaOS must invest more heavily in firmware workarounds than Linux for the same set of hardware. The structured quirk framework makes this manageable — adding a new workaround is a single table entry, not scattered if (dmi_match(...)) checks across the codebase.

Defensive parsing — beyond per-system quirks, all firmware table parsers are defensively coded: - ACPI table lengths are validated against the RSDP/XSDT-reported size - AML interpreter has an instruction count limit (prevents infinite loops in AML code) - Device Tree parser validates all phandle references before dereferencing - PCI config space reads are bounds-checked against MCFG-reported ECAM regions - Any parse failure is logged as an FMA event (Section 19.1) and the offending entry is skipped rather than causing a boot failure

10.5.11.7 Resource Assignment

During PCI enumeration, the registry assigns hardware resources to each device:

For each PCI device:
  1. Read BAR registers to determine resource requirements (size, type).
  2. Assign physical address ranges from the PCI memory/IO space allocator.
     - MMIO BARs: allocate from PCI MMIO window (defined by ACPI `_CRS`
       method on the PCI host bridge device; MCFG defines only the ECAM base
       address for PCIe configuration space access).
     - I/O BARs: allocate from PCI I/O window (legacy x86, rare).
  3. Write assigned addresses back to BAR registers.
  4. Populate DeviceResources.bars with the assigned mappings.
  5. Allocate MSI/MSI-X vectors:
     - If device supports MSI-X: allocate up to min(device_max, driver_requested) vectors.
     - If MSI only: allocate power-of-2 vectors up to device limit.
     - Fallback: assign legacy INTx pin.
  6. Populate DeviceResources.irqs.

Resource conflicts (overlapping BAR assignments, IRQ vector exhaustion) are detected during enumeration and logged as FMA events (Section 19.1). Conflicting devices remain in Discovered state with no driver bound.

10.5.12 Sysfs Compatibility

The registry is the single source of truth for the /sys filesystem required by Linux compatibility (Section 18.1.3).

10.5.12.1 Mapping

Sysfs Path	Registry Source
`/sys/devices/`	Device tree traversal (parent-child edges)
`/sys/bus/pci/devices/`	All nodes with bus_type == Pci
`/sys/bus/usb/devices/`	All nodes with bus_type == Usb
`/sys/class/block/`	Nodes publishing "block" service
`/sys/class/net/`	Nodes publishing "net" service
`/sys/devices/.../driver`	`driver_binding.driver_name`
`/sys/devices/.../power/`	Power state and runtime PM policy
`/sys/devices/.../uevent`	Generated from node properties

10.5.12.2 Attribute Files

Each standard property maps to the expected sysfs attribute format: - vendor → property "vendor-id" formatted as 0x%04x - device → property "device-id" formatted as 0x%04x - class → property "class-code" formatted as 0x%06x

Custom driver-set properties appear under a properties/ subdirectory.

10.5.12.3 Device Class via Service Names

Linux's /sys/class/ directories are derived from service publication: - A driver that publishes a "net" service → device appears under /sys/class/net/ - A driver that publishes a "block" service → device appears under /sys/class/block/ - A driver that publishes a "input" service → device appears under /sys/class/input/

This is more principled than Linux's explicit class_create() calls because the classification falls naturally out of what the driver actually does.

10.5.13 Concurrency and Performance

10.5.13.1 Locking Strategy

Read path (hot): Property queries, service lookups, sysfs reads. Reader-writer lock allows concurrent reads.
Write path (cold): Node creation, state transitions, driver binding, hotplug. Takes exclusive write lock.
Per-node state: Atomic field for lock-free state checks ("is this device active?" does not need the tree lock).
PM ordering cache: Computed once per PM transition. Invalidated when tree topology changes (hotplug).

10.5.13.2 Scalability

Device enumeration: O(n*m) where n = match rules, m = unmatched devices. With <1000 drivers and <200 devices on a typical system, this completes in microseconds. Runs once at boot + on hotplug.
Service lookup: Hash-indexed by service name. O(1) amortized.
Property query: Binary search on sorted PropertyTable. O(log n), n < 30.
PM ordering: Topological sort is O(V+E) where V = nodes, E = edges. Computed once, cached.

10.5.13.3 Memory Budget

Component	Per Node	Notes
DeviceNode struct	~512 bytes	Fixed-size fields
PropertyTable (avg 15 props)	~1 KB	Key strings + values
Children/providers/clients	~128 bytes	Vec overhead
Total per node	~1.7 KB

A typical desktop with ~200 devices: ~340 KB. A busy server with ~1000 devices: ~1.7 MB. Well within kernel memory budget.

10.5.14 Resolved Design Decisions

The following design questions have been resolved:

1. USB topology depth: full topology. The registry represents the full USB hub topology (up to 7 levels). Hub nodes carry a UsbHub property struct with port count and per-port power control. This is required for correct power-management ordering (suspend leaf-first, resume root-first) and surprise-removal cascading (removing a hub invalidates all downstream devices). The node overhead is trivial — one DeviceNode per hub.

2. GPU sub-device modeling: child nodes. Each GPU sub-function (display controller, compute engine, video encoder, copy engine) is a child DeviceNode with its own BusIdentity::PciFunction and capability flags. The parent GPU node holds shared state (VRAM, power domain). Each child binds its own extension vtable (AccelComputeVTable, AccelDisplayVTable per Section 21.1.2) while sharing the parent's AccelBaseVTable. This enables independent driver binding per sub-function (e.g., a display driver and a compute driver on the same GPU).

3. Firmware enumerators: pluggable Tier 0 backends. A FirmwareEnumerator trait defines two methods: enumerate(registry: &mut DeviceRegistry) and match_device(node: &DeviceNode) -> Option<DeviceProperties>. Two implementations:

AcpiEnumerator — walks the ACPI namespace (_STA, _HID, _CRS), creates platform device nodes.
DtEnumerator — walks the flattened device tree compatible strings, creates platform device nodes.

Architecture selection is compile-time via arch::current::firmware_enumerator(): x86 → ACPI, ARM/RISC-V → DT, ARM server → both. Both enumerators are kernel-internal (Tier 0), never exposed through KABI.

4. Multi-function PCI devices: one node per function. The topology is: PciBridge → PciSlot → PciFunction(0..N). The PciSlot node is a lightweight grouping node (no driver binding) that carries the slot's physical identity (segment/bus/device). Each PciFunction child has its own BAR resources, MSI vectors, and IOMMU group assignment. This matches Linux's sysfs model and makes SR-IOV VF creation (Decision 8) natural — VFs are additional function children. Recovery ordering for multi-function devices follows the device tree: if function 0 crashes, sibling functions (1, 2, ...) are notified via the registry's DeviceEvent::SiblingReset event. Each sibling driver independently decides whether to re-probe its function or wait for the parent slot to stabilize. The parent PciSlot node coordinates FLR (Function Level Reset) if the failing function requests it.

5. Service versioning: yes, using InterfaceVersion. registry_publish_service requires the service vtable to start with the standard vtable_size: u64, version: u32 header, same as all KABI vtables (Section 11.1.3). Lookup performs major-version matching; minor-version differences are handled by vtable_size-based field presence detection. No new mechanism — reuses the existing KABI version negotiation protocol.

6. Multi-provider services: topology-aware lookup + enumeration variant. registry_lookup_service(name) returns the closest provider by walking: same device → sibling nodes → parent subtree → global. registry_lookup_all_services(name) returns an iterator over all providers, ordered by topological distance. The "closest" heuristic covers the common case (e.g., an I2C client finding its controller); the enumeration variant handles multi-path cases (RAID member discovery, network bonding).

7. Persistent device naming: yes, bus-identity + serial derived. The registry generates a stable device path from bus-specific identity:

Bus	Stable Path Source
PCI	`segment:bus:device.function` (stable if ACPI/DT provides `_BBN`/`_SEG`)
USB	Hub chain + port number (stable as long as physical topology unchanged)
NVMe	PCI path + namespace ID
SCSI	WWID / VPD page 83

The stable path is stored as a stable_path: ArrayString<128> property on each DeviceNode. The compat layer creates /dev/disk/by-id/, /dev/disk/by-path/ etc. as symlinks. The kernel itself never uses these names — they are purely for userspace convenience.

8. IOMMU group granularity for SR-IOV: PF driver creates VF nodes via KABI. The PF driver calls registry_create_vf_nodes(pf_handle: DeviceHandle, count: u32) which:

Validates the PF has ACS on its upstream port (required for per-VF IOMMU groups).
Creates count child DeviceNodes with BusIdentity::Pci entries for each VF BDF.
Assigns each VF its own IOMMU group (if ACS permits) or groups them with the PF.
Triggers driver matching on each new VF node (the same driver or the VFIO passthrough driver are both valid matches).

Destruction: registry_destroy_vf_nodes(pf_handle) tears down all VFs, unmapping their IOMMU entries and revoking any VFIO leases. Fails with IO_RESULT_BUSY if any VF is actively in use by a guest VM.

9. AML interpreter scope: minimal production subset, growth-on-demand. The initial interpreter supports the following ACPI methods (the minimum for real x86 server/desktop boot): _STA, _CRS, _HID, _UID, _BBN (base bus number), _SEG (PCI segment), _PRT (PCI routing table), _OSI (OS identification — most DSDTs gate behavior on this), _DSM (device-specific method — used by PCIe, NVMe, USB controllers), _PS0/_PS3 (power state transitions), _INI (device initialization), _REG (operation region handler registration), and _CBA (ECAM base for PCIe config space on modern systems).

Required AML bytecode opcodes: Store, If/Else, Return, Buffer, Package, Integer/String/ Buffer operations, Method invocation, OperationRegion, Field. Without _OSI and _DSM, most x86 laptops and many servers fail to enumerate devices correctly. Extend only when real hardware fails to enumerate — do not speculatively implement unused methods.

10. Resource reservation for hot-plug: configurable per-slot defaults, ACPI-guided. Default reservation per hot-plug capable slot: 256MB MMIO, 256MB prefetchable MMIO, and 8 bus numbers (matching Linux's heuristic). Configurable via kernel command-line parameters (pci_hp_mmio=128M, pci_hp_prefetch=256M, pci_hp_buses=4). The PCI allocator reads ACPI _HPP (Hot Plug Parameters) and _HPX (Hot Plug Extensions) methods if present — these override the defaults with firmware-provided values. Reserved regions are tracked as "allocated but unoccupied" to prevent other devices from claiming them.

11. KABI long-term evolution: 5 releases default, LTS KABI opt-in. The support window is 5 major releases. A KABI version may be designated LTS at release time (not retroactively), extending its support to 7 releases. LTS designation requires that at least one major driver ecosystem (storage, network, or accelerator) has certified against that KABI version.

Lifecycle: - At KABI_vN+3 (or +5 for LTS): deprecated methods gain #[deprecated(since = "KABI_vN")] and emit a kernel log warning when called. - At KABI_vN+5 (or +7 for LTS): deprecated methods are removed from the vtable. - Dead method cleanup reduces vtable size, reclaiming the bloat from append-only evolution.

12. IOMMU nested translation performance: proactive large page promotion. The IOMMU mapper always selects the largest page size that fits the DMA mapping alignment and size:

Condition	IOMMU Page Size
Mapping ≥ 1GB and 1GB-aligned	1GB (rare; occurs for GPU BAR mappings)
Mapping ≥ 2MB and 2MB-aligned	2MB
All other cases	4KB

This is a policy in the IOMMU mapping path, not a reactive monitor. Per-device IOMMU stats (IOTLB miss rate via performance counters, if available) are exposed through the FMA health telemetry path (Section 19.1) for observability, but the promotion decision itself is always proactive.

10.5.15 Firmware Management

Devices need firmware updates. The kernel provides infrastructure for loading and updating device firmware without requiring device-specific userspace tools.

10.5.15.1 Firmware Loading

Firmware loading flow (boot and runtime):
  1. Driver calls kabi_request_firmware(name, device_id).
  2. Kernel searches firmware paths in order:
     a. /lib/firmware/updates/<name>  (admin overrides)
     b. /lib/firmware/<name>          (distro-provided)
     c. Initramfs embedded firmware   (for boot-critical devices)
  3. If found: kernel maps the firmware blob read-only into the
     driver's isolation domain. Driver receives a FirmwareBlob handle
     with .data() and .size() accessors.
  4. Driver loads firmware to device via its own mechanism
     (MMIO, DMA upload, vendor mailbox).
  5. Driver releases the handle; kernel unmaps the blob.

  Same semantics as Linux request_firmware() / request_firmware_nowait().
  The async variant (kabi_request_firmware_async) does not block the
  driver's probe path — useful for large firmware blobs (>10MB).

10.5.15.2 Firmware Update (Runtime)

Runtime firmware update (fwupd / vendor tools):
  1. Userspace writes firmware capsule to /sys/class/firmware/<device>/loading.
  2. Kernel validates:
     a. Signature (mandatory: Ed25519 or PQC if enabled).
        The signing key must match the device's firmware trust anchor
        (embedded in device or provided by vendor via UEFI db).
     b. Version (must be >= current version, prevents downgrade attacks
        unless admin explicitly overrides via firmware.allow_downgrade=1).
  3. Kernel notifies driver via KABI callback:
     update_firmware(blob, blob_size) -> FirmwareUpdateResult.
  4. Driver performs the device-specific update procedure:
     - NVMe: Firmware Download + Firmware Commit (NVMe admin commands).
     - GPU: vendor-specific update mechanism.
     - NIC: flash update via vendor mailbox.
  5. Driver returns result: Success, NeedsReset, Failed(error_code).
  6. If NeedsReset: kernel marks device for reset. Reset can be
     triggered immediately (if no active I/O) or deferred to next
     maintenance window (admin-configurable).

UEFI capsule updates (system firmware):
  Kernel writes capsule to EFI System Resource Table (ESRT) via
  efi_capsule_update(). Actual update happens on next reboot.
  Same mechanism as Linux (CONFIG_EFI_CAPSULE_LOADER).
  Exposes /dev/efi_capsule_loader for userspace tools (fwupd).

10.5.15.3 Linux Compatibility

/sys/class/firmware/<device>/loading    — firmware loading trigger
/sys/class/firmware/<device>/data       — firmware blob upload
/sys/class/firmware/<device>/status     — update status
/sys/bus/*/devices/*/firmware_node/     — ACPI firmware node link
/dev/efi_capsule_loader                — UEFI capsule interface

fwupd works unmodified — it uses the standard sysfs firmware update interface and UEFI capsule loader, both of which are provided.

10.5.16 Appendix: Comparison with Prior Art

Aspect	Linux	IOKit	Windows PnP	Fuchsia DF	UmkaOS
Tree owner	Kernel (kobject)	Kernel (IORegistry)	Kernel (devnode)	Userspace (devmgr)	Kernel (DeviceRegistry)
Matching	Per-bus (module_alias)	Property dict match	INF file rules	Bind rules	MatchRule in ELF .kabi_match
PM ordering	Heuristic (dpm_list)	IOPMPowerState tree	IRP tree walk	Component PM	Topological sort of device tree
Service discovery	Per-subsystem APIs	IOService matching	WDF target objects	Protocol/service	Unified registry_publish/lookup
Hot-plug	Per-bus callbacks	IOService terminate	PnP IRP dispatch	devmgr events	Registry-mediated events
Crash recovery	Kernel panic	IOService terminate	Bugcheck	Component restart	Registry-orchestrated reload
ABI coupling	Tight (kobject in driver)	Tight (C++ inheritance)	Tight (WDM/WDF)	Protocol-only	None (KABI vtable only)
Isolation	None	None	None	Process boundary	Domain isolation + process + capability

10.6 Zero-Copy I/O Path

The entire I/O path from user space to device and back avoids all data copies. This is essential for matching Linux performance.

10.6.1 NVMe Read Example (io_uring SQPOLL + Registered Buffers)

Step 1: User writes SQE to io_uring submission ring
        [User space, shared memory, 0 transitions]

Step 2: SQPOLL kernel thread reads SQE from ring
        [UmkaOS Core, shared memory read, 0 copies]

Step 3: Domain switch to NVMe driver domain (~23 cycles on x86 MPK)
        [Single WRPKRU on x86; MSR POR_EL0+ISB on AArch64 POE; MCR DACR on ARMv7]

Step 4: NVMe driver writes command to hardware submission queue
        [Pre-computed DMA address from registered buffer]

Step 5: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
        [Submit path complete, return to core domain]

Step 6: NVMe device DMAs data directly to user buffer
        [IOMMU-validated, zero-copy, device -> user memory]

Step 7: NVMe device writes completion to hardware CQ, raises interrupt

Step 8: Interrupt routes to NVMe driver (domain switch, ~23 cycles on x86 MPK)
        Driver reads hardware CQE

Step 9: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)

Step 10: UmkaOS Core writes CQE to io_uring completion ring
         [Shared memory write, 0 copies]

Step 11: User reads CQE from completion ring
         [User space, shared memory, 0 transitions]

Summary: - Total data copies: 0 - Total domain switches: 4 (steps 3+5 on submit path, steps 8+9 on completion path) - Total domain switch overhead: ~92 cycles on x86 MPK (4 x ~23 cycles per Section 10.2; see Section 10.2 table for other architectures) - Device latency: ~3-10 us - Overhead percentage: < 1%

10.6.1.1 NVMe Doorbell Coalescing (Mandatory)

NVMe hardware uses doorbell registers (MMIO writes) to notify the controller that new commands are available in the submission queue. Each doorbell write is an uncacheable MMIO store — ~100-200 cycles on x86-64 (PCIe posted write), ~150-300 cycles on ARM (device memory type). In the naive case, every submit_io() call writes the doorbell immediately, which means one MMIO write per I/O command.

UmkaOS coalesces doorbell writes as a core design decision. When multiple I/O commands are submitted in a batch (common with io_uring SQPOLL, which drains multiple SQEs per poll cycle), the NVMe driver writes all commands to the submission queue first, then issues a single doorbell write for the entire batch. The NVMe specification explicitly supports this: the doorbell value is the new SQ tail index, and the controller processes all entries between the previous tail and the new tail.

/// NVMe submission batch context. Accumulates commands and defers the
/// doorbell write until `flush()` is called. Created by the KABI dispatch
/// trampoline when it detects multiple pending SQEs in the domain ring buffer.
///
/// # Invariants
///
/// - `pending_count` tracks commands written to the hardware SQ since the
///   last doorbell write.
/// - `flush()` must be called before returning from the KABI dispatch
///   to ensure all commands are visible to the controller. The KABI
///   trampoline enforces this via Drop (flush on drop as safety net).
pub struct NvmeSubmitBatch<'sq> {
    /// Reference to the submission queue (hardware memory).
    sq: &'sq mut NvmeSubmissionQueue,
    /// Number of commands written since last doorbell.
    pending_count: u32,
    /// Maximum batch size before auto-flush (tunable, default: 32).
    /// Prevents unbounded batching that could increase per-command latency.
    max_batch: u32,
}

impl<'sq> NvmeSubmitBatch<'sq> {
    /// Write a command to the SQ without ringing the doorbell.
    /// If `pending_count` reaches `max_batch`, auto-flushes.
    pub fn submit(&mut self, cmd: &NvmeCommand) {
        self.sq.write_entry(cmd);
        self.pending_count += 1;
        if self.pending_count >= self.max_batch {
            self.flush();
        }
    }

    /// Ring the doorbell once for all pending commands.
    /// Cost: one MMIO write (~100-200 cycles) regardless of batch size.
    pub fn flush(&mut self) {
        if self.pending_count > 0 {
            // SAFETY: doorbell is an MMIO register in the driver's private
            // domain. Writes the new SQ tail index.
            unsafe { self.sq.ring_doorbell() };
            self.pending_count = 0;
        }
    }
}

impl Drop for NvmeSubmitBatch<'_> {
    fn drop(&mut self) {
        // Safety net: ensure all commands are submitted even if the caller
        // forgets to call flush(). This is a correctness guarantee, not a
        // performance path — callers should flush() explicitly.
        self.flush();
    }
}

Batch size selection: The default max_batch of 32 balances throughput and latency. With io_uring SQPOLL draining at ~32-64 SQEs per poll cycle, this typically results in 1-2 doorbell writes per poll cycle instead of 32-64. The value is tunable per-device to accommodate different workload patterns.

Cost savings:

Scenario	Without coalescing	With coalescing	Savings
io_uring SQPOLL, 32 SQEs/batch	32 × ~150 cycles = ~4800 cycles	1 × ~150 cycles = ~150 cycles	~4650 cycles (~97%)
io_uring SQPOLL, 1 SQE (fsync)	1 × ~150 cycles	1 × ~150 cycles	0 (no batching opportunity)
Direct submit (non-SQPOLL)	1 × ~150 cycles	1 × ~150 cycles	0 (single command)

Per-I/O amortized doorbell cost with batch-32: ~150/32 = ~5 cycles/command, down from ~150 cycles/command. On a 10μs NVMe read (~25,000 cycles), this reduces doorbell overhead from ~0.6% to ~0.02%.

Applicability beyond NVMe: The same coalescing pattern applies to any device with doorbell-style notification: virtio (virtqueue kick), network TX (NIC doorbell/tail pointer write), and accelerator command queues. The KABI dispatch trampoline detects batch opportunities for any device type and uses the same flush-on-last-command pattern.

10.6.2 TCP Receive Path

Step 1: NIC DMAs packet to pre-posted receive buffer
        [IOMMU-validated, zero-copy]

Step 2: NIC raises interrupt -> domain switch to NIC driver (~23 cycles on x86 MPK)

Step 3: NIC driver processes descriptor, identifies packet
        Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)

Step 4: UmkaOS Core dispatches to umka-net -> domain switch to umka-net (~23 cycles on x86 MPK)

Step 5: umka-net processes TCP headers, copies payload to socket buffer
        (This is the one "copy" -- same as Linux. Technically a move
         of ownership, not a memcpy, when using page-flipping.)

Step 6: Domain switch back to UmkaOS Core (~23 cycles on x86 MPK)
        UmkaOS Core signals epoll/io_uring waiters

Step 7: User reads from socket via read()/recvmsg()/io_uring
        Data delivered from socket buffer (zero-copy with MSG_ZEROCOPY)

Total domain switches: 4 (2 domain entries x 2 switches each: enter NIC driver + exit, enter umka-net + exit) Total domain switch overhead: ~92 cycles on x86 MPK (~20ns) on ~5 us path = ~2% (see Section 10.2 for other architectures)

10.7 IPC Architecture and Message Passing

Section 10.6 describes the data plane -- how bytes flow from user space through Tier 1 drivers to devices and back with zero copies. This section describes the control plane that Section 10.6's data plane relies on: the IPC primitives that carry commands, completions, capability transfers, and event notifications between isolation domains.

10.7.1 IPC Primitives

UmkaOS's IPC model has three distinct layers, each serving a different boundary:

1. Intra-kernel IPC (between isolation domains within Ring 0): domain ring buffers. Shared memory regions with per-domain access controlled by the isolation domain register (WRPKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, etc.). Zero-copy, zero-syscall. This is the transport for all umka-core to Tier 1 driver communication — the command/completion flow shown in Section 10.6's NVMe and TCP examples. The "domain switch" at each step in those diagrams crosses a domain ring buffer boundary.

2. Kernel-user IPC (between kernel and user space): io_uring submission/completion rings. Standard Linux ABI (Section 18.1.5). Applications submit SQEs to the io_uring submission ring and receive CQEs from the completion ring. This is the only I/O interface that user space sees. UmkaOS's io_uring implementation is fully compatible with Linux 6.x semantics -- unmodified applications work without changes.

3. Inter-process IPC (between user processes): POSIX IPC. Pipes, Unix domain sockets, POSIX message queues, and POSIX shared memory -- implemented via the syscall interface (Section 18.1). These are not performance-critical kernel paths; they exist for application compatibility. System V IPC (shmget, msgget, semget) is supported but deprecated in favor of POSIX equivalents.

4. Hardware peer IPC (between the host kernel and a device running UmkaOS firmware): domain ring buffers over PCIe P2P. A device that participates as a first-class cluster member (Section 5.1.2.2) communicates with the host kernel via the same domain ring buffer protocol used for intra-kernel IPC (Layer 1), transported over PCIe peer-to-peer MMIO and MSI-X interrupts instead of in-process memory. From the host kernel's perspective, the device firmware endpoint is just another ring buffer pair — the same DomainRingBuffer structure, the same ClusterMessageHeader wire format, the same message-passing discipline. The transport medium changes (PCIe instead of cache-coherent RAM); the abstraction does not. This is not a compatibility shim. It is the intended model for first-class hardware participation: a SmartNIC, DPU, computational storage device, or RISC-V accelerator running UmkaOS presents an IPC endpoint identical in structure to an in-kernel Tier 1 driver, while owning its own scheduler, memory manager, and capability space. See Section 5.1.2.2 for the wire protocol, implementation paths (A/B/C), and near-term hardware targets.

The terms are not interchangeable. When this document says "io_uring", it means the userspace-facing async I/O interface. When it says "domain ring buffer", it means the internal kernel transport between isolation domains. An io_uring SQE from userspace triggers an isolation domain switch to a Tier 1 driver via a domain ring buffer — the two mechanisms are connected but architecturally distinct.

User space                        Kernel (Ring 0)
+-----------+                     +------------------------------------------+
| App       |                     |  umka-core         Tier 1 driver         |
|           |   io_uring SQE      |                                          |
|  SQ ring -|-------------------->|-> dispatch -----> domain cmd ring --------->|
|           |                     |                   (WRPKRU)               |
|           |   io_uring CQE      |                                          |
|  CQ ring <|--------------------<|<- collect  <----- domain cpl ring <---------|
|           |                     |                   (WRPKRU)               |
+-----------+                     +------------------------------------------+
     Layer 2                           Layer 1 (internal)
  (Linux ABI)                      (domain ring buffers)

10.7.2 Domain Ring Buffer Design

Each Tier 1 driver has a pair of ring buffers shared with umka-core: a command ring (umka-core produces, driver consumes) and a completion ring (driver produces, umka-core consumes). Both use the same underlying structure:

Weak-isolation fast path (isolation=performance or no fast isolation mechanism): When drivers are promoted to Tier 0 (no CPU-side isolation), domain ring buffers remain the IPC mechanism — the data structure and lock-free protocol are unchanged — but the domain register switches are elided. On architectures with hardware domains (MPK, POE, DACR), each ring buffer access requires toggling the domain register to grant access to the shared region (~23-80 cycles per switch, 4 switches per I/O round-trip = ~92-320 cycles). Without hardware domains, the ring buffer memory is mapped with normal kernel permissions and no domain switch is needed: the producer writes directly, the consumer reads directly, and the only synchronization is the existing atomic head/published/tail protocol. This eliminates the dominant per-I/O isolation overhead on RISC-V (~800-2000 cycles saved per I/O) and on any platform running isolation=performance. The ring buffer structure itself is unchanged — only the access-control wrapper is bypassed.

/// A lock-free single-producer single-consumer ring buffer that lives in
/// a shared memory region accessible to exactly two isolation domains.
///
/// The header occupies two cache lines (one producer-owned, one
/// consumer-owned). Ring data follows immediately after the header,
/// aligned to `entry_size`.
#[repr(C, align(64))]
pub struct DomainRingBuffer {
    /// Write claim position. Producers CAS this to claim slots (MPSC mode).
    /// In SPSC mode, only the single producer increments this.
    ///
    /// `AtomicU64`: u32 would wrap in ~29 seconds at 148 Mpps (100 Gbps with
    /// 64-byte packets); u64 wraps after ~4 billion years at the same rate.
    /// u64 counters eliminate the need for modular wrap-around logic in the hot path.
    pub head: AtomicU64,
    /// Published position. In MPSC mode, a producer increments this (in order)
    /// AFTER writing data to the claimed slot. The consumer reads `published`
    /// (not `head`) to determine how many entries are ready. In SPSC mode,
    /// `published` always equals `head` (the single producer updates both).
    /// In broadcast mode, this field is NOT the source of truth —
    /// `last_enqueued_seq` (u64) is the authoritative write position. The
    /// `published` field is derived (`write_seq / 2`) for diagnostic
    /// compatibility only. Implementations MUST NOT increment `published`
    /// independently in broadcast mode.
    pub published: AtomicU64,
    /// Number of entries. Must be a power of two.
    pub size: u32,
    /// Bytes per entry. Fixed at ring creation time.
    pub entry_size: u32,
    /// Number of entries dropped due to ring-full condition.
    /// Monotonically increasing. Exposed via umkafs diagnostics (Section 19.4).
    pub dropped_count: AtomicU64,
    /// Sequence number of the last successfully enqueued entry.
    /// Consumers use this to detect gaps: if the consumer's last-seen
    /// sequence is less than `last_enqueued_seq - ring_size`, entries
    /// were lost.
    /// In broadcast mode, this field serves as `write_seq` for torn-read
    /// prevention (incremented by 2 per entry; odd = write-in-progress,
    /// even = stable). See "Broadcast channels" below.
    pub last_enqueued_seq: AtomicU64,
    /// Ring lifecycle state. Written by crash recovery or graceful shutdown;
    /// read by producers in spin loops to detect partner death.
    ///   0 = Active (normal operation)
    ///   1 = Disconnected (producer died or ring being torn down)
    /// Producers check this in every spin iteration and bail with
    /// `Err(Disconnected)` if set. The crash recovery path (Section 10.8)
    /// sets this AFTER publishing poison markers for any in-flight
    /// slots (see "Producer death recovery" below).
    pub state: AtomicU8,
    /// Padding to fill the producer cache line to exactly 64 bytes.
    /// Layout: head(8) + published(8) + size(4) + entry_size(4)
    ///       + dropped_count(8) + last_enqueued_seq(8) + state(1)
    ///       + _pad(23) = 64.
    _pad_producer: [u8; 23],
    /// Read position. Only the consumer increments this.
    /// On a separate cache line from head/published to avoid false sharing.
    ///
    /// `AtomicU64`: same rationale as `head` — no wrap-around at any realistic rate.
    pub tail: AtomicU64,
    /// Padding to fill the consumer cache line to exactly 64 bytes.
    _pad_consumer: [u8; 56],
    // Ring data follows: `size * entry_size` bytes.
}

/// Errors returned by ring buffer produce operations.
pub enum RingError {
    /// Ring is full — no free slots available.
    Full,
    /// Ring partner has died (crash recovery set `state = Disconnected`).
    /// Caller must not retry; propagate the error.
    Disconnected,
    /// System severely overloaded — entry was discarded (poison marker written).
    /// The entry was lost but the ring remains operational.
    Overloaded,
}

Note on false sharing: size and entry_size are read-only after initialization and are read by both producer and consumer. They are placed on the producer's cache line for layout simplicity, but implementations SHOULD duplicate these values on the consumer's cache line (as consumer_size and consumer_entry_size) to avoid false sharing. The consumer reads only from its own cache line.

Lock-free SPSC protocol. The producer writes an entry at data[head % size], then increments head and published together (in SPSC mode they are always equal). The consumer reads the entry at data[tail % size] when published > tail, then increments tail. If the first byte of an entry is 0xFF (poison marker), the consumer skips the entry and increments tail without processing — this occurs only when a producer hit the Err(Overloaded) path and had to force-publish a discarded slot. No locks, no CAS, no contention. The head/published fields are on one cache line (producer-owned); tail is on a separate cache line (consumer-owned). This eliminates false sharing on hot paths.

Memory ordering. The producer uses Release ordering on the published store. The consumer uses Acquire ordering on the published load. This pair ensures that the entry data written by the producer is visible to the consumer before the consumer sees the updated published counter. On x86-64 this compiles to plain MOV instructions (TSO provides the required ordering for free). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers (stlr/ldar on ARM, fence-qualified atomics on RISC-V, lwsync/isync on PPC).

Architecture	Producer (Release store)	Consumer (Acquire load)	Notes
x86-64	`MOV` (TSO)	`MOV` (TSO)	No explicit barriers needed
AArch64	`STLR`	`LDAR`	ARM's acquire/release instructions
RISC-V 64	`amoswap.w.rl` or `fence rw,w` + `sw`	`lw` + `fence r,rw`	RVWMO requires explicit fencing
PPC32	`lwsync` + `stw`	`lwz` + `isync`	Weak ordering; `lwsync` = lightweight sync
PPC64LE	`lwsync` + `std`	`ld` + `isync`	Same model as PPC32; `lwsync` preferred over `sync`

Backpressure. When the ring is full (head - tail == size), the producer cannot write. For SPSC rings (command and completion channels), umka-core handles this in two stages: (1) spin for up to 64 iterations checking whether the consumer has advanced tail — this covers the common case where the driver is actively draining; (2) if the ring is still full after spinning, yield to the scheduler via sched_yield_current() and retry on the next scheduling quantum. Both stages check state on each iteration — if the ring is Disconnected (partner driver died), the producer returns Err(Disconnected) immediately rather than waiting for a dead consumer to drain. This avoids wasting CPU on a stalled driver while keeping the fast path lock-free. For MPSC rings (event channels), backpressure behavior depends on the calling context — see the MPSC producer API contract in Section 10.7.3 for the distinction between blocking (mpsc_produce_blocking(), thread context only) and non-blocking (mpsc_try_produce(), safe in any context) variants.

10.7.3 Channel Types and Capability Passing

The ring buffer primitive from Section 10.7.2 is instantiated in four channel configurations:

Command channels (SPSC): umka-core -> driver. One per driver instance. Carries I/O requests (read, write, discard), configuration commands (set queue depth, enable feature), and health queries (heartbeat, statistics request). Umka-core is the sole producer; the driver is the sole consumer.

Completion channels (SPSC): driver -> umka-core. One per driver instance. Carries I/O completions (success, error, partial), interrupt notifications (forwarded from the hardware interrupt handler), and error reports (device errors, internal driver faults). The driver is the sole producer; umka-core is the sole consumer.

Event channels (MPSC): multiple drivers -> umka-core event loop. Used for asynchronous events that do not belong to a specific I/O flow: device hotplug notifications, link state changes (NIC up/down), thermal throttle alerts, error notifications requiring global coordination. Multiple drivers may need to signal the same event loop, so the MPSC variant uses a compare-and-swap on head to coordinate multiple producers:

MPSC scaling limits: For event channels with >10 concurrent producers (unusual but possible in systems with many independent drivers signaling a single event loop), CAS contention on the ring head can degrade performance. In this regime, hierarchical fanout is recommended: drivers signal per-device intermediate rings, and an aggregator thread (or softirq batch) forwards events to the central ring. This reduces contention from O(producers) to O(1) at the cost of one additional indirection. The default single-ring design is optimized for the common case of 2-5 active producers per channel.

Per-CPU deferred publish buffer — When Phase 2 publication would require spinning for too long (>64 iterations, meaning an earlier producer is slow), the producer defers its publication by storing the ring pointer and slot into a small per-CPU buffer, then re-enables interrupts. This ensures interrupt-disabled windows remain bounded to ~1-2μs.

/// Per-CPU buffer for deferred MPSC ring publications.
///
/// When Phase 2 cannot complete within 64 spin iterations (because an earlier
/// producer has not yet written its data), the producer stores its pending
/// publication here and re-enables interrupts immediately. The drain function
/// is called at the start of every subsequent `send()` and at idle entry,
/// so deferred publications are completed within bounded time.
///
/// Capacity 16: supports up to 16 simultaneously stalled producers across
/// different rings. Under normal load, 0-2 entries are pending; 16 is
/// reached only under extreme contention or scheduling stalls.
pub struct DeferredPublishBuf {
    /// Ring of (published_counter_ptr, slot_index) pairs awaiting Phase 2.
    /// `published_ptr` is a pointer into the ring's AtomicU64 `published` field.
    /// `slot` is the index this producer claimed in Phase 1.
    pub entries: [Option<DeferredEntry>; 16],
    /// Head index (next slot to fill).
    pub head: u8,
    /// Tail index (next slot to drain).
    pub tail: u8,
}

pub struct DeferredEntry {
    /// Pointer to the ring's `published` counter (the one this entry must advance).
    pub published_ptr: *const AtomicU64,
    /// Slot index claimed by Phase 1 CAS.
    pub slot: u64,
}

DeferredPublishBuf is stored in the per-CPU data structure alongside CpuLocal fields. deferred_publish_drain() iterates tail..head, and for each entry attempts Phase 2 publication: if published == slot - 1, advance published to slot (success, remove from buffer); otherwise leave in place for the next drain pass.

Overflow behavior: When DeferredPublishBuf reaches capacity (16 entries), the producer performs an eager flush: all 16 pending entries are written to the domain ring buffer before adding the new entry. If the ring buffer is full (consumer is behind), the flush blocks until sufficient space is available — this provides natural backpressure. A stalled Tier 1 consumer will stall its producer, preventing unbounded deferred entry accumulation. The 16-entry buffer is a coalescing optimization, not a queue; it is never intended to hold more than a few entries in steady state.

impl DomainRingBuffer {
    /// MPSC non-blocking produce: multiple producers coordinate via CAS on head.
    /// Returns Err(RingError::Full) immediately if the ring is full, or
    /// Err(RingError::Disconnected) if the ring partner has died.
    /// Safe to call from any context (thread, IRQ, softirq).
    /// See "MPSC producer API contract" below for the blocking variant.
    ///
    /// Two-phase commit protocol:
    ///   Phase 1 (claim): CAS on `head` to reserve a slot. After CAS success,
    ///     the slot is exclusively ours but NOT yet visible to the consumer.
    ///   Phase 2 (publish): After writing data, wait until `published` catches
    ///     up to our slot (ensuring in-order publication), then advance `published`.
    ///
    /// The consumer reads `published` (not `head`) to determine ready entries.
    /// This eliminates the data race where a consumer sees an incremented `head`
    /// but reads a slot whose data has not yet been written.
    pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError> {
        // --- BEGIN interrupt-disabled section ---
        // Disable interrupts BEFORE the Phase 1 CAS to prevent a deadlock:
        // if an interrupt fires between a successful CAS (slot claimed) and
        // Phase 2 (published advanced), an interrupt handler calling
        // mpsc_try_produce on the same ring would spin forever in Phase 2
        // waiting for the interrupted thread's slot to be published. Moving
        // local_irq_save() here eliminates that race window entirely.
        // The CAS loop is bounded (succeeds or returns RingError::Full), so the
        // additional interrupt-disabled time is minimal.
        let irq_state = arch::current::interrupts::local_irq_save();

        // Phase 1: Claim a slot by advancing head (interrupts already disabled).
        let my_slot;
        loop {
            let current_head = self.head.load(Ordering::Relaxed);
            let current_tail = self.tail.load(Ordering::Acquire);

            // Ring disconnected?
            if self.state.load(Ordering::Acquire) != 0 {
                arch::current::interrupts::local_irq_restore(irq_state);
                return Err(RingError::Disconnected);
            }
            // Ring full?
            if current_head.wrapping_sub(current_tail) >= self.size as u64 {
                arch::current::interrupts::local_irq_restore(irq_state);
                return Err(RingError::Full);
            }

            // Strong CAS required: on AArch64 LL/SC architectures, compare_exchange_weak
            // permits spurious failures. In an interrupt-disabled window, spurious failures
            // cause unbounded spinning — use compare_exchange (strong) to prevent this.
            // Attempt to claim the slot.
            if self
                .head
                .compare_exchange(
                    current_head,
                    current_head.wrapping_add(1),
                    Ordering::AcqRel,
                    Ordering::Relaxed,
                )
                .is_ok()
            {
                my_slot = current_head;
                break;
            }
            core::hint::spin_loop();
        }

        // Write entry data to the claimed slot.
        let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
        // SAFETY: offset is within bounds (power-of-two size, fixed entry_size).
        // The slot is exclusively ours because we won the CAS race.
        unsafe {
            core::ptr::copy_nonoverlapping(
                entry.as_ptr(),
                self.data_ptr().add(offset),
                self.entry_size as usize,
            );
        }

        // Phase 2: Publish. Wait until all prior slots are published, then
        // advance `published` to make our slot visible to the consumer.
        // This spin is brief: it only waits for producers that claimed earlier
        // slots to finish their writes. Under normal operation, this completes
        // in 1-2 iterations.
        //
        // Drain deferred publications from previous calls. Before attempting
        // our own Phase 2, drain ALL entries from the per-CPU deferred publish
        // ring buffer. This ensures that deferrals from prior send() calls are
        // re-attempted (and completed) before new entries are published, preventing
        // silent loss if multiple producers defer in succession.
        //
        // The drain takes no arguments — each deferred entry stores a pointer to
        // the ring's `published` counter alongside the slot index, so the drain
        // correctly targets the ring that each slot belongs to (a producer may
        // have deferred on ring A and now be calling send() on ring B).
        arch::current::cpu::deferred_publish_drain();

        // **IRQ-disabled window**: Interrupts are disabled only during Phase 1
        // CAS + Phase 2 publication attempt (bounded at 64 iterations). If Phase 2
        // exceeds 64 spins, the entry is deferred and interrupts are restored
        // immediately. The 256-iteration fallback spin (if the defer buffer is full)
        // runs with interrupts **re-enabled**. Worst-case IRQ-disabled duration:
        // ~64 CAS operations ≈ 1-2μs.
        //
        // **Phase 2 uses compare_exchange (strong), not compare_exchange_weak.**
        // On AArch64 LL/SC architectures, compare_exchange_weak can fail spuriously
        // (no actual contention — just LL/SC interference from an unrelated store).
        // In Phase 2, spurious failures increment spin_count, potentially exhausting
        // the 64-iteration budget and triggering unnecessary deferred-publish overhead.
        // Strong CAS ensures the spin count only advances on genuine contention (another
        // producer with an earlier slot has not yet published), keeping the common-case
        // IRQ-disabled window at the expected 1-3 iterations.
        //
        // Bounded publish wait: To prevent unbounded interrupt-disabled spinning,
        // Phase 2 uses a bounded spin of 64 iterations. If `self.published` has
        // not advanced to `my_slot` within 64 iterations, the producer stores the
        // ring's `published` pointer and its slot index as a pair into a per-CPU
        // deferred publish ring buffer, then re-enables interrupts. The drain path
        // (at the start of the next `send()` call and on the consumer side)
        // re-attempts publication on behalf of the stalled producer, using the
        // stored ring pointer to target the correct ring. The per-CPU deferred
        // buffer is a ring (`[Option<(*const AtomicU64, u64)>; 16]` with
        // head/tail indices) rather than a single `Option<u64>`, so multiple
        // consecutive deferrals (potentially targeting different rings) can queue
        // without silently losing earlier deferred values. The buffer holds 16
        // entries (increased from an earlier 4-entry design to ensure bounded-time
        // behavior under heavy contention). If the deferred buffer itself is full
        // (16 outstanding deferrals — an extreme edge case indicating severe system
        // overload), the producer re-enables interrupts before falling back to a
        // bounded spin (up to 256 iterations with `core::hint::spin_loop()`). If the
        // bounded spin also fails, the producer returns `Err(Overloaded)` to the
        // caller, which applies backpressure (increment `dropped_count` for IRQ
        // producers, or yield and retry for thread-context producers). This ensures
        // the interrupt-disabled window is always bounded. The common-case bound is:
        // Phase 1 CAS (~5ns, usually 1 attempt) + data write +
        // drain (up to 16 entries * CAS each = ~80ns) + Phase 2 spin
        // (up to 64 * ~5ns = ~320ns) = ~410ns in the common case.
        let mut spin_count = 0u32;
        loop {
            if self
                .published
                .compare_exchange(
                    my_slot,
                    my_slot.wrapping_add(1),
                    Ordering::Release,
                    Ordering::Relaxed,
                )
                .is_ok()
            {
                break;
            }
            spin_count += 1;
            if spin_count >= 64 {
                // Exceeded bounded spin — defer completion to the consumer drain
                // path and re-enable interrupts to avoid unbounded IRQ latency.
                // The deferred buffer holds up to 16 entries; if it is full,
                // re-enable IRQs and fall through to bounded spin (system overloaded).
                // Fence ensures entry data written at the slot is visible to
                // all CPUs before the slot can be published by a deferred drain
                // on any CPU. Without this, on weakly-ordered architectures
                // (AArch64, RISC-V, PPC), a different CPU draining and publishing
                // via CAS(Release) would only order its own stores, not the
                // original writer's stores.
                core::sync::atomic::fence(Ordering::Release);
                if arch::current::cpu::deferred_publish_enqueue(&self.published, my_slot) {
                    arch::current::interrupts::local_irq_restore(irq_state);
                    return Ok(());
                }
                // Deferred buffer full — re-enable IRQs to preserve RT
                // guarantees, then bounded spin outside the IRQ-disabled window.
                arch::current::interrupts::local_irq_restore(irq_state);
                let mut fallback_spin = 0u32;
                loop {
                    if self.state.load(Ordering::Acquire) != 0 {
                        return Err(RingError::Disconnected);
                    }
                    if self.published.compare_exchange(
                        my_slot, my_slot.wrapping_add(1),
                        Ordering::Release, Ordering::Relaxed,
                    ).is_ok() {
                        return Ok(());
                    }
                    fallback_spin += 1;
                    if fallback_spin >= 256 {
                        // System severely overloaded. We must still advance `published`
                        // past our slot to prevent permanently wedging the ring.
                        // Write a poison marker (entry_type = 0xFF) into the slot so
                        // the consumer knows to skip it, then spin until we can
                        // advance `published`. This spin waits for earlier producers
                        // to publish. If an earlier producer has died (Tier 1/2 crash
                        // between Phase 1 and Phase 2), the crash recovery path will
                        // have set `state = Disconnected` and published poison markers
                        // for the dead producer's slots, unblocking this spin. We
                        // check `state` on every iteration to detect this case.
                        let offset = (my_slot % self.size as u64) as usize * self.entry_size as usize;
                        // SAFETY: slot is ours (won the Phase 1 CAS); offset in bounds.
                        unsafe { *self.data_ptr().add(offset) = 0xFF; } // poison marker
                        let mut publish_spin = 0u32;
                        while self.published.compare_exchange(
                            my_slot, my_slot.wrapping_add(1),
                            Ordering::Release, Ordering::Relaxed,
                        ).is_err() {
                            if self.state.load(Ordering::Acquire) != 0 {
                                return Err(RingError::Disconnected);
                            }
                            publish_spin += 1;
                            if publish_spin >= 4096 {
                                // Earlier producer is alive but severely delayed.
                                // Yield the CPU to allow it to make progress.
                                // This prevents livelock on the same core.
                                arch::current::cpu::yield_cpu();
                                publish_spin = 0;
                            }
                            core::hint::spin_loop();
                        }
                        return Err(RingError::Overloaded);
                    }
                    core::hint::spin_loop();
                }
            }
            core::hint::spin_loop();
        }

        // --- END interrupt-disabled section ---
        arch::current::interrupts::local_irq_restore(irq_state);

        Ok(())
    }
}

To prevent data loss when no future send() occurs, the per-CPU idle entry hook (cpu_idle_enter(), Section 7.2) drains the deferred publish buffer for all MPSC rings registered on that CPU. Additionally, when a thread that performed a deferred publish is migrated to a different CPU, the migration path drains the source CPU's deferred buffer. These hooks ensure deferred entries are published within a bounded window (at most one scheduler tick, ~4ms).

MPSC Phase 2 preemption hazard and mitigation. The Phase 2 publish spin in mpsc_try_produce() can stall if a producer is preempted (by an interrupt or scheduler) between Phase 1 (CAS on head) and Phase 2 (advancing published). While preempted, the published counter is stuck at the preempted producer's slot, blocking all subsequent producers from making their entries visible to the consumer -- even though their data is already written. This is not a deadlock (the preempted producer will eventually resume and complete Phase 2), but it can cause unbounded latency spikes on the consumer side.

Mitigation: UmkaOS addresses this in three ways:

Interrupts disabled from before Phase 1 through Phase 2. The MPSC produce path disables interrupts (not just preemption) BEFORE the Phase 1 CAS, keeping them disabled through the Phase 2 published counter advancement. This prevents the following deadlock scenario: on a uniprocessor (or any CPU), thread T1 claims slot N via CAS, then an IRQ fires and the IRQ handler claims slot N+1 via CAS. The IRQ handler's Phase 2 spin waits for published to reach N, but T1 cannot advance published because it is interrupted — deadlock. Disabling interrupts before Phase 1 eliminates this window entirely (there is no gap between CAS success and interrupt disabling). The interrupt-disabled region covers: Phase 1 CAS (bounded — succeeds or returns Full/Disconnected, typically 1 attempt = ~5ns), data write, deferred drain (up to 16 entries × ~5ns CAS = ~80ns), and Phase 2 publish CAS (up to 64 iterations = ~320ns), totaling ~410ns in the common case. On multiprocessor systems, disabling preemption alone would suffice (another CPU could run the interrupted producer), but disabling interrupts is correct on all configurations and the cost is negligible.
Consumer-side stuck detection and recovery (defense in depth). The consumer (umka-core event loop) maintains a watchdog: if head > published for more than 1000 consecutive poll iterations (~10μs), the consumer treats the gap as a stalled producer. If the gap persists beyond 10ms (configurable), the consumer initiates forced slot recovery: for each unpublished slot from published to head, write a poison marker (0xFF) and advance published. This unblocks any live producers spinning on Phase 2 while discarding the dead producer's incomplete entries. The consumer logs a diagnostic event with the number of force-published slots and the ring identity.

This consumer-side recovery is a safety net for the case where the crash recovery path (Section 10.8, step 5a) has not yet run — e.g., the driver faulted but the FMA detection latency exceeds 10ms, or the fault was a silent hang rather than a trap. Under normal operation, the crash recovery path (step 5a below) handles slot recovery before the consumer watchdog fires.

Interrupt handlers use bounded produce. Interrupt handlers that produce to MPSC rings use mpsc_try_produce(), which fails with Err(Full) if the ring is full rather than spinning. This prevents interrupt handlers from spinning on a full ring while the consumer (which runs in thread context) cannot drain it — if the consumer needs to be scheduled to make progress, a spinning IRQ handler creates an unbounded spin or deadlock.

MPSC producer API contract. The MPSC ring exposes two producer entry points with distinct calling context requirements:

impl DomainRingBuffer {
    /// Non-blocking produce. Returns immediately if the ring is full or
    /// disconnected. Safe to call from ANY context (thread, IRQ, softirq).
    ///
    /// On success: entry is enqueued and will be visible to the consumer
    /// after Phase 2 publish completes.
    /// On Err(Full): the ring has no free slots. The caller is responsible
    /// for handling the overflow (see overflow accounting below).
    /// On Err(Disconnected): the ring partner has died. Caller must not retry.
    pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingError>;

    /// Blocking produce. Spins (with bounded spin + yield) until a slot
    /// becomes available, then enqueues the entry.
    ///
    /// MUST NOT be called with interrupts disabled. If the ring is full,
    /// this function spins waiting for the consumer to drain entries. If
    /// interrupts are disabled, the consumer (which runs in thread context)
    /// may never be scheduled, causing an unbounded spin.
    ///
    /// In debug builds: panics immediately if called with interrupts
    /// disabled (detected via arch::current::interrupts::are_enabled()).
    /// In release builds: falls back to mpsc_try_produce() with overflow
    /// accounting if interrupts are disabled (defense in depth — the debug
    /// panic should catch all such call sites during development).
    pub fn mpsc_produce_blocking(&self, entry: &[u8]);
}

Calling mpsc_produce_blocking() with interrupts disabled is a BUG. Debug builds panic to catch the error during development; release builds fall back to mpsc_try_produce() with overflow accounting to avoid a hard hang in production. The release fallback is a safety net, not a license to call the blocking variant from IRQ context — all such call sites must be fixed.

Overflow accounting. When mpsc_try_produce() returns Err(Full) or Err(Overloaded) (whether called directly from IRQ context or as the release-build fallback), the caller increments a per-ring atomic overflow counter. Err(Disconnected) is not counted as overflow — it indicates the ring is being torn down and the caller should propagate the error to its own caller rather than retrying. The overflow statistics are stored directly in the DomainRingBuffer producer cache line as dropped_count and last_enqueued_seq (see struct definition in Section 10.7.2). Inlining these fields into the ring header avoids an extra pointer dereference on the drop path and keeps both fields on the same cache line as head and published, which are already hot during produce operations.

Each MPSC entry includes a monotonic sequence number in its header. The consumer detects dropped entries by checking for gaps in the sequence: if the sequence jumps from N to N+K (where K > 1), then K-1 entries were dropped due to overflow. The consumer logs a diagnostic event on gap detection, including the ring identity and gap size, so operators can identify rings that need larger depth configuration (Section 10.7.4 channel depths).

Summary of context rules:

Producer context	Permitted API	On ring full	Notes
Thread context (IRQs enabled)	`mpsc_produce_blocking()` or `mpsc_try_produce()`	Blocking: spin + yield until space (or `Err(Disconnected)` on partner death). Try: return `Err(Full)` or `Err(Disconnected)`.	Blocking variant is the normal path for thread-context producers.
IRQ handler / softirq	`mpsc_try_produce()` ONLY	Return `Err(Full)` or `Err(Disconnected)`, increment `dropped_count`, drop message.	Calling the blocking variant is a BUG (debug panic / release fallback).
NMI / MCE handler	NEITHER — use per-CPU buffer	N/A	See NMI/MCE safety below.

NMI/MCE safety: NMI handlers and Machine Check Exception (MCE) handlers MUST NOT produce to MPSC rings. Mitigation 1 (disabling interrupts) does NOT protect against NMIs or MCEs — both are non-maskable architectural exceptions that fire regardless of the interrupt flag state. If an NMI or MCE handler needs to log data, it must use a dedicated per-CPU single-producer buffer (not shared with normal interrupt context) that is drained by the main kernel after the exception returns. On x86, MCE handlers additionally run on a dedicated IST (Interrupt Stack Table) stack, so they must not access per-CPU data structures that assume the normal kernel stack.

Producer death recovery. If a producer (any tier) dies between MPSC Phase 1 (CAS claim) and Phase 2 (advancing published), the published counter is stuck at the dead producer's slot, blocking all subsequent producers. Three mechanisms ensure recovery:

Crash recovery step 5a (Tier 1, Section 10.8.2) / step 4 (Tier 2, Section 10.8.3): The crash handler identifies all MPSC rings where the dead driver was a producer. For each ring with head > published, it writes poison markers (0xFF) into all slots from published to head, then advances published to head. Finally, it sets state = Disconnected. Any live producer currently spinning in Phase 2 observes the state change and returns Err(Disconnected). This is the primary recovery mechanism and handles the vast majority of cases.
Consumer-side watchdog (mitigation 2 above): If head > published persists beyond 10ms (the crash handler hasn't run yet — e.g., silent hang, FMA detection latency), the consumer force-publishes poison markers and advances published. Safety net only.
Spin loop state checks: Every spin loop in mpsc_try_produce() (the 256- iteration fallback and the final unbounded spin) checks state on each iteration. On Disconnected, the spin exits immediately with Err(Disconnected) rather than waiting for published to advance.

These mechanisms are tier-independent for Tier 1 and Tier 2: the ring protocol handles producer death the same way regardless of whether the producer was Tier 1 (MPK fault) or Tier 2 (process death). The tier determines detection latency (Tier 1: <1ms via fault handler; Tier 2: immediate via process exit), but the ring recovery sequence is identical.

Tier 0 (in-kernel) drivers: The recovery mechanisms above do not apply to Tier 0. A Tier 0 driver runs without isolation — if it crashes between Phase 1 and Phase 2 (or anywhere), the kernel is already in a panic state. Corrupted kernel memory makes ring recovery meaningless; the system is going down. The MPSC produce path mitigates the window by disabling interrupts before Phase 1 (preventing preemption between CAS and publication), but no software mechanism can recover from a Tier 0 fault — only hardware isolation provides that.

This is the explicit trade-off of Tier 0 promotion: zero isolation overhead in exchange for accepting that any driver bug is a kernel panic. On platforms that lack hardware domain isolation (e.g., RISC-V without a fast isolation mechanism, or when isolation=performance is set), all Tier 1 drivers are effectively promoted to Tier 0. Operators choosing this configuration accept the reduced fault containment. The ring buffer's state/poison-marker recovery remains compiled in (zero cost when not triggered) but cannot fire because no crash recovery path exists to set state = Disconnected — the kernel has already panicked.

Broadcast channels (SPMC): umka-core -> all drivers. Used for system-wide notifications (suspend imminent, memory pressure, clock change). Umka-core writes once; each driver reads independently. The broadcast channel uses a sequence-numbered ring with a single sequencing mechanism: the last_enqueued_seq field (hereafter write_seq in broadcast mode), a u64 in the ring header. write_seq increments by 2 for each published entry (odd values indicate a write in progress; even values indicate a stable, readable entry — see torn-read prevention below). The logical entry count is write_seq / 2. The DomainRingBuffer's published field is not used independently in broadcast mode; if read, it is derived as write_seq / 2 for compatibility with diagnostic code that inspects published. Implementations must not increment published separately from write_seq in broadcast mode — write_seq is the sole source of truth.

Each consumer tracks its own read position (a u64 sequence number stored in the consumer's private memory, not in the shared ring header). To read, a consumer scans from its last-seen sequence to the ring's current write_seq (even values only). The ring's tail field is unused in broadcast mode — the producer never needs to know individual consumer positions. Instead, the producer overwrites the oldest entry when the ring is full (broadcast semantics: slow consumers miss entries rather than blocking the producer). Consumers detect missed entries by checking for sequence gaps.

Torn-read prevention: Each broadcast ring entry is bracketed by a u64 sequence stamp. Layout: [seq_start: u64 | payload: [u8; entry_size - 16] | seq_end: u64]. The producer writes seq_start = write_seq | 1 (odd = write in progress), then the payload, then seq_end = write_seq (even = complete), then advances write_seq by 2. The consumer reads seq_start, copies the payload, reads seq_end. If seq_end != (seq_start ^ 1), the read is torn — seq_start and seq_end are not a matched pair from the same write (a concurrent write changes seq_start to a different odd value, causing this check to fail). Additionally, if seq_start < consumer.last_seq, the entry is stale. In either case, the consumer detects the gap, increments gap_count, and advances to the next entry. All sequence accesses use Ordering::Acquire (reads) and Ordering::Release (writes).

/// Per-consumer broadcast state (stored in consumer's private memory).
pub struct BroadcastConsumer {
    /// Last sequence number consumed by this consumer.
    pub last_seq: u64,
}

Capability passing. Capabilities (Section 8.1) can be transferred over any IPC channel. The sending domain writes a CapabilityHandle (an opaque 64-bit token) into a ring buffer entry. Umka-core intercepts the transfer at the domain boundary and validates the capability: does the sender actually hold this capability? Is the capability transferable? Is the receiver permitted to hold capabilities of this type? If validation passes, umka-core translates the handle into the receiving domain's capability space -- the receiver gets a new handle that maps to the same underlying resource but exists in its own namespace. Raw capability data (kernel pointers, permission bitmasks) never crosses domain boundaries; only validated, translated handles do.

10.7.4 Flow Control and Ordering

Ordering within a channel. Ring buffer entries are processed in strict FIFO order within a single channel. If umka-core submits commands A, B, C to a driver's command ring, the driver sees them in A, B, C order. Completions flow back in the order the driver produces them (which may differ from submission order -- a driver may complete a fast read before a slow write).

No ordering across channels. There is no ordering guarantee between different channels. Driver A's completion may arrive at umka-core before driver B's completion, regardless of which command was submitted first. Applications that need cross-device ordering must enforce it at the io_uring level (using IOSQE_IO_LINK or IOSQE_IO_DRAIN), which umka-core translates into sequencing constraints on the domain command rings.

Channel depths. Each channel has a configurable entry count, set at ring creation time via the device registry (Section 10.5):

Channel type	Default depth	Typical entry size	Notes
Command (SPSC)	256	64 bytes	Matches NVMe SQ depth default
Completion (SPSC)	1024	16 bytes	4x command depth for batched completions
Event (MPSC)	512	32 bytes	Shared across all drivers on this event loop
Broadcast (SPMC)	64	32 bytes	Low-frequency system events

The minimum useful broadcast entry size is 24 bytes (8 bytes payload with 16 bytes of sequence stamps for torn-read prevention). The default of 32 bytes provides 16 bytes of payload, suitable for most event notifications. Umka-core rejects broadcast ring creation requests with entry_size < 24.

Depths are tunable per-driver via the device registry's ring_config property. Drivers that handle high-throughput workloads (NVMe, high-speed NIC) typically increase command depth to 1024 or 4096 to match hardware queue depths.

Priority channels. Real-time I/O (Section 7.2) uses a separate high-priority command ring per driver. The driver polls the priority ring before the normal ring on every iteration. This ensures RT I/O is not head-of-line blocked behind bulk I/O. Priority rings use the same SPSC structure but are typically shallow (32-64 entries) since RT workloads are low-volume, latency-sensitive flows.

umka-core dispatch logic (per driver, per poll iteration):

  1. Check priority command ring  -> process all pending entries
  2. Check normal command ring    -> process up to batch_limit entries
  3. Check event ring (MPSC)      -> process system events

Comparison with Linux. Linux has no equivalent to the intra-kernel domain ring buffer. Subsystem communication within the Linux kernel uses direct function calls with no isolation boundary. The closest analogy is Linux's io_uring internal implementation (the SQ/CQ ring structure), but that serves a different purpose (kernel-to-userspace communication). UmkaOS effectively uses an io_uring-inspired ring structure inside the kernel to connect isolated subsystems that Linux connects via unprotected function calls.

10.7.5 Terminology Reference

The following terms are used precisely throughout this document. This reference resolves ambiguity that arises from the word "ring" appearing in multiple contexts:

Term	Meaning	Where used
io_uring	Linux-compatible userspace async I/O interface. SQ/CQ rings mapped into user space.	Section 18.1.5, user-facing I/O API
domain ring buffer	Internal kernel IPC mechanism between isolation domains. SPSC or MPSC lock-free rings in shared memory.	Section 10.7, driver architecture
MPSC ring	A domain ring buffer variant with CAS-based multi-producer support. Used for event aggregation.	Section 10.7.3, event channels
Hardware queue	Device-specific command/completion queues (e.g., NVMe SQ/CQ, virtio virtqueue). Mapped via MMIO.	Section 10.6, device I/O paths
SPSC	Single-Producer Single-Consumer. The default domain ring buffer mode.	Section 10.7.2
SPMC	Single-Producer Multi-Consumer. Used for broadcast channels (umka-core -> all drivers).	Section 10.7.3

Any unqualified reference to "ring buffer" in the driver architecture sections (Sections 5-9) means a domain ring buffer. Any reference to "io_uring" means the userspace interface. Hardware queues are always qualified by device type (e.g., "NVMe submission queue", "virtio virtqueue").

10.8 Crash Recovery and State Preservation

This is UmkaOS's killer feature -- the primary reason to choose it over Linux.

Scope: This section covers Tier 1 and Tier 2 driver crash recovery where the host kernel acts as supervisor. For peer kernel crash recovery (devices running UmkaOS as a first-class multikernel peer), see Section 5.1.3, which uses a different isolation model (IOMMU hard boundary + PCIe unilateral controls rather than software domain supervision).

10.8.1 The Linux Problem

In Linux, all drivers run in the same address space with no isolation. A single bug in any driver -- null pointer dereference, buffer overflow, use-after-free -- triggers a kernel panic. Recovery requires a full system reboot: 30-60 seconds of downtime, loss of all in-flight state, and potential filesystem corruption if writes were in progress.

10.8.2 UmkaOS Tier 1 Recovery Sequence

When a Tier 1 (domain-isolated) driver faults:

1. FAULT DETECTED
   - Hardware exception (page fault, GPF) within a Tier 1 isolation domain
   - OR watchdog timer expires (driver stalled for >Nms)
   - OR driver returns invalid result / corrupts its ring buffer

2. ISOLATE
   - UmkaOS Core revokes the faulting driver's isolation domain by setting
     the access-disable bit for that domain's key in the domain register
     (x86: set AD bit in PKRU; AArch64: clear overlay permissions in POR_EL0;
     ARMv7: set domain to "No Access" in DACR; PPC32: invalidate segment;
     PPC64LE: switch to revoked PID; RISC-V/fallback: unmap driver pages)
   - Driver can no longer access any memory in its domain
   - Interrupt lines for this driver are masked

3. DRAIN PENDING I/O
   - All pending requests from user space are completed with -EIO
   - Applications receive error codes, not crashes
   - io_uring CQEs are posted with error status

4. DEVICE RESET
   - Issue Function-Level Reset (FLR) via PCIe
   - OR vendor-specific device reset sequence
   - Device returns to known-good state

5. RELEASE KABI LOCKS
   - The KABI lock registry tracks all Core kernel locks currently held on
     behalf of this driver. Every lock-acquiring KABI call (e.g., mutex_lock,
     rw_lock_read) pushes a (lock_ptr, lock_type) entry onto a per-driver,
     per-CPU lock stack (max depth 8, statically allocated in the driver
     descriptor). On normal unlock, the entry is popped.
   - On crash recovery, the registry is walked in reverse order (LIFO):
     each held lock is force-released (mutex: set owner to NONE and wake
     waiters; rwlock: decrement reader count or clear writer; spinlock:
     release). This prevents deadlock when a driver panics mid-critical-section.
   - After lock release, per-CPU borrow states held by the driver are reset
     to 0 (free), matching the PerCpu borrow-state tracking in Section 3.1.1.
   - **Invariant**: KABI calls that acquire Core locks MUST be non-reentrant
     and hold at most one Core lock at a time (enforced by the KABI vtable
     wrappers). This bounds the lock stack depth and ensures reverse-order
     release is always safe.

5a. RECOVER RING BUFFER IN-FLIGHT SLOTS
   - For each MPSC ring where the dead driver was a producer: if
     `head > published` (indicating the driver may have claimed a slot
     via Phase 1 CAS but died before Phase 2 publication), write poison
     markers (0xFF) into all unpublished slots from `published` to `head`
     and advance `published` to `head`. This unblocks any live producers
     spinning in Phase 2 waiting for the dead driver's slot to be published.
   - For SPSC completion rings (driver -> core): the ring is drained of all
     valid entries up to `published`, then the ring is reset (`head = tail
     = published = 0`) for the replacement driver instance.
   - Set `state = Disconnected` (AtomicU8, value 1) on all rings owned by
     the dead driver. Any producer currently spinning in a Phase 2 loop
     will observe this on its next `state.load()` and return
     `Err(Disconnected)`. This field is reset to `Active` (0) when the
     replacement driver re-initializes the ring.

6. UNLOAD DRIVER
   - Free all driver-private memory
   - Release all driver capabilities
   - Unmap driver MMIO regions

7. RELOAD DRIVER
   - Load fresh copy of driver binary
   - New bilateral vtable exchange
   - Device re-initialization
   - Re-register interrupt handlers

8. RESUME
   - New driver begins accepting I/O requests
   - Applications retry failed operations (standard I/O error handling)

TOTAL RECOVERY TIME: ~50ms typical (soft-reset path) to ~150ms (FLR path)
  (design target; validation requires hardware prototype — actual timing depends
   on driver state snapshot complexity and memory domain reset cost)

10.8.2a Reload Failure Handling

If the new driver instance fails to initialize after a crash, UmkaOS handles the failure as follows:

Detection: reload failure is defined as the new driver instance crashing during initialization, OR initialization not completing within 500 ms (hard timeout).
Device offline: the device is marked DeviceState::Failed; no new I/O is accepted.
Client notification: all processes with open file descriptors to this device receive SIGHUP; any pending I/O syscalls return EIO.
Kernel continues: a Tier 1 reload failure does not panic the kernel — the device is simply unavailable. All other drivers and subsystems continue operating normally.
Audit: a kernel warning is logged with the device canonical name, failure reason (crash vs timeout), and driver version.
Manual recovery: an operator can trigger a fresh reload attempt via the umkafs control interface at /System/Kernel/drivers/<name>/reload after investigating the cause; the failure counter (Section 10.5.10.2) may also trigger automatic demotion to Tier 2 on repeated failures.

Recovery timing breakdown — The ~50ms figure applies to the soft-reset path where the driver performs a vendor-specific device reset (register write + status poll) without a full PCIe Function Level Reset. Many devices (Intel NICs, AHCI controllers) support fast software reset in 1-10ms. The full PCIe FLR path takes longer: the PCIe spec requires the function to complete FLR within 100ms (the device must not be accessed until FLR completes; software polls the device's configuration space to detect completion). With driver reload overhead, the FLR path totals ~150ms. UmkaOS prefers the soft-reset path when the driver crash was a software bug (the device hardware is fine); FLR is used when the device itself appears hung (no response to MMIO reads, completion timeout). In either case, the recovery is 100-1000x faster than a full Linux reboot (30-60s).

10.8.2b FLR Timeout Recovery

The PCIe Base Specification requires that a function complete FLR within 100 ms. UmkaOS enforces this deadline and defines an escalating recovery sequence for the case where FLR does not complete in time.

FLR with timeout enforcement:

/// Poll interval while waiting for FLR completion.
const FLR_POLL_INTERVAL_US: u64 = 1_000; // 1 ms
/// Maximum wait for FLR per PCIe Base Spec (Section 6.6.2).
const FLR_TIMEOUT_MS: u64 = 100;

/// Initiate FLR on a PCIe function and poll for completion.
/// Returns Ok(()) when the function's config space is accessible again.
/// Returns Err(PcieError::FlrTimeout) if the deadline elapses.
fn pcie_flr_with_timeout(dev: &mut PcieDevice) -> Result<(), PcieError> {
    // Initiate FLR: write bit 15 of the Device Control register.
    // Cap offset is discovered via the PCIe Capability structure pointer.
    let devctl_offset = dev.pcie_cap_offset + PCI_EXP_DEVCTL;
    dev.config_write_u16(devctl_offset, PCI_EXP_DEVCTL_BCR_FLR);

    let deadline_ns = monotonic_ns() + FLR_TIMEOUT_MS * 1_000_000;
    loop {
        delay_us(FLR_POLL_INTERVAL_US);
        // FLR completion is indicated by config space returning valid data.
        // A device undergoing FLR returns 0xFFFF for any config read.
        if dev.config_read_u16(PCI_VENDOR_ID) != 0xFFFF {
            return Ok(());
        }
        if monotonic_ns() >= deadline_ns {
            break;
        }
    }
    Err(PcieError::FlrTimeout)
}

Escalation sequence on PcieError::FlrTimeout:

When FLR does not complete within 100 ms, UmkaOS escalates through the following steps in order, stopping at the first step that succeeds:

IOMMU quarantine (immediate, before attempting any escalation): the device's IOMMU domain is placed in fault mode — all further DMA from the device is blocked by the IOMMU. This prevents the hung device from corrupting memory during the escalation sequence, regardless of how long escalation takes.
Secondary bus reset: if the device is behind a PCIe bridge (not directly attached to the root complex), assert the bridge's secondary bus reset bit (PCI_BRIDGE_CTL_BUS_RESET, bit 6 of the Bridge Control register at config offset 0x3E). Hold for 1 ms, then deassert and wait up to 100 ms for the device's Vendor ID to become valid. A secondary bus reset resets all functions on the secondary bus, so sibling functions receive DeviceEvent::SiblingReset.
Hot-plug slot power cycle: if the slot exposes Hot-Plug capability and the HPC_POWER_CTRL bit is set in the Slot Capabilities register, toggle slot power off and on. Wait up to 1 s for the slot's Presence Detect State to return to present and the device's config space to become accessible.
Permanent fault: if neither secondary bus reset nor hot-plug power cycle recovers the device:
Transition the device to DeviceState::FaultedUnrecoverable.
Remove the device from the active device registry (it is retained as a tombstone entry for diagnostic purposes, accessible via umkafs).
Invoke the Tier 1 driver's teardown path (unload the driver, release its memory domain and capabilities) as if a crash occurred, but without attempting reload.
Log: pcie: FLR timeout on [bus:dev.fn] (vid={vid} did={did}), secondary bus reset {"succeeded"|"failed"}, slot power cycle {"succeeded"|"failed"|"unavailable"}, device faulted permanently.
The FMA subsystem (Section 19.1) receives a FaultEvent::PcieFlrTimeout event carrying the BDF, the vendor/device ID, and the escalation result. FMA may trigger a predictive replacement recommendation.
User notification: after the fault is recorded, send a uevent to userspace (ACTION=change, SUBSYSTEM=pci, PCIE_EVENT=FLR_TIMEOUT, PCI_SLOT_NAME=<bdf>). Device manager daemons (udev, systemd-udevd) can trigger operator alerts or automated replacement workflows.

Invariants: - IOMMU quarantine (step 1) is unconditional and runs before any escalation attempt. The device must not be able to DMA during escalation. - Steps 2 and 3 each have their own 100 ms and 1 s timeouts respectively. Total worst-case escalation time before permanent fault: ~1.2 s. - No driver code runs after FlrTimeout is returned. The escalation sequence is entirely in the kernel's PCIe subsystem (Tier 0), not in the Tier 1 driver. - If a secondary bus reset is performed, the sibling functions' drivers are notified via DeviceEvent::SiblingReset before the reset is asserted, giving them 5 ms to quiesce outstanding I/O.

10.8.2a Crash State Buffer Wire Format

When a Tier 1 driver panics, a pre-allocated crash state buffer is filled before the driver's isolation domain is destroyed. This buffer is stored in umka-core memory and remains accessible after teardown. It is used for post-mortem diagnostics, FMA fault reporting, and optionally for warm-restart state recovery.

/// Wire format of the crash state buffer saved when a Tier 1 driver panics.
/// Saved to a pre-allocated crash buffer in umka-core memory so it remains
/// accessible after the driver's memory domain is destroyed.
///
/// Total size: 512 bytes. Aligned to 64 bytes (cache-line boundary).
#[repr(C, align(64))]
pub struct DriverCrashState {
    /// Magic number for validation: 0x49534C435241534B ("ISLCRASH" in ASCII).
    pub magic: u64,
    /// Format version. Current: 1.
    pub version: u16,
    /// Driver ID (same as in the driver registry).
    pub driver_id: u32,
    _pad0: [u8; 2],
    /// TSC value at the time of crash (monotonic, CPU-local).
    pub crash_tsc: u64,
    /// Program counter (instruction pointer) at crash.
    pub crash_pc: u64,
    /// Stack pointer at crash.
    pub crash_sp: u64,
    /// Frame pointer at crash (for stack unwinding).
    pub crash_fp: u64,
    /// Crash reason code.
    pub crash_reason: CrashReason,
    _pad1: [u8; 4],
    /// Ring buffer head index at crash time.
    pub ring_head: u32,
    /// Ring buffer tail index at crash time.
    pub ring_tail: u32,
    /// First 256 bytes of the request being processed when the crash occurred
    /// (zero-padded if the request is shorter or unavailable).
    pub partial_request: [u8; 256],
    /// Crash backtrace: first 128 bytes. Symbolicated if DWARF debug info is
    /// available at the time of crash; raw 8-byte addresses otherwise.
    pub backtrace: [u8; 128],
    _pad2: [u8; 24],
    // Field layout:
    // magic(8) + version(2) + driver_id(4) + _pad0(2) + crash_tsc(8)
    // + crash_pc(8) + crash_sp(8) + crash_fp(8) + crash_reason(4)
    // + _pad1(4) + ring_head(4) + ring_tail(4) + partial_request(256)
    // + backtrace(128) + _pad2(24) = 512 bytes total.
}

/// Crash reason codes stored in DriverCrashState.
#[repr(u32)]
pub enum CrashReason {
    /// Driver code invoked panic!() or hit an assertion failure.
    Panic         = 0,
    /// Page fault (null dereference, stack overflow, bad pointer).
    PageFault     = 1,
    /// Invalid opcode (#UD fault — executed an undefined instruction).
    InvalidOpcode = 2,
    /// Divide-by-zero (#DE fault).
    DivByZero     = 3,
    /// Capability access violation (attempted to cross an isolation boundary
    /// without a valid capability token).
    CapViolation  = 4,
    /// Watchdog timer expired (driver did not make forward progress).
    Timeout       = 5,
    /// Stack overflow detected (guard page fault at the bottom of the driver stack).
    StackOverflow = 6,
}

The crash buffer is pre-allocated per driver at load time (no allocation during the crash path). The domain fault handler fills it with whatever register state is available at fault entry, then proceeds with the normal recovery sequence.

10.8.3 UmkaOS Tier 2 Recovery Sequence

Tier 2 (user-space process) driver recovery is even simpler:

1. Driver process crashes (SIGSEGV, SIGABRT, etc.)
2. UmkaOS Core's driver supervisor detects process exit
3. REVOKE DEVICE ACCESS
   - Mark the device as "in recovery" in the device registry, preventing
     any new MMIO mappings or device access grants for this device.
   - Revoke the driver's IOMMU entries (tear down the device's IOMMU
     domain mappings). Any in-flight DMA that completes after this point
     hits an IOMMU fault and is dropped.
   - If the dying process's teardown has not yet completed MMIO unmapping
     (page table entry removal + TLB shootdown), force-invalidate the
     relevant page table entries. In practice, the process is already
     exiting at step 2, so MMIO unmapping is a cleanup operation — the
     device registry marking and IOMMU revocation are what actually
     prevent further device access.
4. RECOVER RING BUFFER IN-FLIGHT SLOTS
   - Same as Tier 1 step 5a: publish poison markers for any MPSC slots
     claimed but unpublished by the dead driver, set ring `state =
     Disconnected`. Unblocks live producers spinning on Phase 2.
5. Pending I/O completed with -EIO
6. Supervisor restarts driver process
7. New process re-initializes device, resumes service

TOTAL RECOVERY TIME: ~10ms

Why Tier 2 is faster than Tier 1 -- Counter-intuitively, the "weaker" isolation tier recovers faster. The reason is that Tier 2 recovery skips the most expensive step in the Tier 1 sequence: no device FLR in the normal case. Tier 2 drivers have direct MMIO access to their device's BAR regions (for performance), but MMIO revocation (step 3 above) cuts off device access immediately. The IOMMU prevents any DMA initiated through those MMIO registers from reaching non-driver memory, so there is no DMA safety hazard even if the device has in-flight operations.

IOTLB coherence and DMA page lifetime -- A lightweight IOMMU invalidation (not a full drain fence) suffices at step 3 because Tier 2 recovery defers freeing the crashed driver's DMA pages rather than draining all in-flight DMA. After IOMMU entry revocation, stale IOTLB entries may still allow in-flight DMA to complete to the old physical addresses. If those pages were freed immediately, this would be a use-after-free via hardware. Instead, the old DMA pages remain allocated (owned by the kernel, not the dead process) until the replacement driver instance calls init() and either reuses them (warm restart via the state buffer) or explicitly releases them back to the allocator. By the time pages are actually freed, the IOTLB has long since been flushed — either by the invalidation at step 3, by natural IOTLB eviction, or by the new driver's own IOMMU setup. This makes the IOTLB coherence window moot without requiring a synchronous drain fence.

DMA deferred-free lifetime bound -- The deferred-free strategy described above has a resource exhaustion risk: if the replacement driver never loads (or loads but never calls init()), the old DMA pages remain allocated indefinitely. An attacker could repeatedly crash Tier 2 drivers to exhaust DMA-capable memory (typically ZONE_DMA / ZONE_DMA32 on x86-64, or CMA regions on ARM). To bound this exposure, every deferred DMA page set carries a reclaim deadline:

When a Tier 2 driver crashes and its DMA pages are moved to deferred-free status, each page set is tagged with deferred_deadline = now + 30_seconds.
A kernel background task (dma_reclaim_worker, period = 10 seconds) scans all deferred-free DMA page sets. Any page set whose deadline has passed is reclaimed immediately — the "wait for replacement driver" check is bypassed. The reclaim frees the physical pages back to the allocator and logs a warning identifying the driver and number of pages force-reclaimed.
Rationale: 30 seconds is ample time for the driver supervisor to restart the replacement process and for the new driver to call init() and either reuse or release the preserved pages. If no replacement has loaded after 30 seconds, the driver is presumed permanently crashed (or its supervisor has given up), and the pages are safe to reclaim. By the 30-second mark, any stale IOTLB entries have long since been flushed (IOTLB eviction typically occurs within microseconds to milliseconds), so reclaiming the pages carries no DMA safety hazard.

/// Maximum DMA pages preserved across a Tier 2 driver crash.
/// 512 pages × 4 KB = 2 MB maximum preserved DMA state per driver.
/// Drivers requiring more than 2 MB of preserved DMA state should use
/// persistent memory (DAX) or external state servers.
pub const MAX_DEFERRED_DMA_PAGES: usize = 512;

/// DMA pages held in deferred-free state after a Tier 2 driver crash.
///
/// These pages are preserved so the replacement driver can reuse them
/// (warm restart via the state buffer). If no replacement loads before
/// `deadline`, the `dma_reclaim_worker` force-reclaims them.
pub struct DeferredDmaPages {
    /// Physical pages held for replacement driver use after crash recovery.
    /// Fixed-size array: crash handlers MUST NOT allocate from heap.
    /// Pre-allocated at driver initialization time.
    pub pages:       ArrayVec<PhysPage, MAX_DEFERRED_DMA_PAGES>,
    pub page_count:  u16,        // actual count; u16 sufficient (max 512 < 65535)
    /// Deadline after which pages are reclaimed regardless of driver state.
    pub deadline:    Instant,
    /// Which driver's state these pages belong to (for logging on forced reclaim).
    pub driver_name: DriverName,
}

The DriverRegistry maintains a counter for observability:

/// Number of times the 30-second deadline triggered forced DMA page
/// reclaim. Exposed via umkafs at `/Devices/<device>/dma_forced_reclaims`.
/// A sustained non-zero rate indicates drivers that crash without timely
/// replacement — investigate the driver supervisor and restart policy.
pub dma_forced_reclaims: AtomicU64,

If the device appears hung after the Tier 2 crash (the replacement driver's init() detects an unresponsive device), the registry escalates to FLR, but this fallback is rare. Tier 2 recovery is typically "revoke mappings, restart the process, reconnect to the ring buffer" -- a ~10ms operation dominated by process creation and driver init().

10.8.4 State Preservation and Checkpointing

Driver recovery (Section 10.8 steps 1–6) restarts a new driver instance, but without state preservation the new instance starts cold — losing in-flight I/O, device configuration, and connection state. UmkaOS uses a Theseus-inspired state spill design to enable warm restarts.

State buffer — Each Tier 1 driver has an associated kernel-managed "state buffer" that resides outside the driver's isolation domain. The buffer is allocated by umka-core and mapped read-write into the driver's address space. On crash, the isolation domain is destroyed but the state buffer survives (it belongs to umka-core).

Driver Isolation Domain (destroyed on crash)    umka-core (survives)
┌─────────────────────────┐                ┌──────────────────────┐
│  Driver code + heap     │  checkpoint →  │  State Buffer        │
│  Internal caches        │  ──────────→   │  ┌────────────────┐  │
│  (NOT preserved)        │                │  │ Version: 3     │  │
│                         │                │  │ DevCmdQueue[]   │  │
│                         │                │  │ RingBufPos      │  │
│                         │                │  │ ConnState[]     │  │
│                         │                │  │ HMAC Tag        │  │
└─────────────────────────┘                │  └────────────────┘  │
                                           └──────────────────────┘

State buffer format: - Driver-defined structure (the driver author decides what to checkpoint). - Versioned via KABI version field — the state buffer header includes a format version number so a newer driver binary can detect and handle (or reject) state from an older version. - HMAC-SHA256 integrity tag — computed by umka-core using a per-driver key, verified before handing to the new driver instance. Corrupt or tampered buffers are discarded. The HMAC key is generated by umka-core on the first load of a driver for a given DeviceHandle. The key is stored in the DeviceNode (Section 10.6 Device Registry) and persists across driver crash/reload cycles. The key is only discarded when the DeviceHandle is removed from the registry (device unplugged or explicitly deregistered). On reload, umka-core verifies the existing state buffer using the persisted key, then continues using the same key for the new driver instance. The driver writes state data, but only umka-core can produce valid integrity tags, preventing a buggy driver from poisoning the state buffer with corrupted data. Note: Tier 1 drivers run in Ring 0, so a deliberately compromised driver (with arbitrary code execution) could read the HMAC key from umka-core memory by bypassing MPK via WRPKRU (Section 10.2, WRPKRU threat model). This is within the documented Tier 1 threat model — MPK provides crash containment, not exploitation prevention. The HMAC protects state buffer integrity against bugs (the common case), not against active exploitation (which requires Tier 2 for defense).

Checkpoint frequency: - Configurable per-driver. Default: checkpoint after every I/O batch completion, or every 1ms, whichever comes first. - Checkpoint is a memcpy from driver-local structures to the inactive state buffer slot (~1–4 KB typical) plus an atomic doorbell write. At 1ms intervals, the overhead is negligible.

Torn checkpoint protection (double buffering):

The driver cannot compute the HMAC (only umka-core can), so a driver crash mid-write would leave a torn (partially written) state buffer. To prevent this, the state buffer uses a double-buffering protocol:

The state buffer contains two slots (A and B). At any time, one slot is active (the last successfully checkpointed state) and the other is inactive (the write target for the next checkpoint).
The driver writes its checkpoint data to the inactive slot. When the write is complete, the driver signals umka-core by writing a completion flag to a shared doorbell — a single atomic write visible to umka-core.
Umka-core, on observing the doorbell (polled during periodic work or on driver crash), computes HMAC-SHA256 over the completed slot and atomically swaps the active slot pointer.
On crash recovery, umka-core verifies the active slot's HMAC. If valid, that state is used for the new driver instance. If invalid (corruption or incomplete swap), umka-core falls back to the previous active slot, which still holds the last known-good checkpoint.
The double-buffer swap is an atomic pointer update. There is no race with driver writes because the driver only ever writes to the inactive slot.
After ringing the doorbell, the inactive slot is considered "pending" -- the driver must not begin a new checkpoint until umka-core completes the swap and clears the doorbell flag. If the next 1 ms checkpoint interval arrives while a swap is still pending, the driver skips that checkpoint cycle. In practice, umka-core processes the doorbell within a few microseconds (HMAC-SHA256 on 4 KB takes ~2–5 µs with hardware SHA acceleration, ~15–30 µs without — see HMAC-SHA256 performance note below), so skipped checkpoints are rare.

TOCTOU mitigation (verify-then-use atomicity):

The state buffer is mapped read-write into the driver's address space, which creates a potential Time-Of-Check-Time-Of-Use (TOCTOU) vulnerability: a compromised driver could modify the active slot after umka-core verifies the HMAC but before the new driver instance reads it. UmkaOS prevents this attack through the following mechanisms:

Slot revocation on crash: When a driver crashes, umka-core immediately revokes the crashed driver's write access to both state buffer slots by unmapping the entire state buffer from the old isolation domain. This is step 2 of the recovery sequence (Section 10.8) — it happens before HMAC verification (step 4). After revocation, the crashed driver's code cannot execute and its page tables are destroyed, so there is no entity that can modify the buffer between verification and use.
Copy-on-verify to kernel-private storage: After HMAC verification succeeds, umka-core copies the verified slot contents to a kernel-private buffer (not mapped into any driver's address space). The new driver instance receives a read-only snapshot of this copy, not a pointer to the original state buffer. This ensures that even if an attacker could somehow gain write access to the original buffer (which they cannot, per point 1), the verified data cannot be altered.
New driver isolation: The new driver instance is created with a fresh isolation domain. The state buffer is not mapped into this new domain until after the new driver calls init() and signals that it has finished consuming the checkpoint data. During initialization, the driver reads from the kernel-private copy (provided via a read-only mapping or explicit copy to the driver's local heap). Only after init() returns successfully does umka-core map the state buffer (both slots) read-write into the new driver's address space for future checkpoints.
Atomicity guarantee: The sequence — unmap from old domain, verify HMAC, copy to kernel-private storage, create new domain — is performed with preemption disabled on the recovery CPU. There is no window during which any user-space code (driver or otherwise) can execute while holding write access to the verified buffer.

This design ensures that HMAC verification and data consumption are effectively atomic: once verified, the data cannot be modified by any entity before the new driver reads it. The cost is one additional memcpy (~4 KB) per recovery, which is negligible compared to the overall recovery latency (~50-150 ms).

HMAC-SHA256 performance:

HMAC-SHA256 for a 4 KB message: - With hardware SHA acceleration (SHA-NI on x86-64 Skylake+, SHA1/SHA256 extensions on AArch64/ARMv7, Zknh on RISC-V): ~2.1 cycles/byte → ~8,600 cycles → ~2–5 µs at 3 GHz - Without hardware acceleration (software implementation, SSSE3, or generic): ~13 cycles/byte → ~53,000 cycles → ~15–30 µs at 3 GHz

UmkaOS selects the optimal implementation at boot via algorithm priority: hardware-SHA > SSSE3 > generic. The crypto_shash_alloc() API transparently selects the fastest available implementation for the running CPU.

HMAC-SHA256 computation is performed by umka-core asynchronously — not on the driver's hot path. The driver's checkpoint cost is limited to the memcpy plus an atomic doorbell write.

What is preserved vs. rebuilt:

Preserved (in state buffer)	NOT preserved (rebuilt from scratch)
Device command queue positions	Driver-internal caches
Hardware register snapshots	Deferred work queues
In-flight I/O descriptors	Timers and timeout state
Ring buffer head/tail pointers	Debug/logging state
Connection/session state	Statistics counters (reset to zero)
Device configuration (MTU, features, etc.)

NVMe example: - Checkpointed: submission queue tail doorbell position, completion queue head position, in-flight command IDs with their scatter-gather lists, namespace configuration. - On reload: new driver reads state buffer, re-maps device BARs, verifies queue state against hardware registers, and resumes submission. In-flight commands that were submitted but not completed are re-issued.

NIC example: - Checkpointed: active flow table entries, RSS (Receive Side Scaling) indirection table and hash key, interrupt coalescing settings, VLAN filter table, MAC address list. - On reload: new driver re-programs the NIC with the checkpointed configuration. Active TCP connections see a brief pause (~50-150ms) but do not reset — the connection state lives in umka-net (Tier 1), not in the NIC driver.

Fallback: - If HMAC verification of the state buffer fails, or the version is incompatible, the new driver instance performs a cold restart (current behavior: full device reset, all in-flight I/O returned as -EIO). - Cold restart is always safe — state preservation is an optimization, not a requirement.

10.8.5 Crash Dump Infrastructure

When umka-core itself faults (not a driver — the core kernel), the system needs to capture diagnostic state for post-mortem analysis. Unlike driver crashes (which are recoverable), a core panic is fatal.

Reserved memory region: - At boot, UmkaOS reserves a contiguous physical memory region for crash dumps, configured via boot parameter: umka.crashkernel=256M (similar to Linux crashkernel=). - This region is excluded from the normal physical memory allocator — it survives a warm reboot if the firmware doesn't clear RAM.

Panic sequence:

1. Core panic triggered (null deref, assertion failure, double fault, etc.)
2. Disable interrupts on all CPUs (IPI NMI broadcast)
3. Panic handler (Tier 0 code, always resident, minimal dependencies):
   a. Save register state for the faulting CPU:
      - x86-64: GPRs, CR3, IDTR, RSP, RFLAGS, RIP, segment selectors
      - AArch64: GPRs (x0-x30), SP_EL1, ELR_EL1, SPSR_EL1, ESR_EL1, FAR_EL1
      - ARMv7: GPRs (r0-r15), CPSR, DFAR, DFSR, IFAR, IFSR
      - RISC-V: GPRs (x0-x31), sepc, scause, stval, sstatus, satp
   b. Walk the stack, generate backtrace (using .eh_frame / DWARF unwind info)
   c. Snapshot key data structures:
      - Active process list + their states
      - Capability table summary
      - Driver registry state
      - IRQ routing table
      - Recent ring buffer entries (last 64KB of klog)
   d. Write all of the above into the reserved crash region as an ELF core dump
4. Flush panic message to serial console (already works in current implementation)
5. If a pre-registered NVMe region exists (configured at boot):
   a. Use the NVMe driver's Tier 0 "panic write" path (polled mode, no interrupts)
   b. Write the crash dump from reserved memory to the NVMe region
6. Halt or reboot (configurable: `umka.panic=halt|reboot`, default: halt)

Crash stub: - The panic handler is Tier 0 code: statically linked, no dynamic dispatch, no allocation, no locks (or only try-lock with immediate fallback). It must work even if the heap, scheduler, or interrupt subsystem is corrupted. - Serial output always works (Tier 0 serial driver, polled mode). - NVMe panic write uses polled I/O (no interrupts, no completion queues) — a simplified write path that can function with a partially-corrupted kernel.

Next boot recovery:

1. Bootloader loads UmkaOS kernel
2. Early init checks the reserved crash region for a valid dump header
3. If found:
   a. Copy dump to a temporary in-memory buffer
   b. After filesystem mount, write to /var/crash/umka-dump-<timestamp>.elf
   c. Log "Previous crash dump saved to /var/crash/umka-dump-<timestamp>.elf"
   d. Clear the reserved crash region
4. The dump can be analyzed with standard tools:
   - `crash` utility (same as Linux kdump analysis)
   - GDB with the UmkaOS kernel debug symbols
   - `umka-crashdump` tool (UmkaOS-specific, extracts structured summaries)

Dump format: - ELF core dump format, compatible with the crash utility and GDB. - Contains: register state, memory regions (kernel text, data, stack pages for active threads, page tables), and a note section with UmkaOS-specific metadata (kernel version, boot parameters, uptime, driver state).

No kexec on day one: - Linux uses kexec to boot a second "crash kernel" that writes the dump. This is reliable but complex. - UmkaOS uses a simpler "in-place dump" to reserved memory: the panic handler writes directly to the reserved region without booting a second kernel. - kexec-based crash dump is a future enhancement for systems where the in-place approach is insufficient (e.g., very large memory dumps requiring a full kernel to compress and transmit).

10.8.6 Recovery Comparison

Scenario	Linux	UmkaOS
NVMe driver null deref	Kernel panic, full reboot	Reload driver, ~50-150ms (design target)
NIC driver infinite loop	System freeze	Watchdog kill, reload, ~50-150ms (design target)
USB driver buffer overflow	Kernel panic	Restart process, ~10ms
FS driver corruption	Kernel panic + fsck	Reload driver, fsck on mount
Audio driver crash	Kernel panic	Restart process, ~10ms

10.8.7 Crash History and Auto-Demotion

The kernel tracks per-driver crash statistics:

crash_count[driver_id] within window (default: 1 hour)
  0-2 crashes: Reload at same tier
  3+  crashes: Demote to next lower tier (if minimum_tier allows)
                Log warning, notify admin
  5+  crashes: Transition to Quarantined (driver permanently disabled); manual re-enable via sysfs. Log critical alert

A Tier 1 driver that crashes 3 times is demoted to Tier 2 (full process isolation), accepting the performance penalty for increased safety. An administrator can override this policy.

See also: Section 12.6 (Live Kernel Evolution) extends crash recovery to proactively replace core kernel components at runtime, reusing the same state-export/reload mechanism. Section 19.1 (Fault Management) adds predictive telemetry and diagnosis before crashes occur.

10.9 USB Class Drivers and Mass Storage

USB devices follow a class-based driver model. The USB host controller driver (xHCI for USB 3.x, EHCI for USB 2.0) is a Tier 1 platform driver that manages host controller hardware and the root hub. Class drivers are layered above it and bind to devices by USB class code, subclass, and protocol — not by vendor/product ID — giving a single driver coverage across all standards-compliant devices of a class.

10.9.1 USB Host Controller (xHCI, Tier 1)

The xHCI driver (USB 3.2 specification) manages:

Transfer ring management: each endpoint has a ring buffer (producer/consumer pointers in memory). The driver enqueues Transfer Request Blocks (TRBs); the controller processes them and posts Transfer Event TRBs to the Event Ring.
Command ring: host-issued commands (Enable Slot, Disable Slot, Configure Endpoint, Reset Device) use a separate command ring.
Interrupt moderation: MSI-X per-interrupter; Event Ring Segment Table (ERST) maps event ring memory to the controller.

Device enumeration: root hub port status change → enumerate device at default address 0 → GET_DESCRIPTOR (device, configuration, interface, endpoint) → assign address via SET_ADDRESS → bind class driver based on bDeviceClass or bInterfaceClass.

10.9.2 USB Mass Storage (UMS) and USB Attached SCSI (UAS)

Both protocols expose USB storage devices as block devices to umka-block.

UMS (USB Mass Storage, Bulk-Only Transport): - Wraps SCSI commands in a Command Block Wrapper (CBW) sent over a bulk-out endpoint; device responds with data and a Command Status Wrapper (CSW) on bulk-in. One outstanding command at a time. - Device registers as BlockDevice with umka-block upon successful SCSI INQUIRY → READ CAPACITY(16) sequence.

UAS (USB Attached SCSI, USB 3.0+): - Four-endpoint protocol (command, status, data-in, data-out). Multiple outstanding commands (up to 65535 via stream IDs). Significantly higher throughput and lower latency than UMS for fast SSDs. - Preferred over UMS when both are supported (bInterfaceProtocol = 0x62). - Same BlockDevice registration as UMS; umka-block sees no difference.

Hotplug: USB device removal triggers an Unregister event in the device registry (Section 10.5). The volume layer (Section 14.3) transitions dependent block devices to DEVICE_FAILED state. Auto-mount/unmount policy is handled by a userspace daemon (udev-compatible via umka-compat) reacting to device registry events.

Tier classification: UMS/UAS drivers are Tier 2 — they communicate over USB (inherently higher latency than PCIe), and the attack surface of USB storage firmware justifies full process isolation over the modest CPU overhead.

10.9.3 USB4 and Thunderbolt

USB4 (based on Thunderbolt 3 protocol) and Thunderbolt 3/4 are high-bandwidth interconnects (40 Gbps) that tunnel multiple protocols — PCIe, DisplayPort, USB — over a single cable. They are relevant across server (external NVMe enclosures, 40GbE NICs), workstation (external GPUs), and embedded (dock stations) contexts.

Architecture: A USB4/Thunderbolt port is controlled by a retimer/router chip with its own firmware. The host-side driver configures the router and establishes tunnels. The tunneled protocols then appear as native devices:

Physical cable (USB4/TB4)
  └── USB4 router (host controller + retimer firmware)
       ├── PCIe tunnel → appears as PCIe device (NVMe, GPU, NIC)
       ├── DisplayPort tunnel → appears as DP connector (Section 20.4.3, `20-user-io.md`)
       └── USB tunnel → appears as USB hub → USB class devices

Kernel responsibilities:

Router enumeration: Discover USB4 routers via their management interface (MMIO registers or USB control endpoint). Read router topology descriptor to find upstream/downstream adapters and their capabilities.
IOMMU enforcement (mandatory for PCIe tunnels): Before establishing a PCIe tunnel to an external device, the kernel allocates an IOMMU domain for the tunneled device. The PCIe device behind the tunnel is treated identically to a native PCIe device — it gets its own IOMMU domain, its own device registry entry, and its driver follows the normal Tier 1/2 model. IOMMU protection is not optional; external PCIe devices are untrusted by definition.
Tunnel authorization: The kernel blocks PCIe tunnel establishment until an authorization signal is received via sysfs: /sys/bus/thunderbolt/devices/<device>/authorized Writing 1 authorizes the device; writing 0 de-authorizes and tears down the tunnel. This is the kernel's policy interface — what triggers the write (user prompt, pre-approved list, automatic trust) is userspace policy.
Hotplug lifecycle:
Connect: router detects device → kernel enumerates → IOMMU domain allocated → authorization check → tunnel established → PCIe/DP/USB device appears
Disconnect: router reports link-down → kernel tears down tunnel → IOMMU domain revoked → device registry Unregister event → volume/display/USB layers handle disappearance gracefully

/// USB4/Thunderbolt router state.
pub struct Usb4Router {
    /// Router hardware generation and capabilities.
    pub gen: Usb4Generation,
    /// Upstream adapter (host-facing port).
    pub upstream: Usb4Adapter,
    /// Downstream adapters (device-facing ports).
    pub downstream: Vec<Usb4Adapter>,
    /// Currently active tunnels.
    pub tunnels: Vec<Usb4Tunnel>,
    /// IOMMU domains for active PCIe tunnels.
    pub pcie_domains: BTreeMap<Usb4AdapterId, IommuDomain>,
}

#[repr(u32)]
pub enum Usb4Generation {
    Usb4Gen2 = 2,   // 20 Gbps
    Usb4Gen3 = 3,   // 40 Gbps
    Tb3      = 30,  // Thunderbolt 3 (40 Gbps)
    Tb4      = 40,  // Thunderbolt 4 (40 Gbps, mandatory PCIe + DP)
}

pub struct Usb4Tunnel {
    pub kind: Usb4TunnelKind,
    pub adapter_id: Usb4AdapterId,
    pub iommu_domain: Option<IommuDomain>, // Some for PCIe tunnels
}

#[repr(u32)]
pub enum Usb4TunnelKind {
    Pcie        = 0,
    DisplayPort = 1,
    Usb3        = 2,
}

IOMMU domain lifecycle on disconnect/reconnect:

To prevent IOMMU domain reuse on rapid disconnect/reconnect sequences:

On disconnect: the device's IOMMU domain is immediately invalidated (all IOMMU mappings flushed). The domain ID enters a quarantine period (TTL = 5 seconds). The device's CAP_DMA capability is revoked immediately via cap_revoke(device_cap_handle).
Quarantine: the quarantined domain ID is reserved and cannot be assigned to any new device until TTL expires and all in-flight DMA transactions are confirmed drained (via iommu_domain_drain_wait()).
On reconnect: the reconnecting device receives a fresh IOMMU domain with a new domain ID. It never inherits the quarantined domain. Authorization re-runs from scratch (user prompt or policy check).
DMA capability binding: CAP_DMA is bound to the IOMMU domain ID, not the device identity. A reconnecting device gets a new CAP_DMA capability after authorization; the old capability is permanently revoked.

This prevents the race where old IOMMU mappings remain active when a new device appears at the same slot, and prevents capability reuse across device identities.

Firmware updates for TB controllers: Controller firmware is updatable via the NVM update protocol (vendor-specific, typically via the thunderbolt sysfs interface). The kernel exposes the firmware version and provides a write interface for firmware blobs. Actual firmware image selection and update policy is userspace.

Relationship to Section 5.1.3: External PCIe devices attached via USB4/Thunderbolt use the same IOMMU hard boundary and unilateral controls (bus master disable, FLR, slot power) as internal PCIe devices. If the external device runs an UmkaOS peer kernel (Section 5.1.2.2), it participates in the cluster exactly as an internal device would — the tunnel is transparent to the cluster protocol.

10.9.3.1 Authorization TOCTOU Prevention

The authorization flow described above has a time-of-check / time-of-use (TOCTOU) window: a malicious device could present one identity at authorization time and then swap its firmware or topology between authorization and PCIe tunnel enumeration, gaining access to an authorized tunnel under a different identity. UmkaOS closes this window with a cryptographic authorization token that binds to immutable hardware identifiers, plus a mandatory re-verification step at the start of enumeration.

Authorization token:

/// Default authorization token lifetime for interactive sessions.
/// After this duration, the token expires and a new authorization is required.
/// Balances security (limits damage window if device is swapped) with usability.
/// Overridable via the `umka.thunderbolt_auth_timeout_s` kernel parameter.
pub const TBT_AUTH_DEFAULT_TIMEOUT_S: u64 = 1800; // 30 minutes

/// Timeout waiting for the userspace authorization daemon to respond.
/// If no response within this window, the tunnel request is denied (fail-closed).
/// Prevents hangs when the authorization daemon is unresponsive.
pub const TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S: u64 = 30;

/// Cryptographic authorization token for a USB4/Thunderbolt PCIe tunnel.
/// Generated by the security manager when the user authorizes a device.
/// Stored in umka-core memory (not in the driver's isolation domain).
pub struct TbtAuthToken {
    /// HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes).
    /// `auth_key` is a kernel-private key generated at boot (never exported).
    pub token: [u8; 32],
    /// Thunderbolt device UUID as reported by router firmware (immutable field).
    pub device_uuid: [u8; 16],
    /// Thunderbolt device serial number as reported by router firmware (immutable).
    pub device_serial: u64,
    /// Topology path (upstream router UIDs + adapter indices) at authorization time.
    pub topology_path: TbtTopologyPath,
    /// Monotonic nanosecond timestamp when authorization was granted.
    pub authorized_at_ns: u64,
    /// Expiry timestamp (monotonic ns). 0 means valid until disconnect.
    /// Defaults to authorized_at_ns + TBT_AUTH_DEFAULT_TIMEOUT_S * 1_000_000_000
    /// for interactive sessions. Set to 0 for explicit "valid until disconnect" policy.
    pub expires_at_ns: u64,
}

/// Topology path: ordered list of (router_uid, adapter_index) pairs from the
/// host controller down to the authorized device. Max depth = 6 hops (USB4 spec).
pub struct TbtTopologyPath {
    pub hops: [(u64, u8); 6], // (router_uid, adapter_index)
    pub depth: u8,
}

Headless and daemon-less authorization policy:

Headless/daemon-less policy: If no USB4/Thunderbolt authorization daemon is registered (headless server, container, or daemon crash), ALL PCIe tunnel requests are denied by default until explicit authorization via umka-tbtctl authorize <uuid>. USB-class endpoints (not PCIe tunnels) are unaffected by this policy. The deny-default is logged at KERN_INFO level.

Daemon response timeout: If no daemon response arrives within TBT_AUTH_DAEMON_RESPONSE_TIMEOUT_S seconds, the kernel auto-denies the tunnel and logs the timeout event at KERN_WARNING.

Token generation at authorization time:

When the security manager grants authorization (in response to a write of 1 to /sys/bus/thunderbolt/devices/<device>/authorized):

Read the device's UUID and serial from the router firmware via the Thunderbolt management interface (read-only fields in the router topology descriptor; these fields are populated at cable plug-in by the router firmware from the device's identity block and cannot be modified by software).
Snapshot the current topology path (upstream router UIDs + adapter indices from host controller down to this device).
Compute HMAC-SHA256(auth_key, device_uuid || device_serial || topology_path_bytes) using the kernel's boot-time-generated auth_key.
Store the resulting TbtAuthToken in umka-core memory, associated with the device's Usb4Router entry.

Re-verification at enumeration time:

Before the USB4/TBT driver establishes the PCIe tunnel and presents the tunneled device to the PCIe bus, the kernel performs a mandatory re-verification:

PCIe tunnel enumeration protocol (enforced by umka-core, not the driver):

1. USB4/TBT driver requests PCIe tunnel enumeration for adapter <id>.
2. Security manager retrieves the stored TbtAuthToken for that adapter.
3. If no token exists: enumeration denied (PermissionDenied).
4. If token has expired (expires_at_ns != 0 && monotonic_ns() > expires_at_ns):
   revoke authorization, log security event, return PermissionDenied.
5. Re-read device UUID + serial from router firmware.
6. Re-snapshot current topology path.
7. Recompute HMAC-SHA256(auth_key, uuid || serial || path_bytes).
8. Compare computed token with stored token — must match byte-for-byte.
9. If mismatch:
   a. Log security event: "TBT TOCTOU: device identity changed after authorization,
      adapter <id>, expected UUID <stored_uuid>, got <current_uuid>"
   b. Revoke authorization (clear authorized bit, destroy stored token).
   c. Disconnect: instruct the router firmware to disable the PCIe adapter.
   d. Return SecurityViolation to the caller.
10. If match: proceed with PCIe tunnel establishment and enumeration,
    holding a reference to the auth token for the duration of enumeration.
11. During enumeration: verify that the PCIe device hierarchy rooted at the
    tunnel matches the topology snapshot (router count, UIDs at each hop).
    Any discrepancy aborts enumeration with the same TOCTOU revocation sequence.
12. Post-enumeration: associate the PCIe device nodes with this auth token.
    Store the token reference in each `DeviceDescriptor` for the tunneled devices.

Topology change monitoring after enumeration:

After the PCIe tunnel is established, the USB4 driver monitors router firmware events (hotplug notifications, link-state change interrupts from the host controller):

Router added or removed: any unexpected change in the topology between the host controller and the authorized device triggers re-verification. If the re-verify fails (token mismatch due to topology change), the kernel disconnects the PCIe tunnel and revokes authorization.
Link-down on authorized adapter: treated as a disconnect event. The auth token is destroyed. Reconnection requires a fresh authorization cycle.
Router UID mismatch: if a router at any hop in the stored topology path reports a different UID than the token recorded, the kernel disconnects immediately. This catches the attack where an intermediate router (not the endpoint device) is replaced.

The topology monitoring event loop runs in the USB4 host controller driver (Tier 1). Events are delivered via the host controller's interrupt, processed in the driver's interrupt handler, and dispatched to the security manager via an MPSC ring.

10.10 I2C/SMBus Bus Framework

I2C (Inter-Integrated Circuit) and SMBus (System Management Bus, a subset of I2C) are low-speed serial buses used throughout the hardware stack — in servers as well as consumer and embedded devices:

Server / datacenter uses: - BMC (Baseboard Management Controller) sensor buses: CPU, DIMM, and VRM temperature sensors; fan speed controllers; PSU monitoring - PMBus (Power Management Bus, layered on SMBus): voltage regulators, power sequencing, power rail telemetry - SPD (Serial Presence Detect): JEDEC EEPROM on each DIMM, read at boot for memory training; JEDEC JEP106 manufacturer ID, capacity, speed grade, thermal sensor register on DDR4/5 DIMMs - IPMI satellite controllers (IPMB — IPMI over I2C)

Consumer / embedded uses: - Touchpads and touchscreens (I2C-HID protocol, Section 10.10.3 below) - Audio codecs (I2C control path for volume, routing, power state) - Ambient light sensors, accelerometers (shock/vibration detection) - Battery and charger controllers (Smart Battery System over SMBus)

10.10.1 I2C Bus Trait

Platform I2C controller drivers (Intel LPSS, AMD FCH, Synopsys DesignWare, Broadcom BCM2835, Aspeed AST2600 BMC) implement the I2cBus trait. The trait is in umka-core/src/bus/i2c.rs.

/// I2C device address (7-bit, right-aligned; 0x00–0x7F).
pub type I2cAddr = u8;

/// I2C transfer result.
#[repr(u32)]
pub enum I2cResult {
    Ok              = 0,
    /// No ACK (device not present or not responding).
    NoAck           = 1,
    /// Bus arbitration lost (multi-master collision).
    ArbitrationLost = 2,
    /// Timeout (clock stretching exceeded or device hung).
    Timeout         = 3,
    InvalidParam    = 4,
}

/// I2C bus trait. Implemented by platform-specific controller drivers.
/// Used only within Rust-internal code (same compilation unit). For KABI
/// boundaries between separately-compiled modules, use `I2cBusVTable` below.
pub trait I2cBus: Send + Sync {
    /// Combined write-then-read (I2C repeated START).
    /// Typical pattern: write register address, read value.
    fn transfer(&self, addr: I2cAddr, write: &[u8], read: &mut [u8]) -> I2cResult;

    fn write(&self, addr: I2cAddr, data: &[u8]) -> I2cResult {
        self.transfer(addr, data, &mut [])
    }

    fn read(&self, addr: I2cAddr, buf: &mut [u8]) -> I2cResult {
        self.transfer(addr, &[], buf)
    }
}

/// C-ABI vtable for I2C bus controller operations, used at KABI boundaries.
/// When a Tier 1 HID/sensor driver needs to call the I2C bus controller (which
/// may be a separately-compiled Tier 0 module), it receives an `I2cDevice`
/// (below) containing a pointer to this vtable rather than an `Arc<dyn I2cBus>`.
#[repr(C)]
pub struct I2cBusVTable {
    /// Vtable size in bytes. Always `core::mem::size_of::<I2cBusVTable>()` for
    /// the implementing driver; receivers use it for version compatibility.
    pub vtable_size: u64,
    /// Combined write-then-read (I2C repeated START).
    /// `ctx`: opaque per-bus context pointer (first arg to all operations).
    pub transfer: unsafe extern "C" fn(
        ctx:       *mut c_void,
        addr:      I2cAddr,
        write:     *const u8,
        write_len: u32,
        read:      *mut u8,
        read_len:  u32,
    ) -> I2cResult,
}

/// Handle to a device at a fixed address on a specific I2C bus.
/// Uses C-ABI compatible vtable pointer + opaque context instead of
/// `Arc<dyn I2cBus>` to allow use across KABI boundaries between separately
/// compiled Tier 0 bus controller and Tier 1 device driver modules.
pub struct I2cDevice {
    /// Pointer to the bus controller's operation vtable. Points to a static
    /// vtable allocated in the bus controller module; never null.
    pub bus_ops:  *const I2cBusVTable,
    /// Opaque per-bus context pointer passed as the first argument to every
    /// vtable function. Points to the controller driver's internal bus state.
    pub bus_ctx:  *mut c_void,
    pub addr: I2cAddr,
}

impl I2cDevice {
    pub fn read_reg(&self, reg: u8) -> Result<u8, I2cResult> {
        let mut buf = [0u8];
        // SAFETY: bus_ops and bus_ctx come from the bus controller at probe time.
        let result = unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, &reg as *const u8, 1, buf.as_mut_ptr(), 1,
            )
        };
        match result {
            I2cResult::Ok => Ok(buf[0]),
            e => Err(e),
        }
    }

    pub fn write_reg(&self, reg: u8, val: u8) -> I2cResult {
        let data = [reg, val];
        // SAFETY: bus_ops and bus_ctx are valid; data is stack-local and valid for transfer duration.
        unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, data.as_ptr(), 2, core::ptr::null_mut(), 0,
            )
        }
    }

    /// Read a 16-bit little-endian register (common on SMBus devices).
    pub fn read_reg16_le(&self, reg: u8) -> Result<u16, I2cResult> {
        let mut buf = [0u8; 2];
        // SAFETY: bus_ops/bus_ctx valid; buf is stack-local and valid for transfer duration.
        let result = unsafe {
            ((*self.bus_ops).transfer)(
                self.bus_ctx, self.addr, &reg as *const u8, 1, buf.as_mut_ptr(), 2,
            )
        };
        match result {
            I2cResult::Ok => Ok(u16::from_le_bytes(buf)),
            e => Err(e),
        }
    }
}

Tier classification: I2C controller drivers are Tier 1 — they are platform-integrated and accessed from multiple other Tier 1 drivers (audio, sensor, battery). Device drivers using I2C (touchpads, sensors) follow their own tier classification based on their function.

Device enumeration: I2C devices are enumerated from ACPI (_HID, _CRS with I2cSerialBusV2 resource) or device-tree compatible strings. The bus manager matches each ACPI/DT node to a registered I2C device driver.

10.10.2 SMBus and Hardware Sensors

SMBus restricts I2C to well-defined transaction types (Quick Command, Send Byte, Read Byte, Read Word, Block Read) and adds a PEC (Packet Error Code) byte for data integrity. The UmkaOS SMBus layer wraps I2cBus and enforces SMBus transaction semantics.

10.10.2.1 Hardware Monitoring (hwmon) Interface

Server and workstation motherboards expose dozens of sensors over I2C/SMBus. UmkaOS provides a HwmonDevice trait analogous to Linux's hwmon subsystem:

/// A hardware monitor device (temperature, voltage, fan, current sensors).
pub trait HwmonDevice: Send + Sync {
    /// Device name (e.g., "nct6779", "ina3221", "max31790").
    fn name(&self) -> &str;

    /// Read a temperature sensor in millidegrees Celsius.
    /// Returns None if the sensor index is not present.
    fn temperature_mc(&self, index: u8) -> Option<i32>;

    /// Read a fan speed in RPM.
    fn fan_rpm(&self, index: u8) -> Option<u32>;

    /// Read a voltage in millivolts.
    fn voltage_mv(&self, index: u8) -> Option<i32>;

    /// Read a current in milliamperes.
    fn current_ma(&self, index: u8) -> Option<i32>;

    /// Set a fan PWM duty cycle (0–255).
    fn set_fan_pwm(&self, index: u8, pwm: u8) -> Result<(), I2cResult>;
}

Registered HwmonDevice instances are exposed via sysfs under /sys/class/hwmon/hwmon<N>/. Userspace daemons (fancontrol, lm-sensors, IPMI daemons, monitoring agents like Prometheus node-exporter) read these paths without kernel modifications. UmkaOS's hwmon sysfs layout is compatible with Linux's hwmon ABI.

10.10.2.2 PMBus (Power Management Bus)

PMBus is a layered protocol over SMBus for communicating with power conversion devices (VRMs, PSUs, battery chargers). PMBus defines a standardised command set (PMBUS_READ_VIN, PMBUS_READ_VOUT, PMBUS_READ_IOUT, PMBUS_READ_TEMPERATURE_1, etc.) with standardised data formats.

The UmkaOS PMBus driver: 1. Probes devices via ACPI/DT with pmbus compatible string. 2. Reads PMBUS_MFR_ID, PMBUS_MFR_MODEL for identification. 3. Registers a HwmonDevice exposing all PMBus telemetry channels. 4. Monitors STATUS_WORD for fault conditions (over-voltage, over-current, over-temperature, fan fault) and posts HwmonFaultEvent to the event subsystem (Section 6.6, 06-scheduling.md) so userspace daemons can react.

10.10.2.3 DIMM SPD and Thermal Sensors

DDR4/DDR5 DIMMs have an SPD EEPROM at I2C address 0x50–0x57 (slot-indexed). The memory controller driver reads SPD at boot for training parameters. DDR4 DIMMs also expose a thermal sensor at address 0x18–0x1F via the TS3518 or compatible interface.

/// SPD EEPROM read (partial — first 256 bytes sufficient for JEDEC training).
pub fn read_spd(bus: &dyn I2cBus, slot: u8) -> Result<[u8; 256], I2cResult> {
    let addr = 0x50u8 | (slot & 0x07);
    let mut buf = [0u8; 256];
    // SPD page select not needed for first 256 bytes on DDR4.
    bus.transfer(addr, &[0x00], &mut buf)?;
    Ok(buf)
}

/// DDR4 thermal sensor read (TS register, 13-bit two's complement, 0.0625°C LSB).
pub fn read_dimm_temp_mc(bus: Arc<dyn I2cBus>, slot: u8) -> Result<i32, I2cResult> {
    let addr = 0x18u8 | (slot & 0x07);
    // JEDEC JC42.4 thermal sensors transmit MSB first (big-endian).
    let raw = I2cDevice { bus, addr }.read_reg16_be(0x05)?;
    // Bits [15:13] are flags; bits [12:4] are temperature in 1/16°C units.
    let temp_raw = ((raw as i16) >> 4) as i32;
    Ok(temp_raw * 625 / 10) // convert to millidegrees Celsius
}

10.10.3 I2C-HID Protocol

I2C-HID (HID over I2C, HIDI2C v1.0 specification) is used for touchpads, touchscreens, fingerprint readers, and other HID devices with I2C interfaces. The kernel implements the transport layer; HID report parsing is shared with the USB HID stack (Section 10.9.1).

Protocol flow: 1. ACPI reports device with PNP0C50 (_HID) or ACPI0C50; _CRS provides I2C address, IRQ GPIO line, and descriptor register address. 2. Driver reads HID descriptor (30 bytes) from the descriptor register. 3. Driver reads HID Report Descriptor and passes it to the shared HidParser. 4. Device asserts IRQ GPIO (falling edge) when a new input report is ready. 5. ISR: reads input report from the input register address specified in descriptor; parses via HidParser; posts InputEvent to the input subsystem ring buffer (Section 20.1, 20-user-io.md).

#[repr(C, packed)]
pub struct I2cHidDescriptor {
    pub length:            u16,   // Must be 30 (per HIDI2C v1.0 spec)
    pub bcd_version:       u16,   // 0x0100 for v1.0
    pub report_desc_len:   u16,
    pub report_desc_reg:   u16,
    pub input_reg:         u16,
    pub max_input_len:     u16,
    pub output_reg:        u16,
    pub max_output_len:    u16,
    pub cmd_reg:           u16,
    pub data_reg:          u16,
    pub vendor_id:         u16,
    pub product_id:        u16,
    pub version_id:        u16,
    // No _reserved field: the HIDI2C v1.0 wire format is exactly 30 bytes
    // (15 × u16). The struct is 30 bytes with #[repr(C, packed)].
    // When reading from the device, read exactly 30 bytes into this struct.
}

HID parser security bounds (all input from the USB device is UNTRUSTED):

/// Maximum HID report descriptor byte length.
/// USB HID spec §6.2.1 recommends keeping descriptors under 4096 bytes.
/// UmkaOS enforces this as a hard limit to prevent parser state explosion
/// from untrusted (potentially malicious) USB devices.
pub const HID_REPORT_DESC_MAX_BYTES: usize = 4096;

/// Maximum number of usage/field items per HID report ID.
/// Limits parser memory to HID_MAX_FIELDS_PER_REPORT × sizeof(HidField) per report.
pub const HID_MAX_FIELDS_PER_REPORT: usize = 256;

/// Maximum number of report descriptors per HID device.
/// (Enforced structurally by ArrayVec<HidReport, HID_MAX_REPORTS>.)
pub const HID_MAX_REPORTS: usize = 16;

HID descriptor parsing error handling (all input is UNTRUSTED — from USB device): - Descriptor exceeds HID_REPORT_DESC_MAX_BYTES → return Err(HidError::DescriptorTooLong) - Unknown item tag → skip item per USB HID §6.2.2.7 (long-item skipping) and continue parsing (permissive, for hardware compatibility with quirky devices) - Fields exceed HID_MAX_FIELDS_PER_REPORT → truncate excess fields, log KERN_WARNING - report_count × report_size overflows u32 → return Err(HidError::ReportSizeOverflow) - Descriptor ends mid-item → return Err(HidError::TruncatedDescriptor)

The full I2cHidDevice implementation and interrupt handler:

// umka-core/src/hid/i2c_hid.rs

/// I2C-HID driver state.
pub struct I2cHidDevice {
    /// I2C device handle.
    pub i2c: I2cDevice,
    /// Descriptor (fetched at probe time).
    pub desc: I2cHidDescriptor,
    /// Interrupt GPIO line (from ACPI `_CRS` GpioInt resource).
    pub irq_gpio: GpioLine,
    /// HID report descriptor (fetched once at probe time). `Box<[u8]>` over
    /// `Vec<u8>`: the slice is allocated at probe with the exact length returned
    /// by the device and never resized. Prevents accidental reallocation if a
    /// method on `Vec` is called after probe.
    pub report_desc: Box<[u8]>,
    /// Pre-allocated input report buffer sized to `desc.max_input_len` at probe.
    /// `Box<[u8]>` over `Vec<u8>`: the fixed-capacity slice prevents reallocation
    /// in interrupt context. The interrupt handler writes into `&mut report_buf[..]`
    /// via a pre-sized slice — no heap allocation occurs during IRQ handling.
    pub report_buf: Box<[u8]>,
    /// Parsed HID report parser state. Parses a HID report descriptor
    /// (sequence of items per USB HID spec Section 6.2.2) into a structured
    /// representation of reports, fields, and usages.
    /// Bounds enforced: HID_MAX_REPORTS, HID_MAX_FIELDS_PER_REPORT, HID_REPORT_DESC_MAX_BYTES.
    ///
    /// ```rust
    /// pub struct HidParser {
    ///     /// Parsed report descriptors, indexed by report ID.
    ///     pub reports: ArrayVec<HidReport, HID_MAX_REPORTS>,
    /// }
    /// pub struct HidReport {
    ///     pub report_id: u8,
    ///     pub report_type: HidReportType, // Input, Output, Feature
    ///     pub fields: ArrayVec<HidField, HID_MAX_FIELDS_PER_REPORT>,
    ///     pub total_bits: u32,
    /// }
    /// pub struct HidField {
    ///     pub usage_page: u16,
    ///     pub usage_min: u16,
    ///     pub usage_max: u16,
    ///     pub logical_min: i32,
    ///     pub logical_max: i32,
    ///     pub bit_offset: u32,
    ///     pub bit_size: u32,
    ///     pub count: u32,
    ///     pub flags: u32, // Variable, Array, Absolute, Wrap, etc.
    /// }
    /// ```
    pub parser: HidParser,
}

impl I2cHidDevice {
    /// Probe an I2C-HID device. Called when ACPI reports `PNP0C50` (I2C-HID).
    pub fn probe(i2c: I2cDevice, irq_gpio: GpioLine) -> Result<Self, ProbeError> {
        // Read descriptor from register 0x0001.
        let mut desc_buf = [0u8; 30];
        i2c.bus.transfer(i2c.addr, &[0x01, 0x00], &mut desc_buf)?;
        // SAFETY: I2cHidDescriptor is #[repr(C, packed)] with all u16/u8 fields,
        // matching the 30-byte wire format. read_unaligned is required because the
        // I2C transfer buffer may not be 2-byte aligned.
        let desc: I2cHidDescriptor = unsafe { core::ptr::read_unaligned(desc_buf.as_ptr() as *const _) };

        // Read HID report descriptor.
        let mut report_desc = vec![0u8; desc.report_desc_len as usize];
        let reg_bytes = desc.report_desc_reg.to_le_bytes();
        i2c.bus.transfer(i2c.addr, &reg_bytes, &mut report_desc)?;

        // Parse HID report descriptor to build parser.
        let parser = HidParser::parse(&report_desc)?;

        // Register interrupt handler.
        irq_gpio.enable_interrupt(GpioInterruptMode::FallingEdge, move || {
            Self::handle_interrupt(&i2c, &desc, &parser);
        })?;

        let report_buf = vec![0u8; desc.max_input_len as usize];
        Ok(Self { i2c, desc, irq_gpio, report_desc, report_buf, parser })
    }

    /// Interrupt handler: read HID report, parse, deliver events.
    fn handle_interrupt(i2c: &I2cDevice, desc: &I2cHidDescriptor, parser: &HidParser,
                        report_buf: &mut [u8]) {
        // Pre-allocated in probe() — interrupt handlers must not perform heap allocation.
        let reg_bytes = desc.input_reg.to_le_bytes();
        if i2c.bus.transfer(i2c.addr, &reg_bytes, &mut report_buf) != I2cResult::Ok {
            return; // Ignore read errors (spurious interrupt or device glitch).
        }

        // Parse HID report → InputEvent structs.
        let events = parser.parse_input_report(&report_buf);
        for event in events {
            umka_input::post_event(event); // Write to input subsystem ring buffer (Section 20.1).
        }
    }
}

10.10.4 Precision Touchpad (PTP)

Windows Precision Touchpad devices use HID Usage Page 0x0D (Digitizers), Usage 0x05 (Touch Pad). The HID report contains: - Contact count: Number of active touches (0-10+). - Per-contact data: X/Y position (absolute, in logical units), contact width/height, pressure, contact ID. - Button state: Physical button click (if present), pad click (tap-to-click handled in userspace).

// umka-core/src/hid/touchpad.rs

/// Parsed Precision Touchpad report.
pub struct PtpReport {
    /// Number of active contacts.
    pub contact_count: u8,
    /// Per-contact data (up to 10 simultaneous touches).
    pub contacts: [PtpContact; 10],
    /// Button state (bit 0 = left button, bit 1 = right button).
    pub buttons: u8,
}

/// Single touch contact on a Precision Touchpad.
#[derive(Clone, Copy)]
pub struct PtpContact {
    /// Contact ID (persistent across reports while finger is down).
    pub id: u8,
    /// Tip switch (1 = finger down, 0 = finger lifted).
    pub tip: bool,
    /// X position (logical units, 0 = left edge).
    pub x: u16,
    /// Y position (logical units, 0 = top edge).
    pub y: u16,
    /// Width (logical units, or 0 if not reported).
    pub width: u16,
    /// Height (logical units, or 0 if not reported).
    pub height: u16,
}

Gesture recognition: Kernel delivers raw multi-touch HID reports via the input ring buffer. Gesture recognition (palm rejection, tap-to-click, multi-finger swipes) is handled by a userspace input library (libinput or equivalent).